Title: [Help Wanted] Optimize binary matmul kernel · Issue #5 · FasterDecoding/BitDelta · GitHub
Open Graph Title: [Help Wanted] Optimize binary matmul kernel · Issue #5 · FasterDecoding/BitDelta
X Title: [Help Wanted] Optimize binary matmul kernel · Issue #5 · FasterDecoding/BitDelta
Description: The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can ...
Open Graph Description: The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperf...
X Description: The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperf...
Opengraph URL: https://github.com/FasterDecoding/BitDelta/issues/5
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"[Help Wanted] Optimize binary matmul kernel","articleBody":"The [current kernel](https://github.com/FasterDecoding/BitDelta/blob/main/bitdelta/binary_gemm_kernel.py#L297) is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperforms naive simultaneous inference). Can expect an additional 4-8x latency improvement if further optimized. \r\n\r\nI don't have much kernel optimization experience yet, though - if anyone in the OSS community is interested, would love some help! \r\n\r\n\r\nAfterwards, it'd be super interesting to run some benchmarks against LoRA-based multi-tenant systems like Punica/S-LoRA.","author":{"url":"https://github.com/chromecast56","@type":"Person","name":"chromecast56"},"datePublished":"2024-03-03T21:07:20.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":1},"url":"https://github.com/5/BitDelta/issues/5"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:ed203c76-ce38-4008-0438-509ec3d907e1 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | C4C0:107548:857E3F:B94790:69832953 |
| html-safe-nonce | e325e0a70a192cfce69d944fdece1c5f5bb2502aa2fcceb0916de7541d17835b |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJDNEMwOjEwNzU0ODo4NTdFM0Y6Qjk0NzkwOjY5ODMyOTUzIiwidmlzaXRvcl9pZCI6IjQzMDIyNTU2NTg0NTQ5ODE5NSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | aa667603cd70a0bc102db61899ca21757fffd4c07ef19c590e7e636eeac88da5 |
| hovercard-subject-tag | issue:2165557213 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/FasterDecoding/BitDelta/5/issue_layout |
| twitter:image | https://opengraph.githubassets.com/f58c7ea6ee05e7189d66c988aa820af754f64a3826d33716adfd17fe54329668/FasterDecoding/BitDelta/issues/5 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/f58c7ea6ee05e7189d66c988aa820af754f64a3826d33716adfd17fe54329668/FasterDecoding/BitDelta/issues/5 |
| og:image:alt | The current kernel is fairly slow compared to the theoretical optimum, considering the small memory footprint of weight deltas. So right now it functions more as as proof of concept (eg. it outperf... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | chromecast56 |
| hostname | github.com |
| expected-hostname | github.com |
| None | 0b0e6e590c6e47b5c729605586d5461e4615746f65ac066faf1e2bc9b08f78dc |
| turbo-cache-control | no-preview |
| go-import | github.com/FasterDecoding/BitDelta git https://github.com/FasterDecoding/BitDelta.git |
| octolytics-dimension-user_id | 144572371 |
| octolytics-dimension-user_login | FasterDecoding |
| octolytics-dimension-repository_id | 757253898 |
| octolytics-dimension-repository_nwo | FasterDecoding/BitDelta |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 757253898 |
| octolytics-dimension-repository_network_root_nwo | FasterDecoding/BitDelta |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | b77bc6524a4f84d7d9191503e52186f3428082ab |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width