Title: gh-144015: Add portable SIMD optimization for bytes.hex() by gpshead · Pull Request #143991 · python/cpython · GitHub
Open Graph Title: gh-144015: Add portable SIMD optimization for bytes.hex() by gpshead · Pull Request #143991 · python/cpython
X Title: gh-144015: Add portable SIMD optimization for bytes.hex() by gpshead · Pull Request #143991 · python/cpython
Description: Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instructions. Up to 11x faster for large data (1KB+) 1.1-3x faster for common small data (16-64 bytes, covering md5 through sha512 digest sizes) Separator insertion (sep=) also benefits when bytes_per_sep >= 8 Retains the existing scalar code for short inputs (<16 bytes) or platforms lacking SIMD instructions, no observable performance regressions there. Supported platforms: x86-64: SSE2 is always available, no special flags needed ARM64: NEON is always available, no special flags needed ARM32: Requires NEON support and appropriate compiler flags (e.g., -march=native on a Raspberry Pi 3+) Windows/MSVC: Not supported; MSVC lacks __builtin_shufflevector, so the scalar path is used This is compile time detection of features that are always available on the target architectures. No need for runtime feature inspection. Benchmarked using https://github.com/python/cpython/blob/0f94c061d49821a74096e57df8dff9617b80fad7/Tools/scripts/pystrhex_benchmark.py Performance wins confirmed across the board on x86_64 (zen2), ARM64 (RPi4), ARM32 (RPi5 running 32-bit raspbian, with compiler flags to enable it), ARM64 Apple M4. The commit history on this branch contains earlier experiments for reference. Issue: gh-144015 Example benchmark results (M4): bytes.hex() without separator: Scales extremely well - 1.02x at 16 bytes up to 9.8x at 4KB. bytes.hex() with sep=32: Good gains even with separators (1.38x-5x). hashlib hexdigest: Modest 7-15% improvement on the hex conversion portion. The hash computation dominates total time Expand to see the table: bytes.hex() (no separator) ┌────────────┬───────────┬───────────┬─────────┐ │ Size │ Baseline │ Optimized │ Speedup │ ├────────────┼───────────┼───────────┼─────────┤ │ 16 bytes │ 22.9 ns │ 22.4 ns │ 1.02x │ ├────────────┼───────────┼───────────┼─────────┤ │ 32 bytes │ 28.4 ns │ 22.7 ns │ 1.25x │ ├────────────┼───────────┼───────────┼─────────┤ │ 64 bytes │ 44.4 ns │ 24.4 ns │ 1.82x │ ├────────────┼───────────┼───────────┼─────────┤ │ 256 bytes │ 154.9 ns │ 47.6 ns │ 3.25x │ ├────────────┼───────────┼───────────┼─────────┤ │ 4096 bytes │ 1969.2 ns │ 201.6 ns │ 9.8x │ └────────────┴───────────┴───────────┴─────────┘ bytes.hex('\n', 32) (separator every 32 bytes) ┌────────────┬───────────┬───────────┬─────────┐ │ Size │ Baseline │ Optimized │ Speedup │ ├────────────┼───────────┼───────────┼─────────┤ │ 32 bytes │ 48.8 ns │ 35.3 ns │ 1.38x │ ├────────────┼───────────┼───────────┼─────────┤ │ 64 bytes │ 63.4 ns │ 38.8 ns │ 1.63x │ ├────────────┼───────────┼───────────┼─────────┤ │ 256 bytes │ 178.7 ns │ 73.0 ns │ 2.45x │ ├────────────┼───────────┼───────────┼─────────┤ │ 512 bytes │ 293.3 ns │ 89.6 ns │ 3.27x │ ├────────────┼───────────┼───────────┼─────────┤ │ 4096 bytes │ 2074.2 ns │ 415.5 ns │ 5.0x │ └────────────┴───────────┴───────────┴─────────┘ hashlib hexdigest (hash + hex conversion) ┌───────────────────┬──────────┬───────────┬─────────┐ │ Digest │ Baseline │ Optimized │ Speedup │ ├───────────────────┼──────────┼───────────┼─────────┤ │ md5 (16 bytes) │ 238.2 ns │ 231.7 ns │ 1.03x │ ├───────────────────┼──────────┼───────────┼─────────┤ │ sha1 (20 bytes) │ 210.8 ns │ 197.3 ns │ 1.07x │ ├───────────────────┼──────────┼───────────┼─────────┤ │ sha256 (32 bytes) │ 214.6 ns │ 200.0 ns │ 1.07x │ ├───────────────────┼──────────┼───────────┼─────────┤ │ sha512 (64 bytes) │ 282.9 ns │ 255.9 ns │ 1.11x │ └───────────────────┴──────────┴───────────┴─────────┘ and if you're curious about the path not taken by the end state of this PR using AVX, here that is on a zen4: bytes.hex() without separator ┌────────┬───────────┬─────────────────┬──────────────────┬──────────────────┐ │ Size │ Baseline │ SIMD PR │ AVX-512 │ AVX2 │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 32 B │ 44.7 ns │ 27.4 ns (1.6x) │ 29.2 ns (1.5x) │ 29.0 ns (1.5x) │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 64 B │ 64.5 ns │ 28.3 ns (2.3x) │ 29.2 ns (2.2x) │ 29.4 ns (2.2x) │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 128 B │ 104.8 ns │ 31.7 ns (3.3x) │ 29.0 ns (3.6x) │ 30.8 ns (3.4x) │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 256 B │ 185.8 ns │ 45.0 ns (4.1x) │ 35.9 ns (5.2x) │ 40.4 ns (4.6x) │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 512 B │ 361.1 ns │ 75.3 ns (4.8x) │ 55.0 ns (6.6x) │ 61.4 ns (5.9x) │ ├────────┼───────────┼─────────────────┼──────────────────┼──────────────────┤ │ 4096 B │ 2242.6 ns │ 278.1 ns (8.1x) │ 138.5 ns (16.2x) │ 174.0 ns (12.9x) │ └────────┴───────────┴─────────────────┴──────────────────┴──────────────────┘ The SIMD PR (SSE2/SSSE3) delivers strong speedups across the board, reaching 8x at 4KB. The AVX variants push further - AVX-512 hits 16x at 4KB, AVX2 achieves 13x.
Open Graph Description: Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instruc...
X Description: Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instruc...
Opengraph URL: https://github.com/python/cpython/pull/143991
X: @github
Domain: patch-diff.githubusercontent.com
| route-pattern | /:user_id/:repository/pull/:id/files(.:format) |
| route-controller | pull_requests |
| route-action | files |
| fetch-nonce | v2:69d67f5f-8461-10ae-bab9-c0eb33cf3b45 |
| current-catalog-service-hash | ae870bc5e265a340912cde392f23dad3671a0a881730ffdadd82f2f57d81641b |
| request-id | C66E:3D71D3:1AFBD:232A2:696E4712 |
| html-safe-nonce | 56e748601fef405a1ccb2a36a42201a21ed75f546255335017a665def15809da |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJDNjZFOjNENzFEMzoxQUZCRDoyMzJBMjo2OTZFNDcxMiIsInZpc2l0b3JfaWQiOiIyMjA2NTg1ODUxNjU1MTQ1MTQiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 5c57280ab6c09898486db27be94b2b80a021dd41de0a2a33ef3c55c42b6b3332 |
| hovercard-subject-tag | pull_request:3185138582 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/python/cpython/pull/143991/files |
| twitter:image | https://avatars.githubusercontent.com/u/68491?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/68491?s=400&v=4 |
| og:image:alt | Add SIMD optimization for bytes.hex(), bytearray.hex(), and binascii.hexlify() as well as hashlib .hexdigest() methods using portable GCC/Clang vector extensions that compile to native SIMD instruc... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | 3d96554e55b469c47dbcd31f74dc86278872b170531e84c6ce7f3389673e01d1 |
| turbo-cache-control | no-preview |
| diff-view | unified |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive full-width |
| disable-turbo | true |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | ef576694863a4c791d0a5cc9d2b84384d4414bcd |
| ui-target | canary-2 |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width