Title: gh-87729: add LOAD_SUPER_ATTR instruction for faster super() by carljm · Pull Request #103497 · python/cpython · GitHub
Open Graph Title: gh-87729: add LOAD_SUPER_ATTR instruction for faster super() by carljm · Pull Request #103497 · python/cpython
X Title: gh-87729: add LOAD_SUPER_ATTR instruction for faster super() by carljm · Pull Request #103497 · python/cpython
Description: This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use. Microbenchmark results With this PR: ➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()' ..................... Mean +- std dev: 70.4 ns +- 1.4 ns Without this PR: ➜ ./python -m pyperf timeit -s 'from superbench import b' 'b.meth()' ..................... Mean +- std dev: 130 ns +- 1 ns Microbenchmark code ➜ cat superbench.py class A: def meth(self): return 1 class B(A): def meth(self): return super().meth() b = B() Microbenchmark numbers are the same (both pre and post) if the microbenchmark is switched to use return super(B, self).meth() instead. super() is already special-cased in the compiler to ensure the presence of the __class__ cell needed by zero-argument super(). This extends that special-casing a bit in order to compile super().meth() as 4 LOAD_GLOBAL 0 (super) 14 LOAD_DEREF 1 (__class__) 16 LOAD_FAST 0 (self) 18 LOAD_SUPER_ATTR 5 (NULL|self + meth) 20 CALL 0 instead of the current: 4 LOAD_GLOBAL 1 (NULL + super) 14 CALL 0 22 LOAD_ATTR 3 (NULL|self + meth) 42 CALL 0 Bytecode comparison for simple attribute And compile super().attr as 4 LOAD_GLOBAL 0 (super) 14 LOAD_DEREF 1 (__class__) 16 LOAD_FAST 0 (self) 18 LOAD_SUPER_ATTR 4 (attr) instead of the current: 4 LOAD_GLOBAL 1 (NULL + super) 14 CALL 0 22 LOAD_ATTR 2 (attr) The new bytecode has one more instruction, but still ends up executing much faster, because it eliminates the cost of allocating a new single-use super object each time. For zero-arg super, it also eliminates dynamically figuring out each time via frame introspection where to find the self argument and __class__ cell, even though the location of both is already known at compile time. The LOAD_GLOBAL of super remains only in order to support existing semantics in case the name super is re-bound to some other callable besides the built-in super type. Besides being faster, the new bytecode is preferable because it regularizes the loading of self and __class__ to use the normal LOAD_FAST and LOAD_DEREF opcodes, instead of custom code in the super object (not part of the interpreter) relying on private details of interpreter frames to load these in a bespoke way. This helps optimizers like the Cinder JIT that fully support LOAD_FAST and LOAD_DEREF but may not maintain frame locals in the same way. It also makes the bytecode more easily amenable to future optimization by a type-specializing tier 2 interpreter, because __class__ and self will now be surfaced and visible to the optimizer in the usual way, rather than hidden inside the super object. I'll follow up with a specialization of LOAD_SUPER_ATTR for the case where we are looking up a method and a method is found (because this is a common case, and a case where the output of LOAD_SUPER_ATTR depends only on the type of self and not on the actual instance). But to simplify review, I'll do this in a separate PR. I think the benefits of this PR stand alone, even without further benefits of specialization. (ETA: the specialization is now also ready at https://github.com/carljm/cpython/compare/superopt...carljm:cpython:superopt_spec?expand=1 and increases the microbenchmark win from 85% to 2.3x.) The frame introspection code for runtime/dynamic zero-arg super() still remains, but after this PR it would only ever be used in an odd edge case like super(*args) (if args turns out to be empty at runtime), where we can't detect at compile time whether we will have zero-arg or two-arg super(). "Odd" uses of super() (like one-argument super, use of a super object as a descriptor etc) are still supported and experience no change; the compiler will not emit the new LOAD_SUPER_ATTR opcode. I chose to make the new opcode more general by using it for both (statically detectable) zero- and two-arg super. Optimizing zero-arg super is more important because it is more common in modern Python code, and because it also eliminates the frame introspection. But supporting two-arg super costs only one extra bit smuggled via the oparg; this seems worth it. Real-world results and macrobenchmarks This approach provides a speed-up of about 0.5% globally on the Instagram server real-world workload (measured recently on Python 3.10.) I can work on a macrobenchmark for the pyperformance suite that exercises super() (currently it isn't significantly exercised by any benchmark.) (ETA: benchmark is now ready at python/pyperformance#271 -- this diff improves its performance by 10%, the specialization follow-up by another 10%.) Prior art This PR is essentially an updated version of #24936 -- thanks to @vladima for the original inspiration for this approach. Notable differences from that PR: I avoid turning the oparg for the new opcode into a const load, preferring to pass the needed bits of information by bit-shifting the oparg instead (following the precedent of LOAD_ATTR). I prioritize code simplicity over performance in edge cases like when a super() attribute access raises AttributeError, which also reduces the footprint of the PR. #30992 was an attempt to optimize super() solely using the specializing interpreter, but it was never merged because there are too many problems caused by adaptive super-instructions in the tier 1 specializing interpreter. Issue: gh-87729
Open Graph Description: This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use. Microbenchmark results With th...
X Description: This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use. Microbenchmark results With th...
Opengraph URL: https://github.com/python/cpython/pull/103497
X: @github
Domain: github.com
| route-pattern | /:user_id/:repository/pull/:id/checks(.:format) |
| route-controller | pull_requests |
| route-action | checks |
| fetch-nonce | v2:d903fb6f-0509-09fa-1d3c-c117091bc026 |
| current-catalog-service-hash | 87dc3bc62d9b466312751bfd5f889726f4f1337bdff4e8be7da7c93d6c00a25a |
| request-id | 841A:1F8F34:48E709:624E11:6969DE7E |
| html-safe-nonce | 8427256305c9e3d55e6f9833d53533ca3bebb664fd83bbbc48ded5f7d33786bf |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI4NDFBOjFGOEYzNDo0OEU3MDk6NjI0RTExOjY5NjlERTdFIiwidmlzaXRvcl9pZCI6IjU2NDMxNDA4ODg5ODQ2NzM5MTgiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 00f3ab87fcbf66a9546595cd0a07b3bd5ffea3c0313e9ed9584c31799aebd305 |
| hovercard-subject-tag | pull_request:1311857192 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,checks,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/python/cpython/pull/103497/checks |
| twitter:image | https://avatars.githubusercontent.com/u/61586?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/61586?s=400&v=4 |
| og:image:alt | This PR speeds up super() (by around 85%, for a simple one-level super().meth() microbenchmark) by avoiding allocation of a new single-use super() object on each use. Microbenchmark results With th... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | 7b32f1c7c4549428ee399213e8345494fc55b5637195d3fc5f493657579235e8 |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive full-width full-width-p-0 |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | bdde15ad1b403e23b08bbd89b53fbe6bdf688cad |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width