Title: Specializing adaptive interpreter code object hashes are less unique · Issue #94155 · python/cpython · GitHub
Open Graph Title: Specializing adaptive interpreter code object hashes are less unique · Issue #94155 · python/cpython
X Title: Specializing adaptive interpreter code object hashes are less unique · Issue #94155 · python/cpython
Description: In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode ...
Open Graph Description: In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode how...
X Description: In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode how...
Opengraph URL: https://github.com/python/cpython/issues/94155
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Specializing adaptive interpreter code object hashes are less unique","articleBody":"In Python 3.11 and 3.12, the hash function for a `PyCodeObject` (`code_hash()`) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode however it likes, which means the bytecode is no longer immutable, which means you can't rely on it not changing, which means you can't use it as part of calculating the hash. Fair enough.\r\n\r\nBut this means that the hash of a code object no longer factors in the most unique part of a code object. Currently in 3.11 and 3.12 the hash of a code object is calculated using:\r\n\r\n- the unqualified name of the callable\r\n- the code object's const table (a tuple of immutable objects)\r\n- the code object's tuple of externally-referenced names (globals, nonlocals)\r\n- the code object's tuple of locally-referenced names (parameters, local variables, and closures)\r\n- the total number of arguments\r\n- the count of positional-only arguments\r\n- the count of keyword-only arguments\r\n- the \"flags\" for this code object\r\n\r\nWhich means it's not hard to construct code objects with identical hashes but different bytecode. For example:\r\n\r\n```Python\r\nclass A:\r\n def method(self, a, b):\r\n return a + b\r\n\r\nclass B:\r\n def method(self, a, b):\r\n return a * b\r\n\r\nclass C:\r\n def method(self, a, b):\r\n return a / b\r\n\r\n\r\nfor cls in (A, B, C):\r\n o = cls()\r\n print(o.method(3, 5))\r\n print(hex(hash(cls.method.__code__)))\r\n```\r\n\r\nThe hashes for for `A.method.__code__`, `B.method.__code__`, and `C.method.__code__` are different in Python 3.10, but identical in Python 3.11b3, and presumably in trunk as well.\r\n\r\nIs this scenario realistic? I don't know. Certainly I've never seen it. But it's at least _plausible;_ a base class could have a tiny function that only does a little work, and a subclass could override that function and tweak the work done slightly, without relying on additional external names / closures or employing a different list of locals. It's certainly not impossible.\r\n\r\nObviously this is low priority--there's very little code that hashes code objects. It might cause some collisions and probing when marshal dumps a module exhibiting this behavior, because marshal maintains a hash table of the objects it writes. That's the worst side-effect I can think up.\r\n\r\nWe could mitigate this slightly by factoring in more values into the code object's hash. Three come immediately to mind:\r\n\r\n- the filename (`co_filename`)\r\n- the first line number (`co_firstlineno`)\r\n- the first column number (which I'm guessing is encoded in `co_positions` somehow)\r\n\r\nWith only the first two, you'd still have hash collisions from code objects defined using lambdas all on the same line:\r\n```Python\r\n\r\na, b, c = (lambda a, b: a + b), (lambda a, b: a * b), (lambda a, b: a / b)\r\n\r\nfor l in (a, b, c):\r\n print(l(3, 5))\r\n print(hex(hash(l.__code__)))\r\n```\r\n\r\nAs unlikely as it is that someone would stumble over this scenario, it's even _less_ likely that it would cause a problem. But decreasing the likelihood of hash value collisions seems so wholesome and good, I think it's worth pursuing.\n\n\u003c!-- gh-linked-prs --\u003e\n### Linked PRs\n* gh-100183\n\u003c!-- /gh-linked-prs --\u003e\n","author":{"url":"https://github.com/larryhastings","@type":"Person","name":"larryhastings"},"datePublished":"2022-06-23T04:01:32.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":6},"url":"https://github.com/94155/cpython/issues/94155"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:66725f0b-7a0d-4252-93c9-5adc44864d84 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | A13A:1ABE3:308BE48:43D44D4:6969380D |
| html-safe-nonce | b5c762c1a55ee581136c97d49325d1dc10bc338b21ff22ac54ed74350e06a378 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBMTNBOjFBQkUzOjMwOEJFNDg6NDNENDRENDo2OTY5MzgwRCIsInZpc2l0b3JfaWQiOiI3NDg1ODY5MTA2NjU3ODM1MDIxIiwicmVnaW9uX2VkZ2UiOiJpYWQiLCJyZWdpb25fcmVuZGVyIjoiaWFkIn0= |
| visitor-hmac | 93127fa16e2126c09984c35489078022b3ef046f3e99d9cfafdb4e14b7f06387 |
| hovercard-subject-tag | issue:1281684669 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python/cpython/94155/issue_layout |
| twitter:image | https://opengraph.githubassets.com/110f934b0de767a09a53ac5af2b47f65dc92d69325388dcb12a42d046f4ab3fe/python/cpython/issues/94155 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/110f934b0de767a09a53ac5af2b47f65dc92d69325388dcb12a42d046f4ab3fe/python/cpython/issues/94155 |
| og:image:alt | In Python 3.11 and 3.12, the hash function for a PyCodeObject (code_hash()) no longer hashes the bytecode. I assume this is because the specializing adaptive interpreter can change the bytecode how... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | larryhastings |
| hostname | github.com |
| expected-hostname | github.com |
| None | 54182691a21263b584d2e600b758e081b0ff1d10ffc0d2eefa51cf754b43b51d |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | d69ac0477df0f87da03b8b06cebd187012d7a930 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width