Title: More strict rules for group numbers and names in RE · Issue #91760 · python/cpython · GitHub
Open Graph Title: More strict rules for group numbers and names in RE · Issue #91760 · python/cpython
X Title: More strict rules for group numbers and names in RE · Issue #91760 · python/cpython
Description: There were unintentional changes in parsing regular expressions between Python 2 and Python 3. Group references. In patterns and replacement strings you can refer a group by its number using syntax \N where N is a 1-2 digit decimal numbe...
Open Graph Description: There were unintentional changes in parsing regular expressions between Python 2 and Python 3. Group references. In patterns and replacement strings you can refer a group by its number using syntax...
X Description: There were unintentional changes in parsing regular expressions between Python 2 and Python 3. Group references. In patterns and replacement strings you can refer a group by its number using syntax...
Opengraph URL: https://github.com/python/cpython/issues/91760
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"More strict rules for group numbers and names in RE","articleBody":"There were unintentional changes in parsing regular expressions between Python 2 and Python 3.\r\n\r\n1. Group references.\r\n\r\n In patterns and replacement strings you can refer a group by its number using syntax `\\N` where N is a 1-2 digit decimal number. The number should not start by 0, because it will be in an octal escape sequence. The group number can also be used in the conditional expression `(?(N)...)` in patterns and in references `\\g\u003cN\u003e` in replacement strings. And it is interesting, that in Python 3 it can be not only a sequence of decimal digits. The following things are allowed in the group number:\r\n\r\n * Initial zero: `\\g\u003c01\u003e`.\r\n * Spaces around the number: `\\g\u003c 1 \u003e`.\r\n * Underscores: `\\g\u003c1_2\u003e`.\r\n * Non-decimal digits: `\\g\u003c¹\u003e`.\r\n * Non-ASCII decimal digits: `\\g\u003c१\u003e`.\r\n\r\n All this is purely an implementation artifact. After `\\g\u003c` we search the nearest `\u003e` and pass a substring between `\u003c` and `\u003e` to `int()`. In other implementation we could search the longest sequence of decimal digits and all above examples (except may be the first one) would be filtered out automatically.\r\n\r\n2. Group names.\r\n\r\n In `(?P\u003cname\u003e...)`, `(?P=name)`, `(?(name)...)` and `\\g\u003cname\u003e` we can refer groups by name. To avoid ambiguity there is a limitation: the name should follow the rules for identifier. In Python 2 it means that it should contain only letters, digits and underscores and start with a non-digit. Letters and digits are ASCII-only: [A-Za-z] and [0-9].\r\n\r\n In Python 3 identifiers can contain non-ASCII letters and digits. It is good. But in bytes patterns and replacement strings the codes `\\xaa`, `\\xb2`, `\\xb3`, `\\xb5`, `\\xb9`, `\\xba`, `\\xc0`-`\\xd6`, `\\xd8`-`\\xf6`, `\\xf8`-`\\xff` are allowed in the group name. They correspond characters `ª²³µ¹ºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ` after decoding.\r\n\r\n It is an implementation artifact too. Bytes patterns and replacement strings are decoded with the Latin1 encoding for parsing. It simplifies and speeds up the code. There is no other reason why letters and digits in the range U-0080--U-00FF are allowed.\r\n\r\n Note that In Python 3 the bytes literal can only contain printable literal characters in the ASCII range. Codes outside of this range should be represented as octal or hexadecimal escape sequences. So supporting non-ASCII letters and digits does not add to readability.\r\n\r\nSince the above \"features\" are not intentional, not supported by most other RE engines (except `regex`, which is also written in Python), are not tested, and can be changed in result of refactoring the parser, I suggest to introduce more strict rules on group number and name.\r\n\r\n1. Group number should only contain ASCII decimal digits in range [0-9]. Initial 0 is not allowed except for group number 0.\r\n2. Group name in the bytes pattern or replacement string should only contain ASCII letters and digits.\r\n\r\nThe question: do we need a deprecation period for this? I have wrote a code for both options (with deprecation and with error), will create PRs tomorrow.\r\n","author":{"url":"https://github.com/serhiy-storchaka","@type":"Person","name":"serhiy-storchaka"},"datePublished":"2022-04-20T18:30:07.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":2},"url":"https://github.com/91760/cpython/issues/91760"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:6067c71f-d110-94b1-5321-26ee22a2dec1 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | AEE0:1CC3CE:14F756F:1DC2AB4:69691BEE |
| html-safe-nonce | 69379d8e99531baae607e4fe4d3e76cecb50ab875c74c6b1ef8d594adc5e82cf |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBRUUwOjFDQzNDRToxNEY3NTZGOjFEQzJBQjQ6Njk2OTFCRUUiLCJ2aXNpdG9yX2lkIjoiNzI2NzIwMDgwNTc5ODYwMzAiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | f19a5395ef295cb897c90988c87fd3fa3e98e3cd956cbe14d549f70617762811 |
| hovercard-subject-tag | issue:1210062413 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python/cpython/91760/issue_layout |
| twitter:image | https://opengraph.githubassets.com/81e8200c1e2d0914f9f3f22730297293b55bc1660cafdb2debe07477c6014055/python/cpython/issues/91760 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/81e8200c1e2d0914f9f3f22730297293b55bc1660cafdb2debe07477c6014055/python/cpython/issues/91760 |
| og:image:alt | There were unintentional changes in parsing regular expressions between Python 2 and Python 3. Group references. In patterns and replacement strings you can refer a group by its number using syntax... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | serhiy-storchaka |
| hostname | github.com |
| expected-hostname | github.com |
| None | 0e60568924309a021b51adabdce15c2a2f285b556f3130d1a2fa2a5bce11c55f |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | dd206f7ed6207863172be4a783826e86bd2375c3 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width