Title: unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. · Issue #129117 · python/cpython · GitHub
Open Graph Title: unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. · Issue #129117 · python/cpython
X Title: unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties. · Issue #129117 · python/cpython
Description: Bug report Bug description: With the unicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases. The method is to look at unicodedata.ca...
Open Graph Description: Bug report Bug description: With the unicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases....
X Description: Bug report Bug description: With the unicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases....
Opengraph URL: https://github.com/python/cpython/issues/129117
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"unicodedata module needs way of accurately determining XID_START and XID_CONTINUE properties.","articleBody":"# Bug report\n\n### Bug description:\n\nWith the `unicodedata` module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, _but not in a few cases_.\nThe method is to look at `unicodedata.category(c)`.\nA start character has category in `\"Lu Ll Lt Lm Lo Nl Pc\".split()`.\nA continue character has category in `\"Lu Ll Lt Lm Lo Mn Mc Nd Nl Pc\".split()`.\n\nHowever, there are several codepoints which don't match these criteria, either because they are not that type of character or because their category is different.\nHere is a complete list of the exceptions, on Python 3.13 and Unicode version 16.0:\nShould be `XID_START` but are not:\n```\n005f Pc True LOW LINE\n037a Lm True GREEK YPOGEGRAMMENI\n0e33 Lo True THAI CHARACTER SARA AM\n0eb3 Lo True LAO VOWEL SIGN AM\n203f Pc True UNDERTIE\n2040 Pc True CHARACTER TIE\n2054 Pc True INVERTED UNDERTIE\n2e2f Lm True VERTICAL TILDE\nfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM\nfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM\nfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM\nfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM\nfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM\nfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM\nfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM\nfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU\nfe33 Pc True PRESENTATION FORM FOR VERTICAL LOW LINE\nfe34 Pc True PRESENTATION FORM FOR VERTICAL WAVY LOW LINE\nfe4d Pc True DASHED LOW LINE\nfe4e Pc True CENTRELINE LOW LINE\nfe4f Pc True WAVY LOW LINE\nfe70 Lo True ARABIC FATHATAN ISOLATED FORM\nfe72 Lo True ARABIC DAMMATAN ISOLATED FORM\nfe74 Lo True ARABIC KASRATAN ISOLATED FORM\nfe76 Lo True ARABIC FATHA ISOLATED FORM\nfe78 Lo True ARABIC DAMMA ISOLATED FORM\nfe7a Lo True ARABIC KASRA ISOLATED FORM\nfe7c Lo True ARABIC SHADDA ISOLATED FORM\nfe7e Lo True ARABIC SUKUN ISOLATED FORM\nff3f Pc True FULLWIDTH LOW LINE\nff9e Lm True HALFWIDTH KATAKANA VOICED SOUND MARK\nff9f Lm True HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK\n```\nShould not be `XID_START` but are:\n```\n1885 Mn False MONGOLIAN LETTER ALI GALI BALUDA\n1886 Mn False MONGOLIAN LETTER ALI GALI THREE BALUDA\n2118 Sm False SCRIPT CAPITAL P\n212e So False ESTIMATED SYMBOL\n```\nShould be `XID_CONTINUE` but are not:\n```\n037a Lm True GREEK YPOGEGRAMMENI\n2e2f Lm True VERTICAL TILDE\nfc5e Lo True ARABIC LIGATURE SHADDA WITH DAMMATAN ISOLATED FORM\nfc5f Lo True ARABIC LIGATURE SHADDA WITH KASRATAN ISOLATED FORM\nfc60 Lo True ARABIC LIGATURE SHADDA WITH FATHA ISOLATED FORM\nfc61 Lo True ARABIC LIGATURE SHADDA WITH DAMMA ISOLATED FORM\nfc62 Lo True ARABIC LIGATURE SHADDA WITH KASRA ISOLATED FORM\nfc63 Lo True ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM\nfdfa Lo True ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM\nfdfb Lo True ARABIC LIGATURE JALLAJALALOUHOU\nfe70 Lo True ARABIC FATHATAN ISOLATED FORM\nfe72 Lo True ARABIC DAMMATAN ISOLATED FORM\nfe74 Lo True ARABIC KASRATAN ISOLATED FORM\nfe76 Lo True ARABIC FATHA ISOLATED FORM\nfe78 Lo True ARABIC DAMMA ISOLATED FORM\nfe7a Lo True ARABIC KASRA ISOLATED FORM\nfe7c Lo True ARABIC SHADDA ISOLATED FORM\nfe7e Lo True ARABIC SUKUN ISOLATED FORM\n```\nShould not be `XID_CONTINUE` but are:\n```\n00b7 Po False MIDDLE DOT\n0387 Po False GREEK ANO TELEIA\n1369 No False ETHIOPIC DIGIT ONE\n136a No False ETHIOPIC DIGIT TWO\n136b No False ETHIOPIC DIGIT THREE\n136c No False ETHIOPIC DIGIT FOUR\n136d No False ETHIOPIC DIGIT FIVE\n136e No False ETHIOPIC DIGIT SIX\n136f No False ETHIOPIC DIGIT SEVEN\n1370 No False ETHIOPIC DIGIT EIGHT\n1371 No False ETHIOPIC DIGIT NINE\n19da No False NEW TAI LUE THAM DIGIT ONE\n200c Cf False ZERO WIDTH NON-JOINER\n200d Cf False ZERO WIDTH JOINER\n2118 Sm False SCRIPT CAPITAL P\n212e So False ESTIMATED SYMBOL\n30fb Po False KATAKANA MIDDLE DOT\nff65 Po False HALFWIDTH KATAKANA MIDDLE DOT\n```\n\nMany of these exceptions are specified in the UAX#31 Section 5.1, [NFKC Modifications](https://www.unicode.org/reports/tr31/tr31-41.html#NFKC_Modifications).\n\n### Proposal\nI suggest adding two functions to the module, `unicodedata.isidstart(chr)` and `unicodedata.isidcontinue(chr)`. These return `True` if `chr` appears in the `DerivedCoreProperties.txt` file as `XID_Start` or `XID_Continue`, _resp._\n\n\n### CPython versions tested on:\n\n3.13\n\n### Operating systems tested on:\n\nWindows\n\n\u003c!-- gh-linked-prs --\u003e\n### Linked PRs\n* gh-140269\n\u003c!-- /gh-linked-prs --\u003e\n","author":{"url":"https://github.com/mrolle45","@type":"Person","name":"mrolle45"},"datePublished":"2025-01-21T06:21:56.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":3},"url":"https://github.com/129117/cpython/issues/129117"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:68c2c3a1-1138-1de8-8b26-16398a62b3b7 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | E658:144C84:A42A59:E64C08:696A2621 |
| html-safe-nonce | 71ee87de3efabdbd7a0f20a77d77d2279772146e78c0f7b1f176d3179c685040 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFNjU4OjE0NEM4NDpBNDJBNTk6RTY0QzA4OjY5NkEyNjIxIiwidmlzaXRvcl9pZCI6IjczOTU5MTg4ODM3NjA0NzEzOCIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 3d16ba3c567025c09e81ba8f0cf679c44381c5841f2bcaf2cb1fbb8f5ab6ba71 |
| hovercard-subject-tag | issue:2800810826 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python/cpython/129117/issue_layout |
| twitter:image | https://opengraph.githubassets.com/b51ec3433929c6167834b57da566319497c5c908430d37082fb90b7b89a5fd10/python/cpython/issues/129117 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/b51ec3433929c6167834b57da566319497c5c908430d37082fb90b7b89a5fd10/python/cpython/issues/129117 |
| og:image:alt | Bug report Bug description: With the unicodedata module, it is possible to determine if a unicode character is a valid identifier start or identifier continuation character, but not in a few cases.... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | mrolle45 |
| hostname | github.com |
| expected-hostname | github.com |
| None | a1022f03e4f0d91ea173e4e5dac892c982e0588c62f1ce56121d755a320a3569 |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | f472b8e6c7b3fdd5d0354972a3f4c516289bf0be |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width