Title: [3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) by miss-islington · Pull Request #15671 · python/cpython · GitHub
Open Graph Title: [3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) by miss-islington · Pull Request #15671 · python/cpython
X Title: [3.8] closes bpo-37966: Fully implement the UAX GH-15 quick-check algorithm. (GH-15558) by miss-islington · Pull Request #15671 · python/cpython
Description: The purpose of the unicodedata.is_normalized function is to answer the question str == unicodedata.normalized(form, str) more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up unicodedata.is_normalized in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the unicodedata.normalize use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price gnprice@gmail.com https://bugs.python.org/issue37966
Open Graph Description: The purpose of the unicodedata.is_normalized function is to answer the question str == unicodedata.normalized(form, str) more efficiently than writing just that, by using the "quick check"...
X Description: The purpose of the unicodedata.is_normalized function is to answer the question str == unicodedata.normalized(form, str) more efficiently than writing just that, by using the "quick check&...
Opengraph URL: https://github.com/python/cpython/pull/15671
X: @github
Domain: github.com
| route-pattern | /:user_id/:repository/pull/:id/checks(.:format) |
| route-controller | pull_requests |
| route-action | checks |
| fetch-nonce | v2:3e042207-7172-6532-cbec-756ed49ca890 |
| current-catalog-service-hash | 87dc3bc62d9b466312751bfd5f889726f4f1337bdff4e8be7da7c93d6c00a25a |
| request-id | AABC:E1D0E:30874C:3FDAB2:696B3C7C |
| html-safe-nonce | 36b6bfabd6e44fe528d4aa5430356ea6701f5fb4f1b6f9eac6988301d71012fe |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBQUJDOkUxRDBFOjMwODc0QzozRkRBQjI6Njk2QjNDN0MiLCJ2aXNpdG9yX2lkIjoiMzgzOTk0MDUxMzc2MjA2NTUzMiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 9fd8a28ed7d554d8c9e50188d7f7b9d9b3ee79dd5764d56bcdb17c1589cce1cb |
| hovercard-subject-tag | pull_request:313835155 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,checks,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/python/cpython/pull/15671/checks |
| twitter:image | https://avatars.githubusercontent.com/u/31488909?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/31488909?s=400&v=4 |
| og:image:alt | The purpose of the unicodedata.is_normalized function is to answer the question str == unicodedata.normalized(form, str) more efficiently than writing just that, by using the "quick check"... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | 5f99f7c1d70f01da5b93e5ca90303359738944d8ab470e396496262c66e60b8d |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive full-width full-width-p-0 |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 82560a55c6b2054555076f46e683151ee28a19bc |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width