Title: GH-102613: Fast recursive globbing in `pathlib.Path.glob()` by barneygale · Pull Request #104512 · python/cpython · GitHub
Open Graph Title: GH-102613: Fast recursive globbing in `pathlib.Path.glob()` by barneygale · Pull Request #104512 · python/cpython
X Title: GH-102613: Fast recursive globbing in `pathlib.Path.glob()` by barneygale · Pull Request #104512 · python/cpython
Description: This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation recursively walked directories using os.scandir() when it expanded the ** component, and then scanned those same directories again when expanded the *.py component. This is wasteful. In the new implementation, any components following a ** wildcard are used to build a re.Pattern object, which is used to filter the results of the recursive walk. A pattern like **/*.py uses half the number of os.scandir() calls; a pattern like **/*/*.py a third, etc. This new algorithm does not apply if either: The follow_symlinks argument is set to None (its default), or The pattern contains .. components. In these cases we fall back to the old implementation. This PR also replaces selector classes with selector functions. These generators directly yield results rather calling through to their successors. A new internal Path._glob() method takes care to chain these generators together, which simplifies the lazy algorithm and slightly improves performance. It should also be easier to understand and maintain. Performance for the original #102613 repro case, with 400 nested a/ directories, and matching treatment of symlinks and hidden files: $ ../python -m timeit -s 'import glob' 'print(glob.glob("**/*", recursive=True, include_hidden=True))' 5 loops, best of 5: 66.2 msec per loop $ ../python -m timeit -s 'from pathlib import Path' 'print(list(Path(".").rglob("**/*", follow_symlinks=True)))' 10 loops, best of 5: 22.7 msec per loop # before this PR 10 loops, best of 5: 16.5 msec per loop # after this PR These results were from an SSD. The improvement will be greater for slow storage (e.g. network-mounted volumes). Issue: gh-102613
Open Graph Description: This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation recursively...
X Description: This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation rec...
Opengraph URL: https://github.com/python/cpython/pull/104512
X: @github
Domain: github.com
| route-pattern | /:user_id/:repository/pull/:id/checks(.:format) |
| route-controller | pull_requests |
| route-action | checks |
| fetch-nonce | v2:9c686f76-2465-1baa-ce1e-209b2c3aad0c |
| current-catalog-service-hash | 87dc3bc62d9b466312751bfd5f889726f4f1337bdff4e8be7da7c93d6c00a25a |
| request-id | D66C:1F4CB1:4450E3:5ABF8D:69697F93 |
| html-safe-nonce | 99ba3a9c24808086c030438afe049e0926e941d7ae8a41b4eb641400fa49a51b |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJENjZDOjFGNENCMTo0NDUwRTM6NUFCRjhEOjY5Njk3RjkzIiwidmlzaXRvcl9pZCI6IjQxNTA0MDU2NzM2MTIwNTAzMjMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | fd10078ca1f014425b9b50f4c856efdfb4fd16c4ed4a47f2d7947cbff95b6342 |
| hovercard-subject-tag | pull_request:1351240285 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,checks,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/python/cpython/pull/104512/checks |
| twitter:image | https://avatars.githubusercontent.com/u/960340?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/960340?s=400&v=4 |
| og:image:alt | This PR introduces a 'walk-and-match' strategy for handling glob patterns that include a non-terminal ** wildcard, such as **/*.py. For this example, the previous implementation recursively... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | f33e4b94c8824ab2b434d82a94139432fb5ebee9df4b75304140ad22508c4a77 |
| turbo-cache-control | no-preview |
| go-import | github.com/python/cpython git https://github.com/python/cpython.git |
| octolytics-dimension-user_id | 1525981 |
| octolytics-dimension-user_login | python |
| octolytics-dimension-repository_id | 81598961 |
| octolytics-dimension-repository_nwo | python/cpython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 81598961 |
| octolytics-dimension-repository_network_root_nwo | python/cpython |
| turbo-body-classes | logged-out env-production page-responsive full-width full-width-p-0 |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 48f380098b30acbb700b04f1724481ca10d574fc |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width