René's URL Explorer Experiment


Title: Speed up open().read() pattern by reducing the number of system calls · Issue #120754 · python/cpython · GitHub

Open Graph Title: Speed up open().read() pattern by reducing the number of system calls · Issue #120754 · python/cpython

X Title: Speed up open().read() pattern by reducing the number of system calls · Issue #120754 · python/cpython

Description: Feature or enhancement Proposal: I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their contents as config. In tracing I found most ex...

Open Graph Description: Feature or enhancement Proposal: I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their conten...

X Description: Feature or enhancement Proposal: I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their conten...

Opengraph URL: https://github.com/python/cpython/issues/120754

X: @github

direct link

Domain: github.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Speed up open().read() pattern by reducing the number of system calls","articleBody":"# Feature or enhancement\r\n\r\n### Proposal:\r\n\r\nI came across some seemingly redundant `fstat()` and `lseek()` calls when working on a tool that scanned a directory of lots of small YAML files and loaded their contents as config. In tracing I found most execution time wasn't in the python interpreter but system calls (on top of NFS in that case, which made some I/O calls particularly slow).\r\n\r\nI've been experimenting with a program that reads all `.rst` files in the python `Docs` directory to try and remove some of those redundant system calls..\r\n\r\n### Test Program\r\n```python\r\nfrom pathlib import Path\r\n\r\nnlines = []\r\nfor filename in Path(\"cpython/Doc\").glob(\"**/*.rst\"):\r\n    nlines.append(len(filename.read_text()))\r\n```\r\n\r\nIn my experimentation, with some tweaks to fileio can remove over 10% of the system calls the test program makes when scanning the whole `Doc` folders for `.rst` files on both macOS and Linux (don't have a Windows machine to measure on).\r\n\r\n### Current State (9 system calls)\r\nCurrently on my Linux machine to read a whole `.rst` file with the above code there is this series of system calls:\r\n```python\r\nopenat(AT_FDCWD, \"cpython/Doc/howto/clinic.rst\", O_RDONLY|O_CLOEXEC) = 3\r\nfstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0\r\nioctl(3, TCGETS, 0x7ffe52525930)        = -1 ENOTTY (Inappropriate ioctl for device)\r\nlseek(3, 0, SEEK_CUR)                   = 0\r\nlseek(3, 0, SEEK_CUR)                   = 0\r\nfstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0\r\nread(3, \":orphan:\\n\\n.. This page is retain\"..., 344) = 343\r\nread(3, \"\", 1)                          = 0\r\nclose(3)                                = 0\r\n```\r\n\r\n### Target State (~~7~~ 5 system calls)\r\nIt would be nice to get it down to (for small files, large file caveat in PR / get an additional seek):\r\n```python\r\n# Open the file\r\nopenat(AT_FDCWD, \"cpython/Doc/howto/clinic.rst\", O_RDONLY|O_CLOEXEC) = 3\r\n# Check if the open fd is a file or directory and early-exit on directories with a specialized error.\r\n# With my changes we also stash the size information from this for later use as an estimate.\r\nfstat(3, {st_mode=S_IFREG|0644, st_size=343, ...}) = 0\r\n# Read the data directly into a PyBytes\r\nread(3, \":orphan:\\n\\n.. This page is retain\"..., 344) = 343\r\n# Read the EOF marker\r\nread(3, \"\", 1)                          = 0\r\n# Close the file\r\nclose(3)                                = 0\r\n```\r\n\r\nIn a number of cases (ex. importing modules) there is often a `fstat` followed immediately by an open / read the file (which does another `fstat` typically), but that is an extension point and I want to keep that out of scope for now.\r\n\r\n### Questions rattling around in my head around this\r\nSome of these are likely better for Discourse / longer form discussion, happy to start threads there as appropriate.\r\n\r\n1. Is there a way to add a test for certain system calls happening with certain arguments and/or a certain amount of time? (I don't currently see a great way to write a test to make sure the number of system calls doesn't change unintentionally)\r\n2. Running a simple python script (`python simple.py` that contains `print(\"Hello, World!\")`) currently reads `simple.py` in full at least 4 times and does over 5 seeks. I have been pulling on that thread but it interacts with importlib as well as how the python compiler currently works, still trying to get my head around. Would removing more of those overheads be something of interest / should I keep working to get my head around it? \r\n3. We could potentially save more\r\n    1. with readv (one readv call, two iovecs). I avoided this for now because _Py_read does quite a bit.\r\n    2. dispatching multiple calls in parallel using asynchronous I/O APIs to meet the python API guarantees; I am experimenting with this (backed by relatively new Linux I/O APIs but possibly for kqueue and epoll), but it's _very_ experimental and feeling a lot like \"has to be a interpreter primitive\" to me to work effectively which is complex to plumb through. Very early days though, many thoughts, not much prototype code.\r\n4. The `_blksize` member of fileio was added in bpo-21679. It is not used much as far as I can tell as its reflection `_blksize` in python or in the code. The only usage I can find is https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374, where we could just query for it when needed in that case to save some storage on all `fileio` objects. The behavior of using the stat returned st_blksize is part of the docs, so doesn't feel like we can fully remove it.\r\n\r\n### Has this already been discussed elsewhere?\r\n\r\nThis is a minor feature, which does not need previous discussion elsewhere\r\n\r\n### Links to previous discussion of this feature:\r\n\r\n_No response_\r\n\r\n\u003c!-- gh-linked-prs --\u003e\r\n### Linked PRs\r\n* gh-120755\r\n* gh-121143\n* gh-121357\n* gh-121593\n* gh-121633\n* gh-122101\n* gh-122103\n* gh-122111\n* gh-122215\n* gh-122216\n* gh-123303\n* gh-123412\n* gh-123413\n* gh-124225\n* gh-125166\n* gh-126466\n\u003c!-- /gh-linked-prs --\u003e\r\n","author":{"url":"https://github.com/cmaloney","@type":"Person","name":"cmaloney"},"datePublished":"2024-06-19T19:36:57.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":20},"url":"https://github.com/120754/cpython/issues/120754"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:c8e18d55-9660-2af3-87f4-ad95e550efaa
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id8946:231D79:2A39F7F:377E71A:696B2EF5
html-safe-nonce050db2947e419d5453887ea6279a5205a7410c7455f729f9331470c857108052
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI4OTQ2OjIzMUQ3OToyQTM5RjdGOjM3N0U3MUE6Njk2QjJFRjUiLCJ2aXNpdG9yX2lkIjoiMTAxOTc5NDU3MTIxOTY0NDE1MCIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmacb48bd07149c25e7b6c3bbbfa36116a481c42ef21bd6b989627aba46e1591a690
hovercard-subject-tagissue:2363021533
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python/cpython/120754/issue_layout
twitter:imagehttps://opengraph.githubassets.com/4e4bec4fa9fb2ebd1639bcd01a249f46a4526d6c15fe1ef5f3a091a47db14257/python/cpython/issues/120754
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/4e4bec4fa9fb2ebd1639bcd01a249f46a4526d6c15fe1ef5f3a091a47db14257/python/cpython/issues/120754
og:image:altFeature or enhancement Proposal: I came across some seemingly redundant fstat() and lseek() calls when working on a tool that scanned a directory of lots of small YAML files and loaded their conten...
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernamecmaloney
hostnamegithub.com
expected-hostnamegithub.com
None5f99f7c1d70f01da5b93e5ca90303359738944d8ab470e396496262c66e60b8d
turbo-cache-controlno-preview
go-importgithub.com/python/cpython git https://github.com/python/cpython.git
octolytics-dimension-user_id1525981
octolytics-dimension-user_loginpython
octolytics-dimension-repository_id81598961
octolytics-dimension-repository_nwopython/cpython
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id81598961
octolytics-dimension-repository_network_root_nwopython/cpython
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release82560a55c6b2054555076f46e683151ee28a19bc
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://github.com/python/cpython/issues/120754#start-of-content
https://github.com/
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F120754
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F120754
Sign up https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=python%2Fcpython
Reloadhttps://github.com/python/cpython/issues/120754
Reloadhttps://github.com/python/cpython/issues/120754
Reloadhttps://github.com/python/cpython/issues/120754
python https://github.com/python
cpythonhttps://github.com/python/cpython
Please reload this pagehttps://github.com/python/cpython/issues/120754
Notifications https://github.com/login?return_to=%2Fpython%2Fcpython
Fork 33.9k https://github.com/login?return_to=%2Fpython%2Fcpython
Star 71.1k https://github.com/login?return_to=%2Fpython%2Fcpython
Code https://github.com/python/cpython
Issues 5k+ https://github.com/python/cpython/issues
Pull requests 2.1k https://github.com/python/cpython/pulls
Actions https://github.com/python/cpython/actions
Projects 31 https://github.com/python/cpython/projects
Security Uh oh! There was an error while loading. Please reload this page. https://github.com/python/cpython/security
Please reload this pagehttps://github.com/python/cpython/issues/120754
Insights https://github.com/python/cpython/pulse
Code https://github.com/python/cpython
Issues https://github.com/python/cpython/issues
Pull requests https://github.com/python/cpython/pulls
Actions https://github.com/python/cpython/actions
Projects https://github.com/python/cpython/projects
Security https://github.com/python/cpython/security
Insights https://github.com/python/cpython/pulse
New issuehttps://github.com/login?return_to=https://github.com/python/cpython/issues/120754
New issuehttps://github.com/login?return_to=https://github.com/python/cpython/issues/120754
#120755https://github.com/python/cpython/pull/120755
Speed up open().read() pattern by reducing the number of system callshttps://github.com/python/cpython/issues/120754#top
#120755https://github.com/python/cpython/pull/120755
performancePerformance or resource usagehttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22performance%22
type-featureA feature request or enhancementhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22type-feature%22
https://github.com/cmaloney
https://github.com/cmaloney
cmaloneyhttps://github.com/cmaloney
on Jun 19, 2024https://github.com/python/cpython/issues/120754#issue-2363021533
bpo-21679https://bugs.python.org/issue?@action=redirect&bpo=21679
https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374https://github.com/python/cpython/blob/main/Modules/_io/_iomodule.c#L365-L374
gh-120754: Reduce system calls in full-file readall case #120755https://github.com/python/cpython/pull/120755
GH-120754: Add a strace helper and test set of syscalls for open().read() #121143https://github.com/python/cpython/pull/121143
gh-120754: Update estimated_size in C truncate #121357https://github.com/python/cpython/pull/121357
GH-120754: Remove isatty call during regular open #121593https://github.com/python/cpython/pull/121593
GH-120754: Make PY_READ_MAX smaller than max byteobject size #121633https://github.com/python/cpython/pull/121633
gh-113977, gh-120754: Remove unbounded reads from zipfile #122101https://github.com/python/cpython/pull/122101
GH-120754: Add more tests around seek + readall #122103https://github.com/python/cpython/pull/122103
GH-120754: Disable buffering in Path.read_bytes #122111https://github.com/python/cpython/pull/122111
[3.13] GH-120754: Add more tests around seek + readall (GH-122103) #122215https://github.com/python/cpython/pull/122215
[3.12] GH-120754: Add more tests around seek + readall (GH-122103) #122216https://github.com/python/cpython/pull/122216
Revert "GH-120754: Add a strace helper and test set of syscalls for o… #123303https://github.com/python/cpython/pull/123303
gh-120754: Refactor I/O modules to stash whole stat result rather than individual members #123412https://github.com/python/cpython/pull/123412
gh-120754: Add a strace helper and test set of syscalls for open().read(), Take 2 #123413https://github.com/python/cpython/pull/123413
gh-120754: Fix memory leak in FileIO.__init__() #124225https://github.com/python/cpython/pull/124225
gh-120754: Ensure _stat_atopen is cleared on fd change #125166https://github.com/python/cpython/pull/125166
gh-120754: Add to io open() and .read() optimization to what's new #126466https://github.com/python/cpython/pull/126466
performancePerformance or resource usagehttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22performance%22
type-featureA feature request or enhancementhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22type-feature%22
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.