René's URL Explorer Experiment


Title: Faster decompression of gzip files · Issue #95534 · python/cpython · GitHub

Open Graph Title: Faster decompression of gzip files · Issue #95534 · python/cpython

X Title: Faster decompression of gzip files · Issue #95534 · python/cpython

Description: Pitch Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip support is advertised via headers). Tar.gz fi...

Open Graph Description: Pitch Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip suppo...

X Description: Pitch Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip suppo...

Opengraph URL: https://github.com/python/cpython/issues/95534

X: @github

direct link

Domain: github.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Faster decompression of gzip files","articleBody":"# Pitch\n\nDecompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip support is advertised via headers). Tar.gz files are an extremely common way to archive files. Zip files use internal gzip compression. \n\nSpeeding this up by a non-trivial amount is therefore very advantageous.\n\n# Feature or enhancement\n\nThe current gzip reading pipeline can be improved quite a lot. This is the current way of doing things:\n\n- read `io.DEFAULT_BUFFER_SIZE` of data from a _PaddedFile object\n- feed it to a `zlib.decompressobj()` using the `decompress(raw_data, size)` function.\n- Internally decompress always starts with a 16KB buffer, regardless of the requested size. When the output data is 64KB big, it will need to resize 2 times.\n- The decompressed data is returned, anything not returned is saved in an unconsumed_tail object.\n- The unconsumed_tail is used to rewind the _PaddedFile object to the correct position.\n- The decompressed data length and crc32 are taken and these are used to update the GzipReader state.\n\nThis has some severe disadvantages when reading large blocks:\n- Gzip compresses between 50 and 90% in most cases. This means that the decompress function is going to return anywhere   between 16 and 80KB maximum. When 128 KB is requested from the read function this means the 128 KB is not going to be filled, leading to unnecessary read requests.\n-  In the above case say the typical return size is 37kb. That means due to the DEFAULT_BUFFER_SIZE in zlib being 16KB there will be two calls to resize the memory of the return object. (16-\u003e32-\u003e64). EDIT: actually it is three, since the end product is also resized to fit the contents (16-\u003e32-\u003e64-\u003e37).\n\nThis also has some severe disadvantages when reading small blocks.\n- When reading individual lines from a file (quite common) this actually queries a io.BufferedReader instance which reads from the _GzipReader in io.DEFAULT_BUFFER_SIZE chunks.\n- This means only 8KB requested. But still, 8KB is read from the _PaddedFile object. With typical compression rates this means anywhere between 4KB to 7KB of unconsumed tail will be returned. This creates a new object. Meaning allocating memory again. The same data is reread in the next iteration. This means in case of a 70% decompression rate, the same data is memcpy'd around 2 to 3 times.\n- Continuous rewinding of the _PaddedFile object\n\n## How to improve this:\n\n- Use the structure of the _bz2.Bz2Decompressor. Submitted data is copied to an internal buffer, data from this internal buffer is used to return decompress calls. This has the advantage of not endlessly allocating and destroing uncompressed_tail objects.\n- Read 128KB at once from the _PaddedFile object. And only read this data when the decompressor's `needs_input` attribute is True. This prevents querying the _PaddedFile object too much.\n- When decompress is called and a maxsize is given (say 128KB) allocate maxsize immediately, instead of allocating only 16KB and resizing later. \n\nThis prevents a lot of calls to the Python memory allocator. \n\nThis restructuring has already been implemented in [python-isal](https://github.com/pycompression/python-isal). That project is a modification of zlibmodule.c to use the [ISA-L](https://github.com/intel/isa-l) optimizations. While this did improve speed, I also looked at other ways to improve the performance. By restructuring the gzip module and the zlib code the Python overhead was significantly reduced.\n\nRelevant code:\n- [_GzipReader.read function](https://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip.py#L253)\n- [Modified BZ2Decompressor code](https://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip_libmodule.c)\n- [Code that sets the buffer size to the maximum size](https://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip_libmodule.c#L67)\n\nMost of this code can be seemlessly copied back into CPython. Which I will do when I have the time. This can best be done after the 3.11 release I think.\n\n# Previous discussion\n\nNA. This is a performance enhancement, so not necessarily a new feature, but also not a bug.\n\n\u003c!-- gh-linked-prs --\u003e\n### Linked PRs\n* gh-97664\n* gh-137923\n\u003c!-- /gh-linked-prs --\u003e\n","author":{"url":"https://github.com/rhpvorderman","@type":"Person","name":"rhpvorderman"},"datePublished":"2022-08-01T13:54:11.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":5},"url":"https://github.com/95534/cpython/issues/95534"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:2af36c4e-0448-c8a0-e3bc-f0a4ca5a0405
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-idE82A:2C4B08:19F683:22B63A:6969C9FE
html-safe-nonce74b6ca8b592e0faebb0bbfc16656cb2e7a3159c6c4dcf3d7b09b8809b11b5722
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFODJBOjJDNEIwODoxOUY2ODM6MjJCNjNBOjY5NjlDOUZFIiwidmlzaXRvcl9pZCI6IjUwNTU5ODIwNjg5MzE2MTExMzQiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ==
visitor-hmacf0b1571c9ada99b171b049fb2062513a3b06806b0ca5423cc943fb0f040ff99e
hovercard-subject-tagissue:1324467622
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python/cpython/95534/issue_layout
twitter:imagehttps://opengraph.githubassets.com/78381fab9790ce946a24ea97748455f7e7d43bf5597d4d4d52bf79d075889db0/python/cpython/issues/95534
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/78381fab9790ce946a24ea97748455f7e7d43bf5597d4d4d52bf79d075889db0/python/cpython/issues/95534
og:image:altPitch Decompressing gzip streams is an extremely common practice. Most web browsers support gzip decompression, as such most (virtually all) servers return gzip compressed data (when the gzip suppo...
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernamerhpvorderman
hostnamegithub.com
expected-hostnamegithub.com
Noneacedec8b5f975d9e3d494ddd8f949b0b8a0de59d393901e26f73df9dcba80056
turbo-cache-controlno-preview
go-importgithub.com/python/cpython git https://github.com/python/cpython.git
octolytics-dimension-user_id1525981
octolytics-dimension-user_loginpython
octolytics-dimension-repository_id81598961
octolytics-dimension-repository_nwopython/cpython
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id81598961
octolytics-dimension-repository_network_root_nwopython/cpython
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release83c08c21cdda978090dc44364b71aa5bc6dcea79
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://github.com/python/cpython/issues/95534#start-of-content
https://github.com/
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F95534
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Fpython%2Fcpython%2Fissues%2F95534
Sign up https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=python%2Fcpython
Reloadhttps://github.com/python/cpython/issues/95534
Reloadhttps://github.com/python/cpython/issues/95534
Reloadhttps://github.com/python/cpython/issues/95534
python https://github.com/python
cpythonhttps://github.com/python/cpython
Please reload this pagehttps://github.com/python/cpython/issues/95534
Notifications https://github.com/login?return_to=%2Fpython%2Fcpython
Fork 33.9k https://github.com/login?return_to=%2Fpython%2Fcpython
Star 71.1k https://github.com/login?return_to=%2Fpython%2Fcpython
Code https://github.com/python/cpython
Issues 5k+ https://github.com/python/cpython/issues
Pull requests 2.1k https://github.com/python/cpython/pulls
Actions https://github.com/python/cpython/actions
Projects 31 https://github.com/python/cpython/projects
Security Uh oh! There was an error while loading. Please reload this page. https://github.com/python/cpython/security
Please reload this pagehttps://github.com/python/cpython/issues/95534
Insights https://github.com/python/cpython/pulse
Code https://github.com/python/cpython
Issues https://github.com/python/cpython/issues
Pull requests https://github.com/python/cpython/pulls
Actions https://github.com/python/cpython/actions
Projects https://github.com/python/cpython/projects
Security https://github.com/python/cpython/security
Insights https://github.com/python/cpython/pulse
New issuehttps://github.com/login?return_to=https://github.com/python/cpython/issues/95534
New issuehttps://github.com/login?return_to=https://github.com/python/cpython/issues/95534
Faster decompression of gzip fileshttps://github.com/python/cpython/issues/95534#top
3.12only security fixeshttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%223.12%22
performancePerformance or resource usagehttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22performance%22
stdlibStandard Library Python modules in the Lib/ directoryhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22stdlib%22
type-featureA feature request or enhancementhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22type-feature%22
https://github.com/rhpvorderman
https://github.com/rhpvorderman
rhpvordermanhttps://github.com/rhpvorderman
on Aug 1, 2022https://github.com/python/cpython/issues/95534#issue-1324467622
python-isalhttps://github.com/pycompression/python-isal
ISA-Lhttps://github.com/intel/isa-l
_GzipReader.read functionhttps://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip.py#L253
Modified BZ2Decompressor codehttps://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip_libmodule.c
Code that sets the buffer size to the maximum sizehttps://github.com/pycompression/python-isal/blob/v1.0.0/src/isal/igzip_libmodule.c#L67
gh-95534: Improve gzip reading speed by 10% #97664https://github.com/python/cpython/pull/97664
gh-95534: Convert ZlibDecompressor.__new__ to AC #137923https://github.com/python/cpython/pull/137923
3.12only security fixeshttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%223.12%22
performancePerformance or resource usagehttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22performance%22
stdlibStandard Library Python modules in the Lib/ directoryhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22stdlib%22
type-featureA feature request or enhancementhttps://github.com/python/cpython/issues?q=state%3Aopen%20label%3A%22type-feature%22
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.