René's URL Explorer Experiment


Title: GH-794: Avro file read and write support by martin-traverse · Pull Request #802 · apache/arrow-java · GitHub

Open Graph Title: GH-794: Avro file read and write support by martin-traverse · Pull Request #802 · apache/arrow-java

X Title: GH-794: Avro file read and write support by martin-traverse · Pull Request #802 · apache/arrow-java

Description: What's Changed Add top level reader / writer classes for Avro files. This is a draft for discussion, I haven't written tests, doc comments etc. so will do all that once we agree the shape of the implementation is correct. I have built a very simple implementation which just uses Avro's public components with the existing producers / consumers, following the Avro container file spec. Internally Avro uses input / output streams and heap allocated byte arrays, so I have based our reader / writer on those elements. I have recycled buffers / streams wherever it is possible, without breaking into Avro's internal structures. For compression I am using the Avro own codec implementations. Instantiation via CodecFactory is restricted to Avro's own file handling package, but the codecs themselves are mostly public, with the exception of Snappy for some unknown reason. I'd be happy to raise a ticket and ask about that, or we could just copy the Snappy implementation in our own namespace (it is a simple wrapper on Xerial). I did look at an alternative approach, using ArrowBuf for the batch buffers with Arrow's codec implementations, which we could add to. However there were a couple of issues: Since Avro uses streams / byte arrays internally, pretty much every way of getting to ArrowBuf involved going to a byte array first and then copying. To break out of that we'd need to reimplement large parts of Avro's file package, including encoders / decoders and shade some key classes in the Avro namespace. Arrow's codec API assumes that compressed data is always written with the uncompressed size stored at the start of the output, which makes them unusable for other formats that don't do that, including Avro. We'd need to add a new API and implement the codecs again to handle resizing the output buffer. Given these considerations, I eventually came to think the very simple approach I've drafted might be the best option. If there are performance benefits to be had by switching to ArrowBuf, Channels etc. we'd need to write a lot more code, which needs to stay in sync and be maintained etc. We can still add overload constructors to wrap channels for IO. Final point on non-blocking mode for the reader. The approach I have used is just to insist that when blocking = false the input stream must support mark / reset, and then peak at the beginning of each batch to determine its size. Also, non-blocking readers are direct, which disable's Avro's internal buffering. I'm assuming that anyone using non-blocking will implement their own stream and buffering logic, in which case they can add mark / reset and won't want Avro randomly reading extra bytes for an internal buffer! At least that is what I'm planning to do. I haven't tried to estimate header size - there isn't really a way to do that without reading the whole header. We could have something like headerBytesNeeded() to incrementally return the number of bytes still needed. The alternative is just to provide a couple of MB and assume it's enough, which probably works, but on reflection I do think we should probably add this, just so the API for non-blocking is "complete". Hope this all makes sense - please let me know your thoughts when you get a chance and then I'll do the last bit of work to get this ready. Closes #794.

Open Graph Description: What's Changed Add top level reader / writer classes for Avro files. This is a draft for discussion, I haven't written tests, doc comments etc. so will do all that once we agree the shape o...

X Description: What's Changed Add top level reader / writer classes for Avro files. This is a draft for discussion, I haven't written tests, doc comments etc. so will do all that once we agree the...

Opengraph URL: https://github.com/apache/arrow-java/pull/802

X: @github

direct link

Domain: patch-diff.githubusercontent.com

route-pattern/:user_id/:repository/pull/:id/checks(.:format)
route-controllerpull_requests
route-actionchecks
fetch-noncev2:dcc64955-fae3-5c89-1ac3-41622825e34d
current-catalog-service-hash87dc3bc62d9b466312751bfd5f889726f4f1337bdff4e8be7da7c93d6c00a25a
request-idDAAC:18E7F8:8C200B:B548D5:699224EB
html-safe-nonceaeb3db3bd34b5c5ec742d962975de6cbc049cd8fa96e9bbbf35e8035f8c97d9a
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJEQUFDOjE4RTdGODo4QzIwMEI6QjU0OEQ1OjY5OTIyNEVCIiwidmlzaXRvcl9pZCI6IjkwNTYzNTU0OTIyMjIzNDY0NzUiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ==
visitor-hmac4b3e1220cfc44ac7b2144c7ce336211ad581934653fdb960e66ccee5f7d77c28
hovercard-subject-tagpull_request:2683249600
github-keyboard-shortcutsrepository,pull-request-list,pull-request-conversation,pull-request-files-changed,checks,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///pull_requests/show/checks
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/apache/arrow-java/pull/802/checks
twitter:imagehttps://avatars.githubusercontent.com/u/2040008?s=400&v=4
twitter:cardsummary_large_image
og:imagehttps://avatars.githubusercontent.com/u/2040008?s=400&v=4
og:image:altWhat's Changed Add top level reader / writer classes for Avro files. This is a draft for discussion, I haven't written tests, doc comments etc. so will do all that once we agree the shape o...
og:site_nameGitHub
og:typeobject
hostnamegithub.com
expected-hostnamegithub.com
None42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b
turbo-cache-controlno-cache
go-importgithub.com/apache/arrow-java git https://github.com/apache/arrow-java.git
octolytics-dimension-user_id47359
octolytics-dimension-user_loginapache
octolytics-dimension-repository_id893682219
octolytics-dimension-repository_nwoapache/arrow-java
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id893682219
octolytics-dimension-repository_network_root_nwoapache/arrow-java
turbo-body-classeslogged-out env-production page-responsive full-width full-width-p-0
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release848bc6032dcc93a9a7301dcc3f379a72ba13b96e
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks#start-of-content
https://patch-diff.githubusercontent.com/
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fpull%2F802%2Fchecks
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fpull%2F802%2Fchecks
Sign up https://patch-diff.githubusercontent.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fpull_requests%2Fshow%2Fchecks&source=header-repo&source_repo=apache%2Farrow-java
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
apache https://patch-diff.githubusercontent.com/apache
arrow-javahttps://patch-diff.githubusercontent.com/apache/arrow-java
Notifications https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Fork 111 https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Star 83 https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Code https://patch-diff.githubusercontent.com/apache/arrow-java
Issues 403 https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests 39 https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security 0 https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
Code https://patch-diff.githubusercontent.com/apache/arrow-java
Issues https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
Sign up for GitHub https://patch-diff.githubusercontent.com/signup?return_to=%2Fapache%2Farrow-java%2Fissues%2Fnew%2Fchoose
terms of servicehttps://docs.github.com/terms
privacy statementhttps://docs.github.com/privacy
Sign inhttps://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java%2Fissues%2Fnew%2Fchoose
martin-traversehttps://patch-diff.githubusercontent.com/martin-traverse
apache:mainhttps://patch-diff.githubusercontent.com/apache/arrow-java/tree/main
martin-traverse:feature/avro-container-formathttps://patch-diff.githubusercontent.com/martin-traverse/arrow-java/tree/feature/avro-container-format
Conversation 8 https://patch-diff.githubusercontent.com/apache/arrow-java/pull/802
Commits 7 https://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/commits
Checks 26 https://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
Files changed https://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/files
Please reload this pagehttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
Please reload this pagehttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
Sign in for the full log viewhttps://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fpull%2F802%2Fchecks
GH-794: Avro file read and write support https://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks#top
Please reload this pagehttps://patch-diff.githubusercontent.com/apache/arrow-java/pull/802/checks
RC on: pull_request https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186
Source https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177950562?pr=802
JNI ubuntu-latest x86_64 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177961615?pr=802
JNI ubuntu-24.04-arm aarch_64 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177961611?pr=802
JNI macos-15-intel x86_64 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177961602?pr=802
JNI macos-14 aarch_64 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177961608?pr=802
JNI windows-2022 x86_64 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56177961614?pr=802
Binaries https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56178371837?pr=802
Docs https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56178371878?pr=802
Verify https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56178371855?pr=802
Publish docs https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56178371964?pr=802
Upload https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016186/job/56178371940?pr=802
Test on: pull_request https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187
AMD64 Ubuntu JDK 11 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950817?pr=802
AMD64 Conda JNI JDK 11 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950811?pr=802
AMD64 Ubuntu JDK 17 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950816?pr=802
AMD64 Conda JNI JDK 17 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950815?pr=802
AMD64 Ubuntu JDK 21 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950826?pr=802
AMD64 Conda JNI JDK 21 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950850?pr=802
AMD64 Ubuntu JDK 23 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950825?pr=802
AMD64 Conda JNI JDK 23 Maven 3.9.9 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950841?pr=802
AMD64 macOS 13 Java JDK 11 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950788?pr=802
AArch64 macOS latest Java JDK 11 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950808?pr=802
AMD64 Windows Server 2022 Java JDK 11 https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950805?pr=802
AMD64 integration https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016187/job/56177950791?pr=802
Dev on: pull_request https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016189
pre-commit https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016189/job/56177950977?pr=802
Dev PR on: pull_request_target https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016913
Ensure PR format https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19610016913/job/56154454799?pr=802
Dev PR on: pull_request_target https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19619778637
Ensure PR format https://patch-diff.githubusercontent.com/apache/arrow-java/actions/runs/19619778637/job/56177942486?pr=802
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.