René's URL Explorer Experiment

Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java · GitHub

Open Graph Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java

X Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java

Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block ...

Open Graph Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...

X Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...

Opengraph URL: https://github.com/apache/arrow-java/issues/794

X: @github

direct link

Domain: patch-diff.githubusercontent.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Avro adapter - Read and write Avro container files","articleBody":"### Describe the enhancement requested\n\nPart 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block and fill a single VSR. The VSR can be recycled between batches. Input and output can be to Avro encoder / decoder (set up externally) or to Java's native byte channels (which are set up with default binary encoder / decoder). To cater for async scenarios, the reader API should know how many bytes are required for a block before attempting to read it.\n\nI'd like to propose the following API - hopefully this is going in the right direction. I've taken some inspiration from ArrowFilleReader / Writer and Json Reader / Writer, but it's not identical (and they're not identical to each other). If there is a desire to line up on specific naming / conventions then certainly happy to do that, in which case I'll need a steer on exactly how it should be. Otherwise if anyone has radically different ideas of what it should look like, please do share!\n\n    class AvroFileWriter {\n\n        // Writer owns a channel / encoder and will close them\n        // VSR and optional dictionaries are not owned and will not be closed\n        // VSR can be recycled or supplied as a stream\n\n        // Avro encoder configured externally\n        public AvroFileWriter(\n            Encoder encoder,\n            VectorSchemaRoot firstBatch, \n            DictionaryProvider dictionaries)\n\n        // Sets up a defaulr binary encoder for the channel\n        public AvroFileWriter(\n            WritableByteChannel channel,\n            VectorSchemaRoot firstBatch, \n            DictionaryProvider dictionaries)\n\n        // Write the Avro header (throws if already written)\n        void writeHeader()\n\n        // Write the contents of the VSR as an Avro data block\n        // Writes header if not yet written\n        // Expects new data to be in the batch (i.e. VSR can be recycled)\n        void writeBatch()\n\n        // Reset vectors in all the producders\n        // Supports a stream of VSRs if source VSR is not recycled\n        void resetBatch(VectorSchemaRoot batch)\n\n        // Closes encoder and / or channel\n        // Does not close VSR or dictionary vectors\n        void close()\n\n    }\n  \nNow writing data looks like this:\n\n    void writeAvro(MyApp app) {\n\n        var root = app.prepareVsr();\n        var dictionaries = app.prepareDictionaries()\n\n        try (var writer = new AvroFileWriter(app.openChannel(), root, dictionaries)) {\n\n            writer.writeHeader();\n\n            // Assume recycling, loadBatch() puts fresh data into root\n            while (app.loadBatch()) {\n                writer.writeBatch()\n            }\n        }\n    }\n\nAnd then for the reader:\n\n    class AvroFileReader implements DictionaryProvider {\n\n        // Writer owns a channel / decoder and will close them\n        // Schema / VSR / dictionaries are created when header is read\n        // VSR / dictionaries are cleaned up on close\n        // Dictionaries accessible through DictionaryProvider iface\n\n        // Avro decoder configured externally\n        public AvroFileWriter(\n            Decoder decoder,\n            BufferAllocator allocator)\n\n        // Sets up a defaulr binary deocder for the channel\n        // Avro read sequentially so seekable channel not needed\n        public AvroFileWriter(\n            ReadableByteChannel channel,\n            BufferAllocator allocator)\n\n        // Read the Avro header and set up schema / VSR / dictionaries\n        void readHeader()\n\n        // Schema and VSR available after readHeader() \n        Schema getSchema()\n        VectorSchemaRoot getVectorSchemaRoot()\n\n        // Read the next Avro block and load it into the VSR\n        // Return true if successful, false if EOS\n        // Also false in non-blocking mode if need more data\n        boolean readBatch()\n\n        // Check for position and size of the next Avro data block\n        // Provides a mechanism for non-blocking / reactive styles\n        boolean hasNextBatch();\n        long nextBatchPosition();\n        long nextBatchSize();\n\n        // Closes encoder and / or channel\n        // Also closes VSR and dictionary vectors\n        void close()\n\n    }\n\nSo reading looks like this:\n\n    // Blocking style\n    void readAvro(MyApp app) {\n\n        try (var reader = new AvroFileReader(app.openChannel(), app.allocator()) {\n\n            reader.readHeader();\n            \n            app.setSchema(reader.getSchema());\n            app.setVsr(reader.getVectorSchemaRoot());\n            app.setDictionaries(reader);\n\n            while (reader.readBatch())) {\n                app.saveBatch();\n            }\n        }\n    }\n\n    // Non-blocking stage to process one batch\n    CompletionStage\u003cBoolean\u003e readAvroAsync(AvroFileReader reader) {\n\n        if (reader.hasNextBatch()) {\n\n            var start = reader.nextBatchStart();\n            var end = reader.nextBatchEnd();\n\n            return app.ensureBytesAvailable(start, end)\n                .thenApply(x -\u003e {\n\n                    if (reader.readBatch()) {\n                        app.saveBatch();\n                    }\n\n                    return reader;\n                })\n                .thenCompose(this::readAsync);\n        }\n        else {\n            return CompletableFuture.completedFuture(true);\n        }\n    }\n\nThe non-blocking read is quite important for me as I have a web service that receives bytes in a stream. There is a slight gotcha because we need the first 8 bytes of the next batch before we know its size, but we can implement hasNextBatch() without them and the probably expose the batch padding size as a constant. \n\nCompression is probably worth thinking about now - each block is compressed individually so the implementation needs to treat the contents of each block as a separate chunk, that can be fed through a codec. My guess is this is fairly straightforward for codecs that are already available so we might as well include it rather than reworking later.\n\nIf this looks broadly right I'll make a start on top of #779 ","author":{"url":"https://github.com/martin-traverse","@type":"Person","name":"martin-traverse"},"datePublished":"2025-07-07T20:30:03.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":5},"url":"https://github.com/794/arrow-java/issues/794"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:87ba72d2-10ff-cd57-3a90-e43814b1714a
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	9CEE:11BDB2:14C2DF6:1A4385B:6991A18A
html-safe-nonce	2cdfcf008b86d4064c600eace203d1ea085a0f8aac47a92bac8ec85bab599c5f
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5Q0VFOjExQkRCMjoxNEMyREY2OjFBNDM4NUI6Njk5MUExOEEiLCJ2aXNpdG9yX2lkIjoiNjAxMDA4MTMwMzk1NjkyMjc2MiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac	bc6f67a108c0cbb2747276ab03e79de63787c6d88964eb03b7856414993cb635
hovercard-subject-tag	issue:3210178394
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/apache/arrow-java/794/issue_layout
twitter:image	https://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794
og:image:alt	Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	martin-traverse
hostname	github.com
expected-hostname	github.com
None	42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b
turbo-cache-control	no-preview
go-import	github.com/apache/arrow-java git https://github.com/apache/arrow-java.git
octolytics-dimension-user_id	47359
octolytics-dimension-user_login	apache
octolytics-dimension-repository_id	893682219
octolytics-dimension-repository_nwo	apache/arrow-java
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	893682219
octolytics-dimension-repository_network_root_nwo	apache/arrow-java
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	848bc6032dcc93a9a7301dcc3f379a72ba13b96e
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://patch-diff.githubusercontent.com/apache/arrow-java/issues/794#start-of-content
	https://patch-diff.githubusercontent.com/
Sign in	https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fissues%2F794
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fissues%2F794
Sign up	https://patch-diff.githubusercontent.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=apache%2Farrow-java
Reload	https://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
Reload	https://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
Reload	https://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
apache	https://patch-diff.githubusercontent.com/apache
arrow-java	https://patch-diff.githubusercontent.com/apache/arrow-java
Notifications	https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Fork 111	https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Star 83	https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Code	https://patch-diff.githubusercontent.com/apache/arrow-java
Issues 403	https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests 38	https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions	https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions	https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security 0	https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights	https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
Code	https://patch-diff.githubusercontent.com/apache/arrow-java
Issues	https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests	https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions	https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions	https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security	https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights	https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
New issue	https://patch-diff.githubusercontent.com/login?return_to=https://github.com/apache/arrow-java/issues/794
New issue	https://patch-diff.githubusercontent.com/login?return_to=https://github.com/apache/arrow-java/issues/794
#802	https://github.com/apache/arrow-java/pull/802
Avro adapter - Read and write Avro container files	https://patch-diff.githubusercontent.com/apache/arrow-java/issues/794#top
#802	https://github.com/apache/arrow-java/pull/802
Type: enhancementNew feature or request	https://github.com/apache/arrow-java/issues?q=state%3Aopen%20label%3A%22Type%3A%20enhancement%22
19.0.0	https://github.com/apache/arrow-java/milestone/3
	https://github.com/martin-traverse
	https://github.com/martin-traverse
martin-traverse	https://github.com/martin-traverse
on Jul 7, 2025	https://github.com/apache/arrow-java/issues/794#issue-3210178394
#731	https://github.com/apache/arrow-java/issues/731
#779	https://github.com/apache/arrow-java/pull/779
Type: enhancementNew feature or request	https://github.com/apache/arrow-java/issues?q=state%3Aopen%20label%3A%22Type%3A%20enhancement%22
19.0.0No due date	https://github.com/apache/arrow-java/milestone/3
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.