René's URL Explorer Experiment


Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java · GitHub

Open Graph Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java

X Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java

Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block ...

Open Graph Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...

X Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...

Opengraph URL: https://github.com/apache/arrow-java/issues/794

X: @github

direct link

Domain: patch-diff.githubusercontent.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Avro adapter - Read and write Avro container files","articleBody":"### Describe the enhancement requested\n\nPart 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block and fill a single VSR. The VSR can be recycled between batches. Input and output can be to Avro encoder / decoder (set up externally) or to Java's native byte channels (which are set up with default binary encoder / decoder). To cater for async scenarios, the reader API should know how many bytes are required for a block before attempting to read it.\n\nI'd like to propose the following API - hopefully this is going in the right direction. I've taken some inspiration from ArrowFilleReader / Writer and Json Reader / Writer, but it's not identical (and they're not identical to each other). If there is a desire to line up on specific naming / conventions then certainly happy to do that, in which case I'll need a steer on exactly how it should be. Otherwise if anyone has radically different ideas of what it should look like, please do share!\n\n    class AvroFileWriter {\n\n        // Writer owns a channel / encoder and will close them\n        // VSR and optional dictionaries are not owned and will not be closed\n        // VSR can be recycled or supplied as a stream\n\n        // Avro encoder configured externally\n        public AvroFileWriter(\n            Encoder encoder,\n            VectorSchemaRoot firstBatch, \n            DictionaryProvider dictionaries)\n\n        // Sets up a defaulr binary encoder for the channel\n        public AvroFileWriter(\n            WritableByteChannel channel,\n            VectorSchemaRoot firstBatch, \n            DictionaryProvider dictionaries)\n\n        // Write the Avro header (throws if already written)\n        void writeHeader()\n\n        // Write the contents of the VSR as an Avro data block\n        // Writes header if not yet written\n        // Expects new data to be in the batch (i.e. VSR can be recycled)\n        void writeBatch()\n\n        // Reset vectors in all the producders\n        // Supports a stream of VSRs if source VSR is not recycled\n        void resetBatch(VectorSchemaRoot batch)\n\n        // Closes encoder and / or channel\n        // Does not close VSR or dictionary vectors\n        void close()\n\n    }\n  \nNow writing data looks like this:\n\n    void writeAvro(MyApp app) {\n\n        var root = app.prepareVsr();\n        var dictionaries = app.prepareDictionaries()\n\n        try (var writer = new AvroFileWriter(app.openChannel(), root, dictionaries)) {\n\n            writer.writeHeader();\n\n            // Assume recycling, loadBatch() puts fresh data into root\n            while (app.loadBatch()) {\n                writer.writeBatch()\n            }\n        }\n    }\n\nAnd then for the reader:\n\n    class AvroFileReader implements DictionaryProvider {\n\n        // Writer owns a channel / decoder and will close them\n        // Schema / VSR / dictionaries are created when header is read\n        // VSR / dictionaries are cleaned up on close\n        // Dictionaries accessible through DictionaryProvider iface\n\n        // Avro decoder configured externally\n        public AvroFileWriter(\n            Decoder decoder,\n            BufferAllocator allocator)\n\n        // Sets up a defaulr binary deocder for the channel\n        // Avro read sequentially so seekable channel not needed\n        public AvroFileWriter(\n            ReadableByteChannel channel,\n            BufferAllocator allocator)\n\n        // Read the Avro header and set up schema / VSR / dictionaries\n        void readHeader()\n\n        // Schema and VSR available after readHeader() \n        Schema getSchema()\n        VectorSchemaRoot getVectorSchemaRoot()\n\n        // Read the next Avro block and load it into the VSR\n        // Return true if successful, false if EOS\n        // Also false in non-blocking mode if need more data\n        boolean readBatch()\n\n        // Check for position and size of the next Avro data block\n        // Provides a mechanism for non-blocking / reactive styles\n        boolean hasNextBatch();\n        long nextBatchPosition();\n        long nextBatchSize();\n\n        // Closes encoder and / or channel\n        // Also closes VSR and dictionary vectors\n        void close()\n\n    }\n\nSo reading looks like this:\n\n    // Blocking style\n    void readAvro(MyApp app) {\n\n        try (var reader = new AvroFileReader(app.openChannel(), app.allocator()) {\n\n            reader.readHeader();\n            \n            app.setSchema(reader.getSchema());\n            app.setVsr(reader.getVectorSchemaRoot());\n            app.setDictionaries(reader);\n\n            while (reader.readBatch())) {\n                app.saveBatch();\n            }\n        }\n    }\n\n    // Non-blocking stage to process one batch\n    CompletionStage\u003cBoolean\u003e readAvroAsync(AvroFileReader reader) {\n\n        if (reader.hasNextBatch()) {\n\n            var start = reader.nextBatchStart();\n            var end = reader.nextBatchEnd();\n\n            return app.ensureBytesAvailable(start, end)\n                .thenApply(x -\u003e {\n\n                    if (reader.readBatch()) {\n                        app.saveBatch();\n                    }\n\n                    return reader;\n                })\n                .thenCompose(this::readAsync);\n        }\n        else {\n            return CompletableFuture.completedFuture(true);\n        }\n    }\n\nThe non-blocking read is quite important for me as I have a web service that receives bytes in a stream. There is a slight gotcha because we need the first 8 bytes of the next batch before we know its size, but we can implement hasNextBatch() without them and the probably expose the batch padding size as a constant. \n\nCompression is probably worth thinking about now - each block is compressed individually so the implementation needs to treat the contents of each block as a separate chunk, that can be fed through a codec. My guess is this is fairly straightforward for codecs that are already available so we might as well include it rather than reworking later.\n\nIf this looks broadly right I'll make a start on top of #779 ","author":{"url":"https://github.com/martin-traverse","@type":"Person","name":"martin-traverse"},"datePublished":"2025-07-07T20:30:03.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":5},"url":"https://github.com/794/arrow-java/issues/794"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:87ba72d2-10ff-cd57-3a90-e43814b1714a
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id9CEE:11BDB2:14C2DF6:1A4385B:6991A18A
html-safe-nonce2cdfcf008b86d4064c600eace203d1ea085a0f8aac47a92bac8ec85bab599c5f
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5Q0VFOjExQkRCMjoxNEMyREY2OjFBNDM4NUI6Njk5MUExOEEiLCJ2aXNpdG9yX2lkIjoiNjAxMDA4MTMwMzk1NjkyMjc2MiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmacbc6f67a108c0cbb2747276ab03e79de63787c6d88964eb03b7856414993cb635
hovercard-subject-tagissue:3210178394
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/apache/arrow-java/794/issue_layout
twitter:imagehttps://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794
og:image:altDescribe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernamemartin-traverse
hostnamegithub.com
expected-hostnamegithub.com
None42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b
turbo-cache-controlno-preview
go-importgithub.com/apache/arrow-java git https://github.com/apache/arrow-java.git
octolytics-dimension-user_id47359
octolytics-dimension-user_loginapache
octolytics-dimension-repository_id893682219
octolytics-dimension-repository_nwoapache/arrow-java
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id893682219
octolytics-dimension-repository_network_root_nwoapache/arrow-java
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release848bc6032dcc93a9a7301dcc3f379a72ba13b96e
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://patch-diff.githubusercontent.com/apache/arrow-java/issues/794#start-of-content
https://patch-diff.githubusercontent.com/
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fissues%2F794
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://patch-diff.githubusercontent.com/login?return_to=https%3A%2F%2Fgithub.com%2Fapache%2Farrow-java%2Fissues%2F794
Sign up https://patch-diff.githubusercontent.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=apache%2Farrow-java
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
Reloadhttps://patch-diff.githubusercontent.com/apache/arrow-java/issues/794
apache https://patch-diff.githubusercontent.com/apache
arrow-javahttps://patch-diff.githubusercontent.com/apache/arrow-java
Notifications https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Fork 111 https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Star 83 https://patch-diff.githubusercontent.com/login?return_to=%2Fapache%2Farrow-java
Code https://patch-diff.githubusercontent.com/apache/arrow-java
Issues 403 https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests 38 https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security 0 https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
Code https://patch-diff.githubusercontent.com/apache/arrow-java
Issues https://patch-diff.githubusercontent.com/apache/arrow-java/issues
Pull requests https://patch-diff.githubusercontent.com/apache/arrow-java/pulls
Discussions https://patch-diff.githubusercontent.com/apache/arrow-java/discussions
Actions https://patch-diff.githubusercontent.com/apache/arrow-java/actions
Security https://patch-diff.githubusercontent.com/apache/arrow-java/security
Insights https://patch-diff.githubusercontent.com/apache/arrow-java/pulse
New issuehttps://patch-diff.githubusercontent.com/login?return_to=https://github.com/apache/arrow-java/issues/794
New issuehttps://patch-diff.githubusercontent.com/login?return_to=https://github.com/apache/arrow-java/issues/794
#802https://github.com/apache/arrow-java/pull/802
Avro adapter - Read and write Avro container fileshttps://patch-diff.githubusercontent.com/apache/arrow-java/issues/794#top
#802https://github.com/apache/arrow-java/pull/802
Type: enhancementNew feature or requesthttps://github.com/apache/arrow-java/issues?q=state%3Aopen%20label%3A%22Type%3A%20enhancement%22
19.0.0https://github.com/apache/arrow-java/milestone/3
https://github.com/martin-traverse
https://github.com/martin-traverse
martin-traversehttps://github.com/martin-traverse
on Jul 7, 2025https://github.com/apache/arrow-java/issues/794#issue-3210178394
#731https://github.com/apache/arrow-java/issues/731
#779https://github.com/apache/arrow-java/pull/779
Type: enhancementNew feature or requesthttps://github.com/apache/arrow-java/issues?q=state%3Aopen%20label%3A%22Type%3A%20enhancement%22
19.0.0No due datehttps://github.com/apache/arrow-java/milestone/3
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.