Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java · GitHub
Open Graph Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java
X Title: Avro adapter - Read and write Avro container files · Issue #794 · apache/arrow-java
Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block ...
Open Graph Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...
X Description: Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch...
Opengraph URL: https://github.com/apache/arrow-java/issues/794
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Avro adapter - Read and write Avro container files","articleBody":"### Describe the enhancement requested\n\nPart 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch will correspond to one Avro file block and fill a single VSR. The VSR can be recycled between batches. Input and output can be to Avro encoder / decoder (set up externally) or to Java's native byte channels (which are set up with default binary encoder / decoder). To cater for async scenarios, the reader API should know how many bytes are required for a block before attempting to read it.\n\nI'd like to propose the following API - hopefully this is going in the right direction. I've taken some inspiration from ArrowFilleReader / Writer and Json Reader / Writer, but it's not identical (and they're not identical to each other). If there is a desire to line up on specific naming / conventions then certainly happy to do that, in which case I'll need a steer on exactly how it should be. Otherwise if anyone has radically different ideas of what it should look like, please do share!\n\n class AvroFileWriter {\n\n // Writer owns a channel / encoder and will close them\n // VSR and optional dictionaries are not owned and will not be closed\n // VSR can be recycled or supplied as a stream\n\n // Avro encoder configured externally\n public AvroFileWriter(\n Encoder encoder,\n VectorSchemaRoot firstBatch, \n DictionaryProvider dictionaries)\n\n // Sets up a defaulr binary encoder for the channel\n public AvroFileWriter(\n WritableByteChannel channel,\n VectorSchemaRoot firstBatch, \n DictionaryProvider dictionaries)\n\n // Write the Avro header (throws if already written)\n void writeHeader()\n\n // Write the contents of the VSR as an Avro data block\n // Writes header if not yet written\n // Expects new data to be in the batch (i.e. VSR can be recycled)\n void writeBatch()\n\n // Reset vectors in all the producders\n // Supports a stream of VSRs if source VSR is not recycled\n void resetBatch(VectorSchemaRoot batch)\n\n // Closes encoder and / or channel\n // Does not close VSR or dictionary vectors\n void close()\n\n }\n \nNow writing data looks like this:\n\n void writeAvro(MyApp app) {\n\n var root = app.prepareVsr();\n var dictionaries = app.prepareDictionaries()\n\n try (var writer = new AvroFileWriter(app.openChannel(), root, dictionaries)) {\n\n writer.writeHeader();\n\n // Assume recycling, loadBatch() puts fresh data into root\n while (app.loadBatch()) {\n writer.writeBatch()\n }\n }\n }\n\nAnd then for the reader:\n\n class AvroFileReader implements DictionaryProvider {\n\n // Writer owns a channel / decoder and will close them\n // Schema / VSR / dictionaries are created when header is read\n // VSR / dictionaries are cleaned up on close\n // Dictionaries accessible through DictionaryProvider iface\n\n // Avro decoder configured externally\n public AvroFileWriter(\n Decoder decoder,\n BufferAllocator allocator)\n\n // Sets up a defaulr binary deocder for the channel\n // Avro read sequentially so seekable channel not needed\n public AvroFileWriter(\n ReadableByteChannel channel,\n BufferAllocator allocator)\n\n // Read the Avro header and set up schema / VSR / dictionaries\n void readHeader()\n\n // Schema and VSR available after readHeader() \n Schema getSchema()\n VectorSchemaRoot getVectorSchemaRoot()\n\n // Read the next Avro block and load it into the VSR\n // Return true if successful, false if EOS\n // Also false in non-blocking mode if need more data\n boolean readBatch()\n\n // Check for position and size of the next Avro data block\n // Provides a mechanism for non-blocking / reactive styles\n boolean hasNextBatch();\n long nextBatchPosition();\n long nextBatchSize();\n\n // Closes encoder and / or channel\n // Also closes VSR and dictionary vectors\n void close()\n\n }\n\nSo reading looks like this:\n\n // Blocking style\n void readAvro(MyApp app) {\n\n try (var reader = new AvroFileReader(app.openChannel(), app.allocator()) {\n\n reader.readHeader();\n \n app.setSchema(reader.getSchema());\n app.setVsr(reader.getVectorSchemaRoot());\n app.setDictionaries(reader);\n\n while (reader.readBatch())) {\n app.saveBatch();\n }\n }\n }\n\n // Non-blocking stage to process one batch\n CompletionStage\u003cBoolean\u003e readAvroAsync(AvroFileReader reader) {\n\n if (reader.hasNextBatch()) {\n\n var start = reader.nextBatchStart();\n var end = reader.nextBatchEnd();\n\n return app.ensureBytesAvailable(start, end)\n .thenApply(x -\u003e {\n\n if (reader.readBatch()) {\n app.saveBatch();\n }\n\n return reader;\n })\n .thenCompose(this::readAsync);\n }\n else {\n return CompletableFuture.completedFuture(true);\n }\n }\n\nThe non-blocking read is quite important for me as I have a web service that receives bytes in a stream. There is a slight gotcha because we need the first 8 bytes of the next batch before we know its size, but we can implement hasNextBatch() without them and the probably expose the batch padding size as a constant. \n\nCompression is probably worth thinking about now - each block is compressed individually so the implementation needs to treat the contents of each block as a separate chunk, that can be fed through a codec. My guess is this is fairly straightforward for codecs that are already available so we might as well include it rather than reworking later.\n\nIf this looks broadly right I'll make a start on top of #779 ","author":{"url":"https://github.com/martin-traverse","@type":"Person","name":"martin-traverse"},"datePublished":"2025-07-07T20:30:03.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":5},"url":"https://github.com/794/arrow-java/issues/794"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:87ba72d2-10ff-cd57-3a90-e43814b1714a |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | 9CEE:11BDB2:14C2DF6:1A4385B:6991A18A |
| html-safe-nonce | 2cdfcf008b86d4064c600eace203d1ea085a0f8aac47a92bac8ec85bab599c5f |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5Q0VFOjExQkRCMjoxNEMyREY2OjFBNDM4NUI6Njk5MUExOEEiLCJ2aXNpdG9yX2lkIjoiNjAxMDA4MTMwMzk1NjkyMjc2MiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | bc6f67a108c0cbb2747276ab03e79de63787c6d88964eb03b7856414993cb635 |
| hovercard-subject-tag | issue:3210178394 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/apache/arrow-java/794/issue_layout |
| twitter:image | https://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/d77ec97872d903f43d90e21fb1b9b2227b14c6fe56f18931c964e3ed75a21fb3/apache/arrow-java/issues/794 |
| og:image:alt | Describe the enhancement requested Part 4 in the Avro series, following on from #731. This will allow reading and writing whole files in the Avro container format as a series of batches. Each batch... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | martin-traverse |
| hostname | github.com |
| expected-hostname | github.com |
| None | 42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b |
| turbo-cache-control | no-preview |
| go-import | github.com/apache/arrow-java git https://github.com/apache/arrow-java.git |
| octolytics-dimension-user_id | 47359 |
| octolytics-dimension-user_login | apache |
| octolytics-dimension-repository_id | 893682219 |
| octolytics-dimension-repository_nwo | apache/arrow-java |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 893682219 |
| octolytics-dimension-repository_network_root_nwo | apache/arrow-java |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 848bc6032dcc93a9a7301dcc3f379a72ba13b96e |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width