Title: A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance · Issue #59 · simdjson/simdjson-java · GitHub
Open Graph Title: A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance · Issue #59 · simdjson/simdjson-java
X Title: A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance · Issue #59 · simdjson/simdjson-java
Description: hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, for example: parseValue(json, 'path1', 'p...
Open Graph Description: hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, f...
X Description: hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple ud...
Opengraph URL: https://github.com/simdjson/simdjson-java/issues/59
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"A trick solution to achieve multi-field parsing and get compression of structured data by efficient parsing performance","articleBody":"\u003e hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, for example: parseValue(json, 'path1', 'path2', 'path3',,,,), and return (value1,vaule2,value3,,,)\r\nTherefore, we can quickly get the value from json by bitIndexs built by simdjson. The advantage of this solution is that it avoids creating many java object instance for each json node, thereby avoiding garbage collection overhead, and can perform pruning operations, which can make performance better.\r\n\r\n\u003e a simple example,\r\njson value is: {\"field1\":{\"field2\":\"value2\",\"field3\":3},\"field4\":[\"value4\",\"value5\"]}\r\nwe want to get paths is: [$.field1.field2,$.field4.0, $.field4]. _(\\$.field4 will compress list to string, \\$.field4.0 will get first element from list)_ \r\nexpect return value is [value2, value4, '[\"value4\",\"value5\"]']\r\n\r\n\u003e Solution Implementation \r\nfirst, we can convert the path array to a tree。if node color is blue, means we want get value for the path, if the node is container type, we will compress it to string. for example $.field4\r\n\u003cimg width=\"257\" alt=\"image\" src=\"https://github.com/user-attachments/assets/2be070d2-5ead-4004-97ce-29cea71b3456\"\u003e\r\n\r\n\u003esecond,loop through the bitindex,and fill values into paths tree。\r\nIn the above example, the bitindex value is [0, 1, 9, 10, 11, 19, 20, 28, 29, 37, 38, 39, 40, 41, 49, 50, 51, 59, 60, 68, 69]\r\nIn the picture below, I marked the position marked by bitindex with ‘#’. \r\nWe can know that bitindex will mark the starting and ending positions of map type and list type ([ ] { }); the starting position of map type key and value and the middle ':' , and the position of ',' between different elements.\r\n\u003cimg width=\"782\" alt=\"image\" src=\"https://github.com/user-attachments/assets/30cda774-2cdd-4bf9-a51d-d7b8919d29f3\"\u003e\r\n\r\n\r\n\u003e for the above example, we loop through the bitindex, step by step get the value of each node of json path tree, following is a simple flow chart\r\n\r\n\u003cimg width=\"795\" alt=\"image\" src=\"https://github.com/user-attachments/assets/60d0a302-3440-4086-868c-8d1975a641de\"\u003e\r\n\u003cimg width=\"842\" alt=\"image\" src=\"https://github.com/user-attachments/assets/ea58addd-e683-4672-9da3-cb13d8a69132\"\u003e\r\n\u003cimg width=\"970\" alt=\"image\" src=\"https://github.com/user-attachments/assets/c30901f3-4815-4b0a-b4d3-a01b81e51452\"\u003e\r\n\u003cimg width=\"1087\" alt=\"image\" src=\"https://github.com/user-attachments/assets/9613ffa3-0bca-4a26-8cf4-201db7e2c876\"\u003e\r\n\u003cimg width=\"1027\" alt=\"image\" src=\"https://github.com/user-attachments/assets/09ce572c-0266-4f81-9a40-1b841de9197c\"\u003e\r\n\u003cimg width=\"1035\" alt=\"image\" src=\"https://github.com/user-attachments/assets/3c64b418-3afc-456e-96b1-c7b51249dffb\"\u003e\r\n\u003cimg width=\"1122\" alt=\"image\" src=\"https://github.com/user-attachments/assets/ba1390fe-d48c-49bb-a8d0-d3b5b7903c27\"\u003e\r\n\r\n\u003e Since the json path tree can be reused, in the process of parsing multiple jsons, there is no need to build a json node tree for each json, but only a tree for the required path, which can improving parsing performance, and support compressing container type json data, and parsing multiple values at the same time, and is compatible with the case where the json value on the path is null.\r\n","author":{"url":"https://github.com/heykirby","@type":"Person","name":"heykirby"},"datePublished":"2024-10-04T14:52:11.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":2},"url":"https://github.com/59/simdjson-java/issues/59"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:5e6b6289-f2bc-b4a4-bb72-0df018eb6275 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | E06C:2065B7:CEEE2E:120AF98:696F4720 |
| html-safe-nonce | 1f2dda3caf1ee5ee20fad6c5dd8533dd7f74659bddb4d941cbad00d48a8a6fde |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFMDZDOjIwNjVCNzpDRUVFMkU6MTIwQUY5ODo2OTZGNDcyMCIsInZpc2l0b3JfaWQiOiI2MDQyMDMxNzQxNzYzMjc0NTI5IiwicmVnaW9uX2VkZ2UiOiJpYWQiLCJyZWdpb25fcmVuZGVyIjoiaWFkIn0= |
| visitor-hmac | 317bc18f65de31572527a238a76188eeb1e162872abae19ce80efb2324ca7745 |
| hovercard-subject-tag | issue:2566572667 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/simdjson/simdjson-java/59/issue_layout |
| twitter:image | https://opengraph.githubassets.com/11f992783ca23e9176daec27da5a0c3a54dcc2746b903f52f8030e8deb3d6a96/simdjson/simdjson-java/issues/59 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/11f992783ca23e9176daec27da5a0c3a54dcc2746b903f52f8030e8deb3d6a96/simdjson/simdjson-java/issues/59 |
| og:image:alt | hello, piotrrzysko, In many business scenarios, parsing multi value from json only requires path array as parameters and get the string type value only one time, usage like hive's json_tuple udf, f... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | heykirby |
| hostname | github.com |
| expected-hostname | github.com |
| None | b278ad162d35332b6de714dfb005de04386c4d92df6475522bef910f491a35ee |
| turbo-cache-control | no-preview |
| go-import | github.com/simdjson/simdjson-java git https://github.com/simdjson/simdjson-java.git |
| octolytics-dimension-user_id | 62337925 |
| octolytics-dimension-user_login | simdjson |
| octolytics-dimension-repository_id | 664918257 |
| octolytics-dimension-repository_nwo | simdjson/simdjson-java |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 664918257 |
| octolytics-dimension-repository_network_root_nwo | simdjson/simdjson-java |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 39aed5006635ab6f45e6b77d23e73b08a00272a3 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width