Title: Ingestion with Spark: Job Management for Beam Spark Runner · Issue #362 · feast-dev/feast · GitHub
Open Graph Title: Ingestion with Spark: Job Management for Beam Spark Runner · Issue #362 · feast-dev/feast
X Title: Ingestion with Spark: Job Management for Beam Spark Runner · Issue #362 · feast-dev/feast
Description: We would like to run ingestion on Spark (Streaming), i.e. with the Beam Spark Runner. Thus, an implementation of Feast's job management is needed. There are a couple of factors that make this a bit less straightforward than Google Cloud ...
Open Graph Description: We would like to run ingestion on Spark (Streaming), i.e. with the Beam Spark Runner. Thus, an implementation of Feast's job management is needed. There are a couple of factors that make this a bit...
X Description: We would like to run ingestion on Spark (Streaming), i.e. with the Beam Spark Runner. Thus, an implementation of Feast's job management is needed. There are a couple of factors that make this a...
Opengraph URL: https://github.com/feast-dev/feast/issues/362
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Ingestion with Spark: Job Management for Beam Spark Runner","articleBody":"We would like to run ingestion on Spark (Streaming), i.e. with [the Beam Spark Runner][1]. Thus, an implementation of Feast's job management is needed.\r\n\r\nThere are a couple of factors that make this a bit less straightforward than Google Cloud Dataflow:\r\n\r\n1. There is not a standard remote/HTTP API for job submission and management built into Spark*.\r\n1. The Beam Spark Runner does not upload your executable job artifact and submit it for you like it does for Dataflow, because of 1 and because there is no assumption of a cloud service like GCS for where to put it—conventions vary depending on how \u0026 where organizations run Spark: they might use S3, HDFS, or an artifact repository to ferry job packages to where they're accessible from the runtime (YARN, Mesos, Kubernetes, EMR).\r\n\r\n\\* *Other than starting a [SparkContext] connected to the remote cluster, in-process in Feast Core. I feel that isn't workable for a number of reasons, not least of which are heavy dependencies on Spark as a library, and the lifecycle of streaming ingestion jobs being unnecessarily coupled to that of the Feast Core instance.*\r\n\r\n### Planned Approach\r\n\r\n#### Job Management\r\n\r\nWe initially plan to implement `JobManager` using ~~the Java client library for~~ [Apache Livy][2], a REST interface to Spark. This will use only an HTTP client, so it is light on dependencies and shouldn't get in the way of alternative `JobManager`s for Spark, should another organization wish to implement one for something other than Livy. _(Edit: turns out that Livy's `livy-http-client` artifact still depends on Spark as a library, it's not a plain REST client, so we'll avoid that…)_\r\n\r\nWe have internal experience and precedent using Livy, but not for Spark Streaming applications, so we have some uncertainties about whether it can work well. In case that it doesn't, we'll probably look to try [spark-jobserver] which does explicitly claim support for Streaming jobs.\r\n\r\n#### Ingestion Job Artifact\r\n\r\nWe're a bit less certain about how users should get the Feast ingestion Beam job artifact to their Spark cluster, due to the above mentioned variation in deployments.\r\n\r\nRoughly speaking, Feast Ingestion would be packaged as an assembly JAR that includes `beam-runners-spark` as well. So, a new `ingestion-spark` module may be added to the Maven build which is simply a POM for doing just that.\r\n\r\nDeployment itself may then need to rely on documentation.\r\n\r\n#### Beam Spark Runner\r\n\r\nA minor note, but we will use the \"legacy\", non-portable Beam Spark Runner. As [the Beam docs][1] cover, the runner based on Spark Structured Streaming is incomplete and only supports batch jobs, and the non-portable runner is still recommended for Java-only needs.\r\n\r\nIn theory this is runtime configuration for Feast users: if they want to try the portable runner, it should be possible, but we'll most likely be testing with the non-portable one.\r\n\r\ncc @smadarasmi \r\n\r\nReference issues to keep tabs on during implementation: #302, #361.\r\n\r\n[1]: https://beam.apache.org/documentation/runners/spark/\r\n[2]: https://livy.incubator.apache.org/\r\n[SparkContext]: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkContext.html\r\n[spark-jobserver]: https://github.com/spark-jobserver/spark-jobserver","author":{"url":"https://github.com/ches","@type":"Person","name":"ches"},"datePublished":"2019-12-13T02:15:42.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":6},"url":"https://github.com/362/feast/issues/362"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:730b9570-65f1-cd92-be1a-9444c0d95e40 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | DBD6:3E41F6:26450BC:3390158:697AC7BE |
| html-safe-nonce | ec003b1b46a092c728963490dce901a672d1c82eaf5f60f7773058d587721067 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJEQkQ2OjNFNDFGNjoyNjQ1MEJDOjMzOTAxNTg6Njk3QUM3QkUiLCJ2aXNpdG9yX2lkIjoiODExODMzOTU5OTc1NjYwOTQ3MCIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | b43067dc2bb0cf6f6908c41bf85807e4f9ae70ef66baf8c8f8607f0b5f7e7465 |
| hovercard-subject-tag | issue:537310994 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/feast-dev/feast/362/issue_layout |
| twitter:image | https://opengraph.githubassets.com/e5025ae6a043f402aacb6349801970aed4bab248823cc0cd026b45ebcea1282a/feast-dev/feast/issues/362 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/e5025ae6a043f402aacb6349801970aed4bab248823cc0cd026b45ebcea1282a/feast-dev/feast/issues/362 |
| og:image:alt | We would like to run ingestion on Spark (Streaming), i.e. with the Beam Spark Runner. Thus, an implementation of Feast's job management is needed. There are a couple of factors that make this a bit... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | ches |
| hostname | github.com |
| expected-hostname | github.com |
| None | 4af1ba0e68200258a80b0c5ab34f12a78bf48372a377a11e14eb668863c03b3a |
| turbo-cache-control | no-preview |
| go-import | github.com/feast-dev/feast git https://github.com/feast-dev/feast.git |
| octolytics-dimension-user_id | 57027613 |
| octolytics-dimension-user_login | feast-dev |
| octolytics-dimension-repository_id | 161133770 |
| octolytics-dimension-repository_nwo | feast-dev/feast |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 161133770 |
| octolytics-dimension-repository_network_root_nwo | feast-dev/feast |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 353b231ffaec2de44db15b2e82887804ede7c21e |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width