Title: Databricks Spark runner · Issue #764 · feast-dev/feast · GitHub
Open Graph Title: Databricks Spark runner · Issue #764 · feast-dev/feast
X Title: Databricks Spark runner · Issue #764 · feast-dev/feast
Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP...
Open Graph Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...
X Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...
Opengraph URL: https://github.com/feast-dev/feast/issues/764
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Databricks Spark runner","articleBody":"**Is your feature request related to a problem? Please describe.**\r\nRunning on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP in some places.\r\n\r\n**Describe the solution you'd like**\r\nDatabricks is a popular managed service available on Azure and AWS. It offers fully managed and optimized Spark environments with a REST API on top. Databricks is the main committer on Apache Spark, and on Azure the service is directly offered and supported by Microsoft, making it available to all Azure customers without an extra contract needed. The Spark clusters run in the customer's environment as fully managed VMs. The customer only pays while clusters are running, storage at rest has very little cost.\r\n\r\nThe Databricks runtime includes the open-source [Delta Lake](http://delta.io) storage layer which allows efficiently using cloud storage as a repository for historical serving.\r\n\r\nWe're starting work on a Databricks runner that we would like to submit as a PR. This issue is a place to discuss and align with the community upfront, to ensure this PR will be accepted.\r\n\r\n**Describe alternatives you've considered**\r\n- The Beam Spark runner (#362) has limitations with Structured Streaming. Spark doesn't have a standard client-server API, and causes classpath issues.\r\n- Running Beam on Flink introduces additional infrastructure management, and requires use of community (unsupported) Flink operator. We would also have to introduce a component to replace BigQuery.\r\n\r\n**Additional context**\r\n\r\n### Baseline\r\n\r\nWe will baseline the work on the 0.5.0 release.\r\n\r\n### Ingestion - Feast Core\r\n\r\n- A class (feast.core.job.databricks.DatabricksJobManager) implementing a Databricks job manager to create, monitor and control jobs. In contrast to preexisting feast.core.job packages, it does not use the feast.ingestion Beam implementation classes, nor the Feast storage connectors, but will call out to the Databricks API to run jobs, passing job parameters such as Kafka topic name. The Databricks Run ID will be tracked as Feast job `extId`.\r\n- Databricks jobs will be run with `max_retries=-1` to retry indefinitely in case of job failure (e.g. VM host failure).\r\n- The JobManager `updateJob` and `restartJob` will be implemented as stop+start.\r\n- The job definition Protobuf will be extended to allow the user to specify the Databricks cluster configuration for the job (e.g. Azure F3sv2 VM, 3 worker nodes)\r\n\r\n### Ingestion - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed. The JAR should be small and have no dependencies beyond the Spark and Delta runtimes + the Spark Redis connector. Scala is usually the preferred programming language for concision, but Java can be used as well if required. The CI/CD process (and not the Feast runtime) shall deploy the JAR to Databricks' storage.\r\n- The Spark job will connect directly to Kafka to retrieve features, using the Kafka Spark connector.\r\n- Feast’s ValidateFeatureRowDoFn class, which is tied to Beam, will be split into a core logic class that only has dependencies on common classes (feast-datatypes, and feast-storage-api for the FailedElement class). The core logic class must be placed into a new Feast module, so that it can be used by both the Beam Ingestion module and the Spark job. The Spark job will apply the core logic validation on each incoming feature in the stream, using a UDF, and store dead lettered messages into a specific Delta table.\r\n- The Spark job will store features into Delta tables (one table per FeatureSet) on cloud storage, for historical serving.\r\n- The Spark job will connect directly to Redis to populate data for online serving, using the Redis Spark connector.\r\n\r\n### Historical Retriever - API change\r\n\r\n- The Retriever API is defined to return a list of blobs containing Avro data. Feast server-side code currently enumerates blobs on GCP storage to return a list to the client. We want to change this, to return a cloud storage path (directory) containing Avro files, and move the blob enumeration work to the client. Motivation:\r\n - Do not tie server-side code to specific GCP/Azure storage client libraries\r\n - Efficiency, e.g. when the client does not access the data directly but triggers a remote ML job\r\n - Support Parquet format in the future, where the directory is the parquet \"file\"\r\n- In a future step we would like to add Parquet support instead of/in addition to Avro. We expect this will lead to a significant size reduction in many cases, due to columnar compression.\r\n\r\n### Historical Retriever - Feast Serving\r\n\r\n- We will develop a retriever as a FEAST Storage connector (reader only). We will build a class DatabricksHistoricalRetriever that runs a Databricks job, similarly to the ingestion job.\r\n- The class will generate a new temporary cloud storage location in which the job output will be stored (for download by the SDK client). The implementation will call the Databricks Jobs API to run the job, and then busy-loop, calling the API until the job is completed.\r\n\r\n### Historical Retriever - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed, similarly to the Spark ingestion job. It will replicate the logic done in the [SQL templates for BigQuery](https://github.com/feast-dev/feast/tree/v0.5.0/storage/connectors/bigquery/src/main/resources/templates).\r\n- Depending on the complexity of this, we might propose an initial version that is not yet fully featured (e.g. working only a single featureset).\r\n\r\n### Historical Retriever - Python SDK\r\n\r\n- The Python SDK is currently directly tied to the GCP Storage library. This will be replaced with [Apache Libcloud](https://libcloud.apache.org/) to make the SDK compatible with Azure/AWS without adding complexity.\r\n\r\n### Secrets management\r\n\r\n- Feast to Databricks: a Personal Access Token must be passed along with REST API operations. This token will be retrieved from the application environment by the DatabricksJobManager.\r\n- Databricks to Cloud Storage/Redis/Kafka: secrets should be preprovisioned into Databricks (Secrets API), and made available to the jobs via the Spark environment.\r\n\r\n### Integration testing\r\n\r\n- A Databricks API emulator (new Feast Maven module) will be developed. The emulator will be packaged as a Docker image, running a Java application exposing a REST API that is a compatible subset of the Databricks API (but only covering the jobs operations used in Feast, so a very small surface). The Java application will run real local Spark applications, so it should be able to test the real Feast Spark jobs (at small scale, and using local mounted storage instead of cloud storage).\r\n - I have a prototype of this working already, using the [Spark REST framework](http://sparkjava.com/) (confusing naming - this is unrelated to Apache Spark!) for simplicity and small image size. We can also refactor that to use Spring REST if desired.\r\n\r\n- In addition, the Databricks setup can be integration tested against the real Databricks, similar to what the BigQuery test scripts are doing.\r\n","author":{"url":"https://github.com/algattik","@type":"Person","name":"algattik"},"datePublished":"2020-06-02T10:43:13.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":9},"url":"https://github.com/764/feast/issues/764"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:17056b3e-6686-609c-325b-45dedf06ba80 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | 904E:137566:17C622A:1FA2089:697C2417 |
| html-safe-nonce | 8f1c992ef7b4ef91a04c4056672f77d427ef7326aa4ae571a76036cad208f1fb |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5MDRFOjEzNzU2NjoxN0M2MjJBOjFGQTIwODk6Njk3QzI0MTciLCJ2aXNpdG9yX2lkIjoiNjg3NDExMDE4MjgxMzM0NDc5MSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 0d259166d1fa614edea5b71725badc643c7c4d759d308e746c4709f9ee63c3c5 |
| hovercard-subject-tag | issue:629098142 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/feast-dev/feast/764/issue_layout |
| twitter:image | https://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764 |
| og:image:alt | Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | algattik |
| hostname | github.com |
| expected-hostname | github.com |
| None | da4f0ee56809799586f8ee546b27f94fe9b5893edfbf87732e82be45be013b52 |
| turbo-cache-control | no-preview |
| go-import | github.com/feast-dev/feast git https://github.com/feast-dev/feast.git |
| octolytics-dimension-user_id | 57027613 |
| octolytics-dimension-user_login | feast-dev |
| octolytics-dimension-repository_id | 161133770 |
| octolytics-dimension-repository_nwo | feast-dev/feast |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 161133770 |
| octolytics-dimension-repository_network_root_nwo | feast-dev/feast |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 787d8d274e314f52ce6d846c7581f9476d8dc736 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width