René's URL Explorer Experiment

Title: Databricks Spark runner · Issue #764 · feast-dev/feast · GitHub

Open Graph Title: Databricks Spark runner · Issue #764 · feast-dev/feast

X Title: Databricks Spark runner · Issue #764 · feast-dev/feast

Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP...

Open Graph Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...

X Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...

Opengraph URL: https://github.com/feast-dev/feast/issues/764

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Databricks Spark runner","articleBody":"**Is your feature request related to a problem? Please describe.**\r\nRunning on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP in some places.\r\n\r\n**Describe the solution you'd like**\r\nDatabricks is a popular managed service available on Azure and AWS. It offers fully managed and optimized Spark environments with a REST API on top. Databricks is the main committer on Apache Spark, and on Azure the service is directly offered and supported by Microsoft, making it available to all Azure customers without an extra contract needed. The Spark clusters run in the customer's environment as fully managed VMs. The customer only pays while clusters are running, storage at rest has very little cost.\r\n\r\nThe Databricks runtime includes the open-source [Delta Lake](http://delta.io) storage layer which allows efficiently using cloud storage as a repository for historical serving.\r\n\r\nWe're starting work on a Databricks runner that we would like to submit as a PR. This issue is a place to discuss and align with the community upfront, to ensure this PR will be accepted.\r\n\r\n**Describe alternatives you've considered**\r\n- The Beam Spark runner (#362) has limitations with Structured Streaming. Spark doesn't have a standard client-server API, and causes classpath issues.\r\n- Running Beam on Flink introduces additional infrastructure management, and requires use of community (unsupported) Flink operator. We would also have to introduce a component to replace BigQuery.\r\n\r\n**Additional context**\r\n\r\n### Baseline\r\n\r\nWe will baseline the work on the 0.5.0 release.\r\n\r\n### Ingestion - Feast Core\r\n\r\n- A class (feast.core.job.databricks.DatabricksJobManager) implementing a Databricks job manager to create, monitor and control jobs. In contrast to preexisting feast.core.job packages, it does not use the feast.ingestion Beam implementation classes, nor the Feast storage connectors, but will call out to the Databricks API to run jobs, passing job parameters such as Kafka topic name. The Databricks Run ID will be tracked as Feast job `extId`.\r\n- Databricks jobs will be run with `max_retries=-1` to retry indefinitely in case of job failure (e.g. VM host failure).\r\n- The JobManager `updateJob` and `restartJob` will be implemented as stop+start.\r\n- The job definition Protobuf will be extended to allow the user to specify the Databricks cluster configuration for the job (e.g. Azure F3sv2 VM, 3 worker nodes)\r\n\r\n### Ingestion - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed. The JAR should be small and have no dependencies beyond the Spark and Delta runtimes + the Spark Redis connector. Scala is usually the preferred programming language for concision, but Java can be used as well if required. The CI/CD process (and not the Feast runtime) shall deploy the JAR to Databricks' storage.\r\n- The Spark job will connect directly to Kafka to retrieve features, using the Kafka Spark connector.\r\n- Feast’s ValidateFeatureRowDoFn class, which is tied to Beam, will be split into a core logic class that only has dependencies on common classes (feast-datatypes, and feast-storage-api for the FailedElement class). The core logic class must be placed into a new Feast module, so that it can be used by both the Beam Ingestion module and the Spark job. The Spark job will apply the core logic validation on each incoming feature in the stream, using a UDF, and store dead lettered messages into a specific Delta table.\r\n- The Spark job will store features into Delta tables (one table per FeatureSet) on cloud storage, for historical serving.\r\n- The Spark job will connect directly to Redis to populate data for online serving, using the Redis Spark connector.\r\n\r\n### Historical Retriever - API change\r\n\r\n- The Retriever API is defined to return a list of blobs containing Avro data. Feast server-side code currently enumerates blobs on GCP storage to return a list to the client.  We want to change this, to return a cloud storage path (directory) containing Avro files, and move the blob enumeration work to the client. Motivation:\r\n   - Do not tie server-side code to specific GCP/Azure storage client libraries\r\n   - Efficiency, e.g. when the client does not access the data directly but triggers a remote ML job\r\n   - Support Parquet format in the future, where the directory is the parquet \"file\"\r\n- In a future step we would like to add Parquet support instead of/in addition to Avro. We expect this will lead to a significant size reduction in many cases, due to columnar compression.\r\n\r\n### Historical Retriever - Feast Serving\r\n\r\n- We will develop a retriever as a FEAST Storage connector (reader only). We will build a class DatabricksHistoricalRetriever that runs a Databricks job, similarly to the ingestion job.\r\n- The class will generate a new temporary cloud storage location in which the job output will be stored (for download by the SDK client). The implementation will call the Databricks Jobs API to run the job, and then busy-loop, calling the API until the job is completed.\r\n\r\n### Historical Retriever - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed, similarly to the Spark ingestion job. It will replicate the logic done in the [SQL templates for BigQuery](https://github.com/feast-dev/feast/tree/v0.5.0/storage/connectors/bigquery/src/main/resources/templates).\r\n- Depending on the complexity of this, we might propose an initial version that is not yet fully featured (e.g. working only a single featureset).\r\n\r\n### Historical Retriever - Python SDK\r\n\r\n- The Python SDK is currently directly tied to the GCP Storage library.  This will be replaced with [Apache Libcloud](https://libcloud.apache.org/) to make the SDK compatible with Azure/AWS without adding complexity.\r\n\r\n### Secrets management\r\n\r\n- Feast to Databricks: a Personal Access Token must be passed along with REST API operations. This token will be retrieved from the application environment by the DatabricksJobManager.\r\n- Databricks to Cloud Storage/Redis/Kafka: secrets should be preprovisioned into Databricks (Secrets API), and made available to the jobs via the Spark environment.\r\n\r\n### Integration testing\r\n\r\n- A Databricks API emulator (new Feast Maven module) will be developed. The emulator will be packaged as a Docker image, running a Java application exposing a REST API that is a compatible subset of the Databricks API (but only covering the jobs operations used in Feast, so a very small surface). The Java application will run real local Spark applications, so it should be able to test the real Feast Spark jobs (at small scale, and using local mounted storage instead of cloud storage).\r\n    - I have a prototype of this working already, using the [Spark REST framework](http://sparkjava.com/) (confusing naming - this is unrelated to Apache Spark!) for simplicity and small image size. We can also refactor that to use Spring REST if desired.\r\n\r\n- In addition, the Databricks setup can be integration tested against the real Databricks, similar to what the BigQuery test scripts are doing.\r\n","author":{"url":"https://github.com/algattik","@type":"Person","name":"algattik"},"datePublished":"2020-06-02T10:43:13.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":9},"url":"https://github.com/764/feast/issues/764"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:17056b3e-6686-609c-325b-45dedf06ba80
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	904E:137566:17C622A:1FA2089:697C2417
html-safe-nonce	8f1c992ef7b4ef91a04c4056672f77d427ef7326aa4ae571a76036cad208f1fb
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5MDRFOjEzNzU2NjoxN0M2MjJBOjFGQTIwODk6Njk3QzI0MTciLCJ2aXNpdG9yX2lkIjoiNjg3NDExMDE4MjgxMzM0NDc5MSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac	0d259166d1fa614edea5b71725badc643c7c4d759d308e746c4709f9ee63c3c5
hovercard-subject-tag	issue:629098142
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/feast-dev/feast/764/issue_layout
twitter:image	https://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764
og:image:alt	Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	algattik
hostname	github.com
expected-hostname	github.com
None	da4f0ee56809799586f8ee546b27f94fe9b5893edfbf87732e82be45be013b52
turbo-cache-control	no-preview
go-import	github.com/feast-dev/feast git https://github.com/feast-dev/feast.git
octolytics-dimension-user_id	57027613
octolytics-dimension-user_login	feast-dev
octolytics-dimension-repository_id	161133770
octolytics-dimension-repository_nwo	feast-dev/feast
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	161133770
octolytics-dimension-repository_network_root_nwo	feast-dev/feast
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	787d8d274e314f52ce6d846c7581f9476d8dc736
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/feast-dev/feast/issues/764#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ffeast-dev%2Ffeast%2Fissues%2F764
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ffeast-dev%2Ffeast%2Fissues%2F764
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=feast-dev%2Ffeast
Reload	https://github.com/feast-dev/feast/issues/764
Reload	https://github.com/feast-dev/feast/issues/764
Reload	https://github.com/feast-dev/feast/issues/764
feast-dev	https://github.com/feast-dev
feast	https://github.com/feast-dev/feast
Notifications	https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Fork 1.2k	https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Star 6.7k	https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Code	https://github.com/feast-dev/feast
Issues 182	https://github.com/feast-dev/feast/issues
Pull requests 67	https://github.com/feast-dev/feast/pulls
Discussions	https://github.com/feast-dev/feast/discussions
Actions	https://github.com/feast-dev/feast/actions
Security 0	https://github.com/feast-dev/feast/security
Insights	https://github.com/feast-dev/feast/pulse
Code	https://github.com/feast-dev/feast
Issues	https://github.com/feast-dev/feast/issues
Pull requests	https://github.com/feast-dev/feast/pulls
Discussions	https://github.com/feast-dev/feast/discussions
Actions	https://github.com/feast-dev/feast/actions
Security	https://github.com/feast-dev/feast/security
Insights	https://github.com/feast-dev/feast/pulse
New issue	https://github.com/login?return_to=https://github.com/feast-dev/feast/issues/764
New issue	https://github.com/login?return_to=https://github.com/feast-dev/feast/issues/764
Databricks Spark runner	https://github.com/feast-dev/feast/issues/764#top
keep-open	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22keep-open%22
kind/discussion	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Fdiscussion%22
kind/featureNew feature or request	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Ffeature%22
	https://github.com/algattik
	https://github.com/algattik
algattik	https://github.com/algattik
on Jun 2, 2020	https://github.com/feast-dev/feast/issues/764#issue-629098142
#367	https://github.com/feast-dev/feast/issues/367
Delta Lake	http://delta.io
Ingestion with Spark: Job Management for Beam Spark Runner #362	https://github.com/feast-dev/feast/issues/362
SQL templates for BigQuery	https://github.com/feast-dev/feast/tree/v0.5.0/storage/connectors/bigquery/src/main/resources/templates
Apache Libcloud	https://libcloud.apache.org/
Spark REST framework	http://sparkjava.com/
keep-open	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22keep-open%22
kind/discussion	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Fdiscussion%22
kind/featureNew feature or request	https://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Ffeature%22
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.