René's URL Explorer Experiment


Title: Databricks Spark runner · Issue #764 · feast-dev/feast · GitHub

Open Graph Title: Databricks Spark runner · Issue #764 · feast-dev/feast

X Title: Databricks Spark runner · Issue #764 · feast-dev/feast

Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP...

Open Graph Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...

X Description: Is your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...

Opengraph URL: https://github.com/feast-dev/feast/issues/764

X: @github

direct link

Domain: github.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Databricks Spark runner","articleBody":"**Is your feature request related to a problem? Please describe.**\r\nRunning on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes this hard. The Feast code is tied to GCP in some places.\r\n\r\n**Describe the solution you'd like**\r\nDatabricks is a popular managed service available on Azure and AWS. It offers fully managed and optimized Spark environments with a REST API on top. Databricks is the main committer on Apache Spark, and on Azure the service is directly offered and supported by Microsoft, making it available to all Azure customers without an extra contract needed. The Spark clusters run in the customer's environment as fully managed VMs. The customer only pays while clusters are running, storage at rest has very little cost.\r\n\r\nThe Databricks runtime includes the open-source [Delta Lake](http://delta.io) storage layer which allows efficiently using cloud storage as a repository for historical serving.\r\n\r\nWe're starting work on a Databricks runner that we would like to submit as a PR. This issue is a place to discuss and align with the community upfront, to ensure this PR will be accepted.\r\n\r\n**Describe alternatives you've considered**\r\n- The Beam Spark runner (#362) has limitations with Structured Streaming. Spark doesn't have a standard client-server API, and causes classpath issues.\r\n- Running Beam on Flink introduces additional infrastructure management, and requires use of community (unsupported) Flink operator. We would also have to introduce a component to replace BigQuery.\r\n\r\n**Additional context**\r\n\r\n### Baseline\r\n\r\nWe will baseline the work on the 0.5.0 release.\r\n\r\n### Ingestion - Feast Core\r\n\r\n- A class (feast.core.job.databricks.DatabricksJobManager) implementing a Databricks job manager to create, monitor and control jobs. In contrast to preexisting feast.core.job packages, it does not use the feast.ingestion Beam implementation classes, nor the Feast storage connectors, but will call out to the Databricks API to run jobs, passing job parameters such as Kafka topic name. The Databricks Run ID will be tracked as Feast job `extId`.\r\n- Databricks jobs will be run with `max_retries=-1` to retry indefinitely in case of job failure (e.g. VM host failure).\r\n- The JobManager `updateJob` and `restartJob` will be implemented as stop+start.\r\n- The job definition Protobuf will be extended to allow the user to specify the Databricks cluster configuration for the job (e.g. Azure F3sv2 VM, 3 worker nodes)\r\n\r\n### Ingestion - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed. The JAR should be small and have no dependencies beyond the Spark and Delta runtimes + the Spark Redis connector. Scala is usually the preferred programming language for concision, but Java can be used as well if required. The CI/CD process (and not the Feast runtime) shall deploy the JAR to Databricks' storage.\r\n- The Spark job will connect directly to Kafka to retrieve features, using the Kafka Spark connector.\r\n- Feast’s ValidateFeatureRowDoFn class, which is tied to Beam, will be split into a core logic class that only has dependencies on common classes (feast-datatypes, and feast-storage-api for the FailedElement class). The core logic class must be placed into a new Feast module, so that it can be used by both the Beam Ingestion module and the Spark job. The Spark job will apply the core logic validation on each incoming feature in the stream, using a UDF, and store dead lettered messages into a specific Delta table.\r\n- The Spark job will store features into Delta tables (one table per FeatureSet) on cloud storage, for historical serving.\r\n- The Spark job will connect directly to Redis to populate data for online serving, using the Redis Spark connector.\r\n\r\n### Historical Retriever - API change\r\n\r\n- The Retriever API is defined to return a list of blobs containing Avro data. Feast server-side code currently enumerates blobs on GCP storage to return a list to the client.  We want to change this, to return a cloud storage path (directory) containing Avro files, and move the blob enumeration work to the client. Motivation:\r\n   - Do not tie server-side code to specific GCP/Azure storage client libraries\r\n   - Efficiency, e.g. when the client does not access the data directly but triggers a remote ML job\r\n   - Support Parquet format in the future, where the directory is the parquet \"file\"\r\n- In a future step we would like to add Parquet support instead of/in addition to Avro. We expect this will lead to a significant size reduction in many cases, due to columnar compression.\r\n\r\n### Historical Retriever - Feast Serving\r\n\r\n- We will develop a retriever as a FEAST Storage connector (reader only). We will build a class DatabricksHistoricalRetriever that runs a Databricks job, similarly to the ingestion job.\r\n- The class will generate a new temporary cloud storage location in which the job output will be stored (for download by the SDK client). The implementation will call the Databricks Jobs API to run the job, and then busy-loop, calling the API until the job is completed.\r\n\r\n### Historical Retriever - Spark job\r\n\r\n- A Spark job JAR (new Feast Maven module) will be developed, similarly to the Spark ingestion job. It will replicate the logic done in the [SQL templates for BigQuery](https://github.com/feast-dev/feast/tree/v0.5.0/storage/connectors/bigquery/src/main/resources/templates).\r\n- Depending on the complexity of this, we might propose an initial version that is not yet fully featured (e.g. working only a single featureset).\r\n\r\n### Historical Retriever - Python SDK\r\n\r\n- The Python SDK is currently directly tied to the GCP Storage library.  This will be replaced with [Apache Libcloud](https://libcloud.apache.org/) to make the SDK compatible with Azure/AWS without adding complexity.\r\n\r\n### Secrets management\r\n\r\n- Feast to Databricks: a Personal Access Token must be passed along with REST API operations. This token will be retrieved from the application environment by the DatabricksJobManager.\r\n- Databricks to Cloud Storage/Redis/Kafka: secrets should be preprovisioned into Databricks (Secrets API), and made available to the jobs via the Spark environment.\r\n\r\n### Integration testing\r\n\r\n- A Databricks API emulator (new Feast Maven module) will be developed. The emulator will be packaged as a Docker image, running a Java application exposing a REST API that is a compatible subset of the Databricks API (but only covering the jobs operations used in Feast, so a very small surface). The Java application will run real local Spark applications, so it should be able to test the real Feast Spark jobs (at small scale, and using local mounted storage instead of cloud storage).\r\n    - I have a prototype of this working already, using the [Spark REST framework](http://sparkjava.com/) (confusing naming - this is unrelated to Apache Spark!) for simplicity and small image size. We can also refactor that to use Spring REST if desired.\r\n\r\n- In addition, the Databricks setup can be integration tested against the real Databricks, similar to what the BigQuery test scripts are doing.\r\n","author":{"url":"https://github.com/algattik","@type":"Person","name":"algattik"},"datePublished":"2020-06-02T10:43:13.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":9},"url":"https://github.com/764/feast/issues/764"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:17056b3e-6686-609c-325b-45dedf06ba80
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id904E:137566:17C622A:1FA2089:697C2417
html-safe-nonce8f1c992ef7b4ef91a04c4056672f77d427ef7326aa4ae571a76036cad208f1fb
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5MDRFOjEzNzU2NjoxN0M2MjJBOjFGQTIwODk6Njk3QzI0MTciLCJ2aXNpdG9yX2lkIjoiNjg3NDExMDE4MjgxMzM0NDc5MSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac0d259166d1fa614edea5b71725badc643c7c4d759d308e746c4709f9ee63c3c5
hovercard-subject-tagissue:629098142
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/feast-dev/feast/764/issue_layout
twitter:imagehttps://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/91dfb3f331d0e7dcd96f363899bc2c34a22fa368dd6df5ca7dd94cc933a41330/feast-dev/feast/issues/764
og:image:altIs your feature request related to a problem? Please describe. Running on non-GCP clouds is a common request (#367) but the lack of a fully managed Beam service comparable to Google Dataflow makes ...
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernamealgattik
hostnamegithub.com
expected-hostnamegithub.com
Noneda4f0ee56809799586f8ee546b27f94fe9b5893edfbf87732e82be45be013b52
turbo-cache-controlno-preview
go-importgithub.com/feast-dev/feast git https://github.com/feast-dev/feast.git
octolytics-dimension-user_id57027613
octolytics-dimension-user_loginfeast-dev
octolytics-dimension-repository_id161133770
octolytics-dimension-repository_nwofeast-dev/feast
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id161133770
octolytics-dimension-repository_network_root_nwofeast-dev/feast
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release787d8d274e314f52ce6d846c7581f9476d8dc736
ui-targetfull
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://github.com/feast-dev/feast/issues/764#start-of-content
https://github.com/
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ffeast-dev%2Ffeast%2Fissues%2F764
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ffeast-dev%2Ffeast%2Fissues%2F764
Sign up https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=feast-dev%2Ffeast
Reloadhttps://github.com/feast-dev/feast/issues/764
Reloadhttps://github.com/feast-dev/feast/issues/764
Reloadhttps://github.com/feast-dev/feast/issues/764
feast-dev https://github.com/feast-dev
feasthttps://github.com/feast-dev/feast
Notifications https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Fork 1.2k https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Star 6.7k https://github.com/login?return_to=%2Ffeast-dev%2Ffeast
Code https://github.com/feast-dev/feast
Issues 182 https://github.com/feast-dev/feast/issues
Pull requests 67 https://github.com/feast-dev/feast/pulls
Discussions https://github.com/feast-dev/feast/discussions
Actions https://github.com/feast-dev/feast/actions
Security 0 https://github.com/feast-dev/feast/security
Insights https://github.com/feast-dev/feast/pulse
Code https://github.com/feast-dev/feast
Issues https://github.com/feast-dev/feast/issues
Pull requests https://github.com/feast-dev/feast/pulls
Discussions https://github.com/feast-dev/feast/discussions
Actions https://github.com/feast-dev/feast/actions
Security https://github.com/feast-dev/feast/security
Insights https://github.com/feast-dev/feast/pulse
New issuehttps://github.com/login?return_to=https://github.com/feast-dev/feast/issues/764
New issuehttps://github.com/login?return_to=https://github.com/feast-dev/feast/issues/764
Databricks Spark runnerhttps://github.com/feast-dev/feast/issues/764#top
keep-openhttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22keep-open%22
kind/discussionhttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Fdiscussion%22
kind/featureNew feature or requesthttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Ffeature%22
https://github.com/algattik
https://github.com/algattik
algattikhttps://github.com/algattik
on Jun 2, 2020https://github.com/feast-dev/feast/issues/764#issue-629098142
#367https://github.com/feast-dev/feast/issues/367
Delta Lakehttp://delta.io
Ingestion with Spark: Job Management for Beam Spark Runner #362https://github.com/feast-dev/feast/issues/362
SQL templates for BigQueryhttps://github.com/feast-dev/feast/tree/v0.5.0/storage/connectors/bigquery/src/main/resources/templates
Apache Libcloudhttps://libcloud.apache.org/
Spark REST frameworkhttp://sparkjava.com/
keep-openhttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22keep-open%22
kind/discussionhttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Fdiscussion%22
kind/featureNew feature or requesthttps://github.com/feast-dev/feast/issues?q=state%3Aopen%20label%3A%22kind%2Ffeature%22
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.