Title: In BigQueryRetrievalJob.to_remote_storage(), return value is incorrect (includes all parquet files created in gcs_staging_location, not those those created in that specific call) · Issue #3712 · feast-dev/feast · GitHub
Open Graph Title: In BigQueryRetrievalJob.to_remote_storage(), return value is incorrect (includes all parquet files created in gcs_staging_location, not those those created in that specific call) · Issue #3712 · feast-dev/feast
X Title: In BigQueryRetrievalJob.to_remote_storage(), return value is incorrect (includes all parquet files created in gcs_staging_location, not those those created in that specific call) · Issue #3712 · feast-dev/feast
Description: Expected Behavior In BigQueryRetrievalJob, when I call to_remote_storage(), the return value that I would expect would be the paths of the parquet files that have been written to GCS... Current Behavior ...however, it turns out the the p...
Open Graph Description: Expected Behavior In BigQueryRetrievalJob, when I call to_remote_storage(), the return value that I would expect would be the paths of the parquet files that have been written to GCS... Current Beh...
X Description: Expected Behavior In BigQueryRetrievalJob, when I call to_remote_storage(), the return value that I would expect would be the paths of the parquet files that have been written to GCS... Current Beh...
Opengraph URL: https://github.com/feast-dev/feast/issues/3712
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"In BigQueryRetrievalJob.to_remote_storage(), return value is incorrect (includes all parquet files created in gcs_staging_location, not those those created in that specific call)","articleBody":"## Expected Behavior \r\nIn [BigQueryRetrievalJob](https://github.com/feast-dev/feast/blob/c75a01fce2d52cd18479ace748b8eb2e6c81c988/sdk/python/feast/infra/offline_stores/bigquery.py#L402), when I call [to_remote_storage](https://github.com/feast-dev/feast/blob/c75a01fce2d52cd18479ace748b8eb2e6c81c988/sdk/python/feast/infra/offline_stores/bigquery.py#L553)(), the [return value](https://github.com/feast-dev/feast/blob/c75a01fce2d52cd18479ace748b8eb2e6c81c988/sdk/python/feast/infra/offline_stores/bigquery.py#L588) that I would expect would be the paths of the parquet files that have been written to GCS...\r\n\r\n## Current Behavior\r\n...however, it turns out the the paths that are returned are all parquets ever that you have written to the bucket that you are using to store these parquets.\r\n\r\nFor example, say you set you gcs_staging_location in your feature_store.yaml to `feast-materialize-dev` and project_id to `my_feature-store`, then the self._gcs_path, as defined [here](https://github.com/feast-dev/feast/blob/c75a01fce2d52cd18479ace748b8eb2e6c81c988/sdk/python/feast/infra/offline_stores/bigquery.py#L428-L432) will be: `gs://feast-materialize-dev/my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731` (or some other uuid string, but you get the idea). However, the rest of the code in the to_remote_storage method returns all paths that are in the path `gs://feast-materialize-dev/export` which is not we we want, as the parquets are written to the self._gcs_path.\r\n\r\n## Steps to reproduce\r\nYou can see that the code is wrong with a simple example:\r\n\r\nCurrent code (pretty much from [this](https://github.com/feast-dev/feast/blob/c75a01fce2d52cd18479ace748b8eb2e6c81c988/sdk/python/feast/infra/offline_stores/bigquery.py#L579C9-L588). In this example you might imagine there are parquets created from the to-remote_storage call under `gs://feast-materialize-dev/ki_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/` but from a previous call there are also some at `gs://feast-materialize-dev/ki_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/`:\r\n\r\n```python\r\n\u003e\u003e\u003e from google.cloud.storage import Client as StorageClient\r\n\u003e\u003e\u003e _gcs_path = \"gs://feast-materialize-dev/my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731\"\r\n\u003e\u003e\u003e bucket, prefix = _gcs_path[len(\"gs://\") :].split(\"/\", 1)\r\n\u003e\u003e\u003e print(bucket)\r\n'feast-materialize-dev'\r\n\u003e\u003e\u003e print(prefix)\r\n'my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731'\r\n\u003e\u003e\u003e prefix = prefix.rsplit(\"/\", 1)[0] # THIS IS THE LINE THAT WE DO NOT WANT\r\n\u003e\u003e\u003e print(prefix)\r\n'my_feature_store/export'\r\n\u003e\u003e\u003e if prefix.startswith(\"/\"):\r\n\u003e\u003e\u003e prefix = prefix[1:]\r\n\u003e\u003e\u003e print(prefix)\r\n'my_feature_store/export'\r\n\r\n\u003e\u003e\u003e storage_client = StorageClient()\r\n\u003e\u003e\u003e blobs = storage_client.list_blobs(bucket, prefix=prefix)`\r\n\u003e\u003e\u003e results = []\r\n\u003e\u003e\u003e for b in blobs:\r\n\u003e\u003e\u003e results.append(f\"gs://{b.bucket.name}/{b.name}\")\r\n\u003e\u003e\u003e print(results)\r\n[\"gs://feast-materialize-dev/my_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/000000000000.parquet\", \"gs://feast-materialize-dev/my_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/000000000001.parquet\", \"gs://feast-materialize-dev/my_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/000000000000.parquet\", \"gs://feast-materialize-dev/my_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/000000000001.parquet\"] \r\n```\r\n\r\nYou can see in this example, there are parquets paths returned that are not [art of the self._gcs_path and therefore the write to gcs that occurred in this call. This is not what i would expect.\r\n\r\n## Possible Solution\r\nThe corrected code would simply not include the line `prefix = prefix.rsplit(\"/\", 1)[0]`\r\n\r\n","author":{"url":"https://github.com/crispin-ki","@type":"Person","name":"crispin-ki"},"datePublished":"2023-08-07T17:28:47.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":2},"url":"https://github.com/3712/feast/issues/3712"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:4d90c84e-1a2f-cdd4-8349-7683f0ee46e2 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | A144:344C46:2BF551C:3E3555F:697913C0 |
| html-safe-nonce | 38126b7aebc98d226770ee9f60533e3e80fa5936031227b58f400aabf5ccc7a4 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBMTQ0OjM0NEM0NjoyQkY1NTFDOjNFMzU1NUY6Njk3OTEzQzAiLCJ2aXNpdG9yX2lkIjoiMTcyODkxMTc4NjcwODIyMzM2IiwicmVnaW9uX2VkZ2UiOiJpYWQiLCJyZWdpb25fcmVuZGVyIjoiaWFkIn0= |
| visitor-hmac | 87935b97ca967912bf876898749e48d495123b42e2d034e5ee29da39c86f90ca |
| hovercard-subject-tag | issue:1839925268 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/feast-dev/feast/3712/issue_layout |
| twitter:image | https://opengraph.githubassets.com/55e19bde4c0c29519df0da36ec6a02f593eb8484e6e43c56d9b94dd972914271/feast-dev/feast/issues/3712 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/55e19bde4c0c29519df0da36ec6a02f593eb8484e6e43c56d9b94dd972914271/feast-dev/feast/issues/3712 |
| og:image:alt | Expected Behavior In BigQueryRetrievalJob, when I call to_remote_storage(), the return value that I would expect would be the paths of the parquet files that have been written to GCS... Current Beh... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | crispin-ki |
| hostname | github.com |
| expected-hostname | github.com |
| None | db675ffbe86f3a08023aaf76f083fc7f65e074708cdc617650b84119176f1009 |
| turbo-cache-control | no-preview |
| go-import | github.com/feast-dev/feast git https://github.com/feast-dev/feast.git |
| octolytics-dimension-user_id | 57027613 |
| octolytics-dimension-user_login | feast-dev |
| octolytics-dimension-repository_id | 161133770 |
| octolytics-dimension-repository_nwo | feast-dev/feast |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 161133770 |
| octolytics-dimension-repository_network_root_nwo | feast-dev/feast |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 3e6c9f597d227b0490794716e8b9dddd21a41ead |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width