Title: Some of the step tasks have been OOM Killed. · Issue #189 · modAL-python/modAL · GitHub
Open Graph Title: Some of the step tasks have been OOM Killed. · Issue #189 · modAL-python/modAL
X Title: Some of the step tasks have been OOM Killed. · Issue #189 · modAL-python/modAL
Description: I am facing "oom_kill event in StepId=866679.batch. Some of the step tasks have been OOM Killed." while using avg_confidence strategy for my multilabel dataset with around 38000 images of size 224. I use torch Dataloader with batch size ...
Open Graph Description: I am facing "oom_kill event in StepId=866679.batch. Some of the step tasks have been OOM Killed." while using avg_confidence strategy for my multilabel dataset with around 38000 images of size 224....
X Description: I am facing "oom_kill event in StepId=866679.batch. Some of the step tasks have been OOM Killed." while using avg_confidence strategy for my multilabel dataset with around 38000 images of...
Opengraph URL: https://github.com/modAL-python/modAL/issues/189
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Some of the step tasks have been OOM Killed.","articleBody":"I am facing \"oom_kill event in StepId=866679.batch. Some of the step tasks have been OOM Killed.\" while using avg_confidence strategy for my multilabel dataset with around 38000 images of size 224. I use torch Dataloader with batch size 8 to load the data. Here's a snippet of the code covering Active Learning loop -\r\n\r\nn_queries = 14\r\nfor i in range(n_queries):\r\n if i == 0:\r\n n_instances = 8\r\n else:\r\n power += 0.25\r\n n_instances = batch(int(np.ceil(np.power(10, power))), batch_size)\r\n total_samples += n_instances\r\n n_instances_list.append(total_samples)\r\n \r\n print(f\"\\nQuery {i + 1}: Requesting {n_instances} samples.\")\r\n print(f\"Number of samples in pool before query: {X_pool.shape[0]}\")\r\n\r\n \r\n\r\n with torch.device(\"cpu\"):\r\n query_idx, _ = learner.query(X_pool, n_instances=n_instances) \r\n query_idx = np.unique(query_idx)\r\n query_idx = np.array(query_idx).flatten() \r\n\r\n # Extract the samples based on the query indices\r\n X_query = X_pool[query_idx]\r\n y_query = y_pool[query_idx]\r\n filenames_query = [filenames_pool[idx] for idx in query_idx]\r\n\r\n print(\"Shape of X_query after indexing:\", X_query.shape)\r\n\r\n if X_query.ndim != 4:\r\n raise ValueError(f\"Unexpected number of dimensions in X_query: {X_query.ndim}\")\r\n if X_query.shape[1:] != (224, 224, 3):\r\n raise ValueError(f\"Unexpected shape in X_query dimensions: {X_query.shape}\")\r\n\r\n X_cumulative = np.vstack((X_cumulative, X_query))\r\n y_cumulative = np.vstack((y_cumulative, y_query))\r\n filenames_cumulative.extend(filenames_query)\r\n\r\n save_checkpoint(i + 1, X_cumulative, y_cumulative, filenames_cumulative, save_dir)\r\n\r\n learner.teach(X=X_cumulative, y=y_cumulative)\r\n\r\n y_pred = learner.predict(X_test_np)\r\n accuracy = accuracy_score(y_test_np, y_pred)\r\n f1 = f1_score(y_test_np, y_pred, average='macro')\r\n acc_test_data.append(accuracy)\r\n f1_test_data.append(f1)\r\n\r\n print(f\"Accuracy after query {i + 1}: {accuracy}\")\r\n print(f\"F1 Score after query {i + 1}: {f1}\")\r\n\r\n\r\n # Early stopping check\r\n if f1 \u003e best_f1_score:\r\n best_f1_score = f1\r\n wait = 0 # reset the wait counter\r\n else:\r\n wait += 1 # increment the wait counter\r\n if wait \u003e= patience:\r\n print(\"Stopping early due to no improvement in F1 score.\")\r\n break\r\n\r\n # Remove queried instances from the pool\r\n X_pool = np.delete(X_pool, query_idx, axis=0)\r\n y_pool = np.delete(y_pool, query_idx, axis=0)\r\n filenames_pool = [filename for idx, filename in enumerate(filenames_pool) if idx not in query_idx]\r\n print(f\"Number of samples in pool after query: {X_pool.shape[0]}\")\r\n\r\nThis code runs well till 11 iterations but in the 12th iteration I get the OOM kill error. \r\n\r\nI am using A100 GPU with 40GB RAM which should be sufficient for this loop. Could you please help me identify what could be going wrong which leads to excessive memory requirement. Is there a bottleneck in my code that I should address? Could it be the case that for every iterarion the data is held in the main memory and can it be freed somehow without breaking the code and distorting the results. ","author":{"url":"https://github.com/shubhamgp47","@type":"Person","name":"shubhamgp47"},"datePublished":"2024-07-27T13:07:40.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/189/modAL/issues/189"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:11a86914-d4af-b744-c77f-58b6c22636d1 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | D874:170A5:ADEFDF:E2EE13:698EBD36 |
| html-safe-nonce | 53dc6409c48f9e1b33dca906b35e9e77b41223bd8bdb57a76334b179b9e9a07c |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJEODc0OjE3MEE1OkFERUZERjpFMkVFMTM6Njk4RUJEMzYiLCJ2aXNpdG9yX2lkIjoiNTE4MTQxOTg5NTkzODUzMDYxNCIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 03d673899e879e90da9c85f42e4f7338f5fef3ff635b69df72e978d965bd05dd |
| hovercard-subject-tag | issue:2433474346 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/modAL-python/modAL/189/issue_layout |
| twitter:image | https://opengraph.githubassets.com/f6109675c567a872f4b75170b31d6ad8dd5df99beb36d20a6cdca85f769cd781/modAL-python/modAL/issues/189 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/f6109675c567a872f4b75170b31d6ad8dd5df99beb36d20a6cdca85f769cd781/modAL-python/modAL/issues/189 |
| og:image:alt | I am facing "oom_kill event in StepId=866679.batch. Some of the step tasks have been OOM Killed." while using avg_confidence strategy for my multilabel dataset with around 38000 images of size 224.... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | shubhamgp47 |
| hostname | github.com |
| expected-hostname | github.com |
| None | cb2828a801ee6b7be618f3ac76fbf55def35bbc30f053a9c41bf90210b8b72ba |
| turbo-cache-control | no-preview |
| go-import | github.com/modAL-python/modAL git https://github.com/modAL-python/modAL.git |
| octolytics-dimension-user_id | 42179679 |
| octolytics-dimension-user_login | modAL-python |
| octolytics-dimension-repository_id | 110697473 |
| octolytics-dimension-repository_nwo | modAL-python/modAL |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 110697473 |
| octolytics-dimension-repository_network_root_nwo | modAL-python/modAL |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | e6b91a7e6e46287d26887e3fb7a4161657bab8f7 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width