Title: [DLRM/Pytorch] Cuda error: illegal memory access after changing embedding size to 64 · Issue #830 · NVIDIA/DeepLearningExamples · GitHub
Open Graph Title: [DLRM/Pytorch] Cuda error: illegal memory access after changing embedding size to 64 · Issue #830 · NVIDIA/DeepLearningExamples
X Title: [DLRM/Pytorch] Cuda error: illegal memory access after changing embedding size to 64 · Issue #830 · NVIDIA/DeepLearningExamples
Description: Related to DLRM/Pytorch Describe the bug Changed embedding size to 64 (default 128) Changed the last layer of bottom MLP size to 64 (default 128) This caused crash as shown below. Traceback (most recent call last): File "/opt/conda/lib/p...
Open Graph Description: Related to DLRM/Pytorch Describe the bug Changed embedding size to 64 (default 128) Changed the last layer of bottom MLP size to 64 (default 128) This caused crash as shown below. Traceback (most r...
X Description: Related to DLRM/Pytorch Describe the bug Changed embedding size to 64 (default 128) Changed the last layer of bottom MLP size to 64 (default 128) This caused crash as shown below. Traceback (most r...
Opengraph URL: https://github.com/NVIDIA/DeepLearningExamples/issues/830
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"[DLRM/Pytorch] Cuda error: illegal memory access after changing embedding size to 64","articleBody":"Related to **DLRM/Pytorch** \r\n\r\n**Describe the bug**\r\nChanged embedding size to 64 (default 128)\r\nChanged the last layer of bottom MLP size to 64 (default 128)\r\nThis caused crash as shown below.\r\n```\r\nTraceback (most recent call last):\r\n File \"/opt/conda/lib/python3.6/runpy.py\", line 193, in _run_module_as_main\r\n \"__main__\", mod_spec)\r\n File \"/opt/conda/lib/python3.6/runpy.py\", line 85, in _run_code\r\n exec(code, run_globals)\r\n File \"/workspace/dlrm/dlrm/scripts/main.py\", line 519, in \u003cmodule\u003e\r\n app.run(main)\r\n File \"/opt/conda/lib/python3.6/site-packages/absl/app.py\", line 299, in run\r\n _run_main(main, args)\r\n File \"/opt/conda/lib/python3.6/site-packages/absl/app.py\", line 250, in _run_main\r\n sys.exit(main(argv))\r\n File \"/workspace/dlrm/dlrm/scripts/main.py\", line 264, in main\r\n train(model, loss_fn, optimizer, data_loader_train, data_loader_test, scaled_lr)\r\n File \"/workspace/dlrm/dlrm/scripts/main.py\", line 361, in train\r\n loss.backward()\r\n File \"/opt/conda/lib/python3.6/site-packages/torch/tensor.py\", line 184, in backward\r\n torch.autograd.backward(self, gradient, retain_graph, create_graph)\r\n File \"/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py\", line 123, in backward\r\n allow_unreachable=True) # allow_unreachable flag\r\nRuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`\r\nException raised from createCublasHandle at ../aten/src/ATen/cuda/CublasHandlePool.cpp:8 (most recent call first):\r\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string\u003cchar, std::char_traits\u003cchar\u003e, std::allocator\u003cchar\u003e \u003e) + 0x6b (0x7ff5f440a82b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)\r\nframe #1: \u003cunknown function\u003e + 0x327d0c2 (0x7ff4bbe1c0c2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #2: at::cuda::getCurrentCUDABlasHandle() + 0xb82 (0x7ff4bbe1d9d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #3: \u003cunknown function\u003e + 0x326945f (0x7ff4bbe0845f in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #4: at::native::addmm_out_cuda_impl(at::Tensor\u0026, at::Tensor const\u0026, at::Tensor const\u0026, at::Tensor const\u0026, c10::Scalar, c10::Scalar) + 0x78e (0x7ff4bacef5ee in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #5: at::native::mm_cuda(at::Tensor const\u0026, at::Tensor const\u0026) + 0x15b (0x7ff4bacf04bb in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #6: \u003cunknown function\u003e + 0x3293808 (0x7ff4bbe32808 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #7: \u003cunknown function\u003e + 0x330f734 (0x7ff4bbeae734 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)\r\nframe #8: \u003cunknown function\u003e + 0x2ba029b (0x7ff537b0b29b in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #9: \u003cunknown function\u003e + 0x7a8224 (0x7ff535713224 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #10: at::Tensor c10::Dispatcher::call\u003cat::Tensor, at::Tensor const\u0026, at::Tensor const\u0026\u003e(c10::OperatorHandle const\u0026, at::Tensor const\u0026, at::Tensor const\u0026) const + 0xc5 (0x7ff5c6f346e5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)\r\nframe #11: \u003cunknown function\u003e + 0x28fe447 (0x7ff537869447 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #12: torch::autograd::generated::AddmmBackward::apply(std::vector\u003cat::Tensor, std::allocator\u003cat::Tensor\u003e \u003e\u0026\u0026) + 0x155 (0x7ff5378aeca5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #13: \u003cunknown function\u003e + 0x2ee2f75 (0x7ff537e4df75 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #14: torch::autograd::Engine::evaluate_function(std::shared_ptr\u003ctorch::autograd::GraphTask\u003e\u0026, torch::autograd::Node*, torch::autograd::InputBuffer\u0026, std::shared_ptr\u003ctorch::autograd::ReadyQueue\u003e const\u0026) + 0x1808 (0x7ff537e48f68 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #15: torch::autograd::Engine::thread_main(std::shared_ptr\u003ctorch::autograd::GraphTask\u003e const\u0026, bool) + 0x551 (0x7ff537e49e01 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #16: torch::autograd::Engine::thread_init(int, std::shared_ptr\u003ctorch::autograd::ReadyQueue\u003e const\u0026) + 0xa3 (0x7ff537e3f863 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)\r\nframe #17: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr\u003ctorch::autograd::ReadyQueue\u003e const\u0026) + 0x50 (0x7ff5c7236b20 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)\r\nframe #18: \u003cunknown function\u003e + 0xbd6df (0x7ff5f4af76df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)\r\nframe #19: \u003cunknown function\u003e + 0x76db (0x7ff5fffcf6db in /lib/x86_64-linux-gnu/libpthread.so.0)\r\nframe #20: clone + 0x3f (0x7ff5ffcf888f in /lib/x86_64-linux-gnu/libc.so.6)\r\n\r\nterminate called after throwing an instance of 'c10::Error'\r\n what(): CUDA error: an illegal memory access was encountered\r\nException raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):\r\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string\u003cchar, std::char_traits\u003cchar\u003e, std::allocator\u003cchar\u003e \u003e) + 0x6b (0x7ff5f440a82b in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)\r\nframe #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xc10 (0x7ff5f41a5500 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)\r\nframe #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff5f43f2c9d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)\r\nframe #3: \u003cunknown function\u003e + 0x59f1e2 (0x7ff5c724b1e2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)\r\n\u003comitting python frames\u003e\r\nframe #16: __libc_start_main + 0xe7 (0x7ff5ffbf8b97 in /lib/x86_64-linux-gnu/libc.so.6)\r\n\r\nFatal Python error: Aborted\r\n\r\nThread 0x00007ff59fda0700 (most recent call first):\r\n\r\nThread 0x00007ff56b58b700 (most recent call first):\r\n\r\nCurrent thread 0x00007ff6003fc740 (most recent call first):\r\nAborted\r\n```\r\n\r\n**To Reproduce**\r\nuse the command line:\r\n--embedding_dim 64 --bottom_mlp_sizes 512,256,64\r\n\r\n**Expected behavior**\r\nit should not crash.\r\n\r\n**Environment**\r\nPlease provide at least:\r\n* Container version (e.g. pytorch:20.06-py3):\r\n* GPUs in the system: (e.g. 1x Tesla V100 32GB):\r\n* CUDA driver version (e.g. 418.67):\r\n","author":{"url":"https://github.com/junshi15","@type":"Person","name":"junshi15"},"datePublished":"2021-02-13T06:18:00.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":4},"url":"https://github.com/830/DeepLearningExamples/issues/830"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:ab9f775f-4348-a1b8-aa4d-230fc11e65a7 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | B6A0:3337F8:A6C215:ECD791:6978C43D |
| html-safe-nonce | 81a351328a91c167f77ac607b2fb7d296e989de538a06fa48bc3e32c9171f989 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJCNkEwOjMzMzdGODpBNkMyMTU6RUNENzkxOjY5NzhDNDNEIiwidmlzaXRvcl9pZCI6IjE3OTU5Mjg0NzMzNzQwODIxMDkiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | e5658a2e88c46c32eb5f6c4a59e708c28865766ec4c978e378110cdc6ae1a9a3 |
| hovercard-subject-tag | issue:807687937 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/NVIDIA/DeepLearningExamples/830/issue_layout |
| twitter:image | https://opengraph.githubassets.com/2101c68b2e0e1d3e3e65257d8017b7a0c664c246457993d59793596724578337/NVIDIA/DeepLearningExamples/issues/830 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/2101c68b2e0e1d3e3e65257d8017b7a0c664c246457993d59793596724578337/NVIDIA/DeepLearningExamples/issues/830 |
| og:image:alt | Related to DLRM/Pytorch Describe the bug Changed embedding size to 64 (default 128) Changed the last layer of bottom MLP size to 64 (default 128) This caused crash as shown below. Traceback (most r... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | junshi15 |
| hostname | github.com |
| expected-hostname | github.com |
| None | 2981c597c945c1d90ac6fa355ce7929b2f413dfe7872ca5c435ee53a24a1de50 |
| turbo-cache-control | no-preview |
| go-import | github.com/NVIDIA/DeepLearningExamples git https://github.com/NVIDIA/DeepLearningExamples.git |
| octolytics-dimension-user_id | 1728152 |
| octolytics-dimension-user_login | NVIDIA |
| octolytics-dimension-repository_id | 131881622 |
| octolytics-dimension-repository_nwo | NVIDIA/DeepLearningExamples |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 131881622 |
| octolytics-dimension-repository_network_root_nwo | NVIDIA/DeepLearningExamples |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 9520342ca7ead2f1a011aa96eaff82fc054a4970 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width