Title: Improve performance on V100s · Issue #52 · PPPLDeepLearning/plasma-python · GitHub
Open Graph Title: Improve performance on V100s · Issue #52 · PPPLDeepLearning/plasma-python
X Title: Improve performance on V100s · Issue #52 · PPPLDeepLearning/plasma-python
Description: Mostly repeating private email and in-person communication on this topic for reference notes and posterity. FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traverse cluster, is about 3x slower than on t...
Open Graph Description: Mostly repeating private email and in-person communication on this topic for reference notes and posterity. FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traver...
X Description: Mostly repeating private email and in-person communication on this topic for reference notes and posterity. FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Tr...
Opengraph URL: https://github.com/PPPLDeepLearning/plasma-python/issues/52
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Improve performance on V100s","articleBody":"Mostly repeating private email and in-person communication on this topic for reference notes and posterity. \r\n\r\nFRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traverse cluster, is **about 3x slower** than on the P100s on Princeton's TigerGPU cluster. See the below table, which tests the performance for `d3d_0D` training on both machines as a function of batch size (as suggested by @jnkh). I have run these tests with 1, 2, 8 GPUs as well, and several datasets. \r\n\r\n\u003ctable border=\"2\" cellspacing=\"0\" cellpadding=\"6\" rules=\"groups\" frame=\"hsides\"\u003e\r\n\r\n\r\n\u003ccolgroup\u003e\r\n\u003ccol class=\"org-left\" /\u003e\r\n\r\n\u003ccol class=\"org-right\" /\u003e\r\n\r\n\u003ccol class=\"org-right\" /\u003e\r\n\r\n\u003ccol class=\"org-right\" /\u003e\r\n\r\n\u003ccol class=\"org-right\" /\u003e\r\n\r\n\u003ccol class=\"org-right\" /\u003e\r\n\u003c/colgroup\u003e\r\n\u003cthead\u003e\r\n\u003ctr\u003e\r\n\u003cth scope=\"col\" class=\"org-left\"\u003eMachine (GPU Model)\u003c/th\u003e\r\n\u003cth scope=\"col\" class=\"org-right\"\u003eN_node\u003c/th\u003e\r\n\u003cth scope=\"col\" class=\"org-right\"\u003eN_{GPU}\u003c/th\u003e\r\n\u003cth scope=\"col\" class=\"org-right\"\u003eExamples/sec\u003c/th\u003e\r\n\u003cth scope=\"col\" class=\"org-right\"\u003eSec/batch\u003c/th\u003e\r\n\u003cth scope=\"col\" class=\"org-right\"\u003eBatch size\u003c/th\u003e\r\n\u003c/tr\u003e\r\n\u003c/thead\u003e\r\n\r\n\u003ctbody\u003e\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003eTraverse (V100)\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e4\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1.35e3\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.75\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1024\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e2.53e3\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.80\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e2048\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e5.20e3\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.80\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e4096\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\r\n\r\n\u003ctbody\u003e\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003eTigerGPU (P100)\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e4\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e4.30e3\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.24\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1024\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e7.70e3\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.26\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e2048\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\r\n\r\n\u003ctr\u003e\r\n\u003ctd class=\"org-left\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e\u0026#xa0;\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e1.38e4\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e0.30\u003c/td\u003e\r\n\u003ctd class=\"org-right\"\u003e4096\u003c/td\u003e\r\n\u003c/tr\u003e\r\n\u003c/tbody\u003e\r\n\u003c/table\u003e\r\n\r\n\r\nAt first, I suspected some issue with my Conda / MPI environment on the Power 9 architecture. However, @ge-dong and I compared figures, and we confirmed that we are both independently observing this behavior. In fact, the original modules on Traverse produced about even slower performance (20%). \r\n\r\n@ASvyatkovskiy identified the primary issue being that the TensorFlow backend for`tf.keras` or external Keras does not run the cuDNN autotuner unlike vanilla TensorFlow architecture definitions. See my notes about the autotuner in #51. The default implementations of our layers might be slower on V100 than on P100.\r\n\r\nHe opened issues about this when he first ran on Summit over 1.5 years ago:\r\nhttps://github.com/tensorflow/tensorflow/issues/18913, https://github.com/keras-team/keras/issues/9825. Related: https://github.com/keras-team/keras/issues/9321\r\n\r\nAnd proposed the following optimizations especially for V100s:\r\n- Use https://github.com/NVIDIA/nccl library to perform all-reduce directly on the GPU\r\n- Use https://github.com/NVIDIA/apex mixed precision optimizers\r\n\r\n\u003e All these things are easier to enable/add in PyTorch, which now also support distributed training natively and through Horovod.\r\n\r\nAlso, I am systematically benchmarking the `LSTM` Keras layer definition vs. `CuDNNLSTM`, which seems to be at least an order of magnitude faster. \r\n\r\n\r\n**IBM AC922 \"Traverse\" architecture details:**\r\n- Processor is 16-core Power 9 running at 2.7 GHz\r\n- Host memory 256 GB DDR4\r\n- 4 X V100 with 32 GB HBM2\r\n\r\n ","author":{"url":"https://github.com/felker","@type":"Person","name":"felker"},"datePublished":"2019-12-17T23:03:56.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/52/plasma-python/issues/52"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:3a276bd9-09c9-e973-14ae-5b25088c29c7 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | DD5C:26F9F:7622A:A1670:698F1B61 |
| html-safe-nonce | 07d07553d03855b97d03f09684145629993f0a6d0b79b63c29c93bfa15916b0f |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJERDVDOjI2RjlGOjc2MjJBOkExNjcwOjY5OEYxQjYxIiwidmlzaXRvcl9pZCI6IjM4OTA2OTA1MDMyMjI5NTY0OSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 12f561e6c5b6f603a35084cdb4652e052c35d65004d79bf8701010fe664d418f |
| hovercard-subject-tag | issue:539358048 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/PPPLDeepLearning/plasma-python/52/issue_layout |
| twitter:image | https://opengraph.githubassets.com/e1a939851304b3730d01433cf6b05929a770f98b8ab49e8f7bd723637ff6f3e1/PPPLDeepLearning/plasma-python/issues/52 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/e1a939851304b3730d01433cf6b05929a770f98b8ab49e8f7bd723637ff6f3e1/PPPLDeepLearning/plasma-python/issues/52 |
| og:image:alt | Mostly repeating private email and in-person communication on this topic for reference notes and posterity. FRNN performance on V100s on the 2x IBM AC922 systems, OLCF Summit and Princeton's Traver... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | felker |
| hostname | github.com |
| expected-hostname | github.com |
| None | 2da1a0d1318592c9965539b12269c4641177dfabfc86c3807992efb13e1d96ff |
| turbo-cache-control | no-preview |
| go-import | github.com/PPPLDeepLearning/plasma-python git https://github.com/PPPLDeepLearning/plasma-python.git |
| octolytics-dimension-user_id | 23219101 |
| octolytics-dimension-user_login | PPPLDeepLearning |
| octolytics-dimension-repository_id | 72968591 |
| octolytics-dimension-repository_nwo | PPPLDeepLearning/plasma-python |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 72968591 |
| octolytics-dimension-repository_network_root_nwo | PPPLDeepLearning/plasma-python |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | c9646ffd6f86b00952c2b39e3c62e15904eff1e5 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width