Title: ETA calculation is inaccurate · Issue #55 · PPPLDeepLearning/plasma-python · GitHub
Open Graph Title: ETA calculation is inaccurate · Issue #55 · PPPLDeepLearning/plasma-python
X Title: ETA calculation is inaccurate · Issue #55 · PPPLDeepLearning/plasma-python
Description: Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse): [0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374...
Open Graph Description: Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse): [0] step: 0 [ETA: 468568011.02s] [0.00/1789], lo...
X Description: Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse): [0] step: 0 [ETA: 468568011.02s] [0.00/1789], lo...
Opengraph URL: https://github.com/PPPLDeepLearning/plasma-python/issues/55
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"ETA calculation is inaccurate","articleBody":"Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse):\r\n```\r\n[0] step: 0 [ETA: 468568011.02s] [0.00/1789], loss: 1.05701 [1.05701] | walltime: 5.7374 | 8.47E+02 Examples/sec | 6.04E-01 sec/batch [92.3% calc., 7.7% sync.][batch = 512 = 128*4] [lr = 7.30E-05 = 1.83E-05*4]\r\n```\r\nThe ETA provided in this example is clearly inaccurate (each epoch takes around 60s). Specifically, there are two types of issues:\r\n1. The ETA computed in the first step of any epoch is always inaccurate. \r\n2. For later epochs within a session, the ETA increases nearly monotonically for many steps before starting to decrease nearly monotonically. \r\n\r\n### First step\r\n\r\nFor the first epoch in a given session, it gives a huge ETA since `MPI_Model.num_so_far` is zero, resulting in `work_so_far` of 0 being passed to:\r\nhttps://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L613-L616\r\ncausing `total_time` to explode. \r\n- [ ] Probably should just refuse to give an ETA for the first step (or steps) of the first epoch\r\n\r\nFor later epochs within a session, it gives a minuscule ETA:\r\n```\r\nstep: 0 [ETA: 0.55s] [1819.00/1789], loss: 0.98688 [0.98688] | walltime: 174.4240 | 8.93E+02 Examples/sec | 5.73E-01 sec/batch [96.1% calc., 3.9% sync.][batch = 512 = 128*4] [lr = 7.08E-05 = 1.77E-05*4]\r\n``` \r\n- [ ] I think an error was introduced when I changed the 0-based indexing of the epochs 1-2 months ago.\r\n\r\n### Later steps in later epochs\r\n\r\nE.g. here are the ETAs for some later epoch:\r\n```\r\n\r\nETA: 0.55s\r\nETA: 22.14\r\nETA: 27.98\r\nETA: 31.63\r\nETA: 35.88\r\nETA: 38.45\r\nETA: 34.89\r\nETA: 36.21\r\nETA: 35.35\r\nETA: 35.56\r\nETA: 36.04\r\nETA: 35.88\r\nETA: 35.33\r\nETA: 34.49\r\nETA: 34.73\r\nETA: 34.29\r\nETA: 34.13\r\nETA: 33.51\r\nETA: 33.16\r\n…\r\nETA: 1.35s\r\nETA: 1.06s\r\nETA: 0.67s\r\nETA: 0.11s\r\nETA: -0.45\r\n```\r\n- [ ] Consider using the measured runtimes of the previous epochs within this session to inform the ETA in later epochs. ","author":{"url":"https://github.com/felker","@type":"Person","name":"felker"},"datePublished":"2020-01-07T02:15:58.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/55/plasma-python/issues/55"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:c1861943-a8bd-e6bd-6281-9cedbeb5325a |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | BF58:105208:CD136:10C597:698E6862 |
| html-safe-nonce | 0b2acfa302d4371e49311a5aab82a38b1ace15da909b6c37d69bc8628b709546 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJCRjU4OjEwNTIwODpDRDEzNjoxMEM1OTc6Njk4RTY4NjIiLCJ2aXNpdG9yX2lkIjoiNDY4MzM4MzY4NzM4NjkxNjk2MiIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 7cc5f9e8b4f5d66167219604466d4c6fa2987bd8f7be8d9b875139ea36696148 |
| hovercard-subject-tag | issue:546039666 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/PPPLDeepLearning/plasma-python/55/issue_layout |
| twitter:image | https://opengraph.githubassets.com/fcbfe0512147b40a8ea0045e182100e74925df5bf8f55757ebf6918723a32d33/PPPLDeepLearning/plasma-python/issues/55 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/fcbfe0512147b40a8ea0045e182100e74925df5bf8f55757ebf6918723a32d33/PPPLDeepLearning/plasma-python/issues/55 |
| og:image:alt | Example of the current per-step (iteration) diagnostic output provided by FRNN around epoch 22 of the D3D 0D model (run on 4 V100 GPUs of Traverse): [0] step: 0 [ETA: 468568011.02s] [0.00/1789], lo... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | felker |
| hostname | github.com |
| expected-hostname | github.com |
| None | 666e30cc1de8ebdf458084bf731e95deba4f074a5008f91b50803aa9a71e3725 |
| turbo-cache-control | no-preview |
| go-import | github.com/PPPLDeepLearning/plasma-python git https://github.com/PPPLDeepLearning/plasma-python.git |
| octolytics-dimension-user_id | 23219101 |
| octolytics-dimension-user_login | PPPLDeepLearning |
| octolytics-dimension-repository_id | 72968591 |
| octolytics-dimension-repository_nwo | PPPLDeepLearning/plasma-python |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 72968591 |
| octolytics-dimension-repository_network_root_nwo | PPPLDeepLearning/plasma-python |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | c5daa44975c44e187dd9ea0d761c37973489d508 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width