Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python · GitHub
Open Graph Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python
X Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python
Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. The premise of this issue is that: The ...
Open Graph Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....
X Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....
Opengraph URL: https://github.com/PPPLDeepLearning/plasma-python/issues/63
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots","articleBody":"Observed on both TigerGPU and Traverse, on `master` before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. \r\n\r\nThe premise of this issue is that:\r\n\u003e The number of steps (iterations) per training epoch should be roughly constant across all epochs.\r\n\r\nHowever, I am not entirely sure that this premise is correct. See below section. \r\n\r\nMini-batches are created by distributing the trimmed and resampled shot signals into **chunks** of LSTM length, typically `length=128` ms when `dt=0.001`; this is the **horizontal dimension** of a min-batch. \r\n\r\nThe other, **vertical dimension** of a mini-batch is the local `batch_size`. Ideally, each shot is uniquely \"owned\" by a **single** GPU (or model replica) for `nsteps` = `nchunks`, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length `= 2*length + T_min_warning = 280` ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):\r\n\r\n\r\n\r\n- [ ] Double check the trimming of resampled shots in order to have an integer number of chunks. Is it trimmed only at the beginning of the shot? How does `conf['training']['paths'] = True` affect this?\r\n\r\nFrom the Methods appendix of the Nature paper:\r\n\u003e... Because there is a persistent internal state between successive chunks in time, it is not possible to use more than one chunk from a given shot in a given mini-batch (chunks that are successive in the shot must also be presented to the RNN in successive mini-batches during training such that the internal state can persist correctly).\r\n\r\n\u003e To train batchwise with a batch size of M, we need M independent (that is, stemming from different shots) time slices of equal length to feed to the GPU.\r\n\r\nHowever, if `effective_batch_size = N_GPU*batch_size` is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs and `batch_size=512`, e.g.), then **each step must involve some shots appearing twice in the overall batch**. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot. \r\n- [ ] Double check how open batch-indices are handled near the end of an epoch.\r\n\r\nFor a recent experiment with 1 GPU, `batch_size=256`, D3D 0D training, the final step of the first epoch is written to stdout as:\r\n```\r\nstep: 143 [ETA: -0.42s] [1794.00/1789], loss: 1.51778 [1.34746] | walltime: 167.4610 | 4.66E+02 Examples/sec | 5.49E-01 sec/batch [96.4% calc., 3.6% sync.][batch = 256 = 256*1] [lr = 2.00E-05 = 2.00E-05*1]\r\n```\r\nIn this example, `1794.00` is `MPIModel.num_so_far`, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change in `signals.py` that I was messing around with. \r\n- [ ] This might be because the value actually is incremented by 1 when a shot's first chunk initially appears in a mini-batch index. If so, either remove the fractional precision in the output, or modify the variable so that it accurately computes `num_chunks_so_far/num_chunks_total_this_shot`.\r\n\r\nBy searching the stdout of FRNN with `grep -nriB 1 \"seconds\" output.txt`, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs. \r\n- For 1 GPU and `batch_size=128`: 416, 529, 529, 529, 531 steps.\r\n- For 1 GPU and `batch_size=128` (restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.\r\n- For 4 GPU and `batch_size=128`: 74, 75, 133, 132 steps\r\nIn other words, for the default PRNG seed, the initial epochs **within a training session** shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis. \r\n\r\nThis variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since `conf['training']['num_batches_minimum'] = 200` in their tests (as opposed to the default value of 20 in the repository's `conf.yaml`), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset and `effective_batch_size=512`\r\n- [ ] Rename `num_batches_minimum` to `num_steps_minimum`?\r\n\r\nIt is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter. However, I **did** confirm that the code has always been printing out `.00` values for `MPIModel.num_so_far`.\r\n\r\nI am not sure if this phenomenon has affected training accuracy at all. \r\n\r\n### Multiprocessor scheduling problem \r\n\r\nThe random shuffling of shots loaded into mini-batches is effectively the *List Scheduling* algorithm applied to a shuffled list of jobs (shots) `j_i` of variable sizes `= nchunks_i`. Each batch index in `effective_batch_size` is an independent, identical \"worker\". The multiprocessor scheduling problem seeks to find the optimal schedule of `j_i` assigned to each of the `m` workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant (\"Does a feasible schedule S exist that satisfies f(S)\u003c= k?\" for a given threshold k) is NP-complete. \r\n\r\nIn this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of `2 - 1/m` of the optimal value.\r\n\r\nBy contrast, the *Longest Processing Time First Rule* (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest `nchunks` to smallest) and returns a makespan within a factor of `4/3-1/(3*m)` of the optimal makespan.\r\n\r\nNote, we are *not* trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal. \r\n\r\nHere, I apply both algorithms to the D3D training set:\r\n```python\r\nimport numpy as np\r\nimport plasma\r\nfrom plasma.primitives.shots import ShotList\r\nimport time\r\nimport sys\r\nnp.set_printoptions(threshold=sys.maxsize)\r\n\r\nshot_list_path = '../processed_shotlists/d3d_0D/shot_lists_signal_group_250640798211266795112500621861190558178.npz'\r\ndata = np.load(shot_list_path, allow_pickle=True)\r\nshot_list_train = data['shot_list_train'][()]\r\nshot_list_train = ShotList(shot_list_train)\r\nprepath='/Users/felker/processed_shots/signal_group_250640798211266795112500621861190558178/'\r\nshot_list_train.shots[0].restore(prepath=prepath, light=False)\r\n\r\nT_RNN = 128 # \"length of LSTM\" = truncated BPPT chunk\r\nnchunks = timesteps/T_RNN\r\neffective_batch_size = 1*128\r\n\r\n\r\n# LPT\r\nnchunks_sorted = np.sort(np.floor(nchunks))[::-1]\r\nloads = np.zeros((effective_batch_size,))\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_sorted:\r\n minload_batch_index = np.argmin(loads)\r\n scheduled_jobs[minload_batch_index].append(job)\r\n loads[minload_batch_index] += job\r\nprint(\"LPT makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(4.0/3.0) - 1.0/(3.0*effective_batch_size)))\r\n\r\n# List scheduling \r\nnp.random.seed(int(time.time()))\r\nnchunks_shuffled = np.floor(nchunks)\r\nnp.random.shuffle(nchunks_shuffled)\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_shuffled:\r\n minload_batch_index = np.argmin(loads)\r\n scheduled_jobs[minload_batch_index].append(job)\r\n loads[minload_batch_index] += job\r\nprint(\"Random List Scheduling makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(2.0 - 1.0/effective_batch_size)))\r\n```\r\n\r\nFor `effective_batch_size = 512`:\r\n```\r\nLPT makespan = 132.0\r\nOptimal job schedule (makespan) \u003e= 98.99934895833333\r\nRandom List Scheduling makespan = 168.0\r\nOptimal job schedule (makespan) \u003e= 84.08211143695014\r\n```\r\n\r\nFor `effective_batch_size = 256`:\r\n```\r\nLPT makespan = 256.0\r\nOptimal job schedule (makespan) \u003e= 191.99869791666666\r\nRandom List Scheduling makespan = 290.0\r\nOptimal job schedule (makespan) \u003e= 145.28375733855185\r\n```\r\n\r\nFor `effective_batch_size = 128`:\r\n```\r\nLPT makespan = 505.0\r\nOptimal job schedule (makespan) \u003e= 378.7473958333333\r\nRandom List Scheduling makespan = 529.0\r\nOptimal job schedule (makespan) \u003e= 265.5372549019608\r\n```\r\n\r\nThe latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to `signals.py` on Traverse). Therefore, **this variability of nsteps/epoch might be expected, and not a bug.**\r\n","author":{"url":"https://github.com/felker","@type":"Person","name":"felker"},"datePublished":"2020-01-14T23:41:05.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/63/plasma-python/issues/63"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:678e8206-d04e-9ca1-db72-288e4bfc2766 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | D60E:ECBEE:A98FEB:EA4811:698E1215 |
| html-safe-nonce | 8bc5279b93207959eee814ef84633ba0a96b19e20e76d177256db7dedf31cca3 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJENjBFOkVDQkVFOkE5OEZFQjpFQTQ4MTE6Njk4RTEyMTUiLCJ2aXNpdG9yX2lkIjoiODMwMzM3NDUyMTgxMDY4NjQ4NSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | 119b32e15c025a25d8360ca950e49f90177e4f454b621c7028b378881f20d461 |
| hovercard-subject-tag | issue:549881802 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/PPPLDeepLearning/plasma-python/63/issue_layout |
| twitter:image | https://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63 |
| og:image:alt | Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs.... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | felker |
| hostname | github.com |
| expected-hostname | github.com |
| None | fdf74c91c9ac187cc5cd7b14d4af2d6ef3e18136d002f5d36253f8538e97ee4c |
| turbo-cache-control | no-preview |
| go-import | github.com/PPPLDeepLearning/plasma-python git https://github.com/PPPLDeepLearning/plasma-python.git |
| octolytics-dimension-user_id | 23219101 |
| octolytics-dimension-user_login | PPPLDeepLearning |
| octolytics-dimension-repository_id | 72968591 |
| octolytics-dimension-repository_nwo | PPPLDeepLearning/plasma-python |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 72968591 |
| octolytics-dimension-repository_network_root_nwo | PPPLDeepLearning/plasma-python |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 1a5a3e7bbfb3486980e340c242368684156fba87 |
| ui-target | canary-2 |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width