René's URL Explorer Experiment

Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python · GitHub

Open Graph Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python

X Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python

Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. The premise of this issue is that: The ...

Open Graph Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....

X Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....

Opengraph URL: https://github.com/PPPLDeepLearning/plasma-python/issues/63

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots","articleBody":"Observed on both TigerGPU and Traverse, on `master` before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. \r\n\r\nThe premise of this issue is that:\r\n\u003e The number of steps (iterations) per training epoch should be roughly constant across all epochs.\r\n\r\nHowever, I am not entirely sure that this premise is correct. See below section. \r\n\r\nMini-batches are created by distributing the trimmed and resampled shot signals into **chunks** of LSTM length, typically `length=128` ms when `dt=0.001`; this is the **horizontal dimension** of a min-batch. \r\n\r\nThe other, **vertical dimension** of a mini-batch is the local `batch_size`. Ideally, each shot is uniquely \"owned\" by a **single** GPU (or model replica) for `nsteps` = `nchunks`, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length `= 2*length + T_min_warning = 280` ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):\r\n\r\n![d3d_training_pulse_length_histogram](https://user-images.githubusercontent.com/1410981/72389584-59818c00-36ee-11ea-88d8-741f8ca77b85.png)\r\n\r\n- [ ] Double check the trimming of resampled shots in order to have an integer number of chunks. Is it trimmed only at the beginning of the shot? How does `conf['training']['paths'] = True` affect this?\r\n\r\nFrom the Methods appendix of the Nature paper:\r\n\u003e... Because there is a persistent internal state between successive chunks in time, it is not possible to use more than one chunk from a given shot in a given mini-batch (chunks that are successive in the shot must also be presented to the RNN in successive mini-batches during training such that the internal state can persist correctly).\r\n\r\n\u003e To train batchwise with a batch size of M, we need M independent (that is, stemming from different shots) time slices of equal length to feed to the GPU.\r\n\r\nHowever, if `effective_batch_size = N_GPU*batch_size` is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs and `batch_size=512`, e.g.), then **each step must involve some shots appearing twice in the overall batch**. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot. \r\n- [ ] Double check how open batch-indices are handled near the end of an epoch.\r\n\r\nFor a recent experiment with 1 GPU, `batch_size=256`, D3D 0D training, the final step of the first epoch is written to stdout as:\r\n```\r\nstep: 143 [ETA: -0.42s] [1794.00/1789], loss: 1.51778 [1.34746] | walltime: 167.4610 | 4.66E+02 Examples/sec | 5.49E-01 sec/batch [96.4% calc., 3.6% sync.][batch = 256 = 256*1] [lr = 2.00E-05 = 2.00E-05*1]\r\n```\r\nIn this example, `1794.00` is `MPIModel.num_so_far`, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change in `signals.py` that I was messing around with. \r\n- [ ] This might be because the value actually is incremented by 1 when a shot's first chunk initially appears in a mini-batch index. If so, either remove the fractional precision in the output, or modify the variable so that it accurately computes `num_chunks_so_far/num_chunks_total_this_shot`.\r\n\r\nBy searching the stdout of FRNN with `grep -nriB 1 \"seconds\" output.txt`, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs. \r\n- For 1 GPU and `batch_size=128`: 416, 529, 529, 529, 531 steps.\r\n- For 1 GPU and `batch_size=128` (restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.\r\n- For 4 GPU and `batch_size=128`: 74, 75, 133, 132 steps\r\nIn other words, for the default PRNG seed, the initial epochs **within a training session** shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis. \r\n\r\nThis variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since `conf['training']['num_batches_minimum'] = 200` in their tests (as opposed to the default value of 20 in the repository's `conf.yaml`), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset and `effective_batch_size=512`\r\n- [ ] Rename `num_batches_minimum` to `num_steps_minimum`?\r\n\r\nIt is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter.  However, I **did** confirm that the code has always been printing out `.00` values for `MPIModel.num_so_far`.\r\n\r\nI am not sure if this phenomenon has affected training accuracy at all. \r\n\r\n### Multiprocessor scheduling problem \r\n\r\nThe random shuffling of shots loaded into mini-batches is effectively the *List Scheduling* algorithm applied to a shuffled list of jobs (shots) `j_i` of variable sizes `= nchunks_i`. Each batch index in `effective_batch_size` is an independent, identical \"worker\". The multiprocessor scheduling problem seeks to find the optimal schedule of `j_i` assigned to each of the `m` workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant (\"Does a feasible schedule S exist that satisfies f(S)\u003c= k?\" for a given threshold k) is NP-complete. \r\n\r\nIn this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of `2 - 1/m` of the optimal value.\r\n\r\nBy contrast, the *Longest Processing Time First Rule* (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest `nchunks` to smallest) and returns a makespan within a factor of `4/3-1/(3*m)` of the optimal makespan.\r\n\r\nNote, we are *not* trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal. \r\n\r\nHere, I apply both algorithms to the D3D training set:\r\n```python\r\nimport numpy as np\r\nimport plasma\r\nfrom plasma.primitives.shots import ShotList\r\nimport time\r\nimport sys\r\nnp.set_printoptions(threshold=sys.maxsize)\r\n\r\nshot_list_path = '../processed_shotlists/d3d_0D/shot_lists_signal_group_250640798211266795112500621861190558178.npz'\r\ndata = np.load(shot_list_path, allow_pickle=True)\r\nshot_list_train = data['shot_list_train'][()]\r\nshot_list_train = ShotList(shot_list_train)\r\nprepath='/Users/felker/processed_shots/signal_group_250640798211266795112500621861190558178/'\r\nshot_list_train.shots[0].restore(prepath=prepath, light=False)\r\n\r\nT_RNN = 128   # \"length of LSTM\" = truncated BPPT chunk\r\nnchunks = timesteps/T_RNN\r\neffective_batch_size = 1*128\r\n\r\n\r\n# LPT\r\nnchunks_sorted = np.sort(np.floor(nchunks))[::-1]\r\nloads = np.zeros((effective_batch_size,))\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_sorted:\r\n    minload_batch_index = np.argmin(loads)\r\n    scheduled_jobs[minload_batch_index].append(job)\r\n    loads[minload_batch_index] += job\r\nprint(\"LPT makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(4.0/3.0) - 1.0/(3.0*effective_batch_size)))\r\n\r\n# List scheduling \r\nnp.random.seed(int(time.time()))\r\nnchunks_shuffled = np.floor(nchunks)\r\nnp.random.shuffle(nchunks_shuffled)\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_shuffled:\r\n    minload_batch_index = np.argmin(loads)\r\n    scheduled_jobs[minload_batch_index].append(job)\r\n    loads[minload_batch_index] += job\r\nprint(\"Random List Scheduling makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(2.0 - 1.0/effective_batch_size)))\r\n```\r\n\r\nFor `effective_batch_size = 512`:\r\n```\r\nLPT makespan = 132.0\r\nOptimal job schedule (makespan) \u003e= 98.99934895833333\r\nRandom List Scheduling makespan = 168.0\r\nOptimal job schedule (makespan) \u003e= 84.08211143695014\r\n```\r\n\r\nFor `effective_batch_size = 256`:\r\n```\r\nLPT makespan = 256.0\r\nOptimal job schedule (makespan) \u003e= 191.99869791666666\r\nRandom List Scheduling makespan = 290.0\r\nOptimal job schedule (makespan) \u003e= 145.28375733855185\r\n```\r\n\r\nFor `effective_batch_size = 128`:\r\n```\r\nLPT makespan = 505.0\r\nOptimal job schedule (makespan) \u003e= 378.7473958333333\r\nRandom List Scheduling makespan = 529.0\r\nOptimal job schedule (makespan) \u003e= 265.5372549019608\r\n```\r\n\r\nThe latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to `signals.py` on Traverse). Therefore, **this variability of nsteps/epoch might be expected, and not a bug.**\r\n","author":{"url":"https://github.com/felker","@type":"Person","name":"felker"},"datePublished":"2020-01-14T23:41:05.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/63/plasma-python/issues/63"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:678e8206-d04e-9ca1-db72-288e4bfc2766
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	D60E:ECBEE:A98FEB:EA4811:698E1215
html-safe-nonce	8bc5279b93207959eee814ef84633ba0a96b19e20e76d177256db7dedf31cca3
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJENjBFOkVDQkVFOkE5OEZFQjpFQTQ4MTE6Njk4RTEyMTUiLCJ2aXNpdG9yX2lkIjoiODMwMzM3NDUyMTgxMDY4NjQ4NSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac	119b32e15c025a25d8360ca950e49f90177e4f454b621c7028b378881f20d461
hovercard-subject-tag	issue:549881802
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/PPPLDeepLearning/plasma-python/63/issue_layout
twitter:image	https://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63
og:image:alt	Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	felker
hostname	github.com
expected-hostname	github.com
None	fdf74c91c9ac187cc5cd7b14d4af2d6ef3e18136d002f5d36253f8538e97ee4c
turbo-cache-control	no-preview
go-import	github.com/PPPLDeepLearning/plasma-python git https://github.com/PPPLDeepLearning/plasma-python.git
octolytics-dimension-user_id	23219101
octolytics-dimension-user_login	PPPLDeepLearning
octolytics-dimension-repository_id	72968591
octolytics-dimension-repository_nwo	PPPLDeepLearning/plasma-python
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	72968591
octolytics-dimension-repository_network_root_nwo	PPPLDeepLearning/plasma-python
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	1a5a3e7bbfb3486980e340c242368684156fba87
ui-target	canary-2
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/PPPLDeepLearning/plasma-python/issues/63#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FPPPLDeepLearning%2Fplasma-python%2Fissues%2F63
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FPPPLDeepLearning%2Fplasma-python%2Fissues%2F63
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=PPPLDeepLearning%2Fplasma-python
Reload	https://github.com/PPPLDeepLearning/plasma-python/issues/63
Reload	https://github.com/PPPLDeepLearning/plasma-python/issues/63
Reload	https://github.com/PPPLDeepLearning/plasma-python/issues/63
PPPLDeepLearning	https://github.com/PPPLDeepLearning
plasma-python	https://github.com/PPPLDeepLearning/plasma-python
Notifications	https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Fork 43	https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Star 88	https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Code	https://github.com/PPPLDeepLearning/plasma-python
Issues 21	https://github.com/PPPLDeepLearning/plasma-python/issues
Pull requests 1	https://github.com/PPPLDeepLearning/plasma-python/pulls
Actions	https://github.com/PPPLDeepLearning/plasma-python/actions
Projects 0	https://github.com/PPPLDeepLearning/plasma-python/projects
Security 0	https://github.com/PPPLDeepLearning/plasma-python/security
Insights	https://github.com/PPPLDeepLearning/plasma-python/pulse
Code	https://github.com/PPPLDeepLearning/plasma-python
Issues	https://github.com/PPPLDeepLearning/plasma-python/issues
Pull requests	https://github.com/PPPLDeepLearning/plasma-python/pulls
Actions	https://github.com/PPPLDeepLearning/plasma-python/actions
Projects	https://github.com/PPPLDeepLearning/plasma-python/projects
Security	https://github.com/PPPLDeepLearning/plasma-python/security
Insights	https://github.com/PPPLDeepLearning/plasma-python/pulse
New issue	https://github.com/login?return_to=https://github.com/PPPLDeepLearning/plasma-python/issues/63
New issue	https://github.com/login?return_to=https://github.com/PPPLDeepLearning/plasma-python/issues/63
Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots	https://github.com/PPPLDeepLearning/plasma-python/issues/63#top
	https://github.com/felker
	https://github.com/felker
felker	https://github.com/felker
on Jan 14, 2020	https://github.com/PPPLDeepLearning/plasma-python/issues/63#issue-549881802
#46	https://github.com/PPPLDeepLearning/plasma-python/pull/46
#55	https://github.com/PPPLDeepLearning/plasma-python/issues/55
	https://user-images.githubusercontent.com/1410981/72389584-59818c00-36ee-11ea-88d8-741f8ca77b85.png
@jnkh	https://github.com/jnkh
@ge-dong	https://github.com/ge-dong
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.