René's URL Explorer Experiment


Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python · GitHub

Open Graph Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python

X Title: Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots · Issue #63 · PPPLDeepLearning/plasma-python

Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. The premise of this issue is that: The ...

Open Graph Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....

X Description: Observed on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....

Opengraph URL: https://github.com/PPPLDeepLearning/plasma-python/issues/63

X: @github

direct link

Domain: github.com


Hey, it has json ld scripts:
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shots","articleBody":"Observed on both TigerGPU and Traverse, on `master` before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs. \r\n\r\nThe premise of this issue is that:\r\n\u003e The number of steps (iterations) per training epoch should be roughly constant across all epochs.\r\n\r\nHowever, I am not entirely sure that this premise is correct. See below section. \r\n\r\nMini-batches are created by distributing the trimmed and resampled shot signals into **chunks** of LSTM length, typically `length=128` ms when `dt=0.001`; this is the **horizontal dimension** of a min-batch. \r\n\r\nThe other, **vertical dimension** of a mini-batch is the local `batch_size`. Ideally, each shot is uniquely \"owned\" by a **single** GPU (or model replica) for `nsteps` = `nchunks`, which depends on the particular pulse length. This varies by 1-2 orders of magnitude, with the minimum shot length `= 2*length + T_min_warning = 280` ms, typically. Ignoring any nuanced trimming of the processed shots (i.e. I floored the division into chunks):\r\n\r\n![d3d_training_pulse_length_histogram](https://user-images.githubusercontent.com/1410981/72389584-59818c00-36ee-11ea-88d8-741f8ca77b85.png)\r\n\r\n- [ ] Double check the trimming of resampled shots in order to have an integer number of chunks. Is it trimmed only at the beginning of the shot? How does `conf['training']['paths'] = True` affect this?\r\n\r\nFrom the Methods appendix of the Nature paper:\r\n\u003e... Because there is a persistent internal state between successive chunks in time, it is not possible to use more than one chunk from a given shot in a given mini-batch (chunks that are successive in the shot must also be presented to the RNN in successive mini-batches during training such that the internal state can persist correctly).\r\n\r\n\u003e To train batchwise with a batch size of M, we need M independent (that is, stemming from different shots) time slices of equal length to feed to the GPU.\r\n\r\nHowever, if `effective_batch_size = N_GPU*batch_size` is greater than the number of training shots (1734 shots, which is easy to exceed with 4 GPUs and `batch_size=512`, e.g.), then **each step must involve some shots appearing twice in the overall batch**. Even for smaller effective batch sizes, the batch generator must backfill the mini-batches with repeats at later steps in the epoch as the longer pulses require many more steps to process all the chunks in the shot. \r\n- [ ] Double check how open batch-indices are handled near the end of an epoch.\r\n\r\nFor a recent experiment with 1 GPU, `batch_size=256`, D3D 0D training, the final step of the first epoch is written to stdout as:\r\n```\r\nstep: 143 [ETA: -0.42s] [1794.00/1789], loss: 1.51778 [1.34746] | walltime: 167.4610 | 4.66E+02 Examples/sec | 5.49E-01 sec/batch [96.4% calc., 3.6% sync.][batch = 256 = 256*1] [lr = 2.00E-05 = 2.00E-05*1]\r\n```\r\nIn this example, `1794.00` is `MPIModel.num_so_far`, which is always printed out with fractional precision, but never shows anything other than integer values. Note, 1789 shots is more than in the original D3D training set due to a change in `signals.py` that I was messing around with. \r\n- [ ] This might be because the value actually is incremented by 1 when a shot's first chunk initially appears in a mini-batch index. If so, either remove the fractional precision in the output, or modify the variable so that it accurately computes `num_chunks_so_far/num_chunks_total_this_shot`.\r\n\r\nBy searching the stdout of FRNN with `grep -nriB 1 \"seconds\" output.txt`, I observe 143, 151, 264, 263, and 266 steps for the first 5 epochs. \r\n- For 1 GPU and `batch_size=128`: 416, 529, 529, 529, 531 steps.\r\n- For 1 GPU and `batch_size=128` (restarted/loaded epoch 1 weights): 416, 529, 529, 528, 529, 531.\r\n- For 4 GPU and `batch_size=128`: 74, 75, 133, 132 steps\r\nIn other words, for the default PRNG seed, the initial epochs **within a training session** shuffle mini-batches far closer to the optimal schedule than the later epochs. See below analysis. \r\n\r\nThis variation had not really been noticed in earlier training by @jnkh nor @ge-dong, since `conf['training']['num_batches_minimum'] = 200` in their tests (as opposed to the default value of 20 in the repository's `conf.yaml`), which is much larger than the typical number of required steps for an epoch of 128ms chunks of our original D3D dataset and `effective_batch_size=512`\r\n- [ ] Rename `num_batches_minimum` to `num_steps_minimum`?\r\n\r\nIt is unclear if the above variable-step phenomenon was happening on older versions of FRNN and was masked by this parameter.  However, I **did** confirm that the code has always been printing out `.00` values for `MPIModel.num_so_far`.\r\n\r\nI am not sure if this phenomenon has affected training accuracy at all. \r\n\r\n### Multiprocessor scheduling problem \r\n\r\nThe random shuffling of shots loaded into mini-batches is effectively the *List Scheduling* algorithm applied to a shuffled list of jobs (shots) `j_i` of variable sizes `= nchunks_i`. Each batch index in `effective_batch_size` is an independent, identical \"worker\". The multiprocessor scheduling problem seeks to find the optimal schedule of `j_i` assigned to each of the `m` workers in order to minimize the makespan = the earliest time at which all jobs are completed. Here, we have no inter-job precedence/dependencies, nor individual worker constraints. Still, this problem is strongly NP-hard, since the decision problem variant (\"Does a feasible schedule S exist that satisfies f(S)\u003c= k?\" for a given threshold k) is NP-complete. \r\n\r\nIn this particular heuristic algorithm for the scheduling problem, each next job (shot) is assigned to the worker (batch index) which becomes empty soonest, given some arbitrary ordered input (training buffer). In the worst-case for the List Scheduling algorithm, the longest shot is loaded into a mini-batch last, and the makespan is maximized. Hence, it returns a makespan which within a factor of `2 - 1/m` of the optimal value.\r\n\r\nBy contrast, the *Longest Processing Time First Rule* (LPT) algorithm first sorts the jobs according to non-increasing processing time (largest `nchunks` to smallest) and returns a makespan within a factor of `4/3-1/(3*m)` of the optimal makespan.\r\n\r\nNote, we are *not* trying to minimize the makespan or find the most efficient mini-batching strategy in FRNN, since we rely on the random shuffling to stabilize training. However, this analysis applied to our D3D chunks can give us some expectation on how much variability in steps/epoch is normal. \r\n\r\nHere, I apply both algorithms to the D3D training set:\r\n```python\r\nimport numpy as np\r\nimport plasma\r\nfrom plasma.primitives.shots import ShotList\r\nimport time\r\nimport sys\r\nnp.set_printoptions(threshold=sys.maxsize)\r\n\r\nshot_list_path = '../processed_shotlists/d3d_0D/shot_lists_signal_group_250640798211266795112500621861190558178.npz'\r\ndata = np.load(shot_list_path, allow_pickle=True)\r\nshot_list_train = data['shot_list_train'][()]\r\nshot_list_train = ShotList(shot_list_train)\r\nprepath='/Users/felker/processed_shots/signal_group_250640798211266795112500621861190558178/'\r\nshot_list_train.shots[0].restore(prepath=prepath, light=False)\r\n\r\nT_RNN = 128   # \"length of LSTM\" = truncated BPPT chunk\r\nnchunks = timesteps/T_RNN\r\neffective_batch_size = 1*128\r\n\r\n\r\n# LPT\r\nnchunks_sorted = np.sort(np.floor(nchunks))[::-1]\r\nloads = np.zeros((effective_batch_size,))\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_sorted:\r\n    minload_batch_index = np.argmin(loads)\r\n    scheduled_jobs[minload_batch_index].append(job)\r\n    loads[minload_batch_index] += job\r\nprint(\"LPT makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(4.0/3.0) - 1.0/(3.0*effective_batch_size)))\r\n\r\n# List scheduling \r\nnp.random.seed(int(time.time()))\r\nnchunks_shuffled = np.floor(nchunks)\r\nnp.random.shuffle(nchunks_shuffled)\r\nscheduled_jobs = np.empty((effective_batch_size,), dtype=object)\r\nscheduled_jobs[...] = [[] for _ in range(effective_batch_size)]\r\nfor job in nchunks_shuffled:\r\n    minload_batch_index = np.argmin(loads)\r\n    scheduled_jobs[minload_batch_index].append(job)\r\n    loads[minload_batch_index] += job\r\nprint(\"Random List Scheduling makespan = {}\".format(loads.max()))\r\nprint(\"Optimal job schedule (makespan) \u003e= {}\".format(loads.max()/(2.0 - 1.0/effective_batch_size)))\r\n```\r\n\r\nFor `effective_batch_size = 512`:\r\n```\r\nLPT makespan = 132.0\r\nOptimal job schedule (makespan) \u003e= 98.99934895833333\r\nRandom List Scheduling makespan = 168.0\r\nOptimal job schedule (makespan) \u003e= 84.08211143695014\r\n```\r\n\r\nFor `effective_batch_size = 256`:\r\n```\r\nLPT makespan = 256.0\r\nOptimal job schedule (makespan) \u003e= 191.99869791666666\r\nRandom List Scheduling makespan = 290.0\r\nOptimal job schedule (makespan) \u003e= 145.28375733855185\r\n```\r\n\r\nFor `effective_batch_size = 128`:\r\n```\r\nLPT makespan = 505.0\r\nOptimal job schedule (makespan) \u003e= 378.7473958333333\r\nRandom List Scheduling makespan = 529.0\r\nOptimal job schedule (makespan) \u003e= 265.5372549019608\r\n```\r\n\r\nThe latter two cases are in line with my observations (although these were computed from a slightly different training set; see above comment about changes to `signals.py` on Traverse). Therefore, **this variability of nsteps/epoch might be expected, and not a bug.**\r\n","author":{"url":"https://github.com/felker","@type":"Person","name":"felker"},"datePublished":"2020-01-14T23:41:05.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/63/plasma-python/issues/63"}

route-pattern/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controllervoltron_issues_fragments
route-actionissue_layout
fetch-noncev2:678e8206-d04e-9ca1-db72-288e4bfc2766
current-catalog-service-hash81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-idD60E:ECBEE:A98FEB:EA4811:698E1215
html-safe-nonce8bc5279b93207959eee814ef84633ba0a96b19e20e76d177256db7dedf31cca3
visitor-payloadeyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJENjBFOkVDQkVFOkE5OEZFQjpFQTQ4MTE6Njk4RTEyMTUiLCJ2aXNpdG9yX2lkIjoiODMwMzM3NDUyMTgxMDY4NjQ4NSIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac119b32e15c025a25d8360ca950e49f90177e4f454b621c7028b378881f20d461
hovercard-subject-tagissue:549881802
github-keyboard-shortcutsrepository,issues,copilot
google-site-verificationApib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-urlhttps://collector.github.com/github/collect
analytics-location///voltron/issues_fragments/issue_layout
fb:app_id1401488693436528
apple-itunes-appapp-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/PPPLDeepLearning/plasma-python/63/issue_layout
twitter:imagehttps://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63
twitter:cardsummary_large_image
og:imagehttps://opengraph.githubassets.com/66990fb6a93a3df00706298eeb023995bb035f287efd2adb0054b6a3f933cad0/PPPLDeepLearning/plasma-python/issues/63
og:image:altObserved on both TigerGPU and Traverse, on master before and after the merge of #46. If there is a serious underlying problem, it is possibly related to #55 with the changed indexing of the epochs....
og:image:width1200
og:image:height600
og:site_nameGitHub
og:typeobject
og:author:usernamefelker
hostnamegithub.com
expected-hostnamegithub.com
Nonefdf74c91c9ac187cc5cd7b14d4af2d6ef3e18136d002f5d36253f8538e97ee4c
turbo-cache-controlno-preview
go-importgithub.com/PPPLDeepLearning/plasma-python git https://github.com/PPPLDeepLearning/plasma-python.git
octolytics-dimension-user_id23219101
octolytics-dimension-user_loginPPPLDeepLearning
octolytics-dimension-repository_id72968591
octolytics-dimension-repository_nwoPPPLDeepLearning/plasma-python
octolytics-dimension-repository_publictrue
octolytics-dimension-repository_is_forkfalse
octolytics-dimension-repository_network_root_id72968591
octolytics-dimension-repository_network_root_nwoPPPLDeepLearning/plasma-python
turbo-body-classeslogged-out env-production page-responsive
disable-turbofalse
browser-stats-urlhttps://api.github.com/_private/browser/stats
browser-errors-urlhttps://api.github.com/_private/browser/errors
release1a5a3e7bbfb3486980e340c242368684156fba87
ui-targetcanary-2
theme-color#1e2327
color-schemelight dark

Links:

Skip to contenthttps://github.com/PPPLDeepLearning/plasma-python/issues/63#start-of-content
https://github.com/
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FPPPLDeepLearning%2Fplasma-python%2Fissues%2F63
GitHub CopilotWrite better code with AIhttps://github.com/features/copilot
GitHub SparkBuild and deploy intelligent appshttps://github.com/features/spark
GitHub ModelsManage and compare promptshttps://github.com/features/models
MCP RegistryNewIntegrate external toolshttps://github.com/mcp
ActionsAutomate any workflowhttps://github.com/features/actions
CodespacesInstant dev environmentshttps://github.com/features/codespaces
IssuesPlan and track workhttps://github.com/features/issues
Code ReviewManage code changeshttps://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilitieshttps://github.com/security/advanced-security
Code securitySecure your code as you buildhttps://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they starthttps://github.com/security/advanced-security/secret-protection
Why GitHubhttps://github.com/why-github
Documentationhttps://docs.github.com
Bloghttps://github.blog
Changeloghttps://github.blog/changelog
Marketplacehttps://github.com/marketplace
View all featureshttps://github.com/features
Enterpriseshttps://github.com/enterprise
Small and medium teamshttps://github.com/team
Startupshttps://github.com/enterprise/startups
Nonprofitshttps://github.com/solutions/industry/nonprofits
App Modernizationhttps://github.com/solutions/use-case/app-modernization
DevSecOpshttps://github.com/solutions/use-case/devsecops
DevOpshttps://github.com/solutions/use-case/devops
CI/CDhttps://github.com/solutions/use-case/ci-cd
View all use caseshttps://github.com/solutions/use-case
Healthcarehttps://github.com/solutions/industry/healthcare
Financial serviceshttps://github.com/solutions/industry/financial-services
Manufacturinghttps://github.com/solutions/industry/manufacturing
Governmenthttps://github.com/solutions/industry/government
View all industrieshttps://github.com/solutions/industry
View all solutionshttps://github.com/solutions
AIhttps://github.com/resources/articles?topic=ai
Software Developmenthttps://github.com/resources/articles?topic=software-development
DevOpshttps://github.com/resources/articles?topic=devops
Securityhttps://github.com/resources/articles?topic=security
View all topicshttps://github.com/resources/articles
Customer storieshttps://github.com/customer-stories
Events & webinarshttps://github.com/resources/events
Ebooks & reportshttps://github.com/resources/whitepapers
Business insightshttps://github.com/solutions/executive-insights
GitHub Skillshttps://skills.github.com
Documentationhttps://docs.github.com
Customer supporthttps://support.github.com
Community forumhttps://github.com/orgs/community/discussions
Trust centerhttps://github.com/trust-center
Partnershttps://github.com/partners
GitHub SponsorsFund open source developershttps://github.com/sponsors
Security Labhttps://securitylab.github.com
Maintainer Communityhttps://maintainers.github.com
Acceleratorhttps://github.com/accelerator
Archive Programhttps://archiveprogram.github.com
Topicshttps://github.com/topics
Trendinghttps://github.com/trending
Collectionshttps://github.com/collections
Enterprise platformAI-powered developer platformhttps://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security featureshttps://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI featureshttps://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 supporthttps://github.com/premium-support
Pricinghttps://github.com/pricing
Search syntax tipshttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentationhttps://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FPPPLDeepLearning%2Fplasma-python%2Fissues%2F63
Sign up https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=PPPLDeepLearning%2Fplasma-python
Reloadhttps://github.com/PPPLDeepLearning/plasma-python/issues/63
Reloadhttps://github.com/PPPLDeepLearning/plasma-python/issues/63
Reloadhttps://github.com/PPPLDeepLearning/plasma-python/issues/63
PPPLDeepLearning https://github.com/PPPLDeepLearning
plasma-pythonhttps://github.com/PPPLDeepLearning/plasma-python
Notifications https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Fork 43 https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Star 88 https://github.com/login?return_to=%2FPPPLDeepLearning%2Fplasma-python
Code https://github.com/PPPLDeepLearning/plasma-python
Issues 21 https://github.com/PPPLDeepLearning/plasma-python/issues
Pull requests 1 https://github.com/PPPLDeepLearning/plasma-python/pulls
Actions https://github.com/PPPLDeepLearning/plasma-python/actions
Projects 0 https://github.com/PPPLDeepLearning/plasma-python/projects
Security 0 https://github.com/PPPLDeepLearning/plasma-python/security
Insights https://github.com/PPPLDeepLearning/plasma-python/pulse
Code https://github.com/PPPLDeepLearning/plasma-python
Issues https://github.com/PPPLDeepLearning/plasma-python/issues
Pull requests https://github.com/PPPLDeepLearning/plasma-python/pulls
Actions https://github.com/PPPLDeepLearning/plasma-python/actions
Projects https://github.com/PPPLDeepLearning/plasma-python/projects
Security https://github.com/PPPLDeepLearning/plasma-python/security
Insights https://github.com/PPPLDeepLearning/plasma-python/pulse
New issuehttps://github.com/login?return_to=https://github.com/PPPLDeepLearning/plasma-python/issues/63
New issuehttps://github.com/login?return_to=https://github.com/PPPLDeepLearning/plasma-python/issues/63
Bugs: variable number of steps per epoch, and inaccurate diagnostic count of num_so_far shotshttps://github.com/PPPLDeepLearning/plasma-python/issues/63#top
https://github.com/felker
https://github.com/felker
felkerhttps://github.com/felker
on Jan 14, 2020https://github.com/PPPLDeepLearning/plasma-python/issues/63#issue-549881802
#46https://github.com/PPPLDeepLearning/plasma-python/pull/46
#55https://github.com/PPPLDeepLearning/plasma-python/issues/55
https://user-images.githubusercontent.com/1410981/72389584-59818c00-36ee-11ea-88d8-741f8ca77b85.png
@jnkhhttps://github.com/jnkh
@ge-donghttps://github.com/ge-dong
https://github.com
Termshttps://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacyhttps://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Securityhttps://github.com/security
Statushttps://www.githubstatus.com/
Communityhttps://github.community/
Docshttps://docs.github.com/
Contacthttps://support.github.com?tags=dotcom-footer

Viewport: width=device-width


URLs of crawlers that visited me.