René's URL Explorer Experiment

Title: Clarity on units of string · Issue #65 · WebAssembly/stringref · GitHub

Open Graph Title: Clarity on units of string · Issue #65 · WebAssembly/stringref

X Title: Clarity on units of string · Issue #65 · WebAssembly/stringref

Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming langu...

Open Graph Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're re...

X Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and ho...

Opengraph URL: https://github.com/WebAssembly/stringref/issues/65

X: @github

direct link

Domain: github.com

Hey, it has json ld scripts:

{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Clarity on units of string","articleBody":"I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming languages and VMs.\r\n\r\nThis might be just a wording issue, but I think the current description of [\"What's a string?\"](https://github.com/WebAssembly/stringref/blob/a64917cd5346f8704e614c4825ebf05737ac5e64/proposals/stringref/Overview.md?plain=1#L50) could be problematic:\r\n\r\n\u003e Therefore **we define a string to be a sequence of unicode scalar values and isolated surrogates**. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the [WTF-16](https://simonsapin.github.io/wtf-8/) encoding form.\r\n\r\nIf this is taken to mean that a string is a sequence of Unicode code points (ie, \"Unicode scalar values\" and \"isolated surrogates\", basically any integer from `0x0` to `0x10FFFF`), this does not correspond with \"WTF-16\" or JavaScript strings, since there are sequences of isolated surrogates that can't be distinctly encoded in WTF-16.\r\n\r\neg, the sequence `[U+D83D, U+DCA9]` (high surrogate, low surrogate) doesn't have an encoding form in WTF-16. Interpreting the WTF-16/UTF-16 code unit sequence `\u003cD83D DCA9\u003e` produces the sequence `[U+1F4A9]` (a Unicode scalar value, `💩`). There are 1048576 occurrences of such sequences (one for every code point outside of the BMP), where the \"obvious\" encoding is already used to encode a USV.\r\n\r\n```js\r\n\u003e \"\\uD83D\\uDCA9\"\r\n'💩'\r\n\u003e \"\\u{1F4A9}\"\r\n'💩'\r\n\u003e [...\"\\uD83D\\uDCA9\"].length\r\n1\r\n\u003e [...\"\\u{1F4A9}\"].length\r\n1\r\n```\r\n\r\nNote that this is different to how strings work in Python 3, where a string is indeed a sequence of any Unicode code points:\r\n```python\r\n\u003e\u003e\u003e \"\\U0000D83D\\U0000DCA9\"\r\n'\\ud83d\\udca9'\r\n\u003e\u003e\u003e \"\\U0001F4A9\"\r\n'💩'\r\n\u003e\u003e\u003e len(\"\\U0000D83D\\U0000DCA9\")\r\n2\r\n\u003e\u003e\u003e len(\"\\U0001F4A9\")\r\n1\r\n```\r\n\r\nIf this proposal is suggesting that strings work the same way as in Python 3, I think implementations will likely[0] resort to using UTF-32 in some cases, as I believe Python implementations do (I think Python implementations usually use UTF-32[1] essentially for all strings, though they will switch between using 8-bit, 16-bit and 32-bit arrays depending on the range of code points used). Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.\r\n\r\nAs a side note, the section in question also lumps together Python (presumably 3) and Rust, though this might result from a misunderstanding that should hopefully be explained above. Rust strings are meant to be[2] valid UTF-8, hence they correspond to sequences of Unicode scalar values, but as explained above, Python strings can be any distinct sequence of Unicode code points.\r\n\r\n---\r\n\r\n[0] Another alternative would be using a variation of WTF-8 that preserves UTF-16 surrogates instead of normalising them to USVs on concatenation, though this seems a bit crazy.\r\n\r\n[1] Technically this is an extension of UTF-32, since UTF-32 itself doesn't allow encoding of code points in the surrogate range.\r\n\r\n[2] This is at least true at some API level, though technically the representation of strings in Rust is allowed to contain arbitrary bytes, where it is up to libraries to avoid emitting invalid UTF-8 to safe code: https://github.com/rust-lang/rust/issues/71033","author":{"url":"https://github.com/Maxdamantus","@type":"Person","name":"Maxdamantus"},"datePublished":"2023-10-21T02:36:31.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":16},"url":"https://github.com/65/stringref/issues/65"}

route-pattern	/_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format)
route-controller	voltron_issues_fragments
route-action	issue_layout
fetch-nonce	v2:9721259e-7367-ab65-d8ad-aeeff26360a9
current-catalog-service-hash	81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114
request-id	9D0A:2C1011:7B87286:A459614:696DFC7F
html-safe-nonce	0f137a973004dc68fe969fdf7ed392f168723cff3b66f63af451b96f99e30726
visitor-payload	eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5RDBBOjJDMTAxMTo3Qjg3Mjg2OkE0NTk2MTQ6Njk2REZDN0YiLCJ2aXNpdG9yX2lkIjoiODM1MjMwNTU4NTI2ODMyNTUwMyIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9
visitor-hmac	b5995ba42bb7075b53ba73d1798f0ca87c84fbc25455481b1b6a185350722a0c
hovercard-subject-tag	issue:1955235460
github-keyboard-shortcuts	repository,issues,copilot
google-site-verification	Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I
octolytics-url	https://collector.github.com/github/collect
analytics-location	///voltron/issues_fragments/issue_layout
fb:app_id	1401488693436528
apple-itunes-app	app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/WebAssembly/stringref/65/issue_layout
twitter:image	https://opengraph.githubassets.com/c8d7c23f261eaa029a2a7310bb6bc7b3e941096e42b54a0036f0cad53e4073b2/WebAssembly/stringref/issues/65
twitter:card	summary_large_image
og:image	https://opengraph.githubassets.com/c8d7c23f261eaa029a2a7310bb6bc7b3e941096e42b54a0036f0cad53e4073b2/WebAssembly/stringref/issues/65
og:image:alt	I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're re...
og:image:width	1200
og:image:height	600
og:site_name	GitHub
og:type	object
og:author:username	Maxdamantus
hostname	github.com
expected-hostname	github.com
None	4922b452d03cd8dbce479d866a11bc25b59ef6ee2da23aa9b0ddefa6bd4d0064
turbo-cache-control	no-preview
go-import	github.com/WebAssembly/stringref git https://github.com/WebAssembly/stringref.git
octolytics-dimension-user_id	11578470
octolytics-dimension-user_login	WebAssembly
octolytics-dimension-repository_id	485975060
octolytics-dimension-repository_nwo	WebAssembly/stringref
octolytics-dimension-repository_public	true
octolytics-dimension-repository_is_fork	false
octolytics-dimension-repository_network_root_id	485975060
octolytics-dimension-repository_network_root_nwo	WebAssembly/stringref
turbo-body-classes	logged-out env-production page-responsive
disable-turbo	false
browser-stats-url	https://api.github.com/_private/browser/stats
browser-errors-url	https://api.github.com/_private/browser/errors
release	7e5ae23c70136152637ceee8d6faceb35596ec46
ui-target	full
theme-color	#1e2327
color-scheme	light dark

Links:

Skip to content	https://github.com/WebAssembly/stringref/issues/65#start-of-content
	https://github.com/
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FWebAssembly%2Fstringref%2Fissues%2F65
GitHub CopilotWrite better code with AI	https://github.com/features/copilot
GitHub SparkBuild and deploy intelligent apps	https://github.com/features/spark
GitHub ModelsManage and compare prompts	https://github.com/features/models
MCP RegistryNewIntegrate external tools	https://github.com/mcp
ActionsAutomate any workflow	https://github.com/features/actions
CodespacesInstant dev environments	https://github.com/features/codespaces
IssuesPlan and track work	https://github.com/features/issues
Code ReviewManage code changes	https://github.com/features/code-review
GitHub Advanced SecurityFind and fix vulnerabilities	https://github.com/security/advanced-security
Code securitySecure your code as you build	https://github.com/security/advanced-security/code-security
Secret protectionStop leaks before they start	https://github.com/security/advanced-security/secret-protection
Why GitHub	https://github.com/why-github
Documentation	https://docs.github.com
Blog	https://github.blog
Changelog	https://github.blog/changelog
Marketplace	https://github.com/marketplace
View all features	https://github.com/features
Enterprises	https://github.com/enterprise
Small and medium teams	https://github.com/team
Startups	https://github.com/enterprise/startups
Nonprofits	https://github.com/solutions/industry/nonprofits
App Modernization	https://github.com/solutions/use-case/app-modernization
DevSecOps	https://github.com/solutions/use-case/devsecops
DevOps	https://github.com/solutions/use-case/devops
CI/CD	https://github.com/solutions/use-case/ci-cd
View all use cases	https://github.com/solutions/use-case
Healthcare	https://github.com/solutions/industry/healthcare
Financial services	https://github.com/solutions/industry/financial-services
Manufacturing	https://github.com/solutions/industry/manufacturing
Government	https://github.com/solutions/industry/government
View all industries	https://github.com/solutions/industry
View all solutions	https://github.com/solutions
AI	https://github.com/resources/articles?topic=ai
Software Development	https://github.com/resources/articles?topic=software-development
DevOps	https://github.com/resources/articles?topic=devops
Security	https://github.com/resources/articles?topic=security
View all topics	https://github.com/resources/articles
Customer stories	https://github.com/customer-stories
Events & webinars	https://github.com/resources/events
Ebooks & reports	https://github.com/resources/whitepapers
Business insights	https://github.com/solutions/executive-insights
GitHub Skills	https://skills.github.com
Documentation	https://docs.github.com
Customer support	https://support.github.com
Community forum	https://github.com/orgs/community/discussions
Trust center	https://github.com/trust-center
Partners	https://github.com/partners
GitHub SponsorsFund open source developers	https://github.com/sponsors
Security Lab	https://securitylab.github.com
Maintainer Community	https://maintainers.github.com
Accelerator	https://github.com/accelerator
Archive Program	https://archiveprogram.github.com
Topics	https://github.com/topics
Trending	https://github.com/trending
Collections	https://github.com/collections
Enterprise platformAI-powered developer platform	https://github.com/enterprise
GitHub Advanced SecurityEnterprise-grade security features	https://github.com/security/advanced-security
Copilot for BusinessEnterprise-grade AI features	https://github.com/features/copilot/copilot-business
Premium SupportEnterprise-grade 24/7 support	https://github.com/premium-support
Pricing	https://github.com/pricing
Search syntax tips	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
documentation	https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax
Sign in	https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2FWebAssembly%2Fstringref%2Fissues%2F65
Sign up	https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Fvoltron%2Fissues_fragments%2Fissue_layout&source=header-repo&source_repo=WebAssembly%2Fstringref
Reload	https://github.com/WebAssembly/stringref/issues/65
Reload	https://github.com/WebAssembly/stringref/issues/65
Reload	https://github.com/WebAssembly/stringref/issues/65
WebAssembly	https://github.com/WebAssembly
stringref	https://github.com/WebAssembly/stringref
Notifications	https://github.com/login?return_to=%2FWebAssembly%2Fstringref
Fork 3	https://github.com/login?return_to=%2FWebAssembly%2Fstringref
Star 40	https://github.com/login?return_to=%2FWebAssembly%2Fstringref
Code	https://github.com/WebAssembly/stringref
Issues 41	https://github.com/WebAssembly/stringref/issues
Pull requests 3	https://github.com/WebAssembly/stringref/pulls
Actions	https://github.com/WebAssembly/stringref/actions
Projects 0	https://github.com/WebAssembly/stringref/projects
Security Uh oh! There was an error while loading. Please reload this page.	https://github.com/WebAssembly/stringref/security
Please reload this page	https://github.com/WebAssembly/stringref/issues/65
Insights	https://github.com/WebAssembly/stringref/pulse
Code	https://github.com/WebAssembly/stringref
Issues	https://github.com/WebAssembly/stringref/issues
Pull requests	https://github.com/WebAssembly/stringref/pulls
Actions	https://github.com/WebAssembly/stringref/actions
Projects	https://github.com/WebAssembly/stringref/projects
Security	https://github.com/WebAssembly/stringref/security
Insights	https://github.com/WebAssembly/stringref/pulse
New issue	https://github.com/login?return_to=https://github.com/WebAssembly/stringref/issues/65
New issue	https://github.com/login?return_to=https://github.com/WebAssembly/stringref/issues/65
Clarity on units of string	https://github.com/WebAssembly/stringref/issues/65#top
	https://github.com/Maxdamantus
	https://github.com/Maxdamantus
Maxdamantus	https://github.com/Maxdamantus
on Oct 21, 2023	https://github.com/WebAssembly/stringref/issues/65#issue-1955235460
"What's a string?"	https://github.com/WebAssembly/stringref/blob/a64917cd5346f8704e614c4825ebf05737ac5e64/proposals/stringref/Overview.md?plain=1#L50
WTF-16	https://simonsapin.github.io/wtf-8/
rust-lang/rust#71033	https://github.com/rust-lang/rust/issues/71033
	https://github.com
Terms	https://docs.github.com/site-policy/github-terms/github-terms-of-service
Privacy	https://docs.github.com/site-policy/privacy-policies/github-privacy-statement
Security	https://github.com/security
Status	https://www.githubstatus.com/
Community	https://github.community/
Docs	https://docs.github.com/
Contact	https://support.github.com?tags=dotcom-footer

Viewport: width=device-width

URLs of crawlers that visited me.