Title: Clarity on units of string · Issue #65 · WebAssembly/stringref · GitHub
Open Graph Title: Clarity on units of string · Issue #65 · WebAssembly/stringref
X Title: Clarity on units of string · Issue #65 · WebAssembly/stringref
Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming langu...
Open Graph Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're re...
X Description: I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and ho...
Opengraph URL: https://github.com/WebAssembly/stringref/issues/65
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"Clarity on units of string","articleBody":"I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming languages and VMs.\r\n\r\nThis might be just a wording issue, but I think the current description of [\"What's a string?\"](https://github.com/WebAssembly/stringref/blob/a64917cd5346f8704e614c4825ebf05737ac5e64/proposals/stringref/Overview.md?plain=1#L50) could be problematic:\r\n\r\n\u003e Therefore **we define a string to be a sequence of unicode scalar values and isolated surrogates**. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the [WTF-16](https://simonsapin.github.io/wtf-8/) encoding form.\r\n\r\nIf this is taken to mean that a string is a sequence of Unicode code points (ie, \"Unicode scalar values\" and \"isolated surrogates\", basically any integer from `0x0` to `0x10FFFF`), this does not correspond with \"WTF-16\" or JavaScript strings, since there are sequences of isolated surrogates that can't be distinctly encoded in WTF-16.\r\n\r\neg, the sequence `[U+D83D, U+DCA9]` (high surrogate, low surrogate) doesn't have an encoding form in WTF-16. Interpreting the WTF-16/UTF-16 code unit sequence `\u003cD83D DCA9\u003e` produces the sequence `[U+1F4A9]` (a Unicode scalar value, `💩`). There are 1048576 occurrences of such sequences (one for every code point outside of the BMP), where the \"obvious\" encoding is already used to encode a USV.\r\n\r\n```js\r\n\u003e \"\\uD83D\\uDCA9\"\r\n'💩'\r\n\u003e \"\\u{1F4A9}\"\r\n'💩'\r\n\u003e [...\"\\uD83D\\uDCA9\"].length\r\n1\r\n\u003e [...\"\\u{1F4A9}\"].length\r\n1\r\n```\r\n\r\nNote that this is different to how strings work in Python 3, where a string is indeed a sequence of any Unicode code points:\r\n```python\r\n\u003e\u003e\u003e \"\\U0000D83D\\U0000DCA9\"\r\n'\\ud83d\\udca9'\r\n\u003e\u003e\u003e \"\\U0001F4A9\"\r\n'💩'\r\n\u003e\u003e\u003e len(\"\\U0000D83D\\U0000DCA9\")\r\n2\r\n\u003e\u003e\u003e len(\"\\U0001F4A9\")\r\n1\r\n```\r\n\r\nIf this proposal is suggesting that strings work the same way as in Python 3, I think implementations will likely[0] resort to using UTF-32 in some cases, as I believe Python implementations do (I think Python implementations usually use UTF-32[1] essentially for all strings, though they will switch between using 8-bit, 16-bit and 32-bit arrays depending on the range of code points used). Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.\r\n\r\nAs a side note, the section in question also lumps together Python (presumably 3) and Rust, though this might result from a misunderstanding that should hopefully be explained above. Rust strings are meant to be[2] valid UTF-8, hence they correspond to sequences of Unicode scalar values, but as explained above, Python strings can be any distinct sequence of Unicode code points.\r\n\r\n---\r\n\r\n[0] Another alternative would be using a variation of WTF-8 that preserves UTF-16 surrogates instead of normalising them to USVs on concatenation, though this seems a bit crazy.\r\n\r\n[1] Technically this is an extension of UTF-32, since UTF-32 itself doesn't allow encoding of code points in the surrogate range.\r\n\r\n[2] This is at least true at some API level, though technically the representation of strings in Rust is allowed to contain arbitrary bytes, where it is up to libraries to avoid emitting invalid UTF-8 to safe code: https://github.com/rust-lang/rust/issues/71033","author":{"url":"https://github.com/Maxdamantus","@type":"Person","name":"Maxdamantus"},"datePublished":"2023-10-21T02:36:31.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":16},"url":"https://github.com/65/stringref/issues/65"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:9721259e-7367-ab65-d8ad-aeeff26360a9 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | 9D0A:2C1011:7B87286:A459614:696DFC7F |
| html-safe-nonce | 0f137a973004dc68fe969fdf7ed392f168723cff3b66f63af451b96f99e30726 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI5RDBBOjJDMTAxMTo3Qjg3Mjg2OkE0NTk2MTQ6Njk2REZDN0YiLCJ2aXNpdG9yX2lkIjoiODM1MjMwNTU4NTI2ODMyNTUwMyIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | b5995ba42bb7075b53ba73d1798f0ca87c84fbc25455481b1b6a185350722a0c |
| hovercard-subject-tag | issue:1955235460 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/WebAssembly/stringref/65/issue_layout |
| twitter:image | https://opengraph.githubassets.com/c8d7c23f261eaa029a2a7310bb6bc7b3e941096e42b54a0036f0cad53e4073b2/WebAssembly/stringref/issues/65 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/c8d7c23f261eaa029a2a7310bb6bc7b3e941096e42b54a0036f0cad53e4073b2/WebAssembly/stringref/issues/65 |
| og:image:alt | I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're re... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | Maxdamantus |
| hostname | github.com |
| expected-hostname | github.com |
| None | 4922b452d03cd8dbce479d866a11bc25b59ef6ee2da23aa9b0ddefa6bd4d0064 |
| turbo-cache-control | no-preview |
| go-import | github.com/WebAssembly/stringref git https://github.com/WebAssembly/stringref.git |
| octolytics-dimension-user_id | 11578470 |
| octolytics-dimension-user_login | WebAssembly |
| octolytics-dimension-repository_id | 485975060 |
| octolytics-dimension-repository_nwo | WebAssembly/stringref |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 485975060 |
| octolytics-dimension-repository_network_root_nwo | WebAssembly/stringref |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 7e5ae23c70136152637ceee8d6faceb35596ec46 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width