Title: lxml doesn’t like control characters · Issue #96 · html5lib/html5lib-python · GitHub
Open Graph Title: lxml doesn’t like control characters · Issue #96 · html5lib/html5lib-python
X Title: lxml doesn’t like control characters · Issue #96 · html5lib/html5lib-python
Description: Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F. Each of these trigger the exception below: html5lib.parse('
', treebuilder='lxml') html5lib.parse('
\x01'...
Open Graph Description: Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F. Each of these trigger the exception below: html5lib.parse('
', tree...
X Description: Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F. Each of these trigger the exception below: html5lib.parse('<p>&...
Opengraph URL: https://github.com/html5lib/html5lib-python/issues/96
X: @github
Domain: patch-diff.githubusercontent.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"lxml doesn’t like control characters","articleBody":"Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F.\n\nEach of these trigger the exception below:\n\n```\nhtml5lib.parse('\u003cp\u003e\u0026#1;', treebuilder='lxml')\nhtml5lib.parse('\u003cp\u003e\\x01', treebuilder='lxml')\nhtml5lib.parse('\u003cp id=\"\u0026#1;\"\u003e', treebuilder='lxml')\nhtml5lib.parse('\u003cp id=\"\\x01\"\u003e', treebuilder='lxml')\n```\n\n```\nTraceback (most recent call last):\n File \"/tmp/a.py\", line 4, in \u003cmodule\u003e\n html5lib.parse('\u003cp\u003e\u0026#1;', treebuilder='lxml')\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py\", line 28, in parse\n return p.parse(doc, encoding=encoding)\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py\", line 224, in parse\n parseMeta=parseMeta, useChardet=useChardet)\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py\", line 93, in _parse\n self.mainLoop()\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py\", line 183, in mainLoop\n new_token = phase.processCharacters(new_token)\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/html5parser.py\", line 991, in processCharacters\n self.tree.insertText(token[\"data\"])\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/_base.py\", line 320, in insertText\n parent.insertText(data)\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree_lxml.py\", line 240, in insertText\n builder.Element.insertText(self, data, insertBefore)\n File \"/home/simon/.virtualenvs/weasyprint/lib/python3.3/site-packages/html5lib/treebuilders/etree.py\", line 108, in insertText\n self._element.text += data\n File \"lxml.etree.pyx\", line 921, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:41467)\n File \"apihelpers.pxi\", line 652, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:18888)\n File \"apihelpers.pxi\", line 1335, in lxml.etree._utf8 (src/lxml/lxml.etree.c:24701)\nValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters\n```\n\nU+000C in text (but not in attribute values) is replaced by U+0020 with a warning:\n\n```\nDataLossWarning: Text cannot contain U+000C\n```\n\nlibxml2’s HTML parser replaces them with nothing, which I slightly prefer. Anyway, this is probably what should happen for every character that lxml doesn’t like.\n","author":{"url":"https://github.com/SimonSapin","@type":"Person","name":"SimonSapin"},"datePublished":"2013-07-22T10:17:29.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":26},"url":"https://github.com/96/html5lib-python/issues/96"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:57c81935-fc19-49df-4235-462ef36cbdae |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | 8D24:F89EA:10FEFD3:16C242B:69702A3B |
| html-safe-nonce | 9e98ee522ec44a1d9c67f7a2771b583196623c0fe0f845b94a50c57c680ee1ee |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiI4RDI0OkY4OUVBOjEwRkVGRDM6MTZDMjQyQjo2OTcwMkEzQiIsInZpc2l0b3JfaWQiOiIxMjg1MDQxODc3MjkxMTE3MTE1IiwicmVnaW9uX2VkZ2UiOiJpYWQiLCJyZWdpb25fcmVuZGVyIjoiaWFkIn0= |
| visitor-hmac | 0111ffffd19269e51ad136590afcf131235dc89868a3339054b4d3cff6c39315 |
| hovercard-subject-tag | issue:17039397 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/html5lib/html5lib-python/96/issue_layout |
| twitter:image | https://opengraph.githubassets.com/24b63c011a604e3dff01c158394e0470133f1615e2e7f2d51844e0d824dbfec0/html5lib/html5lib-python/issues/96 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/24b63c011a604e3dff01c158394e0470133f1615e2e7f2d51844e0d824dbfec0/html5lib/html5lib-python/issues/96 |
| og:image:alt | Same issue as #33, but with other non-whitespace C0 control characters: U+0001 to U+0008, U+000B, U+000C, U+000E to U+001F. Each of these trigger the exception below: html5lib.parse(' ', tree... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | SimonSapin |
| hostname | github.com |
| expected-hostname | github.com |
| None | 01fa379f5de85ef8e791d09724e69709ce9eb9595278316e0a921312dc88e0bc |
| turbo-cache-control | no-preview |
| go-import | github.com/html5lib/html5lib-python git https://github.com/html5lib/html5lib-python.git |
| octolytics-dimension-user_id | 4092973 |
| octolytics-dimension-user_login | html5lib |
| octolytics-dimension-repository_id | 9322649 |
| octolytics-dimension-repository_nwo | html5lib/html5lib-python |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 9322649 |
| octolytics-dimension-repository_network_root_nwo | html5lib/html5lib-python |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | dda91974c069382b0dfa47b2da7e28bd061c8331 |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width