Title: validators.url fails any URL whose FQDN includes consecutive hyphens (e.g. IDNA A-labels) · Issue #78 · python-validators/validators · GitHub
Open Graph Title: validators.url fails any URL whose FQDN includes consecutive hyphens (e.g. IDNA A-labels) · Issue #78 · python-validators/validators
X Title: validators.url fails any URL whose FQDN includes consecutive hyphens (e.g. IDNA A-labels) · Issue #78 · python-validators/validators
Description: As the title implies, validators.url chokes on URLs that contain a domain, hostname, or TLD with two or more consecutive hyphens. The issue is most troublesome when it involves URLs containing valid IDNs in A-label form: In [1]: import v...
Open Graph Description: As the title implies, validators.url chokes on URLs that contain a domain, hostname, or TLD with two or more consecutive hyphens. The issue is most troublesome when it involves URLs containing vali...
X Description: As the title implies, validators.url chokes on URLs that contain a domain, hostname, or TLD with two or more consecutive hyphens. The issue is most troublesome when it involves URLs containing vali...
Opengraph URL: https://github.com/python-validators/validators/issues/78
X: @github
Domain: github.com
{"@context":"https://schema.org","@type":"DiscussionForumPosting","headline":"validators.url fails any URL whose FQDN includes consecutive hyphens (e.g. IDNA A-labels)","articleBody":"As the title implies, validators.url chokes on URLs that contain a domain, hostname, or TLD with two or more consecutive hyphens. The issue is most troublesome when it involves URLs containing valid IDNs in A-label form:\r\n\r\n```\r\nIn [1]: import validators\r\nIn [2]: validators.url('http://xn--j1ail.xn--p1ai')\r\nOut[2]: ValidationFailure(func=url, args={'public': False, 'value': 'http://xn--j1ail.xn--p1ai'})\r\n```\r\n\r\nThis failure is caused by the fact that the regex for validators.url only allows for repetition of hyphens as part of larger groups within the host and domain name sections. These groups must begin with a non-hyphen character, thus preventing sequential hyphens. For the TLD section no such group even exists; hyphens aren't permitted at all. The relevant portion of the regex is found on lines 36-41 of url.py:\r\n\r\n```\r\n# host name\r\nu\"(?:(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9]+)\"\r\n# domain name\r\nu\"(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-?)*[a-z\\u00a1-\\uffff0-9]+)*\"\r\n# TLD identifier\r\nu\"(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\"\r\n```\r\n\r\nThe issue also occurs when processing URLs of valid domains that have consecutive hyphens in their name. While such domain names are less common and may be frowned upon by certain registries, they are still technically valid according to the RFC. Here are the dig and whois results for one such domain:\r\n\r\n```\r\n; \u003c\u003c\u003e\u003e DiG 9.10.3-P4-Ubuntu \u003c\u003c\u003e\u003e @8.8.8.8 online--trading.com\r\n; (1 server found)\r\n;; global options: +cmd\r\n;; Got answer:\r\n;; -\u003e\u003eHEADER\u003c\u003c- opcode: QUERY, status: NOERROR, id: 31443\r\n;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1\r\n\r\n;; OPT PSEUDOSECTION:\r\n; EDNS: version: 0, flags:; udp: 512\r\n;; QUESTION SECTION:\r\n;online--trading.com.\t\tIN\tA\r\n\r\n;; ANSWER SECTION:\r\nonline--trading.com.\t899\tIN\tA\t195.110.124.133\r\n\r\n;; Query time: 167 msec\r\n;; SERVER: 8.8.8.8#53(8.8.8.8)\r\n;; WHEN: Tue Apr 03 15:03:25 PDT 2018\r\n;; MSG SIZE rcvd: 64\r\n```\r\n\r\n```\r\nDomain Name: ONLINE--TRADING.COM\r\nRegistry Domain ID: 2171387112_DOMAIN_COM-VRSN\r\nRegistrar WHOIS Server: whois.register.it\r\nRegistrar URL: http://www.register.it\r\nUpdated Date: 2017-10-06T18:54:58Z\r\nCreation Date: 2017-10-06T18:54:58Z\r\nRegistry Expiry Date: 2018-10-06T18:54:58Z\r\nRegistrar: Register.it SPA\r\nRegistrar IANA ID: 168\r\nRegistrar Abuse Contact Email: abuse@register.it\r\nRegistrar Abuse Contact Phone: +39.5520021555\r\nDomain Status: ok https://icann.org/epp#ok\r\nName Server: NS1.REGISTER.IT\r\nName Server: NS2.REGISTER.IT\r\nDNSSEC: unsigned\r\nURL of the ICANN Whois Inaccuracy Complaint Form: https://www.icann.org/wicf/\r\n```\r\n\r\nIt's arguable whether domains like this _should_ pass validators.url since they're somewhat of an edge case for everyday users. It may not be worth letting potentially erroneous URLs through just to prevent a few oddball domains from failing validation. The IDNA A-labels are a different story though -- those should absolutely pass without requiring the user to convert them beforehand. Python's built-in IDNA decoder cannot properly convert IDNA domains that are contained within URLs, so it's fairly onerous to expect the user to do that before using validators.url. \r\n\r\nModifying the regex to match anything that follows the IDNA A-label format is not an ideal solution since invalid A-labels can be generated using valid characters (e.g. \"xn--aaaa\"). Since the existing regex already checks for the Unicode characters used by IDNA U-labels, I think the ideal solution would be to isolate and convert possible IDNA hostnames before reassembling the URL and matching it against the existing regex. I've made a version of url.py that should make this fairly painless; expect my PR shortly.","author":{"url":"https://github.com/nullripper","@type":"Person","name":"nullripper"},"datePublished":"2018-04-04T01:51:24.000Z","interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/CommentAction","userInteractionCount":0},"url":"https://github.com/78/validators/issues/78"}
| route-pattern | /_view_fragments/issues/show/:user_id/:repository/:id/issue_layout(.:format) |
| route-controller | voltron_issues_fragments |
| route-action | issue_layout |
| fetch-nonce | v2:41af6a61-9171-7820-42be-2eeeead782a3 |
| current-catalog-service-hash | 81bb79d38c15960b92d99bca9288a9108c7a47b18f2423d0f6438c5b7bcd2114 |
| request-id | A0EE:24665B:107B829:15322D8:69924C35 |
| html-safe-nonce | b301d01b6832376fcfcd518d93734281ab77710e7683b1403480f67a4d1b1f3c |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJBMEVFOjI0NjY1QjoxMDdCODI5OjE1MzIyRDg6Njk5MjRDMzUiLCJ2aXNpdG9yX2lkIjoiNDg5NjAzMDI0MTIyMTM5NzU1NyIsInJlZ2lvbl9lZGdlIjoiaWFkIiwicmVnaW9uX3JlbmRlciI6ImlhZCJ9 |
| visitor-hmac | cbe4f2a1c41215812269b897f311c03a00742aa7c3effce14e8576111c1510f7 |
| hovercard-subject-tag | issue:311057259 |
| github-keyboard-shortcuts | repository,issues,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/_view_fragments/issues/show/python-validators/validators/78/issue_layout |
| twitter:image | https://opengraph.githubassets.com/9666ee77a0fbcefda90e233fb1eeccb4b8ee671336e53cdc2e13a13edfb54a28/python-validators/validators/issues/78 |
| twitter:card | summary_large_image |
| og:image | https://opengraph.githubassets.com/9666ee77a0fbcefda90e233fb1eeccb4b8ee671336e53cdc2e13a13edfb54a28/python-validators/validators/issues/78 |
| og:image:alt | As the title implies, validators.url chokes on URLs that contain a domain, hostname, or TLD with two or more consecutive hyphens. The issue is most troublesome when it involves URLs containing vali... |
| og:image:width | 1200 |
| og:image:height | 600 |
| og:site_name | GitHub |
| og:type | object |
| og:author:username | nullripper |
| hostname | github.com |
| expected-hostname | github.com |
| None | 42c603b9d642c4a9065a51770f75e5e27132fef0e858607f5c9cb7e422831a7b |
| turbo-cache-control | no-preview |
| go-import | github.com/python-validators/validators git https://github.com/python-validators/validators.git |
| octolytics-dimension-user_id | 113113270 |
| octolytics-dimension-user_login | python-validators |
| octolytics-dimension-repository_id | 13642984 |
| octolytics-dimension-repository_nwo | python-validators/validators |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 13642984 |
| octolytics-dimension-repository_network_root_nwo | python-validators/validators |
| turbo-body-classes | logged-out env-production page-responsive |
| disable-turbo | false |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 848bc6032dcc93a9a7301dcc3f379a72ba13b96e |
| ui-target | full |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width