Title: Don't choke on (legitimately) invalidly encoded Unicode paths by nvie · Pull Request #467 · gitpython-developers/GitPython · GitHub
Open Graph Title: Don't choke on (legitimately) invalidly encoded Unicode paths by nvie · Pull Request #467 · gitpython-developers/GitPython
X Title: Don't choke on (legitimately) invalidly encoded Unicode paths by nvie · Pull Request #467 · gitpython-developers/GitPython
Description: We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (buggy?) version of Git, and now live in the tree objects with this data. Since we return only unicode strings for the a_path and b_path properties, we're not able to decode this string and thus choke when asking for the diff. This PR fixes that by using "replace" semantics when decoding. This will effectively replace the illegal bytes \200 (or \x80) by \ufffd (= �). Follow-up discussion However, this also means that if you would want to git-blame this file, there's no good way of referencing this path, since it's inherently a bytes path. Normally, when we pass unicode paths to git-blame via GitPython's blame API, the paths get converted to UTF-8 right before issuing the external command. But there's no way of getting the original bytes back after the "replace" operation happened. Example: The input path is b'illegal-\x80.txt' (containing illegal byte \x80) When decoded to UTF-8 with characters replaced, we get the unicode string u'illegal-\ufffd.txt' (= "illegal-�.txt") When encoding that in UTF-8, we find b'illegal-\xef\xbf\xbd.txt' When we next pass illegal-\xef\xbf\xbd.txt to git-blame, it will not be able to find this path. Perhaps it would be a good idea to not only return the decoded path strings, but also provide access to the raw bytes found, i.e. by exposing a_rawpath and b_rawpath, which would always be bytes? That way, you could still have the friendly "unicode paths" for most use cases, but use bytes if you need to speak the language of Git more accurately.
Open Graph Description: We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (bu...
X Description: We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an...
Opengraph URL: https://github.com/gitpython-developers/GitPython/pull/467
X: @github
Domain: github.com
| route-pattern | /:user_id/:repository/pull/:id/files(.:format) |
| route-controller | pull_requests |
| route-action | files |
| fetch-nonce | v2:27747202-0036-9dbc-f3dc-3bf6fbba7b76 |
| current-catalog-service-hash | ae870bc5e265a340912cde392f23dad3671a0a881730ffdadd82f2f57d81641b |
| request-id | EC0C:2B1AB1:506ED5:69DA72:696AE333 |
| html-safe-nonce | 991b1d512a94cbacdaf5f34aaa02b79fa1f44a8de738c8acd39334804f203c16 |
| visitor-payload | eyJyZWZlcnJlciI6IiIsInJlcXVlc3RfaWQiOiJFQzBDOjJCMUFCMTo1MDZFRDU6NjlEQTcyOjY5NkFFMzMzIiwidmlzaXRvcl9pZCI6IjMzMzg2NTA0MzM2OTIwMzM4NDMiLCJyZWdpb25fZWRnZSI6ImlhZCIsInJlZ2lvbl9yZW5kZXIiOiJpYWQifQ== |
| visitor-hmac | 273bd8f4e1d4ce9b5d02443fc8ee0f1043dac7632f0ac6f70be51aed9c6f3e15 |
| hovercard-subject-tag | pull_request:72831833 |
| github-keyboard-shortcuts | repository,pull-request-list,pull-request-conversation,pull-request-files-changed,copilot |
| google-site-verification | Apib7-x98H0j5cPqHWwSMm6dNU4GmODRoqxLiDzdx9I |
| octolytics-url | https://collector.github.com/github/collect |
| analytics-location | / |
| fb:app_id | 1401488693436528 |
| apple-itunes-app | app-id=1477376905, app-argument=https://github.com/gitpython-developers/GitPython/pull/467/files |
| twitter:image | https://avatars.githubusercontent.com/u/83844?s=400&v=4 |
| twitter:card | summary_large_image |
| og:image | https://avatars.githubusercontent.com/u/83844?s=400&v=4 |
| og:image:alt | We've come across path names that contain bytes that are invalid in UTF-8 encoded strings, even though they're very rare. My assumption here is these commits have been created by an old (bu... |
| og:site_name | GitHub |
| og:type | object |
| hostname | github.com |
| expected-hostname | github.com |
| None | 5f99f7c1d70f01da5b93e5ca90303359738944d8ab470e396496262c66e60b8d |
| turbo-cache-control | no-preview |
| diff-view | unified |
| go-import | github.com/gitpython-developers/GitPython git https://github.com/gitpython-developers/GitPython.git |
| octolytics-dimension-user_id | 503709 |
| octolytics-dimension-user_login | gitpython-developers |
| octolytics-dimension-repository_id | 1126087 |
| octolytics-dimension-repository_nwo | gitpython-developers/GitPython |
| octolytics-dimension-repository_public | true |
| octolytics-dimension-repository_is_fork | false |
| octolytics-dimension-repository_network_root_id | 1126087 |
| octolytics-dimension-repository_network_root_nwo | gitpython-developers/GitPython |
| turbo-body-classes | logged-out env-production page-responsive full-width |
| disable-turbo | true |
| browser-stats-url | https://api.github.com/_private/browser/stats |
| browser-errors-url | https://api.github.com/_private/browser/errors |
| release | 3d84d50b3c75fa36755c3cf392edbc09e626f979 |
| ui-target | canary-1 |
| theme-color | #1e2327 |
| color-scheme | light dark |
Links:
Viewport: width=device-width