On Sat, Apr 25, 2020 at 8:32 PM Sam Eiderman <sameid(a)google.com> wrote:
Hi Nir,
I think latin1,
How do you think we should handle latin1 errors then? Replace on latin1 or replace on
utf-8?
Decoding from latin1 (or any other 8 bit encoding) never fails, it returns junk.
For example, the name "Jörgen":
>> "Jörgen".encode("utf-8")
b'J\xc3\xb6rgen'
If the data happens to be "utf-8", we will decode it successfully:
>> b'J\xc3\xb6rgen'.decode("utf-8")
'Jörgen'
But if the data was "latin1":
>> "Jörgen".encode("latin1")
b'J\xf6rgen'
Replacing will give:
>> b'J\xf6rgen'.decode("utf-8",
errors="replace")
'J�rgen'
Falling back to "latin1":
>> b'J\xf6rgen'.decode("latin1")
'Jörgen'
But note that if the data was not latin1, like this (Hebrew Alef):
>> "\u05d0".encode("cp1255")
b'\xe0'
Fallback to "latin" will succeed, returning junk:
>> b"\xe0".decode("latin1")
'à'
Instead of the actual value:
>> b"\xe0".decode("cp1255")
'א'
This makes sense if we know that the relevant data is usually encoded in latin1.
You can check if this gives better results for your use case.
Nir