Hi Nir,
I think latin1,

How do you think we should handle latin1 errors then? Replace on latin1 or replace on utf-8?

for codec in  ["utf8", "latin1"]:
  try:
    return decode(b, codec)
  except:
    pass
return decode(b, "utf8", errors="replace")

(Pseudocode, will be implemented in c)



On Thu, Apr 23, 2020, 21:34 Nir Soffer <nsoffer@redhat.com> wrote:
On Mon, Apr 20, 2020 at 3:38 PM Sam Eiderman <sameid@google.com> wrote:
>
> The python3 bindings create unicode objects from application strings
> on the guest (i.e. installed rpm, deb packages).
> It is documented that rpm package fields such as description should be
> utf8 encoded - however in some cases they are not a valid unicode
> string,

So what are they? latin1 maybe?

Maybe use:

    try:
        value.decode("utf-8")
    except UnicodeDecodeError:
        value.decode("latin1")

This will always succeed, producing possibly garbage output but so is
errors='replace'.

> on SLES11 SP4 the following packages fail to be converted to
> unicode using guestfs_int_py_fromstring() (which invokes
> PyUnicode_FromString()):
>
>  PackageKit
>  aaa_base
>  coreutils
>  dejavu
>  desktop-data-SLED
>  gnome-utils
>  hunspell
>  hunspell-32bit
>  hunspell-tools
>  libblocxx6
>  libexif
>  libgphoto2
>  libgtksourceview-2_0-0
>  libmpfr1
>  libopensc2
>  libopensc2-32bit
>  liborc-0_4-0
>  libpackagekit-glib10
>  libpixman-1-0
>  libpixman-1-0-32bit
>  libpoppler-glib4
>  libpoppler5
>  libsensors3
>  libtelepathy-glib0
>  m4
>  opensc
>  opensc-32bit
>  permissions
>  pinentry
>  poppler-tools
>  python-gtksourceview
>  splashy
>  syslog-ng
>  tar
>  tightvnc
>  xorg-x11
>  xorg-x11-xauth
>  yast2-mouse
>
> Fix this by globally changing guestfs_int_py_fromstring()
> and guestfs_int_py_fromstringsize() to decode utf-8 with the "replace"
> error handler:
>
https://docs.python.org/3/library/codecs.html#error-handlers
>
> For example, this will decode PackageKit's description on SLES4 the
> following way:
>
>     Backend: pisi
>         S.�a&#287;lar Onur <caglar@pardus.org.tr>

What is the original text?

Nir

> Signed-off-by: Sam Eiderman <sameid@google.com>
> ---
>  python/handle.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/python/handle.c b/python/handle.c
> index 2fb8c18f0..427424707 100644
> --- a/python/handle.c
> +++ b/python/handle.c
> @@ -387,7 +387,7 @@ guestfs_int_py_fromstring (const char *str)
>  #if PY_MAJOR_VERSION < 3
>    return PyString_FromString (str);
>  #else
> -  return PyUnicode_FromString (str);
> +  return PyUnicode_Decode(str, strlen(str), "utf-8", "replace");
>  #endif
>  }
>
> @@ -397,7 +397,7 @@ guestfs_int_py_fromstringsize (const char *str, size_t size)
>  #if PY_MAJOR_VERSION < 3
>    return PyString_FromStringAndSize (str, size);
>  #else
> -  return PyUnicode_FromStringAndSize (str, size);
> +  return PyUnicode_Decode(str, size, "utf-8", "replace");
>  #endif
>  }
>
> --
> 2.26.1.301.g55bc3eb7cb9-goog
>
>
> _______________________________________________
> Libguestfs mailing list
> Libguestfs@redhat.com
> https://www.redhat.com/mailman/listinfo/libguestfs