[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes

Saturday, 26 February 2011

Hi,

libhivex seems to do a great job at parsing hives most of the time, but
there are some issues with a few registry keys.

These can be worked around in the application that uses libhivex, but I
think it'd be better if libhivex handled these itself.

1. UTF16 string in REG_SZ that has garbage after the \0\0

There is code in hivex.c to handle this already but I think it has a typo:

  /* Deal with the case where Windows has allocated a large buffer
   * full of random junk, and only the first few bytes of the buffer
   * contain a genuine UTF-16 string.
   *
   * In this case, iconv would try to process the junk bytes as UTF-16
   * and inevitably find an illegal sequence (EILSEQ).  Instead, stop
   * after we find the first \0\0.
   *
   * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive).
   */
  size_t slen = utf16_string_len_in_bytes_max (data, len);
  if (slen > len)
    len = slen;

  char *ret = windows_utf16_to_utf8 (data, len);

slen is only used to increase length of data, but I think it should be
decreasing it (to stop earlier).

Example key where problem occurs:
software\Microsoft\MediaPlayer\Preferences> lsval
hivexsh: lsval: Invalid or incomplete multibyte or wide character
"MyPlayLists"=software\Microsoft\MediaPlayer\Preferences>

Same for LcnStartLocation key in
HKLM\\SOFTWARE\\Microsoft\\Dfrg\\BootOptimizeFunction (it starts with 30
00 00 00 .. some garbage).

Printing the key with value_value shows this, which would be fine if
hivex stopped parsing after the first 00 00:
43 00 3A 00 5C 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 73 00
20 00 61 00 6E 00 64 00 20 00 53 00 65 00 74 00 74 00 69 00 6E 00 67 00
73 00 5C 00 41 00 6C 00 6C 00 20 00 55 00 73 00 65 00 72 00 73 00 5C 00
44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 00 69 00 5C 00 4D 00 75 00
73 00 69 00 63 00 61 00 5C 00 53 00 61 00 6D 00 70 00 6C 00 65 00 20 00
50 00 6C 00 61 00 79 00 6C 00 69 00 73 00 74 00 73 00 00 00 64 F7 06 00
2E 40 92 7C A8 20 08 00 3C F5 06 00 70 09 92 7C C0 E4 98 7C EF 40 92 7C
BB 40 92 7C 04 01 00 00 00 DC FD 7F 00 00 00 00 02 00 00 00 39 00 00 00
C8 05 92 7C 90 97 08 00 00 00 00 00 08 00 0A 00 88 3E 92 7C 1A 02 00 00
00 00 00 00 98 97 08 00 F8 81 5D 77 B8 1B 09 00 6A 00 00 00 00 00 00 00
E0 1B 09 00 5C 01 08 00 6A 00 6C 00 00 DC FD 7F 3C F5 06 00 02 00 00 00
A0 20 08 00 60 00 00 01 43 00 3A 00 5C 00 44 00 6F 00 63 00 75 00 6D 00
65 00 6E 00 74 00 73 00 20 00 61 00 6E 00 64 00 20 00 53 00 65 00 74 00
74 00 69 00 6E 00 67 00 73 00 5C 00 41 00 6C 00 6C 00 20 00 55 00 73 00
65 00 72 00 73 00 5C 00 44 00 61 00 74 00 69 00 20 00 61 00 70 00 70 00
6C 00 69 00 63 00 61 00 7A 00 69 00 6F 00 6E 00 69 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Workaround: I use value_value if value_string fails

2. Non-ascii node names

I found a node with a \xDC (Ü) in it:
SOFTWARE\\ODBC\\ODBCINST.INI\\MS Code Page-\xDCbersetzer

hivex.c has a comment like this:
  /* AFAIK the node name is always plain ASCII, so no conversion
   * to UTF-8 is necessary.  However we do need to nul-terminate
   * the string.
   */

I think hivex should convert the node names from CP1252 (or is it
ISO-8859-1?) to UTF-8.

Workaround: I do the CP1252 -> UTF8 conversion myself for now

3. node_get_child is slow

Documentation issue, it should say that using node_get_child is slow
(because registry doesn't have an index, and you do a linear search).

Workaround: I create a map of node names to children of a node, a lookup
in that is faster than using node_get_child repeatedly

4. hivexml output is not a well-formed XML

See problem #1 and #2, if value_string and node_name are fixed to not
dump the binary garbage and just return UTF8 then I think hivexml's
output would pass xmllint.

Best regards,
--Edwin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[Libguestfs] hivex: some issues (key encoding, ...) and suggested fixes