On Sat, Feb 26, 2011 at 08:56:48PM +0200, Török Edwin wrote:
 Hi,
 
 libhivex seems to do a great job at parsing hives most of the time, but
 there are some issues with a few registry keys.
 
 These can be worked around in the application that uses libhivex, but I
 think it'd be better if libhivex handled these itself.
 
 1. UTF16 string in REG_SZ that has garbage after the \0\0
 
 There is code in hivex.c to handle this already but I think it has a typo:
 
   /* Deal with the case where Windows has allocated a large buffer
    * full of random junk, and only the first few bytes of the buffer
    * contain a genuine UTF-16 string.
    *
    * In this case, iconv would try to process the junk bytes as UTF-16
    * and inevitably find an illegal sequence (EILSEQ).  Instead, stop
    * after we find the first \0\0.
    *
    * (Found by Hilko Bengen in a fresh Windows XP SOFTWARE hive).
    */
   size_t slen = utf16_string_len_in_bytes_max (data, len);
   if (slen > len)
     len = slen;
 
   char *ret = windows_utf16_to_utf8 (data, len);
 
 slen is only used to increase length of data, but I think it should be
 decreasing it (to stop earlier). 
Yes, it's strange -- this does appear to be a bug.
[...]
 2. Non-ascii node names
 
 I found a node with a \xDC (Ü) in it:
 SOFTWARE\\ODBC\\ODBCINST.INI\\MS Code Page-\xDCbersetzer
 
 hivex.c has a comment like this:
   /* AFAIK the node name is always plain ASCII, so no conversion
    * to UTF-8 is necessary.  However we do need to nul-terminate
    * the string.
    */
 
 I think hivex should convert the node names from CP1252 (or is it
 ISO-8859-1?) to UTF-8.
 
 Workaround: I do the CP1252 -> UTF8 conversion myself for now 
This patch was posted but I didn't apply it because it seems
quite risky:
https://www.redhat.com/archives/libguestfs/2010-July/msg00064.html
 3. node_get_child is slow
 
 Documentation issue, it should say that using node_get_child is slow
 (because registry doesn't have an index, and you do a linear search).
 
 Workaround: I create a map of node names to children of a node, a lookup
 in that is faster than using node_get_child repeatedly 
Agreed.
 4. hivexml output is not a well-formed XML
 
 See problem #1 and #2, if value_string and node_name are fixed to not
 dump the binary garbage and just return UTF8 then I think hivexml's
 output would pass xmllint. 
Shoot or fix.
Rich.
-- 
Richard Jones, Virtualization Group, Red Hat 
http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  
http://libguestfs.org