2016-10-11 11:56 GMT+03:00 Pino Toscano <ptoscano@redhat.com>:
On Saturday, 8 October 2016 18:27:21 CEST Matteo Cafasso wrote:
> Patch ready for merging.
> v4:
>  - check return code of tsk_fs_attr_walk
>  - pass TSK_FS_FILE_WALK_FLAG_NOSPARSE as additional flag to
>  tsk_fs_attr_walk
> After discussing with TSK authors the behaviour is clear. [1]

Thanks, this improves the situation a bit.

> In case of COMPRESSED blocks, the callback will be called for all the
> attributes no matter whether they are on disk or not (sparse). In
> such cases, the block address will be 0. [2]

Note that the API docs say:
  For compressed and sparse attributes, the address *may* be zero.
(emphasis is mine)

My concern is that, if the address in such cases is "unspecified", then
the comparisons in "attrwalk_callback" are done against a
random/unitialized value (which would be bad).

I understand your concerns. The data will not be wrong. Is the API documentation being misleading.
The data *will* be 0 for SPARSE blocks and *might* be 0 or not for compressed blocks based on certain criteria. See below.

Also, if the block address would be zero, what's the point of having it
among the blocks tsk_fs_attr_walk() iterates over?

This is due to the way NTFS organizes information and deals with its compression and the way the API loops over them.

For each file or directory, there is a MFT (Master File Table) record which consists in a linear repository of attributes (1Kb of size each).
Attributes can be resident within the MFT or non-resident according to their size. The $DATA attribute storing the actual file content is an example of typically non-resident ones.

Non-resident attributes are stored on disk in what is referred as data-runs (contiguous blocks) which are then mapped within the attribute itself. A typical file greater than 800 Bytes has the $DATA attribute containing a map of data runs with their location on the disk. If the map itself is too big for the $DATA attribute (this can happen if the actual content is too fragmented), then extra records are created and their mapping is placed in a special attribute called $ATTRIBUTE_LIST. [1]

When the given file is compressed (native NTFS compression, not application level one), the algorithm goes on each data run within the attribute and: [2]
 1 if the data run is zero filled, will set the corresponding blocks as sparse and set their address to 0.
 2 if compressing the data run does not save any disk block, it will leave it as is.
 3 if compressing the data run does save one or more blocks, the spared one will be again marked as sparse and their address will be 0.

Note that the entire attribute will be marked as compressed no matter what happened to the clusters on disk.

The logic loops through all non-resident attributes (which is what we want: we want all the disk blocks allocated for that file). For each attribute, it loops over all the blocks which that attributes maps and calls the callback.

Our issue is the information at the origin of the sparse flag: the information might come from the block (BAD/ALLOC/UNALLOC), or from the file metadata (RAW,SPARSE,COMPRESSED,CONT, META). [3]

The tsk_fs_attr_walk() walks over the given attribute's blocks. In case we are inspecting attributes of compressed files, the flag will report the *file* status (COMPRESSED) yet will not able to tell us what the compression algorithm did (1,2,3) to that block. It will still correctly give us the address: 0 if sparse (case 1 or 3) or the correct number otherwise (case 2 or 3).

[1] https://en.wikipedia.org/wiki/NTFS#Attribute_lists.2C_attributes.2C_and_streams
[2] http://www.digital-evidence.org/fsfa/  -  Chapter 11
[3] http://www.sleuthkit.org/sleuthkit/docs/api-docs/4.2/tsk__fs_8h.html#a1e6bf157f5d258191bf5d8ae31ee7148  -  "Note that some of these are set only by file_walk because they are file-level details, such as compression and sparse."

Pino Toscano

Libguestfs mailing list