On Thu, Aug 8, 2019 at 5:05 PM Richard W.M. Jones <rjones@redhat.com> wrote:
On Mon, Mar 18, 2019 at 06:51:26PM +0200, Daniel Erez wrote:
> After invoking transfer_service.finalize, check operation
> status by examining DiskStatus.
> This is done instead of failing after a predefined timeout
> regardless the status.
>
> * not verified *
>
> Bug-Url: https://bugzilla.redhat.com/show_bug.cgi?id=1680361
> ---
>  v2v/rhv-upload-plugin.py | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/v2v/rhv-upload-plugin.py b/v2v/rhv-upload-plugin.py
> index 2a950c5ed..2cefe94b0 100644
> --- a/v2v/rhv-upload-plugin.py
> +++ b/v2v/rhv-upload-plugin.py
> @@ -523,16 +523,23 @@ def close(h):
>          # waiting for the transfer object to cease to exist, which
>          # falls through to the exception case and then we can
>          # continue.
> -        endt = time.time() + timeout
> +        disk_id = disk.id
> +        start = time.time()
>          try:
>              while True:
>                  time.sleep(1)
> -                tmp = transfer_service.get()
> -                if time.time() > endt:
> -                    raise RuntimeError("timed out waiting for transfer "
> -                                       "to finalize")
> +                disk_service = h['disk_service']
> +                disk = disk_service.get()
> +                if disk.status == types.DiskStatus.LOCKED:
> +                    if time.time() > start + timeout:
> +                        raise RuntimeError("timed out waiting for transfer "
> +                                           "to finalize")
> +                    continue
> +                if disk.status == types.DiskStatus.OK:
> +                    debug("finalized after %s seconds", time.time() - start)

Maybe this was the real error? (using , instead of %)
 
> +                    break
>          except sdk.NotFoundError:
> -            pass
> +            raise RuntimeError("transfer failed: disk %s not found" % disk_id)
>
>          # Write the disk ID file.  Only do this on successful completion.
>          with builtins.open(params['diskid_file'], 'w') as fp:

Ilanit tested this patch for us.  Unfortunately it fails for reasons I
don't really understand but seem to be deep inside the RHV API:

nbdkit: python[1]: error: /var/tmp/v2v.sZm0my/rhv-upload-plugin.py: close: error: ['Traceback (most recent call last):
', '  File "/var/tmp/v2v.sZm0my/rhv-upload-plugin.py", line 577, in close
', '  File "/var/tmp/v2v.sZm0my/rhv-upload-plugin.py", line 516, in delete_disk_on_failure

This looks like failure flow - but there is no info here on the real failure. We need to understand
what was that failure.
 
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/services.py", line 37334, in remove
    self._internal_remove(headers, query, wait)
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 271, in _internal_remove
    return future.wait() if wait else future
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 55, in wait
    return self._code(response)
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 268, in callback
    self._check_fault(response)
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 132, in _check_fault
    self._raise_error(response, body)
', '  File "/usr/lib64/python2.7/site-packages/ovirtsdk4/service.py", line 118, in _raise_error
    raise error
', 'Error: Fault reason is "Operation Failed". Fault detail is "[Cannot remove Virtual Disk. Related operation is currently in progress. Please try again later.]". HTTP response code is 409.
']

I think this error means that the disk is currently busy, probably because of the real error.

Ilanit, can you share complete virt-v2v and daemon logs?

There is a surprising behavior in the image transfer API - you create a disk, and start
a transfer session with that disk. However once you started a the transfer, it owns
the disk, and it will delete it in case of failures, so you don't need to delete it yourself.

I guess you need something like

    h["owns_disk"] = True

After creating the disk, and once the transfer was started, set it to False.
In your cleanup flow, delete the disk only if you still own it.

The other issue here is that the error during close is hiding the real error. This is
a common problem in python 2, and the solution is typically to catch all exceptions
in the cleanup flow and log them, and make sure the original error is raised and
logged by the top error handler.

Nir