On Tue, Sep 27, 2016 at 10:26:18AM +0200, Carl-Daniel Hailfinger wrote:
On 26.09.2016 20:40, Carl-Daniel Hailfinger wrote:
> On 26.09.2016 19:20, Carl-Daniel Hailfinger wrote:
>> On 26.09.2016 14:29, Richard W.M. Jones wrote:
>>> On Mon, Sep 26, 2016 at 02:18:02PM +0200, Carl-Daniel Hailfinger wrote:
>>>> Hi,
>>>>
>>>> has anyone ever run "make check" from nbd against nbdkit with
a python
>>>> plugin? I usually get segfaults during such a run, and sometimes
various
>>>> other errors happen before the segfault, suggesting that some memory
>>>> corruption is underway.
>>>> AFAICS a pure python plugin should not be able to cause memory
corruption.
>>> Correct, a python plugin should not cause memory corruption,
>>> and nbdkit shouldn't segfault ever.
>>>
>>> Did you get a stack trace from C (not from Python)?
>> The core files were useless, but I ran nbdkit in gdb and got something...
> And another one... my close function pickles all python objects which
> were accessed in the plugin. Maybe that's what exposes the memory
> corruption on close.
> [...]
> Will try instrumenting the close function next.
The results are in.
I dumped the reference counters for various python objects inside
py_close and often the reference counters are off-by-one or off-by-two
in the crashing case.
AFAICS the python API says that once you have threads, you have to do
the GIL dance:
https://docs.python.org/2/c-api/init.html#thread-state-and-the-global-int...
. All the crashes I saw can be explained by unsynchronized accesses to
python objects.
I found the bug.
plugin_lock_request() is used to lock client requests only, but it does
_not_ lock any other calls into the plugin. Due to that, the following
calls are neither locked against each other nor are they locked against
client requests:
plugin_register (const char *_filename,
plugin_cleanup (void)
plugin_config (const char *key, const char *value)
plugin_config_complete (void)
plugin_open (struct connection *conn, int readonly)
plugin_close (struct connection *conn)
plugin_get_size (struct connection *conn)
plugin_can_write (struct connection *conn)
plugin_can_flush (struct connection *conn)
plugin_is_rotational (struct connection *conn)
plugin_can_trim (struct connection *conn)
That means dumping the plugin-internal data on connection close is not
locked against the requests within that connection, and any garbage
collection triggered by the last request may still be running while
plugin_close runs. Boom.
In general, when working with python any call into python should be
wrapped like this:
PyGILState_STATE gstate;
gstate = PyGILState_Ensure();
/* Perform Python actions here. */
result = CallSomeFunction();
/* evaluate result or handle exception */
/* Release the thread. No Python API allowed beyond this point. */
PyGILState_Release(gstate);
That would even allow us to run everything in parallel.
I have now added a call to plugin_lock_request() within plugin_close()
and will check if it still explodes. That's obviously less than what the
python developers recommend, but it may at least reduce the likelihood
of the crashes.
Just following up on some old email. Did this issue get fixed in the
end?
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top