On Mon, Oct 09, 2023 at 12:12:02PM +0100, Richard W.M. Jones wrote:
 On Mon, Oct 09, 2023 at 12:57:14PM +0200, Laszlo Ersek wrote:
 > On 10/9/23 09:46, Richard W.M. Jones wrote:
 > > Hi Eric, a couple of POSIX questions for you from nbdkit.
 > > 
 > > The first question is from an AUR comment on nbdkit:
 > > 
 > >   
https://aur.archlinux.org/packages/nbdkit#comment-937381
 > > 
 > > I think there's a bash-ism in the logscript parameter in this test:
 > > 
 > >  
https://gitlab.com/nbdkit/nbdkit/-/blame/master/tests/test-log-script-inf...
 > > 
 > > I believe it is happening in the $(( .. )) expression.  How do we
 > > write that so it'll work in a posix shell?
 > 
 > (I'm not Eric, but curious! :) )
 > 
 > Here I think we should just explicitly insist on bash.
 
 Unfortunately we can't do that, not easily anyway.
 
 We use bash for writing the test scripts because it's the sane choice
 for shell programming, and we declare /bin/bash (or something with
 'env') at the top of each of those scripts.  This introduces a *build*
 dependency from nbdkit to /bin/bash.  That's all fine.
 
 However when nbdkit is installed in production & runs an external
 script, it uses system(3) which uses /bin/sh.  That might not be bash,
 and indeed bash might not even be installed on the same machine as
 nbdkit. 
Correct.
 
 [Note: I think the whole Debian dash-for-/bin/sh is the stupidest idea
 I ever heard, but (a) it's a thing and (b) there's BSD and Macs where
 they don't want to use bash for licensing reasons.] 
It's not just Debian's dash, but alpine's choice of busybox (a variant
of ash with distinct differences from dash) that can also bite you
when using non-POSIX constructs.  Debian pioneered a shellcheck app
(also available in the ShellCheck package in Fedora) to try and detect
scripts that depend on bashisms; but I'm not sure it would help the
case of where you are embedding constructs that will be run down the
road under system(3) in a much larger file that is not constrained by
a limited shell.
 
 Arguably for the tests we could have some way to cause nbdkit to use
 /bin/bash instead of /bin/sh when running external scripts, but it's
 major complexity to implement. 
Anywhere you call system(3), you can pass "/path/to/known/shell -c '"
+ shell_quote(original script) + "'" as the arguments to system to
control which shell syntax you support in the original script.  But I
agree that it is not always trivial.
 
 So we gotta use a lowest common denominator for these external
 scripts, even in our tests.  Note only this one test is affected.
 
 [Aside: Windows is a whole separate kettle of fish.  Currently the
 Windows port of nbdkit doesn't support --run for other reasons, but if
 it did it would probably run some Windows-ish thing (COMMAND.COM?
 Power Shell?).  I'm not sure how Windows behaves for the other
 external commands we try to run like logscript.] 
Yeah, it's not trivial to decide how to support --filter=log
logscript=... under Windows - we could require the user to have
something like Cygwin or mingw installed so that there is a 'sh.exe'
somewhere, but that's not reliable; and native Windows scripting is so
different from anything POSIX that it's probably easier to just
declare logscript= unsupported on that platform (if the rest of the
log filter can still be used).
 
 > Shell arrays are
 > a bash-specific feature, and the extent array is deeply ingrained. See
 > especially commit df63b23b6280 ("log: Use strict shell quoting for every
 > parameter displayed in the log file.", 2021-01-04). In particular the
 > assignment
 > 
 > extents=(0x0 0x8000 "hole,zero" 0x8000 0x8000 "")
 > 
 > turns "extents" into an array variable. 
Yep, that's a definite bashism in the log filter.  Several ideas:
maybe we change to call out to bash (directly via exec*(), or
indirectly via the system("/path/to/bash -c 'original script'")
trick).  Maybe we come up with a way for the command-line client of
the log filter to specify which syntax they expect to be handed to
their log script (if they don't specify a syntax, they get output that
is parseable only by bash).  I'm open to other ideas.
 > 
 > In the bug tracker, comment
 > <
https://aur.archlinux.org/packages/nbdkit#comment-937375> says, "Thus
 > this is an upstream issue; their scripts are calling sh when they should
 > be calling bash". I think that's correct; for logscript=..., we should
 > require /bin/bash in the manual, and execute the script with /bin/bash
 > explicitly, not just system().
 
 This would create a runtime dependency from nbdkit to /bin/bash which
 I'd like to avoid. 
Only for that one filter, not for nbdkit in general.  But yeah, it's
still a rather strong dependency, and if we could avoid it, that would
be nicer.
 
 > > Secondly while looking into this I was trying variations on:
 > > 
 > >   $ POSIXLY_CORRECT=1 ./nbdkit -fv data '1 2 3' --run 'nbdinfo
$uri'
 > > 
 > > This doesn't actually cause bash to emulate a posix shell, but it does
 > > uncover a different bug:
 > > 
 > >   nbdkit: error: raw|base64|data parameter must be specified exactly once
 > > 
 > > This seems to be happening because getopt_long in wrapper.c behaves
 > > somehow differently parsing when POSIXLY_CORRECT is set.  However I
 > > couldn't work out exactly why. 
glibc's getopt_long() is indeed sensitive to POSIXLY_CORRECT - if set,
it refuses to allow options after non-options.  You can force that
behavior even when POSIXLY_CORRECT is not set by passing leading '+'
in the string handed to getopt_long(); but the manual does not list a
way to explicitly request option reordering as an extension to
override POSIXLY_CORRECT.
 > > 
 > >   
https://gitlab.com/nbdkit/nbdkit/-/blob/master/wrapper.c?ref_type=heads#L278
 > > 
 > > I guess the wrapper ought to work if POSIXLY_CORRECT is set (?)
 > 
 > IIRC, POSIXLY_CORRECT makes getopt() stop parsing options when the first
 > non-option argument (i.e., first operand) is reached. So "-f" and
"-v"
 > are taken as options, then the two arguments "data" and '1 2 3'
are
 > taken as operands,  and then "--run" and the rest are taken as operands
 > as well. 
Yes, that's why we're seeing the difference in behavior.  If you were to write:
POSIXLY_CORRECT=1 ./nbdkit -vf --run 'nbdinfo $uri' data '1 2 3'
you'd get the desired behavior, but it doesn't read as nicely.  So we
really DO like the late-option extension to be usable, which somewhat
implies that we should be (temporarily?) unsetting POSIXLY_CORRECT
from an inherited environment around the part where we construct the
passthrough arguments.
But even if we temporarily unset POSIXLY_CORRECT in the wrapper, that
still won't help that nbdkit itself will also treat '--run' as a
non-option if it occurs after the plugin name.  So if we rely on
temporarily unsetting the variable, we'd have to do it in both the
wrapper and in nbdkit proper.  Otherwise, we'll have to audit every
unit test that uses --run to rewrite them to pass options like --run
first before the plugin name.
 > 
 > I think the following loop is relevant:
 > 
 >   /* Are there any non-option arguments? */
 >   if (optind < argc) {
 >     /* Ensure any further parameters can never be parsed as options by
 >      * real nbdkit.
 >      */
 >     passthru ("--");
 > 
 >     /* The first non-option argument is the plugin name.  If it is a
 >      * short name then rewrite it.
 >      */
 >     if (is_short_name (argv[optind])) {
 >       const char *language;
 > 
 >       /* Plugins written in scripting languages. */
 >       if (is_script_plugin (argv[optind], &language)) {
 >         passthru_format ("%s/plugins/%s/.libs/nbdkit-%s-plugin." SOEXT,
 >                          builddir, language, language);
 >         passthru_format ("%s/plugins/%s/nbdkit-%s-plugin",
 >                          builddir, argv[optind], argv[optind]);
 >       }
 >       /* Otherwise normal plugins written in C or other languages that
 >        * compile to .so files.
 >        */
 >       else {
 >         passthru_format ("%s/plugins/%s/.libs/nbdkit-%s-plugin." SOEXT,
 >                          builddir, argv[optind], argv[optind]);
 >       }
 >       ++optind;
 >     }
 > 
 >     /* Everything else is passed through without rewriting. */
 >     while (optind < argc) {
 >       passthru (argv[optind]);
 >       ++optind;
 >     }
 >   }
 > 
 > With POSIXLY_CORRECT set, the "Everything else is passed through without
 > rewriting" logic extends to "--run" etc. I'm not sure how that
would
 > lead to the specific error message, though. 
It's not just that loop in wrapper.c, but also the loop in
server/main.c that is seeing --run as a non-option, at which point the
plugin complains that the config callback is being reached too many
times on key=value with the magic "data=" key.  The error is the same
as if you explicitly write:
./nbdkit -vf data data='1 2 3' data='--run' data='nbdinfo $uri'
which is not subject to option reordering regardless of
POSIXLY_CORRECT.
 
 That's annoying :-( 
Yes, if we want option reordering in spite of POSIXLY_CORRECT possibly
leaking through the environment, we have quite a bit of work to do.
Even if we don't want option reordering, we probably have quite a few
test cases that need to be rewritten to list --run before the plugin
name.
I'd actually lean towards allowing option reordering in spite of
POSIXLY_CORRECT (it makes for nicer command lines, and we already
limit the set of non-options we accept because we will then hand them
off to the .config plugin callback as name=value pairs, possibly with
the magic name).  But I'm not sure how invasive it will be to write
such a patch.  On the other hand, it should be easy enough to tweak our
CI engine to have at least one platform with POSIXLY_CORRECT=1 set in
the environment to make sure we still pass, if that's what we want.
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  
qemu.org | 
libguestfs.org