On Fri, May 22, 2020 at 03:59:14PM -0500, Eric Blake via Libc-alpha wrote:
It has long been known that the C specification of *scanf() leaves
behavior undefined for things like
int i;
sscanf("9999999999999999", "%i", &i);
C11 7.21.6.2 P12
"Matches an optionally signed integer, whose format is the same as
expected for the subject sequence of the strtol function with the
value 0 for the base argument."
C11 7.21.6.2 P10
"If this object does not have an appropriate type, or if the result
of the conversion cannot be represented in the object, the behavior
is undefined."
as there is an overflow when consuming the input which matches the
strtol subject sequence but does not fit in the width of an int. On
my Linux system, 'man sscanf' mentions that ERANGE might be set in
such a case, but neither C nor POSIX actually requires this
behavior; other likely behaviors is storing the value mod 2^32 into
i, or storing INT_MAX into i, or ...
This is annoying - the only safe way to parse integers from
untrustworthy sources, where overflow MUST be detected, is to
manually open-code strtol() calls, which can get quite lengthy in
comparison to the concise representations possible with *scanf.
Would glibc be willing to consider a GNU extension to add an
optional flag character between '%' and the various numeric
conversion specifiers (both integral based on strto*l, and floating
point based on strtod), where we could force *scanf to treat numeric
overflow as a matching failure, rather than undefined behavior? Or
even a second flag to request that printf stop consuming characters
if the next character in input would cause overflow in the current
specifier, leaving that character to instead be matched to the
remainder of the format string?
Since conversion specifier forms outside the standard *also* have
undefined behavior, I see no advantage to defining that particular
undefined case vs just defining the result of the overflowing
conversion, unless you're worried the standard might later define a
conflicting definition. Neither way is amenable to configure detection
(without breaking cross compiling) without also adopting something
like my proposal on libc-coord:
https://www.openwall.com/lists/libc-coord/2020/04/22/1
BTW there is a portable only-somewhat-hideous way to do this with
sscanf: using assignment suppression combined with %n, then strtol,
etc. with the offsets sproduced by %n.
Let's suppose for arguments that we add '^' as a request
to force
overflow to be a matching error. Then sscanf("9999999999999999",
"%^i", &i) would be well-specified to return 0, rather than
returning 1 with an unknown value assigned into i or any other
behavior that other libc do with the undefined behavior when the ^
is not present.
And if glibc likes the idea of such an extension, and we see an
uptick in applications actually using it, I'd also be happy to
champion the addition of such an extension in POSIX (but the POSIX
folks will definitely want to see existing practice first - both an
implementation and applications that use that implementation). The
libguestfs suite of programs is willing to be an early adopter, if
glibc is willing to pursue adding such a safety valve.
I think it would be more useful to look for existing practice where
the UB blows up in horrible ways, and if there is none (if all
implementations behave somewhat reasonably) define the intersection of
their behaviors as standard and get rid of the UB here. A new feature
will not reliably be usable for decades in portable software, but new
documentation of existing universal practice would be immediately
usable.
Rich