Bug in numeric multiplication

Started by Dean Rasheedover 10 years ago18 messageshackers
Jump to latest
#1Dean Rasheed
dean.a.rasheed@gmail.com

Hi,

By chance, while testing the nearby numeric log/exp/pow patch, I came
across the following case which generates an initially puzzling
looking error on HEAD -- (5.6-1e-500) ^ (3.2-1e-200). This computation
actually works OK with that other patch, but only by blind luck. It
turns out that the underlying problem is a bug in the low-level
numeric multiplication function mul_var(). It is possible to trigger
it directly with the following carefully crafted inputs:

select 4790999999999999999999999999999999999999999999999999999999999999999999999999999999999999
* 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

Result:
47909999999999999999999999999999999999999999999999999999999999999999999999999999997852304953000000000000000000000000000000000000000000000000000000000000000000000000000000000001

That answer is actually incorrect. Tweaking the input a little, it is
possible to generate a much more obviously nonsensical result:

select 4789999999999999999999999999999999999999999999999999999999999999999999999999999999999999
* 9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999;

Result:
4789999999999999999999999999999999999999999999999999999999999999999999999999999999785231+0,*000000000000000000000000000000000000000000000000000000000000000000000000000000000001

Notice those garbage digits in the middle of the number returned.

The problem is that these examples trigger an overflow of the digits
in the accumulator array in mul_var().

The number on the left in the first example consists of 21 copies of
9999, preceded by 4790. Those are chosen so that when added together
they lead to a value for maxdig in mul_var() of 21*9999 + 4790 =
214769, which is exactly equal to INT_MAX / (NBASE - 1). So this
doesn't quite trigger a normalisation of the accumulator array, and
leaves several of the digits in that array a little under INT_MAX at
the end of the main multiplication loop.

The problem then arises in the final carry propagation pass. During
this phase of the computation, the carry from one digit (which can be
a shade under INT_MAX / NBASE) is added to the next digit, and that's
where the overflow happens.

To fix that, the initial value for maxdig needs to be made larger to
leave headroom for the carry. The largest possible carry is INT_MAX /
NBASE, and maxdig is the maximum possible dig value divided by
NBASE-1, so maxdig needs to be initialised to

(INT_MAX / NBASE) / (NBASE - 1)

which is 21 for NBASE = 10000.

A new corner-case input that doesn't quite trigger an accumulator
normalisation is then 47699999... The worst case inputs are now values
like this for which the sum of a sequence of input digits is INT_MAX /
(NBASE - 1) - 21 = 214769 - 21 = 214748. So in the worst case, the
accumulator's digits can be up to 214748 * 9999 = 2147265252 in the
main multiplication loop. Then, during the carry propagation phase (or
any of the normalisation phases), the carry can be anything up to
INT_MAX / NBASE = 214748. So the maximum value that can be assigned to
any individual digit is now 2147265252 + 214748 = 2147480000, which is
now less than INT_MAX.

Patch attached.

Regards,
Dean

Attachments:

numeric-mul.patchtext/x-diff; charset=US-ASCII; name=numeric-mul.patchDownload+45-2
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#1)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

The problem then arises in the final carry propagation pass. During
this phase of the computation, the carry from one digit (which can be
a shade under INT_MAX / NBASE) is added to the next digit, and that's
where the overflow happens.

Nice catch! I think the comment could use a little more work, but I'll
adjust it and push.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: Bug in numeric multiplication

I wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

The problem then arises in the final carry propagation pass. During
this phase of the computation, the carry from one digit (which can be
a shade under INT_MAX / NBASE) is added to the next digit, and that's
where the overflow happens.

Nice catch! I think the comment could use a little more work, but I'll
adjust it and push.

After trying to rework the comment to explain what maxdig really meant
after your changes, I came to the conclusion that it'd be better to do
it as per attached. Does this look sane to you?

regards, tom lane

Attachments:

numeric-mul-2.patchtext/x-diff; charset=us-ascii; name=numeric-mul-2.patchDownload+50-11
#4Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#3)
Re: Bug in numeric multiplication

On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

The problem then arises in the final carry propagation pass. During
this phase of the computation, the carry from one digit (which can be
a shade under INT_MAX / NBASE) is added to the next digit, and that's
where the overflow happens.

Nice catch! I think the comment could use a little more work, but I'll
adjust it and push.

After trying to rework the comment to explain what maxdig really meant
after your changes, I came to the conclusion that it'd be better to do
it as per attached. Does this look sane to you?

Yes that looks better. It's still the same amount of extra headroom
(21), but I think it's clearer your way.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#4)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After trying to rework the comment to explain what maxdig really meant
after your changes, I came to the conclusion that it'd be better to do
it as per attached. Does this look sane to you?

Yes that looks better. It's still the same amount of extra headroom
(21), but I think it's clearer your way.

OK, pushed (after further hacking on the comment ...)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#5)
Re: Bug in numeric multiplication

On 21 September 2015 at 17:14, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 21 September 2015 at 16:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After trying to rework the comment to explain what maxdig really meant
after your changes, I came to the conclusion that it'd be better to do
it as per attached. Does this look sane to you?

Yes that looks better. It's still the same amount of extra headroom
(21), but I think it's clearer your way.

OK, pushed (after further hacking on the comment ...)

regards, tom lane

I just noticed that div_var_fast() has almost identical code, and so
in principle it has the same vulnerability, although it obviously only
affects the transcendental functions.

I don't actually have a test case that triggers it, but it's basically
the same algorithm, so logically it needs the same additional headroom
to avoid a possible overflow.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#6)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I just noticed that div_var_fast() has almost identical code, and so
in principle it has the same vulnerability, although it obviously only
affects the transcendental functions.
I don't actually have a test case that triggers it, but it's basically
the same algorithm, so logically it needs the same additional headroom
to avoid a possible overflow.

Hm, good point. I don't feel a compulsion to have a test case that
proves it's broken before we fix it. Do you want to send a patch?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#7)
Re: Bug in numeric multiplication

On 17 November 2015 at 14:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I just noticed that div_var_fast() has almost identical code, and so
in principle it has the same vulnerability, although it obviously only
affects the transcendental functions.
I don't actually have a test case that triggers it, but it's basically
the same algorithm, so logically it needs the same additional headroom
to avoid a possible overflow.

Hm, good point. I don't feel a compulsion to have a test case that
proves it's broken before we fix it. Do you want to send a patch?

OK, here it is. It's slightly different from mul_var, because maxdiv
is tracking absolute values and the carry may be positive or negative,
and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
it's negative), but otherwise the principle is the same.

Regards,
Dean

Attachments:

div_var_fast.patchtext/x-diff; charset=US-ASCII; name=div_var_fast.patchDownload+10-3
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#8)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 17 November 2015 at 14:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm, good point. I don't feel a compulsion to have a test case that
proves it's broken before we fix it. Do you want to send a patch?

OK, here it is. It's slightly different from mul_var, because maxdiv
is tracking absolute values and the carry may be positive or negative,
and it's absolute value may be as high as INT_MAX / NBASE + 1 (when
it's negative), but otherwise the principle is the same.

I pushed this, but while looking at it, my attention was drawn to this
bit down near the end of the loop:

/*
* The dividend digit we are about to replace might still be nonzero.
* Fold it into the next digit position. We don't need to worry about
* overflow here since this should nearly cancel with the subtraction
* of the divisor.
*/
div[qi + 1] += div[qi] * NBASE;

In the first place, I'm not feeling especially convinced by that handwave
about there being no risk of overflow. In the second place, this is
indisputably failing to consider whether maxdiv might need to increase.
If I add

Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));

right after this, the regression tests blow up. So it looks to me like
we've got some more work to do.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#9)
Re: Bug in numeric multiplication

I wrote:

I pushed this, but while looking at it, my attention was drawn to this
bit down near the end of the loop:

/*
* The dividend digit we are about to replace might still be nonzero.
* Fold it into the next digit position. We don't need to worry about
* overflow here since this should nearly cancel with the subtraction
* of the divisor.
*/
div[qi + 1] += div[qi] * NBASE;

In the first place, I'm not feeling especially convinced by that handwave
about there being no risk of overflow. In the second place, this is
indisputably failing to consider whether maxdiv might need to increase.
If I add
Assert(Abs(div[qi + 1]) <= (maxdiv * (NBASE-1)));
right after this, the regression tests blow up. So it looks to me like
we've got some more work to do.

After thinking some more about what this is doing, it seems to me that
we could avoid changing div[qi + 1] at all here, and instead deal with
leftover dividend digits by shoving them into the floating-point part of
the calculation. All that we really need to have happen as a consequence
of the above is to inject an additional "div[qi] * NBASE" term into future
calculations of fdividend. So we can do that out-of-band and avoid
mucking up the maxdiv invariant.

Also, I realized that this bit:

/*
* All the div[] digits except possibly div[qi] are now in the
* range 0..NBASE-1.
*/
maxdiv = Abs(newdig) / (NBASE - 1);
maxdiv = Max(maxdiv, 1);

is unduly conservative, because, while indeed div[qi] might be out of
range, there is no need to consider it in maxdiv anymore, because the new
maxdiv value only determines what will happen in future loop iterations
where the current div[qi] will be irrelevant. So we should just reset
maxdiv to 1 unconditionally here.

So that led me to the attached patch, which passes regression tests fine.
I'm not sure that it's provably okay though. The loose ends are to show
that div[qi] can't overflow an int during the divisor-subtraction step
and that "outercarry" remains bounded. Testing suggests that outercarry
can't exceed INT_MAX/NBASE, but I don't see how to prove that.

regards, tom lane

Attachments:

div_var_fast-fixes.patchtext/x-diff; charset=us-ascii; name=div_var_fast-fixes.patchDownload+40-35
#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#10)
Re: Bug in numeric multiplication

On 17 November 2015 at 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After thinking some more about what this is doing, it seems to me that
we could avoid changing div[qi + 1] at all here, and instead deal with
leftover dividend digits by shoving them into the floating-point part of
the calculation. All that we really need to have happen as a consequence
of the above is to inject an additional "div[qi] * NBASE" term into future
calculations of fdividend. So we can do that out-of-band and avoid
mucking up the maxdiv invariant.

That makes sense.

Also, I realized that this bit:

/*
* All the div[] digits except possibly div[qi] are now in the
* range 0..NBASE-1.
*/
maxdiv = Abs(newdig) / (NBASE - 1);
maxdiv = Max(maxdiv, 1);

is unduly conservative, because, while indeed div[qi] might be out of
range, there is no need to consider it in maxdiv anymore, because the new
maxdiv value only determines what will happen in future loop iterations
where the current div[qi] will be irrelevant. So we should just reset
maxdiv to 1 unconditionally here.

OK, makes sense too.

So that led me to the attached patch, which passes regression tests fine.

I'd feel better about that if the regression tests actually did
something that stressed this, but coming up with such a test is
non-trivial.

I'm not sure that it's provably okay though. The loose ends are to show
that div[qi] can't overflow an int during the divisor-subtraction step
and that "outercarry" remains bounded. Testing suggests that outercarry
can't exceed INT_MAX/NBASE, but I don't see how to prove that.

At the top of the loop we know that

Abs(div[qi]) <= maxdiv * (NBASE-1)
<= INT_MAX - INT_MAX/NBASE - 1

Then we add Abs(qdigit) to maxdiv which potentially triggers a
normalisation step. That step adds carry to div[qi], and we know that

Abs(carry) <= INT_MAX/NBASE + 1

so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
overflow at that point, but it could be on the verge of doing so.

Then the divisor-subtraction step does

div[qi] -= qdigit * var2digits[0]

That looks as though it could overflow, although actually I would
expect qdigit to have the same sign as div[qi], so that this would
drive div[qi] towards zero. If you didn't want to rely on that though,
you could take advantage of the fact that this new value of div[qi] is
only needed for outercarry, so you could modify the
divisor-subtraction step to skip div[qi]:

for (i = 1; i < istop; i++)
div[qi + i] -= qdigit * var2digits[i];

and fold the most significant digit of the divisor-subtraction step
into the computation of outercarry so that there is definitely no
possibility of integer overflow:

outercarry = outercarry * NBASE + (double) div[qi]
- (double) qdigit * var2digits[0];

It's still difficult to prove that outercarry remains bounded, but to
a first approximation

qdigit ~= ( outercarry * NBASE + (double) div[qi] ) / var2digits[0]

so outercarry ought to be driven towards zero, as the comment says. It
certainly doesn't look like it will grow without bounds, but I haven't
attempted to work out what its bounds actually are.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#11)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 17 November 2015 at 23:57, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not sure that it's provably okay though. The loose ends are to show
that div[qi] can't overflow an int during the divisor-subtraction step
and that "outercarry" remains bounded. Testing suggests that outercarry
can't exceed INT_MAX/NBASE, but I don't see how to prove that.

At the top of the loop we know that
Abs(div[qi]) <= maxdiv * (NBASE-1)
<= INT_MAX - INT_MAX/NBASE - 1
Then we add Abs(qdigit) to maxdiv which potentially triggers a
normalisation step. That step adds carry to div[qi], and we know that
Abs(carry) <= INT_MAX/NBASE + 1
so the result is that Abs(div[qi]) <= INT_MAX -- i.e., it can't
overflow at that point, but it could be on the verge of doing so.

Right.

Then the divisor-subtraction step does
div[qi] -= qdigit * var2digits[0]
That looks as though it could overflow, although actually I would
expect qdigit to have the same sign as div[qi], so that this would
drive div[qi] towards zero.

But if outercarry is nonzero, then qdigit is going to be primarily driven
by outercarry not div[qi], so I'm afraid it's mistaken to rely on it
having the right sign to cause cancellation.

If you didn't want to rely on that though,
you could take advantage of the fact that this new value of div[qi] is
only needed for outercarry, so you could modify the
divisor-subtraction step to skip div[qi]:
and fold the most significant digit of the divisor-subtraction step
into the computation of outercarry

Cute idea. Another thought I'd had is that we could change the carry
propagation loop limits so that div[qi] is normalized along with the other
digits. Then we'd have a carry leftover at the end of the loop, but we
could add it to outercarry. That makes the argument that div[qi] does
not overflow the same as for the other digits.

However, I kind of like your idea of postponing the div[qi] subtraction
to the outercarry update, because then it's very direct to see that
that update should result in near-total cancellation.

so outercarry ought to be driven towards zero, as the comment says. It
certainly doesn't look like it will grow without bounds, but I haven't
attempted to work out what its bounds actually are.

I'm kind of stuck on that too. I did some experimentation by tracking
maximum values of outercarry in the regression tests (including
numeric_big) and did not see it get larger than a couple hundred thousand,
ie more or less INT_MAX/NBASE. But I don't have an argument as to why
that would be the limit.

Another issue here is that with outercarry added into the qdigit
computation, it's no longer clear what the bounds of qdigit itself are,
so is it really safe to assume that "div[qi + i] -= qdigit * var2digits[i]"
can't overflow? Stated in other terms, why are we sure that the new
maxdiv value computed at the bottom of the carry propagation stanza is
within the safe range?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#12)
Re: Bug in numeric multiplication

I wrote:

I'm kind of stuck on that too. I did some experimentation by tracking
maximum values of outercarry in the regression tests (including
numeric_big) and did not see it get larger than a couple hundred thousand,
ie more or less INT_MAX/NBASE. But I don't have an argument as to why
that would be the limit.

A bit of progress on this: I've found a test case, namely

select sqrt(99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999.0099999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999);

If one inserts no-overflow Asserts into div_var_fast, this case will trip
them, specifically this:

/*
* The dividend digit we are about to replace might still be nonzero.
* Fold it into the next digit position. We don't need to worry about
* overflow here since this should nearly cancel with the subtraction
* of the divisor.
*/
+ Assert(Abs(div[qi]) <= INT_MAX/NBASE);
div[qi + 1] += div[qi] * NBASE;

Unfortunately, there isn't any SQL-visible misbehavior in this example,
because the loop in sqrt_var is more or less self-correcting for minor
errors (and this example produces bogus results only in very low-order
digits). Most of the other calls of div_var_fast give it inputs that are
even harder to control than sqrt_var's, so finding something that did
produce visibly wrong results might take a whole lot of trial and error.

Still, this proves that we are onto a real problem.

Another issue here is that with outercarry added into the qdigit
computation, it's no longer clear what the bounds of qdigit itself are,

I concluded that that particular issue is a red herring: qdigit should
always be a fairly accurate estimate of the next output digit, so it
cannot fall very far outside the range 0..NBASE-1. Testing confirms that
the range seen in the regression tests is exactly -1 to 10000, which is
what you'd expect since there can be some roundoff error.

Also, after further thought I've been able to construct an argument why
outercarry stays bounded. See what you think of the comments in the
attached patch, which incorporates your ideas about postponing the div[qi]
calculation.

regards, tom lane

Attachments:

div-var-fast-fix-again.patchtext/x-diff; charset=us-ascii; name=div-var-fast-fix-again.patchDownload+74-69
#14Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#13)
Re: Bug in numeric multiplication

On 18 November 2015 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

A bit of progress on this: I've found a test case, namely

select sqrt(999999...

Wow.

Still, this proves that we are onto a real problem.

Agreed. I had a go at turning that example into something using
log(base, num) so that the result would be visible in a pure SQL test,
but I didn't have any luck.

Another issue here is that with outercarry added into the qdigit
computation, it's no longer clear what the bounds of qdigit itself are,

I concluded that that particular issue is a red herring: qdigit should
always be a fairly accurate estimate of the next output digit, so it
cannot fall very far outside the range 0..NBASE-1. Testing confirms that
the range seen in the regression tests is exactly -1 to 10000, which is
what you'd expect since there can be some roundoff error.

Right, I came to the same conclusion.

Also, after further thought I've been able to construct an argument why
outercarry stays bounded. See what you think of the comments in the
attached patch, which incorporates your ideas about postponing the div[qi]
calculation.

I think the logic is sound, but I worry that that comment is going to
be very difficult to follow.

I had a go at trying to find a simpler approach and came up with the
attached, which computes the value intended for div[qi+1] near the top
of the loop (using doubles) so that it can detect if it will overflow,
and if so it enters the normalisation block. It still needs some work
to prove that the normalised value for fnextdigit won't overflow, but
it feels like a promising, easier-to-follow approach to me. What do
you think?

Regards,
Dean

Attachments:

div_var_fast-alt-fix.patchtext/x-diff; charset=US-ASCII; name=div_var_fast-alt-fix.patchDownload+136-116
#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#14)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 18 November 2015 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Still, this proves that we are onto a real problem.

Agreed. I had a go at turning that example into something using
log(base, num) so that the result would be visible in a pure SQL test,
but I didn't have any luck.

I experimented with that idea some, and found a test case that would
trigger the Assert during log_var's final div_var_fast call:

select log(
exp(.999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999995050012831598249352382434889825483159817)
,
exp(.99999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999009999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999));

However, the emitted answer was the same with or without my proposed
patch. So I dug deeper by putting in some debug printf's, and found that
the overflow occurs at this point:

div[81] will overflow: div[qi] = 218943 qdigit = 7673 maxdiv = 210375
div[]:
0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 -001 9999 9999 9999 5049 9871 6840 1750 6476 1756 5110 1745 1684 0183 0860 9567 3786 7660 2671 2843 5974 6515 4013 7886 0978 7230 5372 8162 7077 2013 6185 3746 3095 8528 1595 3071 7541 7920 7950 218943 -2103498897 -2103499629 -2103499629 -2103499629 -2103504578

Note that div[qi+1], and indeed all the remaining dividend digits, are
large negative values. This is what you'd expect if the carry propagation
step hasn't run for awhile, which is a precondition for div[qi] being
large enough to cause an issue. When we compute 218943 * 10000, we will
indeed get an overflow, and the result will wrap around to some large
negative value (2^32 less than it should be). Then we will add that to
div[qi+1], and we'll get *another* overflow, wrapping what nominally
should have been a negative sum around to a positive value (2^32 more than
it should be). So the two overflows cancel and we get exactly the correct
new value of div[qi+1].

I do not know whether it's possible to devise a test case where you don't
get offsetting overflows. It may be that there's no operational bug here.
Still, the code is surely not behaving per face value.

I had a go at trying to find a simpler approach and came up with the
attached, which computes the value intended for div[qi+1] near the top
of the loop (using doubles) so that it can detect if it will overflow,
and if so it enters the normalisation block. It still needs some work
to prove that the normalised value for fnextdigit won't overflow, but
it feels like a promising, easier-to-follow approach to me. What do
you think?

I'll look at this later ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#15)
Re: Bug in numeric multiplication

I wrote:

Note that div[qi+1], and indeed all the remaining dividend digits, are
large negative values. This is what you'd expect if the carry propagation
step hasn't run for awhile, which is a precondition for div[qi] being
large enough to cause an issue. When we compute 218943 * 10000, we will
indeed get an overflow, and the result will wrap around to some large
negative value (2^32 less than it should be). Then we will add that to
div[qi+1], and we'll get *another* overflow, wrapping what nominally
should have been a negative sum around to a positive value (2^32 more than
it should be). So the two overflows cancel and we get exactly the correct
new value of div[qi+1].

I do not know whether it's possible to devise a test case where you don't
get offsetting overflows. It may be that there's no operational bug here.
Still, the code is surely not behaving per face value.

After further thought I've pretty well convinced myself that there is
indeed no observable bug, at least as long as you assume that overflow
within the multiplication will behave as stated above. The proof goes
like this:

We already know that the divisor-subtraction step cannot cause an overflow
(modulo the question of div[qi] possibly exceeding the maxdiv limit,
which I'll get back to). So what is at stake is just the possibility of
overflow in the final update that transfers leftover digits from div[qi]
to div[qi+1].

To analyze that, consider just the first two dividend and divisor terms in
the qdigit calculation, that is during each loop iteration approximate
qdigit as
qdigit ~= trunc((div[qi]*NBASE + div[qi+1]) /
(var2digits[0]*NBASE + var2digits[1]))
This holds whether or not we do a carry propagation step. Now write
what the divisor-subtraction step updates div[qi] to as
div[qi]' = div[qi] - qdigit * var2digits[0]
and the updated div[qi+1] as
div[qi+1]' = div[qi+1] - qdigit * var2digits[1]
So, if we temporarily disregard the possibility of overflow within the
final calculation
div[qi+1]'' = div[qi+1]' + div[qi]'*NBASE
we can see that div[qi+1]'' is going to be
= div[qi+1]' + div[qi]'*NBASE
= div[qi+1] - qdigit * var2digits[1] +
(div[qi] - qdigit * var2digits[0])*NBASE
= div[qi]*NBASE + div[qi+1] -
qdigit * (var2digits[0]*NBASE + var2digits[1])
Comparing that to the approximate value of qdigit, we can see that
what we've got here is a modulo calculation, and the value of div[qi+1]''
is going to cancel to zero except for a remainder modulo
var2digits[0]*NBASE + var2digits[1], which of course must fall between 0
and NBASE*NBASE+NBASE (since the truncation is towards minus infinity).

Now of course this is only an approximation. The main source of error is
that we've omitted the lower-order dividend terms. Including div[qi+2]
could change the result by at most about INT_MAX/NBASE (which is the most
it could add to the numerator in the qdigit expression above, which will
propagate more or less linearly to div[qi+1]''). Similarly, adding
div[qi+3] could add at most INT_MAX/NBASE/NBASE. So we aren't anywhere
near the overflow threshold. Including the lower-order divisor terms
could change the value of qdigit by a multiplicative factor of no more
than about 1/NBASE. Lastly, roundoff error in the floating-point part
of the calculation could cause qdigit to be off by one count either
way. None of these effects are going to let the final div[qi+1] value
get to more than two or three times NBASE squared, which is still
an order of magnitude less than INT_MAX.

Therefore, the final div[qi+1] value cannot overflow an int, and even
though it might be larger than what maxdiv would suggest, it's not going
to be large enough to cause overflow in the next loop iteration either.
It obviously won't cause overflow during carry propagation, and as for
the next divisor-subtraction step, we can do an analysis similar to the
above but approximating qdigit with just these terms:
qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0])
Plugging that into
div[qi]' = div[qi] - qdigit * var2digits[0]
shows that the updated div[qi] in any loop iteration is going to be just
about -div[qi+1]/NBASE, plus a truncation term that's between 0 and
var2digits[0], plus lower-order terms that aren't going to get you very
much past INT_MAX/NBASE. So div[qi]' is never an overflow hazard in any
loop iteration.

In short, it's impossible for the end result of the div[qi+1] update
calculation to overflow, or even to get large enough to create a problem
in the next loop iteration. However, the intermediate result
div[qi]*NBASE can overflow as per the example I showed before.

We could just ignore this on the grounds that it doesn't matter given sane
behavior of integer arithmetic. Or, if we wanted to be more paranoid, we
could do the multiplication-and-addition in int64 arithmetic; but that
would probably slow things down a bit.

I think all this deserves some code comments, but I'm not sure whether
we need to bother with casting to int64. Thoughts?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#16)
Re: Bug in numeric multiplication

On 24 November 2015 at 16:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After further thought I've pretty well convinced myself that there is
indeed no observable bug, at least as long as you assume that overflow
within the multiplication will behave as stated above. The proof goes
like this:

We already know that the divisor-subtraction step cannot cause an overflow
(modulo the question of div[qi] possibly exceeding the maxdiv limit,
which I'll get back to). So what is at stake is just the possibility of
overflow in the final update that transfers leftover digits from div[qi]
to div[qi+1].

To analyze that, consider just the first two dividend and divisor terms in
the qdigit calculation, that is during each loop iteration approximate
qdigit as
qdigit ~= trunc((div[qi]*NBASE + div[qi+1]) /
(var2digits[0]*NBASE + var2digits[1]))
This holds whether or not we do a carry propagation step. Now write
what the divisor-subtraction step updates div[qi] to as
div[qi]' = div[qi] - qdigit * var2digits[0]
and the updated div[qi+1] as
div[qi+1]' = div[qi+1] - qdigit * var2digits[1]
So, if we temporarily disregard the possibility of overflow within the
final calculation
div[qi+1]'' = div[qi+1]' + div[qi]'*NBASE
we can see that div[qi+1]'' is going to be
= div[qi+1]' + div[qi]'*NBASE
= div[qi+1] - qdigit * var2digits[1] +
(div[qi] - qdigit * var2digits[0])*NBASE
= div[qi]*NBASE + div[qi+1] -
qdigit * (var2digits[0]*NBASE + var2digits[1])
Comparing that to the approximate value of qdigit, we can see that
what we've got here is a modulo calculation, and the value of div[qi+1]''
is going to cancel to zero except for a remainder modulo
var2digits[0]*NBASE + var2digits[1], which of course must fall between 0
and NBASE*NBASE+NBASE (since the truncation is towards minus infinity).

Yes that makes sense.

Now of course this is only an approximation. The main source of error is
that we've omitted the lower-order dividend terms. Including div[qi+2]
could change the result by at most about INT_MAX/NBASE (which is the most
it could add to the numerator in the qdigit expression above, which will
propagate more or less linearly to div[qi+1]''). Similarly, adding
div[qi+3] could add at most INT_MAX/NBASE/NBASE. So we aren't anywhere
near the overflow threshold. Including the lower-order divisor terms
could change the value of qdigit by a multiplicative factor of no more
than about 1/NBASE. Lastly, roundoff error in the floating-point part
of the calculation could cause qdigit to be off by one count either
way. None of these effects are going to let the final div[qi+1] value
get to more than two or three times NBASE squared, which is still
an order of magnitude less than INT_MAX.

Right. In fact I think div[qi+1] is even more constrained than that (see below).

Therefore, the final div[qi+1] value cannot overflow an int, and even
though it might be larger than what maxdiv would suggest, it's not going
to be large enough to cause overflow in the next loop iteration either.
It obviously won't cause overflow during carry propagation, and as for
the next divisor-subtraction step, we can do an analysis similar to the
above but approximating qdigit with just these terms:
qdigit ~= trunc((div[qi] + div[qi+1]/NBASE) / var2digits[0])
Plugging that into
div[qi]' = div[qi] - qdigit * var2digits[0]
shows that the updated div[qi] in any loop iteration is going to be just
about -div[qi+1]/NBASE, plus a truncation term that's between 0 and
var2digits[0], plus lower-order terms that aren't going to get you very
much past INT_MAX/NBASE. So div[qi]' is never an overflow hazard in any
loop iteration.

In short, it's impossible for the end result of the div[qi+1] update
calculation to overflow, or even to get large enough to create a problem
in the next loop iteration. However, the intermediate result
div[qi]*NBASE can overflow as per the example I showed before.

Agreed.

I tried analysing this in a different way, by considering the maximum
possible error in the floating point computations. fdividend takes up
to 4 digits from the dividend:

fdividend = div[qi] * NBASE^3 + div[qi+1] * NBASE^2
+ div[qi+2] * NBASE + div[qi+3]

and the digits being discarded are limited to a little less than
+/-INT_MAX, so the error in fdividend due to discarding digits is at
most +/-INT_MAX/NBASE. fdivisor is computed in a similar way, except
that the var2 digits are in the range [0,NBASE-1], so fdivisor is in
the range [NBASE^3, NBASE^4-1]. Therefore the maximum error in the
computation of fquotient is +/-INT_MAX/NBASE^4, which is less than 1.
So the most this can do is to cause a single integer boundary to be
crossed, and therefore qdigit will be within +/-1 of the correct
result. That's consistent with the observation that in practice qdigit
always falls in the range [-1,10000].

In fact, since we round down when computing qdigit, it should always
be correct or 1 smaller than the correct result, but the "correct"
result in this context may be equal to 10000, to compensate for an
underestimate in the previously calculated digit.

It then follows that the result of the divisor subtraction step can be
to have up to one extra copy of the divisor added to the remainder,
and this will contribute up to an extra var2digits[0]*NBASE^4 +
var2digits[1]*NBASE^3 to fdividend in the next iteration, making
fdividend up to around NBASE*fdivisor.

So fdividend will always be less than around NBASE^5, which means that
at the top of the loop div[qi] is always less than
NBASE^2+INT_MAX/NBASE (since the next digit of the divisor could be
large and negative, down to -INT_MAX).

So, in other words, the limit on the value of div[qi+1] computed at
the bottom of the loop is around NBASE^2+INT_MAX/NBASE=100214748, and
so it can't possibly overflow a 32-bit integer.

I added some logging and ran the self tests plus a few other examples
analogous to your example upthread and the largest value for div[qi+1]
logged was 100210148, which is consistent with that bound.

We could just ignore this on the grounds that it doesn't matter given sane
behavior of integer arithmetic. Or, if we wanted to be more paranoid, we
could do the multiplication-and-addition in int64 arithmetic; but that
would probably slow things down a bit.

I doubt that a single int64 computation would make any noticeable
performance difference to this computation, given all the other code
in the loop, but I think it's probably unnecessary.

I think all this deserves some code comments, but I'm not sure whether
we need to bother with casting to int64. Thoughts?

regards, tom lane

I think probably just a comment is OK.

Regards,
Dean

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#17)
Re: Bug in numeric multiplication

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On 24 November 2015 at 16:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:

... None of these effects are going to let the final div[qi+1] value
get to more than two or three times NBASE squared, which is still
an order of magnitude less than INT_MAX.

Right. In fact I think div[qi+1] is even more constrained than that (see below).

Thanks for the additional analysis and testing.

I think all this deserves some code comments, but I'm not sure whether
we need to bother with casting to int64. Thoughts?

I think probably just a comment is OK.

I'm inclined to agree. The only reason I'd worry at all is that the
compiler guys are getting ever more aggressive about claiming that any
integer overflow can result in arbitrarily-broken behavior. However,
we already have a lot of other code that depends on -fwrapv or equivalent
overflow behavior, so I doubt that this is the first place to worry
about it. The underlying hardware is certainly going to behave as we
wish, and it's pretty hard to see how a compiler could "optimize" the
assembly code in a way that would break such a simple calculation.

I'll go add some comments and call it good.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers