Bug in brin_minmax_multi_distance_numeric()

Started by Tom Lane6 months ago10 messages

tgl@sss.pgh.pa.us

6 months ago

Per the discussion in [1]/messages/by-id/2091622.1753982274@sss.pgh.pa.us, I think we need something like

-	PG_RETURN_FLOAT8(DirectFunctionCall1(numeric_float8, d));
+	PG_RETURN_DATUM(DirectFunctionCall1(numeric_float8, d));

in brin_minmax_multi_distance_numeric(). Peter was proposing
that as cosmetic cleanup, but I think it's actually a functional
bug that needs to be back-patched. It is certainly broken on
32-bit machines where the Datum result of numeric_float8 will
be a pointer, so that we will convert the numeric pointer value
to a float and return that, yielding a totally-garbage distance
value. But I think it's broken on 64-bit machines too, because
we'll be interpreting the output of numeric_float8 as a uintptr_t
and applying some unwanted conversion to make that a float.

It's not clear to me exactly what this function is used for,
but I guess the consequences would be broken BRIN indexes
on numeric columns?

regards, tom lane

[1]: /messages/by-id/2091622.1753982274@sss.pgh.pa.us

Tomas Vondra

tomas@vondra.me

6 months ago

In reply to: Tom Lane (#1)

Re: Bug in brin_minmax_multi_distance_numeric()

On 7/31/25 19:33, Tom Lane wrote:

Per the discussion in [1], I think we need something like
-	PG_RETURN_FLOAT8(DirectFunctionCall1(numeric_float8, d));
+	PG_RETURN_DATUM(DirectFunctionCall1(numeric_float8, d));
in brin_minmax_multi_distance_numeric(). Peter was proposing
that as cosmetic cleanup, but I think it's actually a functional
bug that needs to be back-patched. It is certainly broken on
32-bit machines where the Datum result of numeric_float8 will
be a pointer, so that we will convert the numeric pointer value
to a float and return that, yielding a totally-garbage distance
value. But I think it's broken on 64-bit machines too, because
we'll be interpreting the output of numeric_float8 as a uintptr_t
and applying some unwanted conversion to make that a float.

Agreed it's a bug on 32-bit machines. Not sure about 64-bits.

It's not clear to me exactly what this function is used for,
but I guess the consequences would be broken BRIN indexes
on numeric columns?

Actually, no - it should not cause "broken" indexes, as in "giving
incorrect results". The distance functions determine in what order we
merge points into ranges, and if the distances are bogus then we can
build a summary that is less efficient.

regards

--
Tomas Vondra

Tom Lane

tgl@sss.pgh.pa.us

6 months ago

In reply to: Tomas Vondra (#2)

Re: Bug in brin_minmax_multi_distance_numeric()

Tomas Vondra <tomas@vondra.me> writes:

On 7/31/25 19:33, Tom Lane wrote:

... It is certainly broken on
32-bit machines where the Datum result of numeric_float8 will
be a pointer, so that we will convert the numeric pointer value
to a float and return that, yielding a totally-garbage distance
value. But I think it's broken on 64-bit machines too, because
we'll be interpreting the output of numeric_float8 as a uintptr_t
and applying some unwanted conversion to make that a float.

Agreed it's a bug on 32-bit machines. Not sure about 64-bits.

Yeah, I'm not 100% sure about that. It's certainly doing something
unexpected, but we might accidentally end up with relatively-sane
relative distance comparisons anyway. (I assume the outputs will
only be compared to other outputs of the same function, right?)
I have a vague recollection that the IEEE float format was chosen with
an eye to making comparisons cheap, ie not too much different from
integer comparisons. So the sort order might be about the same
even after incorrectly reinterpreting the bit-pattern as an int.
NaNs probably mess that up, but they would anyway.

Actually, no - it should not cause "broken" indexes, as in "giving
incorrect results". The distance functions determine in what order we
merge points into ranges, and if the distances are bogus then we can
build a summary that is less efficient.

Got it. So it might be worth reindexing such indexes after the
fix, but it's not strictly necessary.

regards, tom lane

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tom Lane (#3)

Re: Bug in brin_minmax_multi_distance_numeric()

On 7/31/25 20:35, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 7/31/25 19:33, Tom Lane wrote:

... It is certainly broken on
32-bit machines where the Datum result of numeric_float8 will
be a pointer, so that we will convert the numeric pointer value
to a float and return that, yielding a totally-garbage distance
value. But I think it's broken on 64-bit machines too, because
we'll be interpreting the output of numeric_float8 as a uintptr_t
and applying some unwanted conversion to make that a float.

Agreed it's a bug on 32-bit machines. Not sure about 64-bits.

Yeah, I'm not 100% sure about that. It's certainly doing something
unexpected, but we might accidentally end up with relatively-sane
relative distance comparisons anyway. (I assume the outputs will
only be compared to other outputs of the same function, right?)

Yes. The index accumulates values, sorts them, calculates distance
between the points, and them "merges" the closest ones. So it only
compares results of the same function.

I have a vague recollection that the IEEE float format was chosen with
an eye to making comparisons cheap, ie not too much different from
integer comparisons. So the sort order might be about the same
even after incorrectly reinterpreting the bit-pattern as an int.
NaNs probably mess that up, but they would anyway.

Actually, no - it should not cause "broken" indexes, as in "giving
incorrect results". The distance functions determine in what order we
merge points into ranges, and if the distances are bogus then we can
build a summary that is less efficient.

Got it. So it might be worth reindexing such indexes after the
fix, but it's not strictly necessary.

Exactly.

regards

--
Tomas Vondra

Peter Eisentraut

peter@eisentraut.org

5 months ago

In reply to: Tomas Vondra (#4)

Re: Bug in brin_minmax_multi_distance_numeric()

On 01.08.25 19:17, Tomas Vondra wrote:

On 7/31/25 20:35, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 7/31/25 19:33, Tom Lane wrote:

... It is certainly broken on
32-bit machines where the Datum result of numeric_float8 will
be a pointer, so that we will convert the numeric pointer value
to a float and return that, yielding a totally-garbage distance
value. But I think it's broken on 64-bit machines too, because
we'll be interpreting the output of numeric_float8 as a uintptr_t
and applying some unwanted conversion to make that a float.

Agreed it's a bug on 32-bit machines. Not sure about 64-bits.

Yeah, I'm not 100% sure about that. It's certainly doing something
unexpected, but we might accidentally end up with relatively-sane
relative distance comparisons anyway. (I assume the outputs will
only be compared to other outputs of the same function, right?)

Yes. The index accumulates values, sorts them, calculates distance
between the points, and them "merges" the closest ones. So it only
compares results of the same function.

I have a vague recollection that the IEEE float format was chosen with
an eye to making comparisons cheap, ie not too much different from
integer comparisons. So the sort order might be about the same
even after incorrectly reinterpreting the bit-pattern as an int.
NaNs probably mess that up, but they would anyway.

Actually, no - it should not cause "broken" indexes, as in "giving
incorrect results". The distance functions determine in what order we
merge points into ranges, and if the distances are bogus then we can
build a summary that is less efficient.

Got it. So it might be worth reindexing such indexes after the
fix, but it's not strictly necessary.

Do we want to make a separate commit for this issue that can be
backpatched and have some user-facing information attached to it?

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Peter Eisentraut (#5)

Re: Bug in brin_minmax_multi_distance_numeric()

Peter Eisentraut <peter@eisentraut.org> writes:

Do we want to make a separate commit for this issue that can be
backpatched and have some user-facing information attached to it?

Yes, I think it ought to be committed/backpatched separately.
I was expecting Tomas to do that, but I can if he's busy ...

regards, tom lane

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tom Lane (#6)

Re: Bug in brin_minmax_multi_distance_numeric()

On 8/5/25 20:11, Tom Lane wrote:

Peter Eisentraut <peter@eisentraut.org> writes:

Do we want to make a separate commit for this issue that can be
backpatched and have some user-facing information attached to it?

Yes, I think it ought to be committed/backpatched separately.
I was expecting Tomas to do that, but I can if he's busy ...

Sorry, I didn't realize that - it seemed you're handling this. I can
take care of this in the next couple days, if still needed.

regards

--
Tomas Vondra

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Tomas Vondra (#7)

Re: Bug in brin_minmax_multi_distance_numeric()

Tomas Vondra <tomas@vondra.me> writes:

On 8/5/25 20:11, Tom Lane wrote:

Yes, I think it ought to be committed/backpatched separately.
I was expecting Tomas to do that, but I can if he's busy ...

Sorry, I didn't realize that - it seemed you're handling this. I can
take care of this in the next couple days, if still needed.

Oh, I'm happy to do it if you weren't intending to.

regards, tom lane

Tomas Vondra

tomas@vondra.me

5 months ago

In reply to: Tom Lane (#8)

Re: Bug in brin_minmax_multi_distance_numeric()

On 8/5/25 22:17, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 8/5/25 20:11, Tom Lane wrote:

Yes, I think it ought to be committed/backpatched separately.
I was expecting Tomas to do that, but I can if he's busy ...

Sorry, I didn't realize that - it seemed you're handling this. I can
take care of this in the next couple days, if still needed.

Oh, I'm happy to do it if you weren't intending to.

If you're willing to take care of it, that'd be great. I won't have time
to deal with this until sometime next week.

Thank you.

--
Tomas Vondra

#10

Tom Lane

tgl@sss.pgh.pa.us

5 months ago

In reply to: Tom Lane (#3)

Re: Bug in brin_minmax_multi_distance_numeric()

I wrote:

Tomas Vondra <tomas@vondra.me> writes:

Agreed it's a bug on 32-bit machines. Not sure about 64-bits.

Yeah, I'm not 100% sure about that. It's certainly doing something
unexpected, but we might accidentally end up with relatively-sane
relative distance comparisons anyway. (I assume the outputs will
only be compared to other outputs of the same function, right?)
I have a vague recollection that the IEEE float format was chosen with
an eye to making comparisons cheap, ie not too much different from
integer comparisons. So the sort order might be about the same
even after incorrectly reinterpreting the bit-pattern as an int.
NaNs probably mess that up, but they would anyway.

For the archives' sake: I researched this a little more, and
verified my recollection wasn't totally inaccurate. If you compare
two IEEE-format floats using 2s-complement integer arithmetic, then:

* You will get the same less/equal/greater result as the correct
floating-point comparison result if both numbers are nonnegative,
or if one is nonnegative and the other is negative.

* You will get the opposite of the correct result (inverted sort
order) if both numbers are negative.

* If either number is NaN, things would normally not work in a
way comparable to floating-point behavior, but as long as all
the NaNs are identically represented ... which they would be ...
it'd actually sort the same way we sort NaNs, as larger than
plus-infinity.

Now that's not exactly what the broken code was doing: it was
interpreting the bits as an integer, converting that hallucinated
integer to a float, and then (later) subtracting two such floats.
However, as long as negative values aren't involved, the float
conversion would preserve sort order and thus give distance
numbers that are at least topologically sane.

Nonetheless, we had better recommend reindexing these indexes
even on 64-bit machines. Even if the distances calculated before
the fix weren't totally insane, they will be quite a bit different
from the distances calculated after the fix. So I fear that even
if an index was more or less okay beforehand, it's likely to degrade
pretty badly once we start making new merge decisions with a different
distance function.

regards, tom lane