wrong query results on bf leafhopper

Started by Andres Freund7 months ago12 messages

Andres Freund

andres@anarazel.de

7 months ago

Hi,

I noticed this recent BF failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-15%2008%3A10%3A04

=== dumping /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs ===
diff -U3 /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out
--- /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out	2025-05-15 08:10:04.211926695 +0000
+++ /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out	2025-05-15 08:18:29.117733601 +0000
@@ -42,7 +42,7 @@
    ->  Nested Loop (actual rows=1000.00 loops=N)
          ->  Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N)
                Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8982
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t2.twenty
                Cache Mode: logical
@@ -178,7 +178,7 @@
    ->  Nested Loop (actual rows=1000.00 loops=N)
          ->  Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N)
                Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8981
          ->  Memoize (actual rows=1.00 loops=N)
                Cache Key: t1.two, t1.twenty
                Cache Mode: binary

For a moment I thought this could be a bug in memoize, but that doesn't
actually make sense - the failure isn't in memoize, it's the seqscan.

Subsequently I got worried that this is an AIO bug or such causing wrong query
results. But there are instances of this error well before AIO was
merged. E.g.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-18%2023%3A35%3A04

The same error is also present down to 16.

In 15, I saw a potentially related error
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03

There have been other odd things on leafhopper, see e.g.:
/messages/by-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987@gmail.com
/messages/by-id/Z4npAKvchWzKfb_r@paquier.xyz

Greetings,

Andres Freund

Alena Rybakina

a.rybakina@postgrespro.ru

7 months ago

In reply to: Andres Freund (#1)

Re: wrong query results on bf leafhopper

is there different tables "Seq Scan on tenk1 t2" and "Seq Scan on tenk1
t1", so it might not be a bug, isn't it?

On 16.05.2025 09:19, Andres Freund wrote:

Hi,

I noticed this recent BF failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-15%2008%3A10%3A04
=== dumping /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/regression.diffs ===
diff -U3 /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out
--- /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/regress/expected/memoize.out	2025-05-15 08:10:04.211926695 +0000
+++ /home/bf/proj/bf/build-farm-17/HEAD/pgsql.build/src/test/recovery/tmp_check/results/memoize.out	2025-05-15 08:18:29.117733601 +0000
@@ -42,7 +42,7 @@
->  Nested Loop (actual rows=1000.00 loops=N)
->  Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N)
Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8982
->  Memoize (actual rows=1.00 loops=N)
Cache Key: t2.twenty
Cache Mode: logical
@@ -178,7 +178,7 @@
->  Nested Loop (actual rows=1000.00 loops=N)
->  Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N)
Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8981
->  Memoize (actual rows=1.00 loops=N)
Cache Key: t1.two, t1.twenty
Cache Mode: binary
For a moment I thought this could be a bug in memoize, but that doesn't
actually make sense - the failure isn't in memoize, it's the seqscan.

Subsequently I got worried that this is an AIO bug or such causing wrong query
results. But there are instances of this error well before AIO was
merged. E.g.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-18%2023%3A35%3A04

The same error is also present down to 16.

In 15, I saw a potentially related error
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03

There have been other odd things on leafhopper, see e.g.:
/messages/by-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987@gmail.com
/messages/by-id/Z4npAKvchWzKfb_r@paquier.xyz

Greetings,

Andres Freund

--
Regards,
Alena Rybakina
Postgres Professional

Robins Tharakan

tharakan@gmail.com

7 months ago

In reply to: Andres Freund (#1)

Re: wrong query results on bf leafhopper

Hi Andres,

On Fri, 16 May 2025 at 22:49, Andres Freund <andres@anarazel.de> wrote:

There have been other odd things on leafhopper, see e.g.:

/messages/by-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987@gmail.com

/messages/by-id/Z4npAKvchWzKfb_r@paquier.xyz

Any chances this could be linked to the openssl bug [2] highlighted
in this other hacker thread [1]? The postgres issue is quite unrelated,
but the openssl bug seems non-trivial and may be good to rule out.

To confirm, leafhopper is on Graviton4, uses openssl v3.2 and is
compiled --with-openssl. I've been unable to triage the recent
leafhopper failures myself and upgrading its openssl (to v3.3+)
has been a pending task (just to rule it out).

[bf@ip-172-31-72-114 ~]$ openssl --version
OpenSSL 3.2.2 4 Jun 2024 (Library: OpenSSL 3.2.2 4 Jun 2024)

-
robins

Reference:
1.
/messages/by-id/6fxlmnyagkycru3bewa4ympknywnsswlqzvwfft3ifqqiioxlv@ax53pv7xdrc2
2. https://github.com/openssl/openssl/pull/26469

Andres Freund

andres@anarazel.de

7 months ago

In reply to: Robins Tharakan (#3)

Re: wrong query results on bf leafhopper

Hi,

On 2025-05-19 12:49:26 +0930, Robins Tharakan wrote:

Hi Andres,

On Fri, 16 May 2025 at 22:49, Andres Freund <andres@anarazel.de> wrote:

There have been other odd things on leafhopper, see e.g.:

/messages/by-id/35d87371-f3ab-42c8-9aac-bb39ab5bd987@gmail.com

/messages/by-id/Z4npAKvchWzKfb_r@paquier.xyz

Any chances this could be linked to the openssl bug [2] highlighted
in this other hacker thread [1]? The postgres issue is quite unrelated,
but the openssl bug seems non-trivial and may be good to rule out.

To confirm, leafhopper is on Graviton4, uses openssl v3.2 and is
compiled --with-openssl. I've been unable to triage the recent
leafhopper failures myself and upgrading its openssl (to v3.3+)
has been a pending task (just to rule it out).

I don't really see how it could conceivably be related, unless we are talking
about a general broken compiler issue or such.

Greetings,

Andres Freund

David Rowley

dgrowleyml@gmail.com

7 months ago

In reply to: Andres Freund (#1)

Re: wrong query results on bf leafhopper

On Sat, 17 May 2025 at 01:19, Andres Freund <andres@anarazel.de> wrote:

@@ -42,7 +42,7 @@
->  Nested Loop (actual rows=1000.00 loops=N)
->  Seq Scan on tenk1 t2 (actual rows=1000.00 loops=N)
Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8982
->  Memoize (actual rows=1.00 loops=N)
Cache Key: t2.twenty
Cache Mode: logical
@@ -178,7 +178,7 @@
->  Nested Loop (actual rows=1000.00 loops=N)
->  Seq Scan on tenk1 t1 (actual rows=1000.00 loops=N)
Filter: (unique1 < 1000)
-               Rows Removed by Filter: 9000
+               Rows Removed by Filter: 8981
->  Memoize (actual rows=1.00 loops=N)
Cache Key: t1.two, t1.twenty
Cache Mode: binary

Note that the actual row count is 1000 still, so that pretty much
discounts corruption with the stored unique1 values. Unfortunately,
that doesn't reduce the number of possible other reasons by very much.

For a moment I thought this could be a bug in memoize, but that doesn't
actually make sense - the failure isn't in memoize, it's the seqscan.

I don't have any bright ideas what the cause might be right now, but I
agree that it seems unlikely to be anything related to Memoize.

It might be worth adding a query like: "select count(odd),min(ctid)
from tenk1;" that should use a Seq Scan plan (ideally max(ctid) too,
but that won't be stable over CPU architectures). Maybe also "select
unique1/1000,count(odd) from tenk1 group by 1 order by 1;" so we can
see if there's any sort of consistency or pattern as to which tuples
are missing. Maybe those will provoke some ideas.

David

Tom Lane

tgl@sss.pgh.pa.us

7 months ago

In reply to: David Rowley (#5)

Re: wrong query results on bf leafhopper

David Rowley <dgrowleyml@gmail.com> writes:

Note that the actual row count is 1000 still, so that pretty much
discounts corruption with the stored unique1 values. Unfortunately,
that doesn't reduce the number of possible other reasons by very much.

Failures like this one [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-19%2007%3A07%3A04:

@@ -340,9 +340,13 @@
 create function myinthash(myint) returns integer strict immutable language
   internal as 'hashint4';
 NOTICE:  argument type myint is only a shell
+ERROR:  ROWS is not applicable when function does not return a set

are hard to explain as anything besides "that machine is quite
broken". Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug. So I'm
unexcited about putting effort into it.

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2025-05-19%2007%3A07%3A04

David Rowley

dgrowleyml@gmail.com

7 months ago

In reply to: Tom Lane (#6)

Re: wrong query results on bf leafhopper

On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Failures like this one [1]:
@@ -340,9 +340,13 @@
create function myinthash(myint) returns integer strict immutable language
internal as 'hashint4';
NOTICE:  argument type myint is only a shell
+ERROR:  ROWS is not applicable when function does not return a set
are hard to explain as anything besides "that machine is quite
broken". Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug. So I'm
unexcited about putting effort into it.

There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?

Are there any Prime95-like stress testers for ARM that could be run on
this machine?

It would be good to kick this one out the pool if there's hardware issues.

David

Tomas Vondra

tomas@vondra.me

7 months ago

In reply to: David Rowley (#7)

Re: wrong query results on bf leafhopper

On 5/20/25 07:50, David Rowley wrote:

On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Failures like this one [1]:
@@ -340,9 +340,13 @@
create function myinthash(myint) returns integer strict immutable language
internal as 'hashint4';
NOTICE:  argument type myint is only a shell
+ERROR:  ROWS is not applicable when function does not return a set
are hard to explain as anything besides "that machine is quite
broken". Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug. So I'm
unexcited about putting effort into it.
There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?

Are there any Prime95-like stress testers for ARM that could be run on
this machine?

It would be good to kick this one out the pool if there's hardware issues.

There are tools like "stress" and "stressant", etc. Works on my rpi5,
but depends on the packager.

I'd probably just look at dmesg first. In my experience hardware issues
are often pretty visible there - reports of failed I/O requests, thermal
issues on the CPU, that kind of stuff.

regards

--
Tomas Vondra

Robins Tharakan

tharakan@gmail.com

7 months ago

In reply to: David Rowley (#7)

Re: wrong query results on bf leafhopper

On Tue, 20 May 2025 at 15:20, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Failures like this one [1]:

@@ -340,9 +340,13 @@
create function myinthash(myint) returns integer strict immutable

language

internal as 'hashint4';
NOTICE: argument type myint is only a shell
+ERROR: ROWS is not applicable when function does not return a set

are hard to explain as anything besides "that machine is quite
broken". Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug. So I'm
unexcited about putting effort into it.

There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?

Recently leafhopper failed again on the same test. For now I've paused it.
To rule out the compiler (and its maturity on the architecture), I'll
upgrade
gcc (to nightly, or something more recent) and then re-enable to see if it
changes anything.

I didn't dive in deeper but I see that indri failed recently [1] on what
seems
like the exact same test / line-number (at t/027_stream_regress.pl line 95)
that leafhopper has been tripping on recently. The error is not verbatim,
but it was a little too coincidental to not highlight here.

-
robins

Ref:
1.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

#10

Tom Lane

tgl@sss.pgh.pa.us

7 months ago

In reply to: Robins Tharakan (#9)

Re: wrong query results on bf leafhopper

Robins Tharakan <tharakan@gmail.com> writes:

I didn't dive in deeper but I see that indri failed recently [1] on what
seems
like the exact same test / line-number (at t/027_stream_regress.pl line 95)
that leafhopper has been tripping on recently. The error is not verbatim,
but it was a little too coincidental to not highlight here.

027_stream_regress.pl is quite a large/complicated test, and for
reasons that are not clear to me it seems more prone to intermittent
timing problems than most other tests. I would not read very much
into that being the test that failed for you, especially since the
detailed symptoms are not like indri's.

regards, tom lane

#11

Andres Freund

andres@anarazel.de

7 months ago

In reply to: Robins Tharakan (#9)

Re: wrong query results on bf leafhopper

Hi,

On 2025-05-28 22:51:14 +0930, Robins Tharakan wrote:

On Tue, 20 May 2025 at 15:20, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 20 May 2025 at 16:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Failures like this one [1]:

@@ -340,9 +340,13 @@
create function myinthash(myint) returns integer strict immutable

language

internal as 'hashint4';
NOTICE: argument type myint is only a shell
+ERROR: ROWS is not applicable when function does not return a set

are hard to explain as anything besides "that machine is quite
broken". Whether it's flaky hardware, broken compiler, or what is
undeterminable from here, but I don't believe it's our bug. So I'm
unexcited about putting effort into it.

There are certainly much fewer moving parts in PostgreSQL code for
that one as this failure doesn't seem to rely on anything stored in
any tables or the catalogues.

I'd have thought it would be unlikely to be a compiler bug as wouldn't
that mean it'd fail every time?

Recently leafhopper failed again on the same test. For now I've paused it.
To rule out the compiler (and its maturity on the architecture), I'll
upgrade
gcc (to nightly, or something more recent) and then re-enable to see if it
changes anything.

+1 to a gcc upgrade, gcc 11 is rather old and out of upstream support.

A kernel upgrade would be good too. My completely baseless gut feeling is that
some SIMD registers occassionally get corrupted, e.g. due to a kernel
interrupt / context switch not properly storing & restoring them. Weirdly
enought the instrumentation code is among the pieces of PG code most
vulnerable to that because we mostly don't do enough auto-vectorizable math,
but InstrEndLoop(), InstrStopNode() etc are trivially auto-vectorizable. I'm
pretty sure I've previously analyzed problems around this, but don't remember
the details (IA64 maybe?).

I didn't dive in deeper but I see that indri failed recently [1] on what
seems
like the exact same test / line-number (at t/027_stream_regress.pl line 95)
that leafhopper has been tripping on recently. The error is not verbatim,
but it was a little too coincidental to not highlight here.

For 027_stream_regress.pl you really need to look at
regress_log_027_stream_regress.log, as that specific line just tests whether
the standard regression tests passed. The failure on indri is rather different
than your issue, I doubt there's an overlap between the problems...

I think we should spruce up 027_stream_regress.pl a bit around this. Before
the "regression tests pass" check we should
a) check if primary is still alive
b) check if standby is still alive

and then, iff a) & b) pass, in addition to printing the entire regression test
file, we should add the head and tail of regression.diffs to the failure
message, so one can quickly glean what went wrong.

Greetings,

Andres Freund

#12

Robins Tharakan

tharakan@gmail.com

6 months ago

In reply to: Andres Freund (#11)

Re: wrong query results on bf leafhopper

Hi,

On Thu, 29 May 2025 at 02:32, Andres Freund <andres@anarazel.de> wrote:

On 2025-05-28 22:51:14 +0930, Robins Tharakan wrote:

Recently leafhopper failed again on the same test. For now I've paused it.

To rule out the compiler (and its maturity on the architecture), I'll
upgrade
gcc (to nightly, or something more recent) and then re-enable to see if

it

changes anything.

+1 to a gcc upgrade, gcc 11 is rather old and out of upstream support.

Ack. I've updated leafhopper to gcc master. For now (to get the machine
green / running), I've disabled some flags, which I'll revisit in some time,
but hopefully that's not about compiler maturity - which is what I'm after
here.

A kernel upgrade would be good too. My completely baseless gut feeling is
that

some SIMD registers occassionally get corrupted, e.g. due to a kernel

interrupt / context switch not properly storing & restoring them. Weirdly
enought the instrumentation code is among the pieces of PG code most
vulnerable to that because we mostly don't do enough auto-vectorizable
math,
but InstrEndLoop(), InstrStopNode() etc are trivially auto-vectorizable.
I'm
pretty sure I've previously analyzed problems around this, but don't
remember
the details (IA64 maybe?).

Fair point, I'll keep that option open. Originally, the machine was spun up
to
evaluate the graviton4 ec2 instance and I'd like to explore whether the
stock-kernel / kernel-updates are able to keep the instance green (and
resort
to updating the kernel only if I exhaust all other options - pg / compiler
etc.).

-
robins

Participants

Thread Outline

wrong query results on bf leafhopper