Non-reproducible AIO failure

andres@anarazel.de

about 1 year ago

In reply to: Tom Lane (#1)

Re: Non-reproducible AIO failure

Hi,

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:

My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

Huh.

I assume there's no core file left over? Would be helpful to query some more
state (e.g. whether operation->io_wref is valid (think it has to be) and
whether this was a partial read or such).

0 postgres 0x0000000100e3df2c ExceptionalCondition + 108
1 postgres 0x0000000100cbb154 WaitReadBuffers + 616
2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300
4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136
5 postgres 0x00000001009ad8c4 heapgettup + 368
6 postgres 0x00000001009ad364 heap_getnext + 104
7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700
8 postgres 0x00000001009ef4b8 spgbuild + 480
9 postgres 0x0000000100a41acc index_build + 428
10 postgres 0x0000000100a43d14 reindex_index + 752
11 postgres 0x0000000100ad09e0 ExecReindex + 1908
12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300
13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528
14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632
15 postgres 0x0000000100d02cec PortalRunUtility + 184
16 postgres 0x0000000100d02354 PortalRunMulti + 236
17 postgres 0x0000000100d01d94 PortalRun + 456
18 postgres 0x0000000100d00e10 exec_simple_query + 1308
...
2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6
2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx;

Scraping the buildfarm database confirmed that there have been no
other similar failures since AIO went in, and unsurprisingly some
attempts to reproduce it manually didn't bear fruit. So I'm
suspecting a race condition with a very narrow window, but I don't
have an idea where to look. Any thoughts on how to gather more
data?

The weird thing is that the concrete symptom here doesn't actually involve
anything that could really race - the relevant state is only updated by the
process itself. But of course there has to be something narrow here,
otherwise we'd have seen this before :(.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Andres Freund (#2)

Re: Non-reproducible AIO failure

Andres Freund <andres@anarazel.de> writes:

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:

My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.

No, it's intended to be a bog-standard macOS environment.

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

Huh.

I assume there's no core file left over?

No such luck, I did not have that enabled. I will turn it on though,
and also on its sibling indri.

regards, tom lane

exclusion@gmail.com

about 1 year ago

In reply to: Tom Lane (#3)

Re: Non-reproducible AIO failure

Hello Tom and Andres,

24.04.2025 01:58, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:

My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.

FWIW, that Assert have just triggered on another mac:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

I have been trying to reproduce it locally (on Linux, FreeBSD VMs) for days,
but failed. So maybe it's a MacOS-specific issue. I will try to get an access
to a similar machine to reproduce it there.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Alexander Lakhin (#4)

Re: Non-reproducible AIO failure

Alexander Lakhin <exclusion@gmail.com> writes:

FWIW, that Assert have just triggered on another mac:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

Yeah, I was just looking at that too. There is a corefile
from that crash, but lldb seems unable to extract anything
from it :-(. There is a low-resolution stack trace in the
postmaster log though:

0 postgres 0x0000000105299c84 ExceptionalCondition + 108
1 postgres 0x00000001051159ac WaitReadBuffers + 616
2 postgres 0x00000001053611ec read_stream_next_buffer.cold.1 + 184
3 postgres 0x0000000105111630 read_stream_next_buffer + 300
4 postgres 0x0000000104e0b994 heap_fetch_next_buffer + 136
5 postgres 0x0000000104e018f4 heapgettup_pagemode + 204
6 postgres 0x0000000104e02010 heap_getnextslot + 84
7 postgres 0x0000000104faebb4 SeqNext + 160

Of note here is that indri and sifaka run on the same host ---
indri uses some MacPorts packages while sifaka doesn't, but
that shouldn't have anything to do with our AIO code.

I trawled the buildfarm database and confirmed that these two crashes
are our only similar reports (grepping for "PGAIO_RS_UNKNOWN"):

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

regards, tom lane

thomas.munro@gmail.com

about 1 year ago

In reply to: Tom Lane (#5)

Re: Non-reproducible AIO failure

On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing. Alexander, were the other OSes you tried all on
x86?

exclusion@gmail.com

about 1 year ago

In reply to: Thomas Munro (#6)

Re: Non-reproducible AIO failure

Hello Thomas,

24.05.2025 14:42, Thomas Munro wrote:

On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing. Alexander, were the other OSes you tried all on
x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2       # +++ tap check in src/test/recovery_2 +++
3       # +++ tap check in src/test/recovery_3 +++
4       # +++ tap check in src/test/recovery_4 +++
5       # +++ tap check in src/test/recovery_5 +++
1       # +++ tap check in src/test/recovery_1 +++
7       # +++ tap check in src/test/recovery_7 +++
6       # +++ tap check in src/test/recovery_6 +++
8       # +++ tap check in src/test/recovery_8 +++
1       [16:07:31] t/027_stream_regress.pl ..
1       Dubious, test returned 29 (wstat 7424, 0x1d00)
1       All 1 subtests passed
1       [16:07:31]
1
1       Test Summary Report
1       -------------------
1       t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1         Non-zero exit status: 29
1         Parse errors: No plan found in TAP output
1       Files=1, Tests=1, 197 wallclock secs ( 0.01 usr 0.01 sys + 4.47 cusr 5.79 csys = 10.28 CPU)
1       Result: FAIL

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4 ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4 pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94 pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116
6   postgres                            0x0000000104c1675c AsyncReadBuffers + 1444
7   postgres                            0x0000000104c15868 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104c153ac StartReadBuffers + 64
9   postgres                            0x0000000104c129b4 read_stream_start_pending_read + 1204
10 postgres                            0x0000000104c12018 read_stream_look_ahead + 812
11 postgres                            0x0000000104c11cd0 read_stream_next_buffer + 1396
12 postgres                            0x0000000104606c38 heap_fetch_next_buffer + 284
13 postgres                            0x00000001045f79d4 heapgettup_pagemode + 192
14 postgres                            0x00000001045f7fa4 heap_getnextslot + 84
15 postgres                            0x000000010497a404 table_scan_getnextslot + 340
16 postgres                            0x000000010497a210 SeqNext + 148
17 postgres                            0x000000010497a8e4 ExecScanFetch + 772
18 postgres                            0x000000010497a4b8 ExecScanExtended + 164
19 postgres                            0x0000000104979bf0 ExecSeqScanWithProject + 244
20 postgres                            0x0000000104953ce0 ExecProcNode + 68
21 postgres                            0x000000010494f148 MultiExecPrivateHash + 64
22 postgres                            0x000000010494ed48 MultiExecHash + 100
23 postgres                            0x0000000104925a18 MultiExecProcNode + 180
24 postgres                            0x0000000104957220 ExecHashJoinImpl + 628
25 postgres                            0x00000001049566dc ExecHashJoin + 32
26 postgres                            0x0000000104925958 ExecProcNodeFirst + 100
27 postgres                            0x000000010491c08c ExecProcNode + 68
28 postgres                            0x000000010491692c ExecutePlan + 268
29 postgres                            0x000000010491679c standard_ExecutorRun + 504
30 pg_stat_statements.dylib            0x00000001055e132c pgss_ExecutorRun + 212
31 postgres                            0x0000000104916580 ExecutorRun + 72
32 postgres                            0x000000010491f59c ParallelQueryMain + 508
33 postgres                            0x00000001046968d4 ParallelWorkerMain + 1712
34 postgres                            0x0000000104b476f4 BackgroundWorkerMain + 824
35 postgres                            0x0000000104b4b87c postmaster_child_launch + 492
36 postgres                            0x0000000104b56610 StartBackgroundWorker + 416
37 postgres                            0x0000000104b51378 maybe_start_bgworkers + 552
38 postgres                            0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316
39 postgres                            0x0000000104b51650 ServerLoop + 616
40 postgres                            0x0000000104b4fef0 PostmasterMain + 6632
41 postgres                            0x00000001049bfe20 main + 952
42 dyld                                0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG: background worker "parallel worker" (PID 32355) was terminated by signal 6:
Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL: Failed process was running: select count(*) from join_foo
          left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be helpful.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

thomas.munro@gmail.com

about 1 year ago

In reply to: Alexander Lakhin (#7)

Re: Non-reproducible AIO failure

On Sun, May 25, 2025 at 9:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Hello Thomas,
24.05.2025 14:42, Thomas Munro wrote:

On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing. Alexander, were the other OSes you tried all on
x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

Hmm. And I see now that this really is all in one backend. Could it
be some variation of the interrupt processing stuff from acad9093?

However, I've managed to get an AIO-related assertion failure on macOS 14.5

...

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355

Can you get a core and print *ioh in the debugger?

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Thomas Munro (#8)

Re: Non-reproducible AIO failure

Thomas Munro <thomas.munro@gmail.com> writes:

Can you get a core and print *ioh in the debugger?

So far, I've failed to get anything useful out of core files
from this failure. The trace goes back no further than

(lldb) bt
* thread #1
* frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8

That's quite odd in itself: while I don't find the debugging
environment on macOS to be the greatest, it's not normally
this unhelpful.

I have managed to reproduce the "aio_ret->result.status !=
PGAIO_RS_UNKNOWN" crash about five times in total on three different
Apple Silicon machines (an M1 mini, an M4 Pro mini, and an M4 Pro
laptop). So there is definitely something wrong. The failure rate
is pretty tiny for me, though I've not tried running several test
instances concurrently.

regards, tom lane

#10

thomas.munro@gmail.com

about 1 year ago

In reply to: Tom Lane (#9)

Re: Non-reproducible AIO failure

On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Can you get a core and print *ioh in the debugger?

So far, I've failed to get anything useful out of core files
from this failure. The trace goes back no further than

(lldb) bt
* thread #1
* frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8

That's quite odd in itself: while I don't find the debugging
environment on macOS to be the greatest, it's not normally
this unhelpful.

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS. Could that be interfering with the stack in the core?!
I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?

So far we have:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0 postgres 0x0000000105299c84
ExceptionalCondition + 108
1 postgres 0x00000001051159ac WaitReadBuffers + 616
2 postgres 0x00000001053611ec
read_stream_next_buffer.cold.1 + 184
3 postgres 0x0000000105111630
read_stream_next_buffer + 300
4 postgres 0x0000000104e0b994
heap_fetch_next_buffer + 136
5 postgres 0x0000000104e018f4
heapgettup_pagemode + 204

Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.

But this one seems like a more basic confusion... wild writes
somewhere? Hmm, we need to see what's in that struct.

If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more? I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now. Could you guys please share your exact repro steps?

#11

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Thomas Munro (#10)

Re: Non-reproducible AIO failure

Thomas Munro <thomas.munro@gmail.com> writes:

Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%. It sounded like Alexander
had a better way.

regards, tom lane

#12

[1]: https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Thomas Munro (#10)

Re: Non-reproducible AIO failure

Thomas Munro <thomas.munro@gmail.com> writes:

On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So far, I've failed to get anything useful out of core files
from this failure. The trace goes back no further than
(lldb) bt
* thread #1
* frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS. Could that be interfering with the stack in the core?!

No, but something is. Just to make sure it wasn't totally broken,
I added a sure-to-fail Assert in a random place (I chose
pg_backend_pid), and I get both a trace in the postmaster log and a
perfectly usable core file:

TRAP: failed Assert("MyProcPid == 0"), File: "pgstatfuncs.c", Line: 692, PID: 59063
0 postgres 0x00000001031f1fa4 ExceptionalCondition + 108
1 postgres 0x00000001031672b4 pg_stat_get_backend_pid + 0
2 postgres 0x0000000102e9e598 ExecInterpExpr + 5524
3 postgres 0x0000000102edb100 ExecResult + 368
4 postgres 0x0000000102ea6418 standard_ExecutorRun + 316

(lldb) bt
* thread #1
* frame #0: 0x00000001836b5388 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x00000001836ee88c libsystem_pthread.dylib`pthread_kill + 296
frame #2: 0x00000001835f7c60 libsystem_c.dylib`abort + 124
frame #3: 0x000000010491dfac postgres`ExceptionalCondition(conditionName=<unavailable>, fileName=<unavailable>, lineNumber=692) at assert.c:66:2 [opt]
frame #4: 0x000000010489329c postgres`pg_backend_pid(fcinfo=<unavailable>) at pgstatfuncs.c:692:2 [opt]
frame #5: 0x00000001045ca598 postgres`ExecInterpExpr(state=0x000000013780d190, econtext=0x000000013780ce38, isnull=<unavailable>) at execExprInterp.c:0 [opt]
frame #6: 0x0000000104607100 postgres`ExecResult [inlined] ExecEvalExprNoReturn(state=<unavailable>, econtext=0x000000013780ce38) at executor.h:417:13 [opt]
frame #7: 0x00000001046070f4 postgres`ExecResult [inlined] ExecEvalExprNoReturnSwitchContext(state=<unavailable>, econtext=0x000000013780ce38) at executor.h:458:2 [opt]

The fact that I can trace through this Assert failure but not the
AIO one strongly suggests some system-level problem in the latter.
There is something rotten in the state of Denmark.

For completeness, this is with Sequoia 15.5 (latest macOS) on
an M4 Pro MacBook.

but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now.

As far as that goes, if you have SIP turned on (which I'm sure a
corporate laptop would), there are extra steps needed to get a
core to happen. See [1]https://nasa.github.io/trick/howto_guides/How-to-dump-core-file-on-MacOS.html; that page is old, but the recipe still
works for me.

regards, tom lane

#13

thomas.munro@gmail.com

about 1 year ago

In reply to: Tom Lane (#11)

Re: Non-reproducible AIO failure

On Mon, May 26, 2025 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%. It sounded like Alexander
had a better way.

Could you please share your configure options?

While flailing around in the dark and contemplating sources of
nondeterminism that might come from a small system under a lot of load
(as hinted at by Alexander's mention of running the test in parallel)
with a 1MB buffer pool (as used by 027_stream_read.pl via Cluster.pm's
settings for replication tests), I thought about partial reads:

--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -128,6 +128,8 @@ pgaio_io_perform_synchronously(PgAioHandle *ioh)
                        result = pg_preadv(ioh->op_data.read.fd, iov,

ioh->op_data.read.iov_length,

ioh->op_data.read.offset);
+                       if (result > BLCKSZ && rand() < RAND_MAX / 2)
+                               result = BLCKSZ;

... and the fallback path for io_method=worker that runs IOs
synchronous when the submission queue overflows because the I/O
workers aren't keeping up:

--- a/src/backend/storage/aio/method_worker.c
+++ b/src/backend/storage/aio/method_worker.c
@@ -253,7 +253,7 @@ pgaio_worker_submit_internal(int nios, PgAioHandle *ios[])
        for (int i = 0; i < nios; ++i)
        {
                Assert(!pgaio_worker_needs_synchronous_execution(ios[i]));
-               if (!pgaio_worker_submission_queue_insert(ios[i]))
+               if (rand() < RAND_MAX / 2 ||
!pgaio_worker_submission_queue_insert(ios[i]))
                {
                        /*
                         * We'll do it synchronously, but only after
we've sent as many as

... but still no dice here...

#14

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 1 year ago

In reply to: Alexander Lakhin (#7)

Re: Non-reproducible AIO failure

On 5/24/25 23:00, Alexander Lakhin wrote:

...

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be
helpful.)

I'm interested in how you run these tests in parallel. Can you share the
patch/script?

thank

--
Tomas Vondra

#15

andres@anarazel.de

about 1 year ago

In reply to: Tom Lane (#11)

Re: Non-reproducible AIO failure

Hi,

On 2025-05-25 20:05:49 -0400, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%. It sounded like Alexander
had a better way.

Just FYI, I've been trying to reproduce this as well, without a single failure
so far. Despite running all tests for a few hundred times (~2 days) and
027_stream_regress.pl many hundreds of times (~1 day).

This is on a m4 mac mini. I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

Greetings,

Andres Freund

#16

exclusion@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#14)

Re: Non-reproducible AIO failure

Hello Tomas,

27.05.2025 16:26, Tomas Vondra wrote:

I'm interested in how you run these tests in parallel. Can you share the
patch/script?

Yeah, sure. I'm running the test as follows:
rm -rf src/test/recovery_*; for i in `seq 40`; do cp -r src/test/recovery/ src/test/recovery_$i/; sed -i .bak
"s|src/test/recovery|src/test/recovery_$i|" src/test/recovery_$i/Makefile; done
PROVE_TESTS="t/027*" make -s check -s -C src/test/recovery

for i in `seq 100`; do echo "iteration $i"; parallel -j40 --linebuffer --tag PROVE_TESTS="t/027*" NO_TEMP_INSTALL=1 make
-s check -s -C src/test/recovery_{} ::: `seq 8` || break; done; echo -e "\007"

Alexander Lakhin
Neon (https://neon.tech)

#17

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Thomas Munro (#13)

Re: Non-reproducible AIO failure

Thomas Munro <thomas.munro@gmail.com> writes:

Could you please share your configure options?

The failures on indri and sifaka were during ordinary buildfarm
runs, you can check the animals' details on the website.
(Note those are same host machine, the difference is that
indri uses some MacPorts packages while sifaka is meant to be
a bare-bones-macOS build.)

On the other two, I believe I just used my normal Mac devel
configuration. On the M4 laptop that's

$ ./configure CC="ccache gcc" CFLAGS="-O2 -fno-common" --enable-debug --enable-cassert --with-bonjour --without-icu --enable-tap-tests --with-system-tzdata=/usr/share/zoneinfo

but I see I haven't quite got the same thing on the M1 mini:

$ ./configure CC="ccache clang" --enable-debug --enable-cassert --enable-tap-tests --with-system-tzdata=/usr/share/zoneinfo

In both cases the compiler is really

$ gcc -v
Apple clang version 17.0.0 (clang-1700.0.13.3)
Target: arm64-apple-darwin24.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I did some experimentation with different -O levels to see if that
would make the failure any more or less common. But I've yet to
reproduce it at anything except -O2.

regards, tom lane

#18

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Andres Freund (#15)

Re: Non-reproducible AIO failure

Andres Freund <andres@anarazel.de> writes:

This is on a m4 mac mini. I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

I dunno, I've seen it on three different physical machines now
(one M1, two M4 Pros). But it is darn hard to repro, for sure.

regards, tom lane

#19

exclusion@gmail.com

about 1 year ago

In reply to: Andres Freund (#15)

Re: Non-reproducible AIO failure

Hello hackers,

27.05.2025 16:35, Andres Freund пишет:

On 2025-05-25 20:05:49 -0400, Tom Lane wrote:

Thomas Munro<thomas.munro@gmail.com> writes:

Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%. It sounded like Alexander
had a better way.

Just FYI, I've been trying to reproduce this as well, without a single failure
so far. Despite running all tests for a few hundred times (~2 days) and
027_stream_regress.pl many hundreds of times (~1 day).

This is on a m4 mac mini. I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

I'm sorry, but I need several days more to present a working reproducer.

I was lucky enough to catch the assert on my first attempt, without much
effort, but then something changed on that MacBook (it's not mine, I
connect to it remotely when it's available) and I can not reproduce it
anymore.
Just today, I discovered that 027_stream_regress is running very slow
there just because of shared_preload_libraries:
# after the 027_stream_regress test run
echo "shared_preload_libraries = 'pg_stat_statements'" >/tmp/extra.config
TEMP_CONFIG=/tmp/extra.config NO_TEMP_INSTALL=1 /usr/bin/time make -s check
1061,29 real 56,09 user 27,69 sys
vs
NO_TEMP_INSTALL=1 /usr/bin/time make -s check
36,42 real 27,11 user 13,98 sys

Probably it's an effect of antivirus (I see wdavdaemon_unprivileged eating
CPU time), and I uninstalled it before, but now it's installed again
(maybe by some policy). So I definitely need more time to figure out the
exact recipe for triggering the assert.

As to the configure options, when I tried to reproduce the issue on other
(non-macOS) machines, I used options from sifaka:
-DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS,
but then I added -DREAD_STREAM_DISABLE_FAST_PATH to stress read_stream,
and then I just copied that command and ran it on MacBook...
So I think the complete compilation command was (and I'm seeing it in
the history):
CFLAGS="-DREAD_STREAM_DISABLE_FAST_PATH -DWRITE_READ_PARSE_PLAN_TREES -DSTRESS_SORT_INT_MIN
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS" ./configure --enable-injection-points --enable-cassert --enable-debug
--enable-tap-tests --prefix=/tmp/pg -q && make -s -j8 && make -s install && make -s check
... then running 5 027_stream_regress tests in parallel ...
I had also applied a patch to repeat "test: brin" line, but I'm not sure
it does matter.

Sorry for the lack of useful information again.

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#20

andres@anarazel.de

about 1 year ago

In reply to: Tom Lane (#18)

Re: Non-reproducible AIO failure

Hi,

On 2025-05-27 10:12:28 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

This is on a m4 mac mini. I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

I dunno, I've seen it on three different physical machines now
(one M1, two M4 Pros). But it is darn hard to repro, for sure.

I just meant that it seems that I can't reproduce it for some as of yet
unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
a single failure, so there has to be *something* different about my
environment than yours.

Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64

cc -v
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix

I guess I'll try to update to a later version and see if it repros there?

Greetings,

Andres Freund

#21

tgl@sss.pgh.pa.us

about 1 year ago

In reply to: Andres Freund (#20)

#22