ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

Started by Alexander Lakhin8 months ago6 messageshackers
Jump to latest
#1Alexander Lakhin
exclusion@gmail.com

Hello hackers,

I've noticed several timeout failures occurred during this month on
sidewinder: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27 [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01 [3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01.
All three hangs happened at test thread/alloc:
...
ok 60        - thread/thread                              95 ms
ok 61        - thread/thread_implicit                     89 ms
ok 62        - thread/prep                               305 ms

I've installed NetBSD 9.3 (from [4]) and reproduced the issue with:
$ printf 'test: thread/alloc\n%.0s' {1..100} > src/interfaces/ecpg/test/ecpg_schedule
$ gmake -s check -C src/interfaces/ecpg/
...
ok 44        - thread/alloc                              133 ms
ok 45        - thread/alloc                              180 ms
ok 46        - thread/alloc                              129 ms
--- hang ---

ps shows:
 1283 pts/0 Is   0:00.42 | |   `-- -bash
18059 pts/0 I+   0:00.01 | |     `-- gmake -s check -C src/interfaces/ecpg/
23360 pts/0 I+   0:00.01 | |       `-- gmake -C test check
22349 pts/0 S+   0:00.27 | |         `-- ./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regression
--create-role=regress_ecpg_user1,regress_ecpg_
15449 pts/0 S+   0:01.06 | |           |-- postgres -D /home/vagrant/postgresql/src/interfaces/ecpg/test/tmp_check/data
-F -c listen_addresses= -k /tmp/pg_regr
 1959 ?     Is   0:00.01 | |           | |-- postgres: io worker 1
 5608 ?     Is   0:00.01 | |           | |-- postgres: autovacuum launcher
 7218 ?     Is   0:00.07 | |           | |-- postgres: io worker 0
15867 ?     Is   0:00.01 | |           | |-- postgres: io worker 2
21071 ?     Is   0:00.00 | |           | |-- postgres: logical replication launcher
22122 ?     Ss   0:00.18 | |           | |-- postgres: walwriter
22159 ?     Is   0:00.00 | |           | |-- postgres: checkpointer
24606 ?     Ss   0:00.02 | |           | `-- postgres: background writer
24407 pts/0 Sl+  0:00.03 | |           `-- /home/vagrant/postgresql/src/interfaces/ecpg/test/thread/alloc

$ gdb -p 24407

(gdb) bt
#0  0x000077122aca23fa in ___lwp_park60 () from /usr/lib/libc.so.12
#1  0x000077122b209c66 in ?? () from /usr/lib/libpthread.so.1
#2  0x000077122ad0fcb9 in je_malloc_mutex_lock_slow () from /usr/lib/libc.so.12
#3  0x000077122ad087c1 in je_arena_choose_hard () from /usr/lib/libc.so.12
#4  0x000077122acb1915 in je_tsd_tcache_data_init () from /usr/lib/libc.so.12
#5  0x000077122acb1b44 in je_tsd_tcache_enabled_data_init () from /usr/lib/libc.so.12
#6  0x000077122acaeda4 in je_tsd_fetch_slow () from /usr/lib/libc.so.12
#7  0x000077122ad08b1c in malloc () from /usr/lib/libc.so.12
#8  0x000077122be0af6b in ECPGget_sqlca () from /home/vagrant/postgresql/tmp_install/usr/local/pgsql/lib/libecpg.so.6
#9  0x000077122be0328f in ECPGconnect () from /home/vagrant/postgresql/tmp_install/usr/local/pgsql/lib/libecpg.so.6

The stack is always the same, so maybe it's an issue with jemalloc,
similar to [5]https://github.com/jemalloc/jemalloc/issues/2402.

The test continues running after gdb detach.

Upgrade to NetBSD 9.4 fixed the issue for me.

It reproduced even on a commit from 2017-01-01, so it's not clear why
there were no such timeouts before...

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01
[4]: https://portal.cloud.hashicorp.com/vagrant/discover/generic/netbsd9
[5]: https://github.com/jemalloc/jemalloc/issues/2402

Best regards,
Alexander

#2Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Alexander Lakhin (#1)
Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

On 2025-08-30 22:00, Alexander Lakhin wrote:

Upgrade to NetBSD 9.4 fixed the issue for me.

It reproduced even on a commit from 2017-01-01, so it's not clear why
there were no such timeouts before...

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01
[3]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01
[4] https://portal.cloud.hashicorp.com/vagrant/discover/generic/netbsd9
[5] https://github.com/jemalloc/jemalloc/issues/2402

Hi,

One change I did pretty recently is that I upgraded the build farm
client from 14 to 19.1 a couple of weeks ago.

I might have lowered wait_timeout initially but now it's higher (14400)
than before (10800).

I also have note that there are stuck processes some times.  Like:

netbsd9_3-pgbf# ps aux | grep ^pgbf
pgbf     1610  0.0  0.1   34068  2200 ?     I    Wed07PM   0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf     3083  0.0  0.3 1903704  6164 ?     Il   22Aug25   0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr
pgbf     3101  0.0  0.2   57892  4160 ?     I    Fri03PM   0:00.01
postdrop -r
pgbf     4490  0.0  0.3 1902676  6068 ?     Il   Wed07PM   0:00.04
/home/pgbf/buildroot/REL_18_STABLE/pgsql.build/src/interfaces/ecpg
pgbf     8739  0.0  0.1   18948  2196 ?     I    22Aug25   0:00.00 gmake
-C test check
pgbf     8927  0.0  0.1   20008  1780 ?     I    22Aug25   0:00.00 sh -c
{ cd  pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf    10380  0.0  0.2   57876  4184 ?     Is   Wed07PM   0:00.00
sendmail -FCronDaemon -odi -oem -oi -t
pgbf    10579  0.0  0.3 1900136  6172 ?     Il   Fri03PM   0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr
pgbf    11374  0.0  0.2   57872  4180 ?     Is   Fri03PM   0:00.01
sendmail -FCronDaemon -odi -oem -oi -t
pgbf    13147  0.0  0.2   43260  3804 ?     I    Fri03PM   0:00.34
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf    14484  0.0  0.0       0     0 ?     Z          -   0:00.00 (ksh)
pgbf    14711  0.0  0.0       0     0 ?     Z          -   0:00.00 (ksh)
pgbf    16040  0.0  0.2   57896  4160 ?     I    Wed07PM   0:00.01
postdrop -r
pgbf    17707  0.0  0.1   32400  2196 ?     I    22Aug25   0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf    18244  0.0  0.1   30100  2204 ?     I    Fri03PM   0:00.00 gmake
-C test check
pgbf    20348  0.0  0.1   20008  1780 ?     I    Fri03PM   0:00.00 sh -c
{ cd  pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf    22704  0.0  0.2   57896  4164 ?     I    22Aug25   0:00.01
postdrop -r
pgbf    23701  0.0  0.2   48960  3788 ?     I    Wed07PM   0:00.29
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf    24016  0.0  0.2   39292  3792 ?     I    22Aug25   0:00.29
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf    24387  0.0  0.1   30888  2200 ?     I    Fri03PM   0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf    24945  0.0  0.2   57876  4176 ?     Is   22Aug25   0:00.01
sendmail -FCronDaemon -odi -oem -oi -t
pgbf    25730  0.0  0.1   18852  2216 ?     I    Wed07PM   0:00.00 gmake
-C test check
pgbf    26757  0.0  0.1   22684  1776 ?     I    Wed07PM   0:00.00 sh -c
{ cd  pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf    29079  0.0  0.0       0     0 ?     Z          -   0:00.00 (ksh)

/Mikael

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikael Kjellström (#2)
Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

=?UTF-8?Q?Mikael_Kjellstr=C3=B6m?= <mikael.kjellstrom@mksoft.nu> writes:

On 2025-08-30 22:00, Alexander Lakhin wrote:

Upgrade to NetBSD 9.4 fixed the issue for me.

I also have note that there are stuck processes some times.  Like:

Yeah, this printout is consistent with Alexander's observation that
it's ecpg threading tests that are getting stuck:

pgbf     3083  0.0  0.3 1903704  6164 ?     Il   22Aug25   0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr

pgbf    10579  0.0  0.3 1900136  6172 ?     Il   Fri03PM   0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr

All the other processes belonging to pgbf are test driver processes
that would be waiting for these two.

Since we're not seeing comparable failures on other animals,
I tend to agree with his diagnosis of a bug in 9.3's malloc library.

regards, tom lane

#4Mikael Kjellström
mikael.kjellstrom@mksoft.nu
In reply to: Tom Lane (#3)
Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

On 2025-08-31 16:52, Tom Lane wrote:

All the other processes belonging to pgbf are test driver processes
that would be waiting for these two.

Since we're not seeing comparable failures on other animals,
I tend to agree with his diagnosis of a bug in 9.3's malloc library.

Yes, I will upgrade that animal to the latest release of NetBSD 10.1 I
think.

NetBSD 9.3 is EOL anyway.

/Mikael

#5Alexander Lakhin
exclusion@gmail.com
In reply to: Mikael Kjellström (#4)
Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

Hello Mikael,

31.08.2025 22:55, Mikael Kjellström wrote:

Yes, I will upgrade that animal to the latest release of NetBSD 10.1 I think.

[ a bit off-topic, but still related to sidewinder ]

I've noticed an interesting failure on sidewinder occurred recently [1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sidewinder&amp;dt=2026-02-19%2021%3A35%3A01&amp;stg=recovery-check:
#   Failed test 'get an error when running within another function'
#   at t/049_wait_for_lsn.pl line 218.
# Looks like you failed 1 test of 33.
[23:19:49] t/049_wait_for_lsn.pl .................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/33 subtests

The corresponding code in the test:
$node_standby->psql(
    'postgres',
    "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
    stderr => \$stderr);
ok( $stderr =~
      /WAIT FOR must be called without an active or registered snapshot/,
    "get an error when running within another function");

But 049_wait_for_lsn_standby.log in fact contains:
2026-02-19 23:19:42.251 CET [7417:4] 049_wait_for_lsn.pl LOG: statement: SELECT pg_wal_replay_wait_wrap('2/570C0FC0');
2026-02-19 23:19:42.252 CET [7417:5] 049_wait_for_lsn.pl ERROR: WAIT FOR must be called without an active or registered
snapshot
2026-02-19 23:19:42.252 CET [7417:6] 049_wait_for_lsn.pl DETAIL: WAIT FOR cannot be executed from a function or
procedure, nor within a transaction with an isolation level higher than READ COMMITTED.
2026-02-19 23:19:42.252 CET [7417:7] 049_wait_for_lsn.pl CONTEXT: SQL statement "WAIT FOR LSN '2/570C0FC0';"
    PL/pgSQL function pg_wal_replay_wait_wrap(pg_lsn) line 3 at EXECUTE
2026-02-19 23:19:42.252 CET [7417:8] 049_wait_for_lsn.pl STATEMENT: SELECT pg_wal_replay_wait_wrap('2/570C0FC0');
2026-02-19 23:19:42.253 CET [7417:9] 049_wait_for_lsn.pl LOG: disconnection: session time: 0:00:00.005 user=pgbf
database=postgres host=[local]

That is, the server returned the expected error, but the test didn't
receive it.

I've managed to reproduce this failure locally, on NetBSD 10.1, with the
following modification:
--- a/src/test/recovery/t/049_wait_for_lsn.pl
+++ b/src/test/recovery/t/049_wait_for_lsn.pl
@@ -211,13 +211,20 @@ LANGUAGE plpgsql;
  ]);
  $node_primary->wait_for_catchup($node_standby);
+
+for (my $i=1; $i<=50; $i++)
+{
+my $stdout;
  $node_standby->psql(
         'postgres',
-       "SELECT pg_wal_replay_wait_wrap('${lsn3}');",
+       "SELECT pg_wal_replay_wait_wrap('${lsn3}') /*$i*/;",
+        stdout => \$stdout,
         stderr => \$stderr);
+print("stdout: [$stdout]; stderr: [$stderr]\n");
  ok( $stderr =~
           /WAIT FOR must be called without an active or registered snapshot/,
-       "get an error when running within another function");
+       "get an error when running within another function ($i)");
+}

 # 6. Check parameter validation error cases on standby before promotion
 my $test_lsn =

when running multiple paralled test instances in a loop:
stdout: []; stderr: [psql:<stdin>:1: ERROR:  WAIT FOR must be called without an active or registered snapshot
DETAIL:  WAIT FOR cannot be executed from a function or procedure, nor within a transaction with an isolation level
higher than READ COMMITTED.
CONTEXT:  SQL statement "WAIT FOR LSN '2/570C0FF8';"
PL/pgSQL function pg_wal_replay_wait_wrap(pg_lsn) line 3 at EXECUTE]
[09:33:27.469](0.062s) ok 25 - get an error when running within another function (14)
stdout: []; stderr: []
[09:33:27.636](0.168s) not ok 26 - get an error when running within another function (15)
[09:33:27.651](0.015s)
[09:33:27.652](0.001s) #   Failed test 'get an error when running within another function (15)'
#   at t/049_wait_for_lsn.pl line 225.
stdout: []; stderr: [psql:<stdin>:1: ERROR:  WAIT FOR must be called without an active or registered snapshot
DETAIL:  WAIT FOR cannot be executed from a function or procedure, nor within a transaction with an isolation level
higher than READ COMMITTED.
CONTEXT:  SQL statement "WAIT FOR LSN '2/570C0FF8';"
PL/pgSQL function pg_wal_replay_wait_wrap(pg_lsn) line 3 at EXECUTE]

It reminded me of [2]/messages/by-id/fb666566-32bb-9c36-9c2e-3949b7a061bc@gmail.com. In my system, I found p5-IPC-Run-20231003.0nb1
installed. So I tried to update it with:
sudo pkg_delete p5-IPC-Run
git clone https://github.com/cpan-authors/IPC-Run.git
(cd IPC-Run*/ && perl Makefile.PL && make && sudo make install)
and it really helped -- no failure reproduced with many more runs. (I can
share my repro if needed.)

So maybe you could consider updating IPC::Run on sidewinder too...

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sidewinder&amp;dt=2026-02-19%2021%3A35%3A01&amp;stg=recovery-check
[2]: /messages/by-id/fb666566-32bb-9c36-9c2e-3949b7a061bc@gmail.com

Best regards,
Alexander

#6Mikael Kjellström
mikael.kjellstrom@gmail.com
In reply to: Alexander Lakhin (#5)
Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

On 2026-02-21 19:00, Alexander Lakhin wrote:

It reminded me of [2]. In my system, I found p5-IPC-Run-20231003.0nb1
installed. So I tried to update it with:
sudo pkg_delete p5-IPC-Run
git clone https://github.com/cpan-authors/IPC-Run.git
(cd IPC-Run*/ && perl Makefile.PL && make && sudo make install)
and it really helped -- no failure reproduced with many more runs. (I can
share my repro if needed.)

So maybe you could consider updating IPC::Run on sidewinder too...

I haven't done an upgrade of the packages in a while and I just ran a
pkgin update and saw that it wanted to upgrade to
"p5-IPC-Run-20250809.0" so I can maybe try that first and see if that helps.

Anyway I have now upgraded all the packages so they should be up to
date.  Let's see if that makes a difference.

Otherwise I could do a manual install of IPC-Run.

/Mikael