ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

Started by Alexander Lakhin5 months ago4 messages

exclusion@gmail.com

5 months ago

Hello hackers,

I've noticed several timeout failures occurred during this month on
sidewinder: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27 [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01 [3]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01.
All three hangs happened at test thread/alloc:
...
ok 60        - thread/thread                              95 ms
ok 61        - thread/thread_implicit                     89 ms
ok 62        - thread/prep                               305 ms

I've installed NetBSD 9.3 (from [4]) and reproduced the issue with:
$ printf 'test: thread/alloc\n%.0s' {1..100} > src/interfaces/ecpg/test/ecpg_schedule
$ gmake -s check -C src/interfaces/ecpg/
...
ok 44        - thread/alloc                              133 ms
ok 45        - thread/alloc                              180 ms
ok 46        - thread/alloc                              129 ms
--- hang ---

ps shows:
1283 pts/0 Is   0:00.42 | |   `-- -bash
18059 pts/0 I+   0:00.01 | |     `-- gmake -s check -C src/interfaces/ecpg/
23360 pts/0 I+   0:00.01 | |       `-- gmake -C test check
22349 pts/0 S+   0:00.27 | |         `-- ./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regression
--create-role=regress_ecpg_user1,regress_ecpg_
15449 pts/0 S+   0:01.06 | |           |-- postgres -D /home/vagrant/postgresql/src/interfaces/ecpg/test/tmp_check/data
-F -c listen_addresses= -k /tmp/pg_regr
1959 ?     Is   0:00.01 | |           | |-- postgres: io worker 1
5608 ?     Is   0:00.01 | |           | |-- postgres: autovacuum launcher
7218 ?     Is   0:00.07 | |           | |-- postgres: io worker 0
15867 ?     Is   0:00.01 | |           | |-- postgres: io worker 2
21071 ?     Is   0:00.00 | |           | |-- postgres: logical replication launcher
22122 ?     Ss   0:00.18 | |           | |-- postgres: walwriter
22159 ?     Is   0:00.00 | |           | |-- postgres: checkpointer
24606 ?     Ss   0:00.02 | |           | `-- postgres: background writer
24407 pts/0 Sl+ 0:00.03 | |           `-- /home/vagrant/postgresql/src/interfaces/ecpg/test/thread/alloc

$ gdb -p 24407

(gdb) bt
#0 0x000077122aca23fa in ___lwp_park60 () from /usr/lib/libc.so.12
#1 0x000077122b209c66 in ?? () from /usr/lib/libpthread.so.1
#2 0x000077122ad0fcb9 in je_malloc_mutex_lock_slow () from /usr/lib/libc.so.12
#3 0x000077122ad087c1 in je_arena_choose_hard () from /usr/lib/libc.so.12
#4 0x000077122acb1915 in je_tsd_tcache_data_init () from /usr/lib/libc.so.12
#5 0x000077122acb1b44 in je_tsd_tcache_enabled_data_init () from /usr/lib/libc.so.12
#6 0x000077122acaeda4 in je_tsd_fetch_slow () from /usr/lib/libc.so.12
#7 0x000077122ad08b1c in malloc () from /usr/lib/libc.so.12
#8 0x000077122be0af6b in ECPGget_sqlca () from /home/vagrant/postgresql/tmp_install/usr/local/pgsql/lib/libecpg.so.6
#9 0x000077122be0328f in ECPGconnect () from /home/vagrant/postgresql/tmp_install/usr/local/pgsql/lib/libecpg.so.6

The stack is always the same, so maybe it's an issue with jemalloc,
similar to [5]https://github.com/jemalloc/jemalloc/issues/2402.

The test continues running after gdb detach.

Upgrade to NetBSD 9.4 fixed the issue for me.

It reproduced even on a commit from 2017-01-01, so it's not clear why
there were no such timeouts before...

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01
[3]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01
[4]: https://portal.cloud.hashicorp.com/vagrant/discover/generic/netbsd9
[5]: https://github.com/jemalloc/jemalloc/issues/2402

Best regards,
Alexander

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

5 months ago

In reply to: Alexander Lakhin (#1)

Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

On 2025-08-30 22:00, Alexander Lakhin wrote:

Upgrade to NetBSD 9.4 fixed the issue for me.

It reproduced even on a commit from 2017-01-01, so it's not clear why
there were no such timeouts before...

[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-22%2011%3A29%3A27
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-27%2016%3A35%3A01
[3]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-08-29%2012%3A35%3A01
[4] https://portal.cloud.hashicorp.com/vagrant/discover/generic/netbsd9
[5] https://github.com/jemalloc/jemalloc/issues/2402

Hi,

One change I did pretty recently is that I upgraded the build farm
client from 14 to 19.1 a couple of weeks ago.

I might have lowered wait_timeout initially but now it's higher (14400)
than before (10800).

I also have note that there are stuck processes some times. Like:

netbsd9_3-pgbf# ps aux | grep ^pgbf
pgbf 1610 0.0 0.1 34068 2200 ? I Wed07PM 0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf 3083 0.0 0.3 1903704 6164 ? Il 22Aug25 0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr
pgbf 3101 0.0 0.2 57892 4160 ? I Fri03PM 0:00.01
postdrop -r
pgbf 4490 0.0 0.3 1902676 6068 ? Il Wed07PM 0:00.04
/home/pgbf/buildroot/REL_18_STABLE/pgsql.build/src/interfaces/ecpg
pgbf 8739 0.0 0.1 18948 2196 ? I 22Aug25 0:00.00 gmake
-C test check
pgbf 8927 0.0 0.1 20008 1780 ? I 22Aug25 0:00.00 sh -c
{ cd pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf 10380 0.0 0.2 57876 4184 ? Is Wed07PM 0:00.00
sendmail -FCronDaemon -odi -oem -oi -t
pgbf 10579 0.0 0.3 1900136 6172 ? Il Fri03PM 0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr
pgbf 11374 0.0 0.2 57872 4180 ? Is Fri03PM 0:00.01
sendmail -FCronDaemon -odi -oem -oi -t
pgbf 13147 0.0 0.2 43260 3804 ? I Fri03PM 0:00.34
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf 14484 0.0 0.0 0 0 ? Z - 0:00.00 (ksh)
pgbf 14711 0.0 0.0 0 0 ? Z - 0:00.00 (ksh)
pgbf 16040 0.0 0.2 57896 4160 ? I Wed07PM 0:00.01
postdrop -r
pgbf 17707 0.0 0.1 32400 2196 ? I 22Aug25 0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf 18244 0.0 0.1 30100 2204 ? I Fri03PM 0:00.00 gmake
-C test check
pgbf 20348 0.0 0.1 20008 1780 ? I Fri03PM 0:00.00 sh -c
{ cd pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf 22704 0.0 0.2 57896 4164 ? I 22Aug25 0:00.01
postdrop -r
pgbf 23701 0.0 0.2 48960 3788 ? I Wed07PM 0:00.29
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf 24016 0.0 0.2 39292 3792 ? I 22Aug25 0:00.29
./pg_regress --expecteddir=. --dbname=ecpg1_regression,ecpg2_regre
pgbf 24387 0.0 0.1 30888 2200 ? I Fri03PM 0:00.00 gmake
NO_LOCALE=1 NO_TEMP_INSTALL=yes check
pgbf 24945 0.0 0.2 57876 4176 ? Is 22Aug25 0:00.01
sendmail -FCronDaemon -odi -oem -oi -t
pgbf 25730 0.0 0.1 18852 2216 ? I Wed07PM 0:00.00 gmake
-C test check
pgbf 26757 0.0 0.1 22684 1776 ? I Wed07PM 0:00.00 sh -c
{ cd pgsql.build/src/interfaces/ecpg && gmake NO_LOCALE=1 N
pgbf 29079 0.0 0.0 0 0 ? Z - 0:00.00 (ksh)

/Mikael

Tom Lane

tgl@sss.pgh.pa.us

4 months ago

In reply to: Mikael Kjellström (#2)

Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

=?UTF-8?Q?Mikael_Kjellstr=C3=B6m?= <mikael.kjellstrom@mksoft.nu> writes:

On 2025-08-30 22:00, Alexander Lakhin wrote:

Upgrade to NetBSD 9.4 fixed the issue for me.

I also have note that there are stuck processes some times. Like:

Yeah, this printout is consistent with Alexander's observation that
it's ecpg threading tests that are getting stuck:

pgbf 3083 0.0 0.3 1903704 6164 ? Il 22Aug25 0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr

pgbf 10579 0.0 0.3 1900136 6172 ? Il Fri03PM 0:00.04
/home/pgbf/buildroot/HEAD/pgsql.build/src/interfaces/ecpg/test/thr

All the other processes belonging to pgbf are test driver processes
that would be waiting for these two.

Since we're not seeing comparable failures on other animals,
I tend to agree with his diagnosis of a bug in 9.3's malloc library.

regards, tom lane

Mikael Kjellström

mikael.kjellstrom@mksoft.nu

4 months ago

In reply to: Tom Lane (#3)

Re: ecpg test thread/alloc hangs on sidewinder running NetBSD 9.3

On 2025-08-31 16:52, Tom Lane wrote:

All the other processes belonging to pgbf are test driver processes
that would be waiting for these two.

Since we're not seeing comparable failures on other animals,
I tend to agree with his diagnosis of a bug in 9.3's malloc library.

Yes, I will upgrade that animal to the latest release of NetBSD 10.1 I
think.

NetBSD 9.3 is EOL anyway.

/Mikael