BUG #18732: Segfault in pgbench on max_connections starvation

Started by PG Bug reporting formover 1 year ago3 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18732
Logged by: Mikhail Kot
Email address: mikhail@neon.tech
PostgreSQL version: 16.6
Operating system: Debian 12
Description:

When --client connections in pgbench exceed max_connections in postgres,
pgbench 16 sometimes exits with segfault when a (presumably) ssl
certificate
validation error occurs.

OpenSSL version: 3.0.15-1~deb12u1
pgbench version: 16.6 (f5cfc6fa898544050e821ac688adafece1ac3cff)
pgbench params: pgbench postgresql://REDACTED/neondb?sslmode=require -c 2000
-T 60 -P 1 -j 20 --protocol=prepared

#0 0x00007f097342d3f0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1 0x00007f097342da0a in OPENSSL_LH_retrieve () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#2 0x00007f097346a283 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3 0x00007f097340bced in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007f097340c122 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007f09733f60ba in EVP_MD_fetch () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007f09733f67f0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007f097342899f in HMAC_Init_ex () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007f09737ba60a in pg_hmac_init (ctx=0x7f09484a7360,
key=0x7f09484a7880 "R1M6EFcoABKs", len=12)
at /home/myrrc/neon/vendor/postgres-v16/src/common/hmac_openssl.c:174
#9 0x00007f09737b54ef in scram_SaltedPassword (password=0x7f09484a7880
"R1M6EFcoABKs",
hash_type=PG_SHA256, key_length=32, salt=0x7f09484a78c0
"\260\376E\302@\0341Z\025%'\244H",
saltlen=16, iterations=4096,
result=0x7f09484a4118
"\235\245\371\260\203\243矮\357\224\305F\204F\341K\212\025$#\030CL\"\325ɑ\247\021os\340=IH\t\177",
errstr=0x7f09719fead8)
at /home/myrrc/neon/vendor/postgres-v16/src/common/scram-common.c:87
#10 0x00007f097379452c in calculate_client_proof (state=0x7f09484a40f0,
client_final_message_without_proof=0x7f0948489920
"c=cD10bHMtc2VydmVyLWVuZC1wb2ludCwstoyKkoGIYqGK5C4vgGtRjvNeDwvmGQlaYHBXl8ZybAA=,r=RbPpYlql+b/rBgDtitBWxtAdW9BcFuPI9WsP7VCILEORedB6",

result=0x7f09719feaf0 "0", errstr=0x7f09719fead8)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:788
#11 0x00007f0973793e55 in build_client_final_message
(state=0x7f09484a40f0)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:565
#12 0x00007f0973793403 in scram_exchange (opaq=0x7f09484a40f0,
input=0x7f09484a60a0
"r=RbPpYlql+b/rBgDtitBWxtAdW9BcFuPI9WsP7VCILEORedB6", inputlen=84,
output=0x7f09719febe0, outputlen=0x7f09719febdc, done=0x7f09719febdb,
success=0x7f09719febda)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth-scram.c:255
#13 0x00007f09737b002e in pg_SASL_continue (conn=0x7f0948486540,
payloadlen=84, final=false)
#14 0x00007f09737af729 in pg_fe_sendauth (areq=11, payloadlen=84,
conn=0x7f0948486540)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-auth.c:1139
#15 0x00007f0973798c5d in PQconnectPoll (conn=0x7f0948486540)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:3802
#16 0x00007f0973794c9c in connectDBComplete (conn=0x7f0948486540)
at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:2511
#17 0x00007f09737949b4 in PQconnectdbParams (keywords=0x7f09719ff890,
values=0x7f09719ff850,
expand_dbname=1) at
/home/myrrc/neon/vendor/postgres-v16/src/interfaces/libpq/fe-connect.c:685
#18 0x0000558da1510b5e in doConnect ()
at /home/myrrc/neon/vendor/postgres-v16/src/bin/pgbench/pgbench.c:1560
#19 0x0000558da15113d0 in threadRun (arg=0x558db50ebce0)
at /home/myrrc/neon/vendor/postgres-v16/src/bin/pgbench/pgbench.c:7384
#20 0x00007f09730a81c4 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#21 0x00007f097312885c in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Steps to reproduce:
1. Launch a postgres server with max_connections=900
2. Launch pgbench a couple of times with -c 2000

I was also able to reproduce this error by running multiple pgbench
instances
with same launch parameters. This error doesn't reproduce on pgbench 17.2 or
15.10
I can provide the coredump upon request.

#2Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: PG Bug reporting form (#1)
Re: BUG #18732: Segfault in pgbench on max_connections starvation

On 03/12/2024 14:23, PG Bug reporting form wrote:

When --client connections in pgbench exceed max_connections in postgres,
pgbench 16 sometimes exits with segfault when a (presumably) ssl
certificate
validation error occurs.

...

Steps to reproduce:
1. Launch a postgres server with max_connections=900
2. Launch pgbench a couple of times with -c 2000

I was also able to reproduce this error by running multiple pgbench
instances
with same launch parameters. This error doesn't reproduce on pgbench 17.2 or
15.10
I can provide the coredump upon request.

I was able to reproduce this on both REL_16_STABLE and REL_17_STABLE.
Didn't try v15, but I presume this issue is present in all branches (see
analysis below).

Backtrace from thread 1:

#0 0x00007f19dfa55516 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#1 0x00007f19dfa55bce in OPENSSL_LH_retrieve () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#2 0x00007f19dfb456d5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#3 0x00007f19dfa2e943 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007f19dfa2edc1 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007f19dfa17eee in EVP_MD_fetch () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007f19dfa1855b in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007f19dfa4c22a in HMAC_Init_ex () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007f19e00a9296 in pg_hmac_init (ctx=ctx@entry=0x7f19cc51bb90,
key=key@entry=0x7f19cc50d560 "foo", len=len@entry=3) at
../src/common/hmac_openssl.c:180
#9 0x00007f19e00a62b0 in scram_SaltedPassword (password=0x7f19cc50d560
"foo", hash_type=<optimized out>, key_length=32, salt=<optimized out>,
saltlen=<optimized out>, iterations=4096,
result=0x7f19cc51bb08
"w\351אI\256\035\330\003y\021ւ\205\327ƿ\217Q\332\362}\a\0364\243^\324\321a\034H0\250P\314\031\177",
errstr=0x7f19dd4bb928) at ../src/common/scram-common.c:87
#10 0x00007f19e0089bcd in calculate_client_proof (state=0x7f19cc51bae0,
client_final_message_without_proof=0x7f19cc50b040
"c=cD10bHMtc2VydmVyLWVuZC1wb2ludCwsvkIO06ZPSH1cmElOgC2DbPafilVET0yej6RhzH30Rzw=,r=Wkk2fofG+RP23HT1tBMqx0ijin6taf2xdjPuJBYqBqw2853/",

result=<optimized out>, errstr=<optimized out>) at
../src/interfaces/libpq/fe-auth-scram.c:788
#11 build_client_final_message (state=0x7f19cc51bae0) at
../src/interfaces/libpq/fe-auth-scram.c:565
#12 scram_exchange (opaq=0x7f19cc51bae0, input=<optimized out>,
inputlen=<optimized out>, output=0x7f19dd4bba28, outputlen=<optimized
out>, done=<optimized out>, success=<optimized out>)
at ../src/interfaces/libpq/fe-auth-scram.c:255
#13 0x00007f19e008a642 in pg_SASL_continue (conn=0x7f19cc4ff1f0,
payloadlen=84, final=<optimized out>) at
../src/interfaces/libpq/fe-auth.c:654
#14 pg_fe_sendauth (areq=11, payloadlen=84,
conn=conn@entry=0x7f19cc4ff1f0) at ../src/interfaces/libpq/fe-auth.c:1139
#15 0x00007f19e008f756 in PQconnectPoll (conn=conn@entry=0x7f19cc4ff1f0)
at ../src/interfaces/libpq/fe-connect.c:3802
#16 0x00007f19e008bae8 in connectDBComplete
(conn=conn@entry=0x7f19cc4ff1f0) at
../src/interfaces/libpq/fe-connect.c:2511
#17 0x00007f19e008b2bf in PQconnectdbParams
(keywords=keywords@entry=0x7f19dd4bc1f0,
values=values@entry=0x7f19dd4bc1b0, expand_dbname=expand_dbname@entry=1)
at ../src/interfaces/libpq/fe-connect.c:685
#18 0x000056350c35efa5 in doConnect () at ../src/bin/pgbench/pgbench.c:1560
#19 0x000056350c35f2c5 in threadRun (arg=0x56350d1184a0) at
../src/bin/pgbench/pgbench.c:7396
#20 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:447
#21 0x00007f19dfe998f8 in __GI___clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 2:

#0 0x00007f19dfe28a04 in _int_free_merge_chunk
(av=av@entry=0x7f19dff70ac0 <main_arena>, p=0x56350d126280, size=144) at
./malloc/malloc.c:4675
#1 0x00007f19dfe28d31 in _int_free (av=0x7f19dff70ac0 <main_arena>,
p=<optimized out>, have_lock=<optimized out>, have_lock@entry=0) at
./malloc/malloc.c:4646
#2 0x00007f19dfe2b4ff in __GI___libc_free (mem=<optimized out>) at
./malloc/malloc.c:3398
#3 0x00007f19dfa5580e in OPENSSL_LH_free () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#4 0x00007f19dfb4489f in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#5 0x00007f19dfa6e0e7 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#6 0x00007f19dfb44c35 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#7 0x00007f19dfa565a5 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#8 0x00007f19dfa56aa0 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.3
#9 0x00007f19dfa5ac32 in OPENSSL_cleanup () from
/lib/x86_64-linux-gnu/libcrypto.so.3
#10 0x00007f19dfdcb1e1 in __run_exit_handlers (status=status@entry=1,
listp=0x7f19dff70680 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
at ./stdlib/exit.c:108
#11 0x00007f19dfdcb29a in __GI_exit (status=status@entry=1) at
./stdlib/exit.c:138
#12 0x000056350c362ae6 in threadRun (arg=<optimized out>) at
../src/bin/pgbench/pgbench.c:7399
#13 0x00007f19dfe1b112 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:447
#14 0x00007f19dfe998f8 in __GI___clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Sometimes you also get this error instead of a crash, which is
presumably another symptom of the same race condition:

pgbench (16.6, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: FATAL: sorry, too many clients already
pgbench: error: could not create connection for client 1145
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: could not verify server signature: OpenSSL failure

Once I also got this:

pgbench (17.2, server 18devel)
starting vacuum...end.
pgbench: error: connection to server at "localhost" (::1), port 5432
failed: FATAL: sorry, too many clients already
pgbench: error: could not create connection for client 1045
k5_mutex_lock: Received error 22 (Invalid argument)
*** %n in writable segment detected ***

It looks like a race condition between OpenSSL's exit handler and the .
HMAC_Init_ex() call in another thread. I think we could use the
OPENSSL_INIT_NO_ATEXIT option to prevent the atexit handler from
running. The OpenSSL man page on OPENSSL_init_crypto says:

OPENSSL_INIT_NO_ATEXIT

By default OpenSSL will attempt to clean itself up when the process
exits via an "atexit" handler. Using this option suppresses that
behaviour. This means that the application will have to clean up
OpenSSL explicitly using OPENSSL_cleanup().

I don't understand why that cleanup would be needed. When the program
exits, all resources are gone anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

#3Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#2)
Re: BUG #18732: Segfault in pgbench on max_connections starvation

Hi,

On 2024-12-03 16:52:32 +0200, Heikki Linnakangas wrote:

It looks like a race condition between OpenSSL's exit handler and the .
HMAC_Init_ex() call in another thread. I think we could use the
OPENSSL_INIT_NO_ATEXIT option to prevent the atexit handler from running.
The OpenSSL man page on OPENSSL_init_crypto says:

Using exit() while another thread is running is, IIRC, undefined behaviour,
regardless of OPENSSL_INIT_NO_ATEXIT's pointlessness. The whole atexit()
mechanism is not threadsafe, two processes exit()ing at the same time can
cause a lot of havoc.

Short term it's probably easiest to just use _exit(). Medium term I think we
should just exit individual threads - which would probably require the main
thread to not run a benchmark itself.

By default OpenSSL will attempt to clean itself up when the process
exits via an "atexit" handler. Using this option suppresses that
behaviour. This means that the application will have to clean up
OpenSSL explicitly using OPENSSL_cleanup().

I don't understand why that cleanup would be needed. When the program exits,
all resources are gone anyway.

Somewhat random aside: This is also bad for postgres performance. Postmaster
initializes openssl. When a child exits, it runs - completely pointlessly -
OPENSSL_cleanup(), which modifies a lot of datastructures that have been set
up in postmaster. Which, in turn, requires all those pages to be
copy-on-write'ed. Just for that copy to immediately be discarded, at process
exit.

Greetings,

Andres Freund