libpq contention due to gss even when not using gss

Started by Andres Freundover 1 year ago9 messages
#1Andres Freund
andres@anarazel.de

Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C. Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1: 16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp: 49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

And it doesn't seem great that something as rarely used as gss introduces
overhead to very common paths.

Here's a bottom-up profile:

- 32.10% pgbench [kernel.kallsyms] [k] queued_spin_lock_slowpath
- 32.09% queued_spin_lock_slowpath
- 16.15% futex_wake
do_futex
__x64_sys_futex
do_syscall_64
- entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
- __GI___pthread_mutex_unlock_usercnt
- 5.12% gssint_select_mech_type
- 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
- gss_indicate_mechs_by_attrs
- 1.58% gss_acquire_cred_from
gss_acquire_cred
pg_GSS_have_cred_cache
select_next_encryption_method (inlined)
init_allowed_encryption_methods (inlined)
PQconnectPoll
pqConnectDBStart (inlined)
PQconnectStartParams
PQconnectdbParams
doConnect

And a bottom-up profile:

- 32.10% pgbench [kernel.kallsyms] [k] queued_spin_lock_slowpath
- 32.09% queued_spin_lock_slowpath
- 16.15% futex_wake
do_futex
__x64_sys_futex
do_syscall_64
- entry_SYSCALL_64_after_hwframe
- 16.15% __GI___lll_lock_wake
- __GI___pthread_mutex_unlock_usercnt
- 5.12% gssint_select_mech_type
- 4.36% gss_inquire_attrs_for_mech
- 2.85% gss_indicate_mechs
- gss_indicate_mechs_by_attrs
- 1.58% gss_acquire_cred_from
gss_acquire_cred
pg_GSS_have_cred_cache
select_next_encryption_method (inlined)
init_allowed_encryption_methods (inlined)
PQconnectPoll
pqConnectDBStart (inlined)
PQconnectStartParams
PQconnectdbParams
doConnect

Clearly the contention originates outside of our code, but is triggered by
doing pg_GSS_have_cred_cache() every time a connection is established.

Greetings,

Andres Freund

#2Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andres Freund (#1)
Re: libpq contention due to gss even when not using gss

On Mon, Jun 10, 2024 at 11:12:12AM GMT, Andres Freund wrote:
Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C. Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1: 16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp: 49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

By not supporting gss I assume you mean having built with --with-gssapi,
but only host (not hostgssenc) records in pg_hba, right?

#3Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#2)
Re: libpq contention due to gss even when not using gss

Hi,

On 2024-06-13 17:33:57 +0200, Dmitry Dolgov wrote:

On Mon, Jun 10, 2024 at 11:12:12AM GMT, Andres Freund wrote:
Hi,

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C. Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1: 16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp: 49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

By not supporting gss I assume you mean having built with --with-gssapi,
but only host (not hostgssenc) records in pg_hba, right?

Yes, the latter. Or not having kerberos set up on the client side.

Greetings,

Andres Freund

#4Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andres Freund (#3)
Re: libpq contention due to gss even when not using gss

On Thu, Jun 13, 2024 at 10:30:24AM GMT, Andres Freund wrote:

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C. Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1: 16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp: 49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

By not supporting gss I assume you mean having built with --with-gssapi,
but only host (not hostgssenc) records in pg_hba, right?

Yes, the latter. Or not having kerberos set up on the client side.

I've been experimenting with both:

* The server is built without gssapi, but the client does support it.
This produces exactly the contention you're talking about.

* The server is built with gssapi, but do not use it in pg_hba, the
client does support gssapi. In this case the difference between
gssencmode=disable/prefer is even more dramatic in my test case
(milliseconds vs seconds) due to the environment with configured
kerberos (for other purposes, thus gss_init_sec_context spends huge
amount of time to still return nothing).

At the same time after quick look I don't see an easy way to avoid that.
Current implementation tries to initialize gss before getting any
confirmation from the server whether it's supported. Doing this other
way around would probably just shift overhead to the server side.

#5Daniel Gustafsson
daniel@yesql.se
In reply to: Dmitry Dolgov (#4)
Re: libpq contention due to gss even when not using gss

On 14 Jun 2024, at 10:46, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Thu, Jun 13, 2024 at 10:30:24AM GMT, Andres Freund wrote:

To investigate a report of both postgres and pgbouncer having issues when a
lot of new connections aree established, I used pgbench -C. Oddly, on an
early attempt, the bottleneck wasn't postgres+pgbouncer, it was pgbench. But
only when using TCP, not with unix sockets.

c=40;pgbench -C -n -c$c -j$c -T5 -f <(echo 'select 1') 'port=6432 host=127.0.0.1 user=test dbname=postgres password=fake'

host=127.0.0.1: 16465
host=127.0.0.1,gssencmode=disable 20860
host=/tmp: 49286

Note that the server does *not* support gss, yet gss has a substantial
performance impact.

Obviously the connection rates here absurdly high and outside of badly written
applications likely never practically relevant. However, the number of cores
in systems are going up, and this quite possibly will become relevant in more
realistic scenarios (lock contention kicks in earlier the more cores you
have).

By not supporting gss I assume you mean having built with --with-gssapi,
but only host (not hostgssenc) records in pg_hba, right?

Yes, the latter. Or not having kerberos set up on the client side.

I've been experimenting with both:

* The server is built without gssapi, but the client does support it.
This produces exactly the contention you're talking about.

* The server is built with gssapi, but do not use it in pg_hba, the
client does support gssapi. In this case the difference between
gssencmode=disable/prefer is even more dramatic in my test case
(milliseconds vs seconds) due to the environment with configured
kerberos (for other purposes, thus gss_init_sec_context spends huge
amount of time to still return nothing).

At the same time after quick look I don't see an easy way to avoid that.
Current implementation tries to initialize gss before getting any
confirmation from the server whether it's supported. Doing this other
way around would probably just shift overhead to the server side.

The main problem seems to be that we check whether or not there is a credential
cache when we try to select encryption but not yet authentication, as a way to
figure out if gssenc it as all worth trying? I experimented with deferring it
with potentially cheaper heuristics in encryption selection, but it seems hard
to get around since other methods were even more expensive.

--
Daniel Gustafsson

#6Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Daniel Gustafsson (#5)
Re: libpq contention due to gss even when not using gss

On Fri, Jun 14, 2024 at 12:12:55PM GMT, Daniel Gustafsson wrote:

I've been experimenting with both:

* The server is built without gssapi, but the client does support it.
This produces exactly the contention you're talking about.

* The server is built with gssapi, but do not use it in pg_hba, the
client does support gssapi. In this case the difference between
gssencmode=disable/prefer is even more dramatic in my test case
(milliseconds vs seconds) due to the environment with configured
kerberos (for other purposes, thus gss_init_sec_context spends huge
amount of time to still return nothing).

At the same time after quick look I don't see an easy way to avoid that.
Current implementation tries to initialize gss before getting any
confirmation from the server whether it's supported. Doing this other
way around would probably just shift overhead to the server side.

The main problem seems to be that we check whether or not there is a credential
cache when we try to select encryption but not yet authentication, as a way to
figure out if gssenc it as all worth trying?

Yep, this is my understanding as well. Which other methods did you try
for checking that?

#7Andres Freund
andres@anarazel.de
In reply to: Dmitry Dolgov (#4)
Re: libpq contention due to gss even when not using gss

Hi,

On 2024-06-14 10:46:04 +0200, Dmitry Dolgov wrote:

At the same time after quick look I don't see an easy way to avoid that.
Current implementation tries to initialize gss before getting any
confirmation from the server whether it's supported. Doing this other
way around would probably just shift overhead to the server side.

Initializing the gss cache at all isn't so much the problem. It's that we do
it for every connection. And that doing so requires locking inside gss. So
maybe we could just globally cache that gss isn't available, instead of
rediscovering it over and over for every new connection.

Greetings,

Andres Freund

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#7)
Re: libpq contention due to gss even when not using gss

Andres Freund <andres@anarazel.de> writes:

Initializing the gss cache at all isn't so much the problem. It's that we do
it for every connection. And that doing so requires locking inside gss. So
maybe we could just globally cache that gss isn't available, instead of
rediscovering it over and over for every new connection.

I had the impression that krb5 already had such a cache internally.
Maybe they don't cache the "failed" state though. I doubt we'd
want to either in long-lived processes --- what if the user
installs the credential while we're running?

regards, tom lane

#9Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#8)
Re: libpq contention due to gss even when not using gss

Hi,

On 2024-06-14 12:27:12 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Initializing the gss cache at all isn't so much the problem. It's that we do
it for every connection. And that doing so requires locking inside gss. So
maybe we could just globally cache that gss isn't available, instead of
rediscovering it over and over for every new connection.

I had the impression that krb5 already had such a cache internally.

Well, if so, it clearly doesn't seem to work very well, given that it causes
contention at ~15k lookups/sec. That's obviously a trivial number for anything
cached, even with the worst possible locking regimen.

Maybe they don't cache the "failed" state though. I doubt we'd
want to either in long-lived processes --- what if the user
installs the credential while we're running?

If we can come up with something better - cool. But it doesn't seem great that
gss introduces contention for the vast majority of folks that use libpq in
environments that never use gss.

I don't think we should cache the set of credentials when gss is actually
available on a process-wide basis, just the fact that gss isn't available at
all. I think it's very unlikely for that fact to change while an application
is running. And if it happens, requiring a restart in those cases seems an
acceptable price to pay for what is effectively a niche feature.

Greetings,

Andres Freund