Tighten SCRAM iteration parsing and bound libpq PBKDF2 work

Started by Sehrope Sarkuni2 months ago3 messageshackers

sehrope@jackdb.com

2 months ago

Hi,

Attached are some SCRAM-related patches.

The first patch fixes parsing of SCRAM iteration counts in both the backend
SCRAM verifier parser and libpq's server-first-message parser. Previously
both paths parsed into a long and then stored the result in an int, so
values in (INT_MAX, LONG_MAX] could be accepted by strtol() and then
narrowed incorrectly. I don't think this allows for any invalid logins as
the password verifier would have tried to verify a different iteration
count, but it's still wrong.

The new common helper, scram_parse_iterations(), rejects values that are
not strictly decimal digits, rejects zero, and rejects values larger than
INT_MAX. This means values with leading whitespace, signs, trailing junk,
decimal points, hex-style prefixes, and integer overflow are all rejected
rather than being partially accepted by strtol() syntax.

Patch 0002 adds a small test_scram test module for direct tests of SCRAM
helpers in src/common. Currently it covers scram_parse_iterations(). I kept
this separate from the parser fix so the bug fix can be backpatched
independently if adding a new test module is considered too invasive for
older branches.

Patches 0003 through 0005 address a separate client-side resource issue. In
a SCRAM exchange, the PBKDF2 iteration count is supplied by the server. A
hostile or misconfigured server can advertise a very large iteration count
and keep a blocking libpq connection inside scram_SaltedPassword() beyond
the caller's connect_timeout. The backend has CHECK_FOR_INTERRUPTS() inside
the loop, but the frontend previously had no equivalent.

These patches mirror the blocking connection attempt's deadline into
PGconn, add an optional interrupt callback to the common SCRAM PBKDF2
helper, and have libpq use that callback to abort once the in-flight
blocking connection deadline has expired. The test modifies a SCRAM
verifier to advertise a very large iteration count and verifies that
connect_timeout interrupts the PBKDF2 loop.

This protection applies to the blocking connection path, such as
PQconnectdb(). It does not make connect_timeout apply automatically to
applications driving PQconnectPoll() themselves.

I considered passing the connection deadline directly into the SCRAM PBKDF2
helper, but used an interrupt callback instead. That keeps the common SCRAM
code independent of libpq's timeout representation and allows the same
mechanism to support other frontend abort conditions, such as cancellation
of an in-progress connection attempt.

Patch 0006 adds a separate libpq connection parameter,
scram_max_iterations, with matching PGSCRAMMAXITERATIONS environment
variable. This places a hard client-side cap on the server-advertised SCRAM
iteration count before PBKDF2 begins. A value of 0 disables the cap.

This is useful independently of connect_timeout as it protects clients that
do not set a timeout. It also protects async PQconnectPoll() users because
it does not depend on the blocking connection deadline.

The attached patch currently uses a default cap of 100K. That is well above
PostgreSQL's normal SCRAM iteration count (4K), but it is still a
client-side behavior change for installations using unusually high SCRAM
iteration counts. For reference, we added a similar patch to pgjdbc with
the same 100K default.

For the connection timeout work, I set up a mock postgres server that
advertises a huge SCRAM iteration count. Without this patch it takes
multiple seconds to perform the client PBKDF2 derivation. It still provides
the "timeout expired" error, but only after the PBKDF2 work is complete:

$ time PGCONNECT_TIMEOUT=1 psql postgresql://user:pass@127.0.0.1/db
psql: error: connection to server at "127.0.0.1", port 5432 failed: timeout
expired

real 0m12.716s
user 0m12.613s
sys 0m0.015s

With the patches it rejects the connection attempt immediately (as it
exceeds the 100K default):

$ time LD_LIBRARY_PATH=src/interfaces/libpq PGCONNECT_TIMEOUT=1
src/bin/psql/psql postgresql://user:pass@127.0.0.1/db
psql: error: connection to server at "127.0.0.1", port 5432 failed: server
requested SCRAM iteration count 10000000, exceeding scram_max_iterations
(100000)

real 0m0.059s
user 0m0.002s
sys 0m0.005s

Explicitly overriding the max iterations and it respects the 1-second
timeout:

$ time LD_LIBRARY_PATH=src/interfaces/libpq PGCONNECT_TIMEOUT=1
PGSCRAMMAXITERATIONS=0 src/bin/psql/psql postgresql://user:pass@127.0.0.1/db
psql: error: connection to server at "127.0.0.1", port 5432 failed: could
not calculate client proof: connection timeout expired

real 0m1.005s
user 0m0.930s
sys 0m0.003s

Some specific things I would particularly like feedback on:

- Whether scram_max_iterations should default to 100K, some other value, or
0 to preserve existing behavior. The last option would still provide an
opt-in hard cap, but would not protect clients unless they configure it.
For reference, pgjdbc ships with a 100K default, but PostgreSQL deployments
with tuned scram_iterations in the 250K–1M range do exist and would need to
set the parameter explicitly on libpq upgrade.
- Whether scram_SaltedPassword() should grow the new interrupt callback
parameters directly rather than via the scram_SaltedPasswordExt() shim used
here. I went with a shim to avoid signature churn for any out-of-tree
consumers of src/common, but I have no strong attachment to the shim if
changing the signature is preferred.
- Whether the in-flight connect deadline should live on PGconn or be passed
through fe_scram_state instead. I went with PGconn on the theory that the
deadline of the current attempt is generally useful state that future auth
paths (GSSAPI, LDAP, etc.) might want to consult, but the narrower scope is
also reasonable.
- Whether the connect_timeout behavior (0003–0005) should be backpatched,
separately from the new scram_max_iterations connection parameter (0006),
which I assume is HEAD-only.
- Whether the iteration-count parsing fix (0001) should be backpatched. The
int-truncation half is safe to backpatch in isolation; tightening
parse_scram_secret() to reject zero/negative iteration counts in stored
verifiers is technically a behavior change though I think it'd be invalid
per the SCRAM RFC.
- Whether the TAP test in 0005, which depends on a 1-second connect_timeout
interrupting a doctored verifier's PBKDF2 loop, is acceptable as written.
The iteration count is large enough that the loop cannot complete in any
realistic CI scenario, but the test is timing-sensitive by nature.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Daniel Gustafsson

daniel@yesql.se

2 months ago

In reply to: Sehrope Sarkuni (#1)

Re: Tighten SCRAM iteration parsing and bound libpq PBKDF2 work

On 24 May 2026, at 22:04, Sehrope Sarkuni <sehrope@jackdb.com> wrote:

Hi,

Attached are some SCRAM-related patches.

Thanks for looking, a few initial comments before looking further on these.

The first patch fixes parsing of SCRAM iteration counts in both the backend SCRAM verifier parser and libpq's server-first-message parser. Previously both paths parsed into a long and then stored the result in an int, so values in (INT_MAX, LONG_MAX] could be accepted by strtol() and then narrowed incorrectly. I don't think this allows for any invalid logins as the password verifier would have tried to verify a different iteration count, but it's still wrong.

The iteration count parsing you refer to is extracting the iterations from the
stored password, it's not parsing user input in any way. The iteration count
in set in the password using scram_sha_256_iterations, which in turn can be set
by the GUC scram_iterations which is limited to 1..INT_MAX (before being
accessible as a GUC it was hardcoded to 4096).

strtol() to an int without checking if the parsed value exceeds INT_MAX isn't
good code hygiene, and we should fix that, but it cannot in practice cause
truncation AFAICS.

Can you show a sequence of generating a SCRAM secret which has an iteration
count outside of 0..INT_MAX, or one which doesn't correspond to the setting in
scram_iterations?

Patches 0003 through 0005 address a separate client-side resource issue. In a SCRAM exchange, the PBKDF2 iteration count is supplied by the server. A hostile or misconfigured server can advertise a very large iteration count and keep a blocking libpq connection inside scram_SaltedPassword() beyond the caller's connect_timeout. The backend has CHECK_FOR_INTERRUPTS() inside the loop, but the frontend previously had no equivalent.

A server which request the iteration count of the stored secret, whatever it
is, cannot be considered misconfigured or hostile. A server which sends an
incorrect iteration count in order to reject a connection after causing the
client to perform CPU work is hostile (but I'd be quite glad if a hostile
server rejected my connection rather than tricking me into logging in and
stealing/corrupting data).

Making the SCRAM calculation honor the connection timeout sounds like a good
way to address the latter.

These patches mirror the blocking connection attempt's deadline into PGconn, add an optional interrupt callback to the common SCRAM PBKDF2 helper, and have libpq use that callback to abort once the in-flight bl:qaocking connection deadline has expired. The test modifies a SCRAM verifier to advertise a very large iteration count and verifies that connect_timeout interrupts the PBKDF2 loop.

This protection applies to the blocking connection path, such as PQconnectdb(). It does not make connect_timeout apply automatically to applications driving PQconnectPoll() themselves.

I considered passing the connection deadline directly into the SCRAM PBKDF2 helper, but used an interrupt callback instead. That keeps the common SCRAM code independent of libpq's timeout representation and allows the same mechanism to support other frontend abort conditions, such as cancellation of an in-progress connection attempt.

Making sure SCRAM secret calculation doesn't exceed connection_timeout sounds
like a good idea. However, ode shared between frontend and backend should IMHO
not have parameters which only work in the frontend. I'm also not convinced
that it's a good idea to add a callback which in your case gets the entire
PGconn object. I'd prefer a check more like the CFI we have in the backend
code.

Another issue is the interval for checking. One point of allowing configurable
iteration counts was to allow constrained low-power IOT platforms to use SCRAM
by using a much lower iteration count. Only checking for timeout every 4096
iterations might be well past the connection timeout. Perhaps the interval of
checking should be a function of the iterations sent by the server?

The attached patch currently uses a default cap of 100K. That is well above PostgreSQL's normal SCRAM iteration count (4K), but it is still a client-side behavior change for installations using unusually high SCRAM iteration counts. For reference, we added a similar patch to pgjdbc with the same 100K default.

I am off the cuff -1 on adding more connection parameters for niche usecases
like this (we already have many enough to make it confusing), it will be very
hard to document what an appropriate setting is, and getting it wrong might
mean rejecting connections in err.

--
Daniel Gustafsson

Sehrope Sarkuni

sehrope@jackdb.com

2 months ago

In reply to: Daniel Gustafsson (#2)

Re: Tighten SCRAM iteration parsing and bound libpq PBKDF2 work

On Mon, May 25, 2026 at 10:13 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 24 May 2026, at 22:04, Sehrope Sarkuni <sehrope@jackdb.com> wrote:
The first patch fixes parsing of SCRAM iteration counts in both the

backend SCRAM verifier parser and libpq's server-first-message parser.
Previously both paths parsed into a long and then stored the result in an
int, so values in (INT_MAX, LONG_MAX] could be accepted by strtol() and
then narrowed incorrectly. I don't think this allows for any invalid logins
as the password verifier would have tried to verify a different iteration
count, but it's still wrong.

The iteration count parsing you refer to is extracting the iterations from
the
stored password, it's not parsing user input in any way. The iteration
count
in set in the password using scram_sha_256_iterations, which in turn can
be set
by the GUC scram_iterations which is limited to 1..INT_MAX (before being
accessible as a GUC it was hardcoded to 4096).

strtol() to an int without checking if the parsed value exceeds INT_MAX
isn't
good code hygiene, and we should fix that, but it cannot in practice cause
truncation AFAICS.

Can you show a sequence of generating a SCRAM secret which has an iteration
count outside of 0..INT_MAX, or one which doesn't correspond to the
setting in
scram_iterations?

You can directly update pg_authid to skip the empty password check
entirely. We do that in pgjdbc to test out connecting against some
nonsensical values.

A regular user can also specify the pre-hashed password in CREATE / ALTER
USER. The server will still try to verify the password to ensure that it is
not an empty string:

// src/backend/commands/user.c#L442
if (password[0] == '\0' || plain_crypt_verify(stmt->role, password, "",
&logdetail) == STATUS_OK)
{
ereport(NOTICE,
(errmsg("empty string is not a valid password, clearing password")));
new_record_nulls[Anum_pg_authid_rolpassword - 1] = true;
}

Normally that would prevent you from trying to create something with a huge
value of iterations as it'd never finish. But the long-to-int overflow
means that it finishes immediately.

postgres=# CREATE USER scram_test PASSWORD 'abcd';
CREATE ROLE

# Default of 4096 iterations
postgres=# SELECT rolpassword FROM pg_authid WHERE rolname = 'scram_test';
rolpassword

---------------------------------------------------------------------------------------------------------------------------------------
SCRAM-SHA-256$4096:PTh5Qe4SmvJnUSHsa6CJpg==$rbWsmPJVYzgjIwnVfmfZ43A0FR0eThILm+TuYpYF4M0=:gFahnsW7KHa73RW1Vk/eP6avCOWf1O2LgCW1ZgiRHtw=
(1 row)

# This hangs effectively for ever as it's trying 2^31-1 iterations to
verify that it's not empty string
postgres=# ALTER USER scram_test PASSWORD
'SCRAM-SHA-256$2147483647:PTh5Qe4SmvJnUSHsa6CJpg==$rbWsmPJVYzgjIwnVfmfZ43A0FR0eThILm+TuYpYF4M0=:gFahnsW7KHa73RW1Vk/eP6avCOWf1O2LgCW1ZgiRHtw=';
^C
Cancel request sent
ERROR: canceling statement due to user request
Time: 4812.007 ms (00:04.812)

# This completes instantly because it overflows and the empty string
password check passes. Trying to connect gives a scram iterations from the
server of "-1".
postgres=# ALTER USER scram_test PASSWORD
'SCRAM-SHA-256$2147483648:PTh5Qe4SmvJnUSHsa6CJpg==$rbWsmPJVYzgjIwnVfmfZ43A0FR0eThILm+TuYpYF4M0=:gFahnsW7KHa73RW1Vk/eP6avCOWf1O2LgCW1ZgiRHtw=';
ALTER ROLE
Time: 0.773 ms

Patches 0003 through 0005 address a separate client-side resource issue.

In a SCRAM exchange, the PBKDF2 iteration count is supplied by the server.
A hostile or misconfigured server can advertise a very large iteration
count and keep a blocking libpq connection inside scram_SaltedPassword()
beyond the caller's connect_timeout. The backend has CHECK_FOR_INTERRUPTS()
inside the loop, but the frontend previously had no equivalent.

A server which request the iteration count of the stored secret, whatever
it
is, cannot be considered misconfigured or hostile. A server which sends an
incorrect iteration count in order to reject a connection after causing the
client to perform CPU work is hostile (but I'd be quite glad if a hostile
server rejected my connection rather than tricking me into logging in and
stealing/corrupting data).

The attack vector here is a client side denial of service. I previously
submitted this to the security list and it was deemed out of scope (as a
security issue for core) as it's a client side cpu / memory issue. A real
world example of it would be a SaaS or reporting system that connects to
user specified databases. A malicious user who could specify the remote
database host / port (common in that kind of platform) could get the app to
perform the PBKDF2 iterations forever. There's plenty of other things a
malicious server can do as well (e.g., forcing allocation of infinite
memory). So again out of scope for core as a security issue, but still a
problem in specific use cases.

Making the SCRAM calculation honor the connection timeout sounds like a
good
way to address the latter.

These patches mirror the blocking connection attempt's deadline into

PGconn, add an optional interrupt callback to the common SCRAM PBKDF2
helper, and have libpq use that callback to abort once the in-flight
bl:qaocking connection deadline has expired. The test modifies a SCRAM
verifier to advertise a very large iteration count and verifies that
connect_timeout interrupts the PBKDF2 loop.

This protection applies to the blocking connection path, such as

PQconnectdb(). It does not make connect_timeout apply automatically to
applications driving PQconnectPoll() themselves.

I considered passing the connection deadline directly into the SCRAM

PBKDF2 helper, but used an interrupt callback instead. That keeps the
common SCRAM code independent of libpq's timeout representation and allows
the same mechanism to support other frontend abort conditions, such as
cancellation of an in-progress connection attempt.

Making sure SCRAM secret calculation doesn't exceed connection_timeout
sounds
like a good idea. However, ode shared between frontend and backend should
IMHO
not have parameters which only work in the frontend.

I suppose it could be enabled for the backend as well. It'd end up being an
additional if-check to skip the NULL arg.

Alternatively the backend could supply CFI/wrapper as the callback. Then
the scram code would not know anything about who's invoking it.

I'm also not convinced
that it's a good idea to add a callback which in your case gets the entire
PGconn object. I'd prefer a check more like the CFI we have in the backend
code.

Is there a front end example of that to look at? Or are you describing
something more general that does not exist yet?

Another issue is the interval for checking. One point of allowing

configurable
iteration counts was to allow constrained low-power IOT platforms to use
SCRAM
by using a much lower iteration count. Only checking for timeout every
4096
iterations might be well past the connection timeout. Perhaps the
interval of
checking should be a function of the iterations sent by the server?

The time check itself is relatively cheap compared to the hashing so
lowering it something more fine grained should not matter much. I picked
4096 to match the default scram iterations so it does not even run in the
normal case. Maybe 256?

The SCRAM RFC estimates it taking .5 seconds to run 4096 iterations on 2015
"current mobile handsets". Would figure it to be even less for devices 10+
years later. And connect timeout granularity is in seconds.

The attached patch currently uses a default cap of 100K. That is well

above PostgreSQL's normal SCRAM iteration count (4K), but it is still a
client-side behavior change for installations using unusually high SCRAM
iteration counts. For reference, we added a similar patch to pgjdbc with
the same 100K default.

I am off the cuff -1 on adding more connection parameters for niche
usecases
like this (we already have many enough to make it confusing), it will be
very
hard to document what an appropriate setting is, and getting it wrong might
mean rejecting connections in err.

For additional reference, we added similar parameters to the java driver
pgjdbc and the node driver node-postgres. Both with a 100K limit.

Both also had a bit more of a direct issue with the large values (v.s.
libpq) as the java driver would create a background thread for the login
(which would get pinned to 100% forever with a huge iteration count) and
the node driver would also hang indefinitely (as node is single threaded
and the crypto functions are not interruptible).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/

Tighten SCRAM iteration parsing and bound libpq PBKDF2 work

Attachments: