Optimize scram_SaltedPassword performance
Hi Hackers,
While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
consumed more CPU time than expected. After some investigation, I found
that the function performs many HMAC iterations (4096 rounds for
SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
excessive overhead.
OpenSSL has an optimization for this case: when the key remains the
same, the HMAC context can be reused with a lightweight state reset by
passing NULL as the key. To take advantage of this, I introduced
`pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.
With this change, the performance improves by approximately **4x** (reducing
execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
does not support context reuse, and modifying it would require substantial
changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
case, maintaining the existing logic.
Regards,
Zixuan
Attachments:
v1-0001-Optimize-scram_SaltedPassword-with-HMAC-context-reuse.patchtext/x-patch; charset=utf-8Download
From e13d9e9c674d1cf4f671ca81b950ae010fd958a4 Mon Sep 17 00:00:00 2001
From: Zi-Xuan Fu <r33s3n6@gmail.com>
Date: Mon, 3 Feb 2025 13:33:43 +0800
Subject: [PATCH v1] Optimize "scram_SaltedPassword" with OpenSSL HMAC context reuse
`scram_SaltedPassword()` requires a large number of HMAC iterations, with
SCRAM-SHA-256 defaulting to 4096 rounds. Previously, each iteration
reinitialized the HMAC context, incurring significant overhead. However,
OpenSSL optimizes this case by allowing context reuse when the key remains
unchangedârequiring only a lightweight state reset by passing NULL as the key.
To leverage this, I introduced `pg_hmac_reuse()`, which sets the key to
NULL when OpenSSL is used. This optimization improves performance by
approximately 4x (reducing execution time from 4ms to 1ms).
The native PostgreSQL HMAC implementation does not support context reuse, and
modifying it would require substantial changes. Therefore, `pg_hmac_reuse()`
simply calls `pg_hmac_init()` in that case, maintaining the existing logic.
---
src/common/hmac.c | 11 +++++++++++
src/common/hmac_openssl.c | 12 ++++++++++++
src/common/scram-common.c | 2 +-
src/include/common/hmac.h | 1 +
4 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/src/common/hmac.c b/src/common/hmac.c
index 9b85910672..6de95c44ee 100644
--- a/src/common/hmac.c
+++ b/src/common/hmac.c
@@ -214,6 +214,17 @@ pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len)
return 0;
}
+/*
+ * pg_hmac_reuse
+ *
+ * Reuse a HMAC context with the same key. Returns 0 on success, -1 on failure.
+ */
+int
+pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len)
+{
+ return pg_hmac_init(ctx, key, len);
+}
+
/*
* pg_hmac_update
*
diff --git a/src/common/hmac_openssl.c b/src/common/hmac_openssl.c
index 31cd188904..9e7a92884d 100644
--- a/src/common/hmac_openssl.c
+++ b/src/common/hmac_openssl.c
@@ -208,6 +208,18 @@ pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len)
return 0;
}
+/*
+ * pg_hmac_reuse
+ *
+ * Reuse a HMAC context with the same key. Returns 0 on success, -1 on failure.
+ */
+int
+pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len)
+{
+ /* OpenSSL skips unnecessary reinitialization if the key is NULL */
+ return pg_hmac_init(ctx, NULL, len);
+}
+
/*
* pg_hmac_update
*
diff --git a/src/common/scram-common.c b/src/common/scram-common.c
index 400900c51c..5717daff28 100644
--- a/src/common/scram-common.c
+++ b/src/common/scram-common.c
@@ -84,7 +84,7 @@ scram_SaltedPassword(const char *password,
CHECK_FOR_INTERRUPTS();
#endif
- if (pg_hmac_init(hmac_ctx, (uint8 *) password, password_len) < 0 ||
+ if (pg_hmac_reuse(hmac_ctx, (uint8 *) password, password_len) < 0 ||
pg_hmac_update(hmac_ctx, (uint8 *) Ui_prev, key_length) < 0 ||
pg_hmac_final(hmac_ctx, Ui, key_length) < 0)
{
diff --git a/src/include/common/hmac.h b/src/include/common/hmac.h
index c4d069e49a..e9a0366d6e 100644
--- a/src/include/common/hmac.h
+++ b/src/include/common/hmac.h
@@ -22,6 +22,7 @@ typedef struct pg_hmac_ctx pg_hmac_ctx;
extern pg_hmac_ctx *pg_hmac_create(pg_cryptohash_type type);
extern int pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len);
+extern int pg_hmac_reuse(pg_hmac_ctx *ctx, const uint8 *key, size_t len);
extern int pg_hmac_update(pg_hmac_ctx *ctx, const uint8 *data, size_t len);
extern int pg_hmac_final(pg_hmac_ctx *ctx, uint8 *dest, size_t len);
extern void pg_hmac_free(pg_hmac_ctx *ctx);
--
2.34.1
03.02.2025 10:11, Zixuan Fu пишет:
Hi Hackers,
While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
consumed more CPU time than expected. After some investigation, I found
that the function performs many HMAC iterations (4096 rounds for
SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
excessive overhead.OpenSSL has an optimization for this case: when the key remains the
same, the HMAC context can be reused with a lightweight state reset by
passing NULL as the key. To take advantage of this, I introduced
`pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.
Good catch.
Since pg_hmac_reuse is not `static`, I'd add some checks that key is
exactly same. At least there should be
Assert(key == prev_key && len == prev_len && hash_bytes(key, len) ==
prev_hash);
Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled
in `pg_hmac_init`.
I don't know, should it be `Assert`, or check that leads to `elog(ERROR)`.
`hash_bytes` is fast enough to not cause measurable slow down in production.
On the other hand, use cases are trivial enough to occasional misuses to be
caught using just `Assert`.
With this change, the performance improves by approximately **4x** (reducing
execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
does not support context reuse, and modifying it would require substantial
changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
case, maintaining the existing logic.
-------
regards,
Yura
On Mon, Feb 03, 2025 at 03:11:48PM +0800, Zixuan Fu wrote:
With this change, the performance improves by approximately **4x** (reducing
execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
does not support context reuse, and modifying it would require substantial
changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
case, maintaining the existing logic.
I suspect that keeping an interface like pg_hmac_reuse() is going to
encourage incorrect practices in the long term, for a behavior that's
hidden in OpenSSL itself. In short, while I agree that the
performance could be better based on the information you are providing
(did not test nor check OpenSSL across the branches we support), your
proposal to address the problem makes me wonder that we'd better limit
this change localized into scram_SaltedPassword() rather than increase
the footprint of the APIs in the two HMAC-specific files of
src/common/.
--
Michael
I initially intended to restrict the modification to the
scram_SaltedPassword(), but I'm unsure how to determine
whether the current implementation is OpenSSL.
Alternatively, could we modify pg_hmac_init() to allow `key=NULL`
in the native HMAC implementation, achieving semantics similar to
OpenSSL? However, without carefully designing the logic for reusing
the context, the performance difference would be minimal. In my tests,
I skipped the initialization of `i_pad` and `o_pad` when `key=NULL`,
but the execution time remained around 10ms. That said, we can avoid
using the pg_hmac_reuse(), though we may need to add additional
explanations in the comments of pg_hmac_init().
--
regards,
Zixuan
Show quoted text
On Mon, Feb 3, 2025 at 4:16 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Feb 03, 2025 at 03:11:48PM +0800, Zixuan Fu wrote:
With this change, the performance improves by approximately **4x** (reducing
execution time from 4ms to 1ms). The built-in PostgreSQL HMAC implementation
does not support context reuse, and modifying it would require substantial
changes. Therefore, `pg_hmac_reuse()` simply calls `pg_hmac_init()` in that
case, maintaining the existing logic.I suspect that keeping an interface like pg_hmac_reuse() is going to
encourage incorrect practices in the long term, for a behavior that's
hidden in OpenSSL itself. In short, while I agree that the
performance could be better based on the information you are providing
(did not test nor check OpenSSL across the branches we support), your
proposal to address the problem makes me wonder that we'd better limit
this change localized into scram_SaltedPassword() rather than increase
the footprint of the APIs in the two HMAC-specific files of
src/common/.
--
Michael
On 3 Feb 2025, at 08:45, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
03.02.2025 10:11, Zixuan Fu пишет:
Hi Hackers,
While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
consumed more CPU time than expected. After some investigation, I found
that the function performs many HMAC iterations (4096 rounds for
SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
excessive overhead.
While I don't disagree with speeding up this in general, the whole point of the
SCRAM iterations is to take a lot of time as a way to combat brute forcing.
Any attacker is likely to use a patched libpq (or not use libpq at all) so
clientside it doesn't matter as much but if we make it 4x faster serverside we
don't really achieve much other than making attacks more feasible.
The relevant portion from RFC 7677 §4 is:
The strength of this mechanism is dependent in part on the hash
iteration-count, as denoted by "i" in [RFC5802]. As a rule of thumb,
the hash iteration-count should be such that a modern machine will take
0.1 seconds to perform the complete algorithm; however, this is
unlikely to be practical on mobile devices and other relatively low-
performance systems. At the time this was written, the rule of thumb
gives around 15,000 iterations required; however, a hash iteration-
count of 4096 takes around 0.5 seconds on current mobile handsets.
This computational cost can be avoided by caching the ClientKey
(assuming the Salt and hash iteration-count is stable). Therefore, the
recommendation of this specification is that the hash iteration- count
SHOULD be at least 4096, but careful consideration ought to be given to
using a significantly higher value, particularly where mobile use is
less important.
The numbers are quite outdated but the gist of it holds. If we speed it up
serverside we need to counter that with a higher iteration count, and for a
rogue client it's unlikely to matter since it wont use our code anyways.
Speeding up HMAC for other usecases is a different story (but also likely to
have less measurable performance impact).
OpenSSL has an optimization for this case: when the key remains the
same, the HMAC context can be reused with a lightweight state reset by
passing NULL as the key. To take advantage of this, I introduced
`pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.
Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled
in `pg_hmac_init`.
Storing any part of a cryptograhic calculation, let alone a key, in a static
variable doesn't seem entirely like a best practice, and it also wont be
threadsafe.
--
Daniel Gustafsson
03.02.2025 14:17, Daniel Gustafsson пишет:
On 3 Feb 2025, at 08:45, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
03.02.2025 10:11, Zixuan Fu пишет:
Hi Hackers,
While profiling a program with `perf`, I noticed that `scram_SaltedPassword`
consumed more CPU time than expected. After some investigation, I found
that the function performs many HMAC iterations (4096 rounds for
SCRAM-SHA-256), and each iteration reinitializes the HMAC context, causing
excessive overhead.While I don't disagree with speeding up this in general, the whole point of the
SCRAM iterations is to take a lot of time as a way to combat brute forcing.
Any attacker is likely to use a patched libpq (or not use libpq at all) so
clientside it doesn't matter as much but if we make it 4x faster serverside we
don't really achieve much other than making attacks more feasible.The relevant portion from RFC 7677 §4 is:
The strength of this mechanism is dependent in part on the hash
iteration-count, as denoted by "i" in [RFC5802]. As a rule of thumb,
the hash iteration-count should be such that a modern machine will take
0.1 seconds to perform the complete algorithm; however, this is
unlikely to be practical on mobile devices and other relatively low-
performance systems. At the time this was written, the rule of thumb
gives around 15,000 iterations required; however, a hash iteration-
count of 4096 takes around 0.5 seconds on current mobile handsets.
This computational cost can be avoided by caching the ClientKey
(assuming the Salt and hash iteration-count is stable). Therefore, the
recommendation of this specification is that the hash iteration- count
SHOULD be at least 4096, but careful consideration ought to be given to
using a significantly higher value, particularly where mobile use is
less important.The numbers are quite outdated but the gist of it holds. If we speed it up
serverside we need to counter that with a higher iteration count, and for a
rogue client it's unlikely to matter since it wont use our code anyways.
Speeding up HMAC for other usecases is a different story (but also likely to
have less measurable performance impact).
There is no sense to not speedup server if client can be made faster. HMAC
should protect from an attacker who have very optimized software on very
powerful hardware, since it should protect against BRUTE force against
known cipher text either.
If 4096 iterations could be performed fast in attacker's environment,
there's no point to slow down our environment. It is meaningless.
OpenSSL has an optimization for this case: when the key remains the
same, the HMAC context can be reused with a lightweight state reset by
passing NULL as the key. To take advantage of this, I introduced
`pg_hmac_reuse()`, which replaces the key with NULL when OpenSSL is used.Where `prev_key`, `prev_len` and `prev_hash` are static variables, filled
in `pg_hmac_init`.Storing any part of a cryptograhic calculation, let alone a key, in a static
variable doesn't seem entirely like a best practice, and it also wont be
threadsafe.
`prev_key` stores not the key, but pointer to key. Just to ensure
`pg_hmac_reuse` were called with exactly same key. It doesn't leak the key
itself.
Storing `prev_hash` is more considered as "storing part of key". I may
agree it is not exactly good idea if `key` is "human-brain-generated".
Thread-safety is more-or-less trivial with `thread_local` and synonyms for.
--
Yura
On Mon, Feb 03, 2025 at 12:17:27PM +0100, Daniel Gustafsson wrote:
The numbers are quite outdated but the gist of it holds. If we speed it up
serverside we need to counter that with a higher iteration count, and for a
rogue client it's unlikely to matter since it wont use our code anyways.
Speeding up HMAC for other usecases is a different story (but also likely to
have less measurable performance impact).
Thanks, I've managed to somewhat forget this point. Note that this
has come up as well when adding the GUC scram_iterations in
b577743000cd, where a lower count can be used for a faster connection.
Storing any part of a cryptograhic calculation, let alone a key, in a static
variable doesn't seem entirely like a best practice, and it also wont be
threadsafe.
Agreed.
By the way, something I have forgotten to mention yesterday.. If a
fast-path gets implemented in these HMAC APIs (perhaps not), it sounds
to me that it would be better to achieve the same for the fallback
non-OpenSSL implementation in src/common/hmac.c, as it would reflect
things properly in the interface. My 2c.
--
Michael