Aggregate versions of hashing functions (md5, sha1, etc...)

Started by Dominique Devienne10 months ago16 messagesgeneral
Jump to latest
#1Dominique Devienne
ddevienne@gmail.com

We store scientific information in PostgreSQL, and some of that is
bytea and large, thus we must "chunk it" both for performance, and not
be limited to 1GB (we do exceed that, in rare occasions).

Recently I added md5/sha1 hashing support for such values (for various
reasons, to track corruptions in our ETLs, now fixed, but also in the
future for custom smart sync and such), and was surprised to discover
there are no aggregate versions of those functions, neither the
built-in md5 one (now with bytea overload), nor for pgcrypto. Did I
miss something?

Any chance this might be added in the future?

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

Thanks, --DD

PS: The holly-grail IMHO, would be built-in support for hashing, with
intelligent lazy-compute and persistence correctly reset on changes.
Probably opt-in.

PPS: Built-in Support for xxh64, or at least in official pgcrypto,
would also be nice. With aggregate!

PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type
attribute on them, which was used as the content-type with served over
the built-in WebDAV extension for the DB. I wish PostgreSQL had
something like that too.

PPPPS: the lo extension is not viable for us, FWIW. All the above
should be opt-in on bytea columns IMHO.

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dominique Devienne (#1)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

Thanks, --DD

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Ron
ronljohnsonjr@gmail.com
In reply to: Adrian Klaver (#2)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

What even is an aggregate hash function? (I can imagine a few
possibilities, but don't want to assume.)

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#4Merlin Moncure
mmoncure@gmail.com
In reply to: Dominique Devienne (#1)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

We store scientific information in PostgreSQL, and some of that is
bytea and large, thus we must "chunk it" both for performance, and not
be limited to 1GB (we do exceed that, in rare occasions).

Recently I added md5/sha1 hashing support for such values (for various
reasons, to track corruptions in our ETLs, now fixed, but also in the
future for custom smart sync and such), and was surprised to discover
there are no aggregate versions of those functions, neither the
built-in md5 one (now with bytea overload), nor for pgcrypto. Did I
miss something?

Any chance this might be added in the future?

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

Thanks, --DD

PS: The holly-grail IMHO, would be built-in support for hashing, with
intelligent lazy-compute and persistence correctly reset on changes.
Probably opt-in.

PPS: Built-in Support for xxh64, or at least in official pgcrypto,
would also be nice. With aggregate!

PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type
attribute on them, which was used as the content-type with served over
the built-in WebDAV extension for the DB. I wish PostgreSQL had
something like that too.

PPPPS: the lo extension is not viable for us, FWIW. All the above
should be opt-in on bytea columns IMHO.

Does it have to match the md5 of the 'unchunked' variant exactly? If not,
maybe you can rig a custom aggregate that would just "hash amend" the
chunks using the n-1 chunk has as salt, this would be fast and easy, at the
cost of not matching the unchunked variant.

I had to do something very similar with compression, I had a need to be
able to compress bytea values with lz4 at SQL level due to limited support
for extensions in the cloud. it works great...if a bit slow, and requires
the database to handle the decompression.

merlin

#5Dominique Devienne
ddevienne@gmail.com
In reply to: Adrian Klaver (#2)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Thu, Jul 10, 2025 at 6:25 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

Hi. Given that [SQLite's SHA3 hasher][1]https://sqlite.org/src/file?name=ext/misc/shathree.c&amp;ci=tip has it (OK, for [8 months only][2]https://github.com/sqlite/sqlite/commit/797fcb8433301d995dbe7731765833c82a74a1da),
it's hardly an original idea. And when considering that `sha3_query`
(and `sha1_query` before it) have been there for years, and provide
equivalent functionality, again, this is not novel by any stretch of
the imagination.

So again, I've really surprised this hasn't come up before. --DD

[1]: https://sqlite.org/src/file?name=ext/misc/shathree.c&amp;ci=tip
[2]: https://github.com/sqlite/sqlite/commit/797fcb8433301d995dbe7731765833c82a74a1da

#6Dominique Devienne
ddevienne@gmail.com
In reply to: Ron (#3)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

What even is an aggregate hash function? (I can imagine a few possibilities, but don't want to assume.)

Well, it's so obvious to me, I wonder if you're baiting me :)

Any hasher/digest inits some internal state, processes bytes,
typically in "streaming-fashion" via successive byte spans (equivalent
to PostgreSQL's bytea), and yields a digest of various length at the
end. The current md5() and pgcrypto.digest() functions roll the x1
init, xN process, and x1 finish into a single call, processing a
single bytea (or perhaps more intelligently for TOAST'ed values, the
2K "rows" of those in streaming-fashion, hopefully. Can a dev
confirm?). As an aggregate, the processing is extended to all values
aggregated. That's it. Obviously order-sensitive, so an explicit ORDER
BY inside the aggregate call is DE RIGEUR, but that's normal. As I
mentioned already, SQLite supports sha3_agg() for almost a year, and
had sha(1|3)_query, which is conceptually similar (although hashes
value types too, since multi-column and dynamically typed), for years
(8+ for sha3, probably decades for sha1).

Basically anyone who knows hashing/digests and has ever written an
aggregate UDF (in SQLite or elsewhere), understands what I'm talking
about. --DD

#7Dominique Devienne
ddevienne@gmail.com
In reply to: Merlin Moncure (#4)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Thu, Jul 10, 2025 at 9:03 PM Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <ddevienne@gmail.com> wrote:

We store scientific information in PostgreSQL, and some of that is
bytea and large, thus we must "chunk it" both for performance, and not
be limited to 1GB (we do exceed that, in rare occasions).

Does it have to match the md5 of the 'unchunked' variant exactly?

Well, yeah, ideally.

If not, maybe you can rig a custom aggregate that would just "hash amend"

Sure, there are various ways to get a deterministic hash of such a
chunked-across-rows (logical) value. But it always involves hashing
the hash, which statistically might not be ideal. But more
importantly, it departs from the normal "serial" hashing of the full
logical hash. With the full hash, I can use various FS OS tools and
SQLite aggregate-UDFs and TBD PostgreSQL aggregate-Digests to compare
those hashes in natural and consistent ways. Working around lack of
aggregate digests in PostgreSQL forces to replicate those work-arounds
at the FS and SQLite levels, for comparisons. I.e. Not good.

I had to do something very similar with compression, I had a need to be able to compress bytea values with lz4 at SQL level due to limited support for extensions in the cloud. it works great...if a bit slow, and requires the database to handle the decompression.

The slow part is what I want to avoid. I explored various ways to
work-around true server-side aggregate hashing, and none are
satisfactory nor performant enough. This is really the kind of
primitive that must be built-in to be efficient enough to be "useful".
(and I'd argue compression, deduplication, etc... are similar issues.
BLOBs / BYTEAs are too often "limited" in RDBMSs, with people saying
store them outside the DB, and I always find so strange, and a bit of
a cope out to making it performant enough and/or full features enough.
A bit of a rant here, sorry ;)).

#8Ron
ronljohnsonjr@gmail.com
In reply to: Dominique Devienne (#6)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025 at 5:01 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

On Thu, Jul 10, 2025 at 7:11 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Thu, Jul 10, 2025 at 12:26 PM Adrian Klaver <

adrian.klaver@aklaver.com> wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

What even is an aggregate hash function? (I can imagine a few

possibilities, but don't want to assume.)

Well, it's so obvious to me, I wonder if you're baiting me :)

Any hasher/digest inits some internal state, processes bytes,
typically in "streaming-fashion" via successive byte spans (equivalent
to PostgreSQL's bytea), and yields a digest of various length at the
end. The current md5() and pgcrypto.digest() functions roll the x1
init, xN process, and x1 finish into a single call, processing a
single bytea (or perhaps more intelligently for TOAST'ed values, the
2K "rows" of those in streaming-fashion, hopefully. Can a dev
confirm?). As an aggregate, the processing is extended to all values
aggregated.

So it "appends" all the fields into one (virtual) mega-structure and takes
the hash on that.

It's what I expected but wanted to verify.

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#9Dominique Devienne
ddevienne@gmail.com
In reply to: Dominique Devienne (#6)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025 at 11:00 AM Dominique Devienne <ddevienne@gmail.com> wrote:

The current md5() and pgcrypto.digest() functions roll the x1
init, xN process, and x1 finish into a single call, processing a
single bytea (or perhaps more intelligently for TOAST'ed values, the
2K "rows" of those in streaming-fashion, hopefully. Can a dev confirm?)

FWIW, I've [asked ChatGPT about that][1]https://chatgpt.com/share/6870fe03-416c-800e-8633-a76e478a794a, and assuming it's right (md5
and pgcrypto.digest not leveraging the "substring-optimization" on
TOASTED bytea), that's an unfortunate lost opportunity, especially for
byteas reaching close to the 1GB limit. And again (sorry to lay it on
thick...), when required to manually chunk for sizes > 1GB, the lack
of aggregate is a bit crippling, I'm afraid.

So again, can a dev confirm what ChatGPT blurted out?

And if true, any interest in improving that for better TOAST support
for true streaming hashing for current scalar digests?

And of course, the main point of this thread, add (true streaming)
aggregate support in a future version?

Thanks, --DD

[1]: https://chatgpt.com/share/6870fe03-416c-800e-8633-a76e478a794a

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Dominique Devienne (#5)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On 7/11/25 01:49, Dominique Devienne wrote:

On Thu, Jul 10, 2025 at 6:25 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

Hi. Given that [SQLite's SHA3 hasher][1] has it (OK, for [8 months only][2]),
it's hardly an original idea. And when considering that `sha3_query`
(and `sha1_query` before it) have been there for years, and provide
equivalent functionality, again, this is not novel by any stretch of
the imagination.

Even if there was interest in writing the code, given that Postgres 18
is in Beta I don't see this happening for it's release. That means the
earliest it would arrive would be Fall of 2026. The alternative would be
to do like Sqlite and create an extension that incorporates the code.

So again, I've really surprised this hasn't come up before. --DD

No one else had the need?

Or they came up with workarounds.

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Florents Tselai
florents.tselai@gmail.com
In reply to: Adrian Klaver (#10)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025, 18:27 Adrian Klaver <adrian.klaver@aklaver.com> wrote:

On 7/11/25 01:49, Dominique Devienne wrote:

On Thu, Jul 10, 2025 at 6:25 PM Adrian Klaver <adrian.klaver@aklaver.com>

wrote:

On 7/10/25 04:48, Dominique Devienne wrote:

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

I've been on this list since late 2002 and I don't recall this ever
being brought up. Now it is entirely possible that age has dimmed my
recall abilities:) Though a quick search seems to confirm my memory.

Hi. Given that [SQLite's SHA3 hasher][1] has it (OK, for [8 months

only][2]),

it's hardly an original idea. And when considering that `sha3_query`
(and `sha1_query` before it) have been there for years, and provide
equivalent functionality, again, this is not novel by any stretch of
the imagination.

Even if there was interest in writing the code, given that Postgres 18
is in Beta I don't see this happening for it's release. That means the
earliest it would arrive would be Fall of 2026. The alternative would be
to do like Sqlite and create an extension that incorporates the code.

That's an ideal use case for an extension indeed .

@Dominique I'd suggest searching for these function on GitHub, just in case
there are already implementations out there.
I've seen a lot of weird aggregates out there for niche cases.

If not, it might be an interesting weekend project for me to explore.

Show quoted text
#12Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Florents Tselai (#11)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On 7/11/25 09:03, Florents Tselai wrote:

That's an ideal use case for an extension indeed .

@Dominique I'd suggest searching for these function on GitHub, just in
case there are already implementations out there.
I've seen a lot of weird aggregates out there for niche cases.

If not, it might be an interesting weekend project for me to explore.

The SQLite extension that incorporates these functions is here:

https://sqlite.org/src/file?name=ext/misc/shathree.c&amp;ci=tip

--
Adrian Klaver
adrian.klaver@aklaver.com

#13Dominique Devienne
ddevienne@gmail.com
In reply to: Florents Tselai (#11)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025 at 6:05 PM Florents Tselai
<florents.tselai@gmail.com> wrote:

On Fri, Jul 11, 2025, 18:27 Adrian Klaver <adrian.klaver@aklaver.com> wrote:

[...] create an extension that incorporates the code.

That's an ideal use case for an extension indeed .

Extensions are of no use to me, unfortunately, unless built-in and
official. So if I have to wait for v19, so be it. But the ball has to
get rolling at least.

And extensions address only the missing aggregate "overloads". Not the
fact current md5() and pgcrypto.digest() are not TOAST-aware to be
"really streaming", and low-memory.

#14Merlin Moncure
mmoncure@gmail.com
In reply to: Dominique Devienne (#13)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025 at 10:17 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

On Fri, Jul 11, 2025 at 6:05 PM Florents Tselai
<florents.tselai@gmail.com> wrote:

On Fri, Jul 11, 2025, 18:27 Adrian Klaver <adrian.klaver@aklaver.com>

wrote:

[...] create an extension that incorporates the code.

That's an ideal use case for an extension indeed .

Extensions are of no use to me, unfortunately, unless built-in and
official. So if I have to wait for v19, so be it. But the ball has to
get rolling at least.

Right -- exactly. The problem is cloud providers do not allow 3rd party
extensions. You can work around this if your extensions are available at
the SQL level, or can be built with standard extensions.

Candidly, it's going to be tough sledding to get your needs incorporated
into contrib. I'm not saying it wont happen -- let's just say holding
breath until solution is not advisable. I know exactly where you're at.

This is why I built an SQL available extension that does lz4 compression;
it's the only way to compress data locally before sending it out to AWS via
the s3 API.

Aside: This may be an unpopular position, but I think the postgres
extension system is useless for 3rd party contributions until there is some
way to introduce items in the vein of npm, pip, etc.

I think the only short term path I know of:
1. write or find a C library that does something similar to what you need
2. compile that to plv8 with emscriptem
3. Wrap with plv8 function handlers

Getting it to reasonable performance is possible if you compile WASM and
work out amortizing start up costs. However, due to how plv8 and
emscripten interact from memory standpoint, there's a lot of copying to
move memory in/out, and you will have to work under a manageable buffer
size, say 1mb. If you're curious I can take you through it.

merlin

#15Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Merlin Moncure (#14)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On 7/11/25 13:30, Merlin Moncure wrote:

On Fri, Jul 11, 2025 at 10:17 AM Dominique Devienne <ddevienne@gmail.com

Aside: This may be an unpopular position, but I think the postgres
extension system is useless for 3rd party contributions until there is
some way to introduce items in the vein of npm, pip, etc.

Have you been following this?:

https://justatheory.com/2025/03/mini-summit-one/

merlin

--
Adrian Klaver
adrian.klaver@aklaver.com

#16Merlin Moncure
mmoncure@gmail.com
In reply to: Adrian Klaver (#15)
Re: Aggregate versions of hashing functions (md5, sha1, etc...)

On Fri, Jul 11, 2025 at 3:01 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 7/11/25 13:30, Merlin Moncure wrote:

On Fri, Jul 11, 2025 at 10:17 AM Dominique Devienne <ddevienne@gmail.com

Aside: This may be an unpopular position, but I think the postgres
extension system is useless for 3rd party contributions until there is
some way to introduce items in the vein of npm, pip, etc.

Have you been following this?:

https://justatheory.com/2025/03/mini-summit-one/

I haven't, but am in full agreement with just about everything in the
article, especially the binary packaging bits.

merlin