Can we get sha* function over text, that could be used in index?

Started by hubert depesz lubaczewskiabout 2 months ago8 messagesgeneral
Jump to latest

Hi,
So, we have various sha* functions.

And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha(). and these functions work only on bytea.

And apparently - we can't.

'text-value'::bytea won't work for some specific text values.
convert_to() isn't immutable.

I figured out that I can do something like:

SELECT
sha256(
string_agg( ascii( t )::text, ',' ORDER BY idx )::bytea
)
FROM
regexp_split_to_table( 'INPUT_STRING', '' ) WITH ORDINALITY AS x ( t, idx );

But that's hardly sane solution.

I've read bug report from 2008:
/messages/by-id/48D20645.1090503@gmx.net

And while I kinda undestand, create-conversion, server-encoding, I don't
really *grok* why we can't have immutable conversion to bytea. And/or
versions of sha* functions that simply work on text.

Is it doable? How does it work in md5()? Apparently it does also work in
pgcrypto/digest(), so there should be a way to get it in core sha*
functions?

Best regards,

depesz

#2Ron
ronljohnsonjr@gmail.com
In reply to: hubert depesz lubaczewski (#1)
Re: Can we get sha* function over text, that could be used in index?

On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz@depesz.com>
wrote:

Hi,
So, we have various sha* functions.

And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha().

I think I'd push back, asking them if they really need
cryptographically-secure hashing (which they most probably don't).

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

#3Linus Heckemann
linus@schreibt.jetzt
In reply to: Ron (#2)
Re: Can we get sha* function over text, that could be used in index?

Hi, I'm the one who asked :)

Ron Johnson wrote:

On Wed, Feb 18, 2026 at 9:58 AM hubert depesz lubaczewski <depesz(at)depesz(dot)com>
wrote:

Hi,
So, we have various sha* functions.

And I recently got asked about using them as a based for unique index
over long texts.
Normally one would do it with md5(text), but the person asking wanted to
use sha().

I think I'd push back, asking them if they really need
cryptographically-secure hashing (which they most probably don't).

I indeed don't need cryptographically-secure hashing in this
scenario, and I will be using md5.

Nevertheless, I think using md5 in any context that isn't "to verify
digests produced in the past" is a poor decision, and postgres should
make using other hashes just as easy!

The specific context of hashing aside, it seems weird to me that:

- there is a byte-array representation of text columns, which appears to
be independent of database encoding
- there doesn't seem to be _any_ sane way to access this.

The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!

The fact that convert_to() is only stable and not immutable makes sense
to me given the effect that configuration can have on its behaviour, but
given that there does appear to be a trivially-accessible UTF-8
representation (as used by md5()) I think there should be an immutable
function that provides access to it? Is there a good reason not to? I'd
be willing to send a patch for it myself.

Linus

#4Peter Eisentraut
peter_e@gmx.net
In reply to: hubert depesz lubaczewski (#1)
Re: Can we get sha* function over text, that could be used in index?

On 18.02.26 15:58, hubert depesz lubaczewski wrote:

And while I kinda undestand, create-conversion, server-encoding, I don't
really*grok* why we can't have immutable conversion to bytea. And/or
versions of sha* functions that simply work on text.

Hash functions fundamentally work on a sequence of bytes, so bytea is
the right type. The encoding of text into bytes is complicated, so it
seems better if you handle that yourself depending on the local
requirements.

#5Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Peter Eisentraut (#4)
Re: Can we get sha* function over text, that could be used in index?

On 2026-02-19 15:48:33 +0100, Peter Eisentraut wrote:

On 18.02.26 15:58, hubert depesz lubaczewski wrote:

And while I kinda undestand, create-conversion, server-encoding, I don't
really*grok* why we can't have immutable conversion to bytea. And/or
versions of sha* functions that simply work on text.

Hash functions fundamentally work on a sequence of bytes, so bytea is the
right type. The encoding of text into bytes is complicated,

Maybe, but it needs to be done anyway, sicne text is ultimately stored
as a sequence of bytes on disk and sent as a sequence of bytes over the
wire. So the code should be present already.

Something like
encode(s text, enc text) -> bytea
Encodes s in encoding enc. E.g. encode('Tröt!', 'utf-8')
produces \x5472c3b67421
decode(d bytea, enc text) -> text
Decodes d assuming encoding enc. E.g.
decode('\x5472c3b67421'::bytea, 'utf-8') produces 'Tröt!'.
might be generally useful.

so it seems better if you handle that yourself depending on the local
requirements.

I would probably do that kind of processing in the application code,
but I can see that one might want it in the database.

hjp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

#6Daniel Verite
daniel@manitou-mail.org
In reply to: Linus Heckemann (#3)
Re: Can we get sha* function over text, that could be used in index?

Linus Heckemann wrote:

- there is a byte-array representation of text columns, which appears to
be independent of database encoding

Not sure what you're refering to. Both the on-disk and in-memory
representations of text/varchar are encoding-dependent.

The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!

Yes. It seems a common mistake to forget or ignore that
backslashes are special in the input text representation of bytea.
It might be not obvious from reading the doc at [1]https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT
but we just need to quote backslashes by doubling them.

AFAIK a working solution for the OP would be:
sha256(replace(colname, '\', '\\')::bytea)

The result is encoding-dependent, but that does not matter
in the context of an expression. index.
If the database ever needs to change its encoding, it will have to
be recreated entirely anyway.

[1]: https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT
https://www.postgresql.org/docs/current/datatype-binary.html#DATATYPE-BINARY-BYTEA-ESCAPE-FORMAT

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/

#7Linus Heckemann
linus@schreibt.jetzt
In reply to: Daniel Verite (#6)
Re: Can we get sha* function over text, that could be used in index?

"Daniel Verite" <daniel@manitou-mail.org> writes:

- there is a byte-array representation of text columns, which appears to
be independent of database encoding

Not sure what you're refering to. Both the on-disk and in-memory
representations of text/varchar are encoding-dependent.

Huh, I thought I'd come to the opposite conclusion from testing the
other day (md5('ÿ') giving the same results for both a utf-8 database
and a latin1 database), but I must have been holding something wrong,
because trying again does confirm this. Sorry about that!

The obvious (to a naive user, like I was) approach, casting to bytea,
has exceptionally surprising behaviour: for many text strings, it does
exactly what the naive user might hope for, giving back the UTF-8
representation. But multiple distinct text strings, like '\033' and
'\x1b', convert to the same byte string! And text strings containing a
backslash that doesn't fit the bytea hex format or the bytea escape
format will fail to convert completely!

Yes. It seems a common mistake to forget or ignore that
backslashes are special in the input text representation of bytea.
It might be not obvious from reading the doc at [1]
but we just need to quote backslashes by doubling them.

AFAIK a working solution for the OP would be:
sha256(replace(colname, '\', '\\')::bytea)

That's so simple that it feels like it should have been
obvious... Thanks! Nevertheless, it feels like a weird and potentially
computationally expensive contortion for getting from a text value to
its bytea representation.

Given their encoding-dependence, shouldn't md5() (and ::bytea, for that
matter) be marked only STABLE, not IMMUTABLE? Or is IMMUTABLE's

guaranteed to return the same results given the same arguments forever

only valid within the context of a single database?

md5() and ::bytea aside, it seems like it would be good to have both

- a function for accessing the bytea representation of a text value
without any extra steps, and

- an IMMUTABLE function for getting a UTF-8 (specifically UTF-8,
because it's the only encoding with a character set encompassing all
other supported character sets) bytea representation of a text value
regardless of the database encoding?

Cheers
Linus

#8Peter J. Holzer
hjp-pgsql@hjp.at
In reply to: Peter J. Holzer (#5)
Re: Can we get sha* function over text, that could be used in index?

On 2026-02-19 23:30:17 +0100, Peter J. Holzer wrote:

Something like
encode(s text, enc text) -> bytea
decode(d bytea, enc text) -> text
might be generally useful.

Please ignore this email. These are basically the same as convert_to()
and convert_from() which already exist.

hjp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"