[PATCH] Add crc32(text) & crc32(bytea)

Started by Aleksander Alekseevover 1 year ago17 messageshackers
Jump to latest
#1Aleksander Alekseev
aleksander@timescale.com

Hi,

While answering one of the recent questions [1]/messages/by-id/CAJ7c6TOurV4uA5Yz=aJ-ae4czL_zdFNqxbu47eyVrYFefrWoog@mail.gmail.com I wanted to use
crc32(text) and discovered that it's missing out-of-the box. Of
course, one can use `substr(md5(x), 1, 8)` with almost the same effect
but it's less convenient and could be slower (I didn't do actual
benchmarks though). Also it's incompatible with third-party software
that may calculate crc32's and store the results in PostgreSQL.

I vaguely recall that I faced this problem before. Supporting crc32
was requested on the mailing list [2]/messages/by-id/auto-000557707157@umail.ru and a number of workarounds
exist in PL/pgSQL [3]https://stackoverflow.com/questions/28179335/crc32-function-with-pl-pgsql[4]https://gist.github.com/cuber/bcf0a3a96fc9a790d96d. Since there seems to be a demand and it
costs us nothing to maintain crc32() I suggest adding it.

The proposed patch exposes our internal crc32 implementation to the
user. I chose to return a hex string similarly to md5(). In my humble
experience this is most convenient in practical use. However if the
majority believes that the function should return a bigint (in order
to fit an unsigned int32) or a bytea (as SHA* functions do), I'm fine
with whatever consensus the community reaches.

[1]: /messages/by-id/CAJ7c6TOurV4uA5Yz=aJ-ae4czL_zdFNqxbu47eyVrYFefrWoog@mail.gmail.com
[2]: /messages/by-id/auto-000557707157@umail.ru
[3]: https://stackoverflow.com/questions/28179335/crc32-function-with-pl-pgsql
[4]: https://gist.github.com/cuber/bcf0a3a96fc9a790d96d

--
Best regards,
Aleksander Alekseev

Attachments:

v1-0001-Add-crc32-text-crc32-bytea.patchapplication/octet-stream; name=v1-0001-Add-crc32-text-crc32-bytea.patchDownload+175-1
#2Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#1)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Thu, Jul 18, 2024 at 02:24:23PM +0300, Aleksander Alekseev wrote:

I vaguely recall that I faced this problem before. Supporting crc32
was requested on the mailing list [2] and a number of workarounds
exist in PL/pgSQL [3][4]. Since there seems to be a demand and it
costs us nothing to maintain crc32() I suggest adding it.

This sounds generally reasonable to me, especially given the apparent
demand. Should we also introduce crc32c() while we're at it?

--
nathan

#3Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#2)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi,

This sounds generally reasonable to me, especially given the apparent
demand. Should we also introduce crc32c() while we're at it?

Might be a good idea. However I didn't see a demand for crc32c() SQL
function yet. Also I'm not sure whether the best interface for it
would be crc32c() or crc32(x, version='c') or perhaps crc32(x,
polinomial=...). I propose keeping the scope small this time.

--
Best regards,
Aleksander Alekseev

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#3)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Fri, Jul 26, 2024 at 12:01:40PM +0300, Aleksander Alekseev wrote:

This sounds generally reasonable to me, especially given the apparent
demand. Should we also introduce crc32c() while we're at it?

Might be a good idea. However I didn't see a demand for crc32c() SQL
function yet. Also I'm not sure whether the best interface for it
would be crc32c() or crc32(x, version='c') or perhaps crc32(x,
polinomial=...). I propose keeping the scope small this time.

I don't think adding crc32c() would sufficiently increase the scope. We'd
use the existing implementations for both crc32() and crc32c(). And
besides, this could be useful for adding tests for that code.

+ <function>crc32</function> ( <type>text</type> )

Do we need a version of the function that takes a text input? It's easy
enough to cast to a bytea.

+ <returnvalue>text</returnvalue>

My first reaction is that we should just have this return bytea like the
SHA ones do, if for no other reason than commit 10cfce3 seems intended to
move us away from returning text for these kinds of functions. Upthread,
you mentioned the possibility of returning a bigint, too. I think I'd
still prefer bytea in case we want to add, say, crc64() or crc16() in the
future. That would allow us to keep all of these functions consistent
instead of returning different types for each. However, I understand that
returning the numeric types might be more convenient. I'm curious what
others think about this.

+        Computes the CRC32 <link linkend="functions-hash-note">hash</link> of
+        the binary string, with the result written in hexadecimal.

I'm not sure we should call the check values "hashes." Wikipedia does
include them in the "List of hash functions" page [0]https://en.wikipedia.org/wiki/List_of_hash_functions, but it seems to
deliberately avoid calling them hashes in the CRC page [1]https://en.wikipedia.org/wiki/Cyclic_redundancy_check. I'd suggest
calling them "CRC32 values" instead.

[0]: https://en.wikipedia.org/wiki/List_of_hash_functions
[1]: https://en.wikipedia.org/wiki/Cyclic_redundancy_check

--
nathan

#5Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#4)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi Nathan,

I don't think adding crc32c() would sufficiently increase the scope. We'd
use the existing implementations for both crc32() and crc32c(). And
besides, this could be useful for adding tests for that code.

+ <function>crc32</function> ( <type>text</type> )

Do we need a version of the function that takes a text input? It's easy
enough to cast to a bytea.

+ <returnvalue>text</returnvalue>

My first reaction is that we should just have this return bytea like the
SHA ones do, if for no other reason than commit 10cfce3 seems intended to
move us away from returning text for these kinds of functions. Upthread,
you mentioned the possibility of returning a bigint, too. I think I'd
still prefer bytea in case we want to add, say, crc64() or crc16() in the
future. That would allow us to keep all of these functions consistent
instead of returning different types for each. However, I understand that
returning the numeric types might be more convenient. I'm curious what
others think about this.

+        Computes the CRC32 <link linkend="functions-hash-note">hash</link> of
+        the binary string, with the result written in hexadecimal.

I'm not sure we should call the check values "hashes." Wikipedia does
include them in the "List of hash functions" page [0], but it seems to
deliberately avoid calling them hashes in the CRC page [1]. I'd suggest
calling them "CRC32 values" instead.

Thanks for the code review. Here is the updated patch.

--
Best regards,
Aleksander Alekseev

Attachments:

v2-0001-Add-crc32-bytea-crc32c-bytea.patchapplication/octet-stream; name=v2-0001-Add-crc32-bytea-crc32c-bytea.patchDownload+182-1
#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#5)
Re: [PATCH] Add crc32(text) & crc32(bytea)
+/*
+ * Calculate CRC32 of the given data.
+ */
+static inline pg_crc32
+crc32_sz(const char *buf, int size)
+{
+	pg_crc32	crc;
+	const char *p = buf;
+
+	INIT_TRADITIONAL_CRC32(crc);
+	while (size > 0)
+	{
+		char		c = (char) (*p);
+
+		COMP_TRADITIONAL_CRC32(crc, &c, 1);
+		size--;
+		p++;
+	}
+	FIN_TRADITIONAL_CRC32(crc);
+	return crc;
+}

I'm curious why we need to do this instead of only using the macros:

INIT_TRADITIONAL_CRC32(crc);
COMP_TRADITIONAL_CRC32(crc, VARDATA_ANY(in), len);
FIN_TRADITIONAL_CRC32(crc);

+ * IDENTIFICATION
+ * src/backend/utils/adt/hashfuncs.c

Perhaps these would fit into src/backend/utils/hash/pg_crc.c?

--
nathan

#7Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#6)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi,

I'm curious why we need to do this instead of only using the macros:

INIT_TRADITIONAL_CRC32(crc);
COMP_TRADITIONAL_CRC32(crc, VARDATA_ANY(in), len);
FIN_TRADITIONAL_CRC32(crc);

+ * IDENTIFICATION
+ * src/backend/utils/adt/hashfuncs.c

Perhaps these would fit into src/backend/utils/hash/pg_crc.c?

Thanks, PFA patch v3.

--
Best regards,
Aleksander Alekseev

Attachments:

v3-0001-Add-crc32-bytea-crc32c-bytea.patchapplication/octet-stream; name=v3-0001-Add-crc32-bytea-crc32c-bytea.patchDownload+145-1
#8Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#7)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Mon, Aug 05, 2024 at 04:19:45PM +0300, Aleksander Alekseev wrote:

Thanks, PFA patch v3.

This looks pretty good to me. The only point that I think deserves more
discussion is the return type. Does bytea make the most sense here? Or
should we consider int/bigint?

--
nathan

#9Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#8)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi,

This looks pretty good to me. The only point that I think deserves more
discussion is the return type. Does bytea make the most sense here? Or
should we consider int/bigint?

Personally I would choose BYTEA in order to be consistent with sha*() functions.

It can be casted to TEXT if user wants a result similar to the one
md5() returns:

```
SELECT encode(crc32('PostgreSQL'), 'hex');
```

... and somewhat less convenient to BIGINT:

```
SELECT ((get_byte(crc, 0) :: bigint << 24) | (get_byte(crc, 1) << 16)
| (get_byte(crc, 2) << 8) | get_byte(crc, 3))
FROM (SELECT crc32('PostgreSQL') AS crc);
```

I don't like the `integer` option because crc32 value is typically
considered as an unsigned one and `integer` is not large enough to
represent uint32.

Perhaps we need get_int4() / get_int8() / get_numeric() as there seems
to be a demand [1]https://stackoverflow.com/questions/32944267/postgresql-converting-bytea-to-bigint[2]/messages/by-id/AANLkTikip9xs8iXc8e+Mgz1T1701i8Xk6QtbVB3KJQzX@mail.gmail.com and it will allow us to easily cast a `bytea`
value to `integer` or `bigint`. This is probably another topic though.

[1]: https://stackoverflow.com/questions/32944267/postgresql-converting-bytea-to-bigint
[2]: /messages/by-id/AANLkTikip9xs8iXc8e+Mgz1T1701i8Xk6QtbVB3KJQzX@mail.gmail.com

--
Best regards,
Aleksander Alekseev

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#9)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Tue, Aug 06, 2024 at 11:04:41AM +0300, Aleksander Alekseev wrote:

Perhaps we need get_int4() / get_int8() / get_numeric() as there seems
to be a demand [1][2] and it will allow us to easily cast a `bytea`
value to `integer` or `bigint`. This is probably another topic though.

Yeah, I was surprised to learn there wasn't yet an easy way to do this.
I'm not sure how much of a factor this should play in deciding the return
value for the CRC functions, but IMHO it's a reason to reconsider returning
text as you originally proposed.

--
nathan

#11Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#10)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi,

Yeah, I was surprised to learn there wasn't yet an easy way to do this.
I'm not sure how much of a factor this should play in deciding the return
value for the CRC functions, but IMHO it's a reason to reconsider returning
text as you originally proposed.

OK, here is the corrected patch v4.

--
Best regards,
Aleksander Alekseev

Attachments:

v4-0001-Add-crc32-bytea-crc32c-bytea.patchapplication/octet-stream; name=v4-0001-Add-crc32-bytea-crc32c-bytea.patchDownload+136-1
#12Peter Eisentraut
peter_e@gmx.net
In reply to: Nathan Bossart (#8)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On 05.08.24 17:28, Nathan Bossart wrote:

This looks pretty good to me. The only point that I think deserves more
discussion is the return type. Does bytea make the most sense here? Or
should we consider int/bigint?

The correct return type of a CRC operation in general is some kind of
exact numerical type. Just pick the best one that fits the result. I
don't think bytea is appropriate.

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Peter Eisentraut (#12)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Thu, Aug 08, 2024 at 04:27:20PM +0200, Peter Eisentraut wrote:

On 05.08.24 17:28, Nathan Bossart wrote:

This looks pretty good to me. The only point that I think deserves more
discussion is the return type. Does bytea make the most sense here? Or
should we consider int/bigint?

The correct return type of a CRC operation in general is some kind of exact
numerical type. Just pick the best one that fits the result. I don't think
bytea is appropriate.

That would leave us either "integer" or "bigint". "integer" is more
correct from a size perspective, but will result in negative values because
it is signed. "bigint" uses twice as many bytes but won't display any CRC
values as negative.

I guess we could also choose "numeric", which would set a more sustainable
precedent if we added functions for CRC-64...

--
nathan

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#13)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Nathan Bossart <nathandbossart@gmail.com> writes:

On Thu, Aug 08, 2024 at 04:27:20PM +0200, Peter Eisentraut wrote:

The correct return type of a CRC operation in general is some kind of exact
numerical type. Just pick the best one that fits the result. I don't think
bytea is appropriate.

That would leave us either "integer" or "bigint". "integer" is more
correct from a size perspective, but will result in negative values because
it is signed. "bigint" uses twice as many bytes but won't display any CRC
values as negative.

bigint seems fine to me; we have used that in other places as a
substitute for uint32, eg block numbers in contrib/pageinspect.

regards, tom lane

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Tom Lane (#14)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Thu, Aug 08, 2024 at 10:49:42AM -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

On Thu, Aug 08, 2024 at 04:27:20PM +0200, Peter Eisentraut wrote:

The correct return type of a CRC operation in general is some kind of exact
numerical type. Just pick the best one that fits the result. I don't think
bytea is appropriate.

That would leave us either "integer" or "bigint". "integer" is more
correct from a size perspective, but will result in negative values because
it is signed. "bigint" uses twice as many bytes but won't display any CRC
values as negative.

bigint seems fine to me; we have used that in other places as a
substitute for uint32, eg block numbers in contrib/pageinspect.

WFM. Here is what I have staged for commit.

--
nathan

Attachments:

v5-0001-Add-user-callable-CRC-functions.patchtext/plain; charset=us-asciiDownload+114-2
#16Aleksander Alekseev
aleksander@timescale.com
In reply to: Nathan Bossart (#15)
Re: [PATCH] Add crc32(text) & crc32(bytea)

Hi,

WFM. Here is what I have staged for commit.

Patch v5 LGTM.

--
Best regards,
Aleksander Alekseev

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Aleksander Alekseev (#16)
Re: [PATCH] Add crc32(text) & crc32(bytea)

On Mon, Aug 12, 2024 at 04:13:02PM +0300, Aleksander Alekseev wrote:

Patch v5 LGTM.

Committed.

--
nathan