Refactor code around GUC default_toast_compression
Hi all,
While hacking on the TOAST code, I have been annoyed more than once
with the following piece in toast_compression.h:
/*
* Built-in compression method ID. The toast compression header will store
* this in the first 2 bits of the raw length. These built-in compression
* method IDs are directly mapped to the built-in compression methods.
*
* Don't use these values for anything other than understanding the meaning
* of the raw bits from a varlena; in particular, if the goal is to identify
* a compression method, use the constants TOAST_PGLZ_COMPRESSION, etc.
* below. We might someday support more than 4 compression methods, but
* we can never have more than 4 values in this enum, because there are
* only 2 bits available in the places where this is stored.
*/
typedef enum ToastCompressionId
{
TOAST_PGLZ_COMPRESSION_ID = 0,
TOAST_LZ4_COMPRESSION_ID = 1,
TOAST_INVALID_COMPRESSION_ID = 2,
} ToastCompressionId;
This is due the fact that we have only two bits that can be used in
va_tcinfo or va_extinfo. While looking at the addition of a new
compression method, this was causing a mess, so I have hacked the
attached patch, that makes the addition of more compression methods
easier. The idea is centralized in toast_compression.c, with the
addition of a registry that knows about all the TOAST compression
methods and its meta-data:
- name
- GUC enum values.
- attcompression char value.
- varatt on-disk value.
This is coupled with a set of translation routines, used in other code
paths. This has also the merit to remove TOAST_INVALID_COMPRESSION_ID
from the list of GUC values, which did not really make sense to begin
with. I don't deny that the addition of a new compression method
would require more tweaks, particularly for the decompression part,
but I think that this is a nice cleanup anyway. This is added to the
next commit fest, to be considered for v20.
Thanks,
--
Michael
Attachments:
0001-Refactor-some-code-logic-around-GUC-default_toast_co.patchtext/plain; charset=us-asciiDownload+179-85
Hi,
On Fri, 1 May 2026 at 13:21, Michael Paquier <michael@paquier.xyz> wrote:
Hi all,
While hacking on the TOAST code, I have been annoyed more than once
with the following piece in toast_compression.h:
/*
* Built-in compression method ID. The toast compression header will store
* this in the first 2 bits of the raw length. These built-in compression
* method IDs are directly mapped to the built-in compression methods.
*
* Don't use these values for anything other than understanding the meaning
* of the raw bits from a varlena; in particular, if the goal is to
identify
* a compression method, use the constants TOAST_PGLZ_COMPRESSION, etc.
* below. We might someday support more than 4 compression methods, but
* we can never have more than 4 values in this enum, because there are
* only 2 bits available in the places where this is stored.
*/
typedef enum ToastCompressionId
{
TOAST_PGLZ_COMPRESSION_ID = 0,
TOAST_LZ4_COMPRESSION_ID = 1,
TOAST_INVALID_COMPRESSION_ID = 2,
} ToastCompressionId;This is due the fact that we have only two bits that can be used in
va_tcinfo or va_extinfo. While looking at the addition of a new
compression method, this was causing a mess, so I have hacked the
attached patch, that makes the addition of more compression methods
easier. The idea is centralized in toast_compression.c, with the
addition of a registry that knows about all the TOAST compression
methods and its meta-data:
- name
- GUC enum values.
- attcompression char value.
- varatt on-disk value.
I looked at the patch, and the refactoring direction looks reasonable
to me. I noticed a few small things worth cleaning up (although
it is for v20, just wanted to drop it here for future)
1. The patch includes an unrelated hunk in
doc/src/sgml/ref/alter_index.sgml, adding text about `ALTER INDEX ...
ATTACH PARTITION`. That looks like an accidental carry-over from another
patch and shouldn't be there ig.
2. The comment in src/include/access/toast_compression.h describing
default_toast_compression looks stale after this change? It still says
that the GUC value is one of the char values stored in
pg_attribute.attcompression, but the patch changes it to use the new
ToastCompressionGucValue enum values instead.(Maybe I'm
missing something)
3. One minor point: CompressionIdToMethod() seems to be added as a public
helper, but I could not find any callers in this patch. Also,
pg_column_compression() still keeps its own cmid-to-name switch. If the
intent is to centralize these mappings in the registry, perhaps that code
could use the new helper path as well, otherwise the unused helper may
not
be necessary yet (though it might be in future).
Thanks for the patch!
Regards,
Ayush
On Fri, May 01, 2026 at 02:44:07PM +0530, Ayush Tiwari wrote:
1. The patch includes an unrelated hunk in
doc/src/sgml/ref/alter_index.sgml, adding text about `ALTER INDEX ...
ATTACH PARTITION`. That looks like an accidental carry-over from another
patch and shouldn't be there ig.
Sorry about that. That feels like a rebase fart.
2. The comment in src/include/access/toast_compression.h describing
default_toast_compression looks stale after this change? It still says
that the GUC value is one of the char values stored in
pg_attribute.attcompression, but the patch changes it to use the new
ToastCompressionGucValue enum values instead.(Maybe I'm
missing something)
Nope, you are missing nothing. I was re-reading the patch and I think
that we could just remove the whole paragraph. Even by doing so we
lose no information.
3. One minor point: CompressionIdToMethod() seems to be added as a public
helper, but I could not find any callers in this patch.
Oops, removed. I may have used it at some point.
Also,
pg_column_compression() still keeps its own cmid-to-name switch. If the
intent is to centralize these mappings in the registry, perhaps that code
could use the new helper path as well, otherwise the unused helper may
not be necessary yet (though it might be in future).
Yes, this is part of the extra tweaks that would be needed when added
a new compression method. This part looks at a varlena pointer,
retrieves the on-disk ID. So this is left as-is on purpose, like the
direct TOAST decompress business based on varlena pointers.
Attached is a v2, to keep the CI happy for as long as we can use it.
--
Michael
Attachments:
v2-0001-Refactor-some-code-logic-around-GUC-default_toast.patchtext/plain; charset=us-asciiDownload+151-87
On May 2, 2026, at 06:43, Michael Paquier <michael@paquier.xyz> wrote:
On Fri, May 01, 2026 at 02:44:07PM +0530, Ayush Tiwari wrote:
1. The patch includes an unrelated hunk in
doc/src/sgml/ref/alter_index.sgml, adding text about `ALTER INDEX ...
ATTACH PARTITION`. That looks like an accidental carry-over from another
patch and shouldn't be there ig.Sorry about that. That feels like a rebase fart.
2. The comment in src/include/access/toast_compression.h describing
default_toast_compression looks stale after this change? It still says
that the GUC value is one of the char values stored in
pg_attribute.attcompression, but the patch changes it to use the new
ToastCompressionGucValue enum values instead.(Maybe I'm
missing something)Nope, you are missing nothing. I was re-reading the patch and I think
that we could just remove the whole paragraph. Even by doing so we
lose no information.3. One minor point: CompressionIdToMethod() seems to be added as a public
helper, but I could not find any callers in this patch.Oops, removed. I may have used it at some point.
Also,
pg_column_compression() still keeps its own cmid-to-name switch. If the
intent is to centralize these mappings in the registry, perhaps that code
could use the new helper path as well, otherwise the unused helper may
not be necessary yet (though it might be in future).Yes, this is part of the extra tweaks that would be needed when added
a new compression method. This part looks at a varlena pointer,
retrieves the on-disk ID. So this is left as-is on purpose, like the
direct TOAST decompress business based on varlena pointers.Attached is a v2, to keep the CI happy for as long as we can use it.
--
Michael
<v2-0001-Refactor-some-code-logic-around-GUC-default_toast.patch>
Overall looks good. A few small comments:
1
```
/*
* GUC support.
- *
- * default_toast_compression is an integer for purposes of the GUC machinery,
- * but the value is one of the char values defined below, as they appear in
- * pg_attribute.attcompression, e.g. TOAST_PGLZ_COMPRESSION.
*/
extern PGDLLIMPORT int default_toast_compression;
```
Before this patch, default_toast_compression stores 'p'/'l'. With this patch, it is changed to store 0/1. Would it be better to rename this variable?
Otherwise, a third-party extension that relies on this variable could silently misbehave. I understand that a major release is allowed to change API/ABI contracts, but a build failure would be better than silent misbehavior. Or at least we should document this change somewhere.
2
```
#ifdef USE_LZ4
-#define DEFAULT_TOAST_COMPRESSION TOAST_LZ4_COMPRESSION
+#define DEFAULT_TOAST_COMPRESSION TOAST_LZ4_COMPRESSION_GUC
#else
-#define DEFAULT_TOAST_COMPRESSION TOAST_PGLZ_COMPRESSION
+#define DEFAULT_TOAST_COMPRESSION TOAST_PGLZ_COMPRESSION_GUC
#endif
```
Would it better to also rename DEFAULT_TOAST_COMPRESSION to DEFAULT_TOAST_COMPRESSION_GUC.
3
```
+#define TOAST_COMPRESS_PGLZ 0
+#define TOAST_COMPRESS_LZ4 1
+#define TOAST_COMPRESS_INVALID 2
```
Now TOAST_COMPRESS_PGLZ is 0, and TOAST_PGLZ_COMPRESSION is ‘p’. When they appear together in the code, it’s hard to guess which is 0 and which is ‘p’. So, would it better to rename TOAST_COMPRESS_PGLZ to TOAST_PGLZ_COMPRESS_ID, and rename TOAST_PGLZ_COMPRESSION to TOAST_PGLZ_COMPRESS_METHOD?
4
```
/*
- * Call appropriate compression routine for the compression method.
+ * Translate the compression method char to the on-disk compression ID
+ * via the Method Registry, then dispatch to the appropriate compression
+ * routine.
*/
+ cmid = MethodToCompressionId(cmethod);
switch (cmethod)
{
case TOAST_PGLZ_COMPRESSION:
tmp = pglz_compress_datum((const varlena *) DatumGetPointer(value));
- cmid = TOAST_PGLZ_COMPRESSION_ID;
break;
case TOAST_LZ4_COMPRESSION:
tmp = lz4_compress_datum((const varlena *) DatumGetPointer(value));
- cmid = TOAST_LZ4_COMPRESSION_ID;
break;
default:
elog(ERROR, "invalid compression method %c", cmethod);
```
As the switch/default explicitly rejects invalid cmethod, I feel slightly better for readability to place "cmid = MethodToCompressionId(cmethod);" after the switch clause.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sat, May 02, 2026 at 09:55:30AM +0800, Chao Li wrote:
Otherwise, a third-party extension that relies on this variable
could silently misbehave. I understand that a major release is
allowed to change API/ABI contracts, but a build failure would be
better than silent misbehavior. Or at least we should document this
change somewhere.Would it better to also rename DEFAULT_TOAST_COMPRESSION to DEFAULT_TOAST_COMPRESSION_GUC.
After pondering about this point, I think that you are touching
something sensible here, but not for the reason you mention: the _GUC
bits serve no actual purpose and we can keep using attcompression in
the GUC.
3 ``` +#define TOAST_COMPRESS_PGLZ 0 +#define TOAST_COMPRESS_LZ4 1 +#define TOAST_COMPRESS_INVALID 2 ```Now TOAST_COMPRESS_PGLZ is 0, and TOAST_PGLZ_COMPRESSION is
‘p’. When they appear together in the code, it’s hard to guess which
is 0 and which is ‘p’. So, would it better to rename
TOAST_COMPRESS_PGLZ to TOAST_PGLZ_COMPRESS_ID, and rename
TOAST_PGLZ_COMPRESSION to TOAST_PGLZ_COMPRESS_METHOD?
Here as well, I can get some of the confusion. We can just reuse the
same names, with _ID instead.
As the switch/default explicitly rejects invalid cmethod, I feel
slightly better for readability to place "cmid =
MethodToCompressionId(cmethod);" after the switch clause.
WFM.
At the end I have the updated version attached, which still does the
job I want it to do, just simpler.
One extra thing to keep in mind is that we may want to make
CompressionIdIsValid() smarter in the future, especially across
multiple vartag_external or varlena types if the same ID values are
shared across multiple compression methods, but would be simpler after
this patch with all this knowledge kept local to toast_compression.c.
Something similar could be said about toast_compress_datum() at some
point, once/if we get there. Another argument would be to just switch
ToastCompressionId to a uint32 and move the numbers to varatt.h, but
I'd like to be more ambitious. This patch is just my take on the
matter.
What do you think?
--
Michael