Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
Hi all,
I have been looking at $subject and the many past reviews recently,
also related to some of the work related to the potential support for
zstandard compression in TOAST values, and found myself pondering
about the following message from Tom, to be reminded that nothing has
been done regarding the fact that the backend may finish in an
infinite loop once a TOAST table reaches 4 billion values:
/messages/by-id/764273.1669674269@sss.pgh.pa.us
Spoiler: I have heard of users that are in this case, and the best
thing we can do currently except raising shoulders is to use
workarounds with data externalization AFAIK, which is not nice,
usually, and users notice the problem once they see some backends
stuck in the infinite loop. I have spent some time looking at the
problem, and looked at all the proposals in this area like these ones
(I hope so at least):
https://commitfest.postgresql.org/patch/4296/
/messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru
Anyway, it seems like nothing goes in a direction that I think would
be suited to fix the two following problems (some of the proposed
patches broke backward-compatibility, as well, and that's critical):
- The limit of TOAST values to 4 billions, because external TOAST
pointers want OIDs.
- New compression methods, see the recent proposal about zstandard at
[1]: /messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com -- Michael
extinfo field of varatt_external has only this much data remaining.
Spoiler: I want to propose a new varatt_external dedicated to
zstandard-compressed external pointers, but that's not for this
thread.
Please find attached a patch set I have finished with while poking at
the problem, to address points 1) and 2) in the first email mentioned
at the top of this message. It is not yet ready for prime day yet
(there are a couple of things that would need adjustments), but I have
reached the point where I am going to need a consensus about what
people would be OK to have in terms of design to be able to support
multiple types of varatt_external to address these issues. And I'm OK
to consume time on that for the v19 cycle.
While hacking at (playing with?) the whole toasting and detoasting
code to understand the blast radius that this would involve, I have
quickly found that it is very annoying to have to touch at many places
of varatt.h to make variants of the existing varatt_external structure
(what we store on-disk as varlena Datum for external TOAST pointers).
Spoiler: it's my first time touching the internals of this code area
so deeply. Most of the code churns happen because we need to make the
[de]toast code aware of what to do depending on the vartags of the
external varlenas. It would be simple to hardcode a bunch of new
VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
While it is efficient, this has a cost for out-of-core code and in
core because all the places that touch external TOAST pointers need to
be adjusted. Point is: it can be done. But if we introduce more
types of external TOAST pointers we need to always patch all these
areas, and there's a cost in that each time one or more new vartags
are added.
So, I have invented a new interface aimed at manipulating on-disk
external TOAST pointers, called toast_external, that is an on-memory
structure that services as an interface between on-disk external TOAST
pointers and what the backend wants to look at when retrieving chunks
of data from the TOAST relations. That's the main proposal of this
patch set, with a structure looking like that:
typedef struct toast_external_data
{
/* Original data size (includes header) */
int32 rawsize;
/* External saved size (without header) */
uint32 extsize;
/* compression method */
ToastCompressionId compression_method;
/* Relation OID of TOAST table containing the value */
Oid toastrelid;
/*
* Unique ID of value within TOAST table. This could be an OID or an
* int8 value. This field is large enough to be able to store any of
* them.
*/
uint64 value;
} toast_external_data;
This is a bit similar to what the code does for R/W and R/O vartags,
only applying to the on-disk external pointers. Then, the [de]toast
code and extension code is updated so as varlenas are changed into
this structure if we need to retrieve some of its data, and these
areas of the code do not need to know about the details of the
external TOAST pointers. When saving an external set of chunks, this
structure is filled with information depending on what
toast_save_datum() deals with, be it a short varlena, a non-compressed
external value, or a compressed external value, then builds a varlena
with the vartag we want.
External TOAST pointers have three properties that are hardcoded in
the tree, bringing some challenges of their own:
- The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
to 2k to make 4 chunks fit on a page. This depends on the size of the
external pointer. This one was actually easy to refactor.
- The varlena header size, based on VARTAG_SIZE(), which is kind of
tricky to refactor out in the new toast_external.c, but that seems OK
even if this knowledge stays in varatt.h.
- The toast pointer size, aka TOAST_POINTER_SIZE. This one is
actually very interesting (tricky): we use it in one place,
toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
attribute should be toastable or not. I've refactored the code to use
a "best" guess depending on the value type in the TOAST relation, but
that's not 100% waterproof. That needs more thoughts.
Anyway, the patch set is able to demonstrate how much needs to be done
in the tree to support multiple chunk_id types, and the CI is happy
with the attached. Some of the things done:
- Introduction of a user-settable GUC called default_toast_type, that
can be switched between two modes "oid" and "int8", to force the
creation of a TOAST relation using one type or the other.
- Dump, restore and upgrade support are integrated, relying on a GUC
makes the logic a breeze.
- 64b values are retrieved from a single counter in the control file,
named a "TOAST counter", which has the same reliability and properties
as an OID, with checkpoint support, WAL records, etc.
- Rewrites are soft, so I have kicked the can down the toast on this
point to not make the proposal more complicated than it should be: a
VACUUM FULL retains the same TOAST value type as the original. We
could extend rewrites so as the type of TOAST value is changed. It is
possible to setup a new cluster with default_toast_type = int8 set
after an upgrade, with the existing tables still using the OID mode.
This relates to the recent proposal with a concurrent VACUUM FULL
(REPACK discussion).
The patch set keeps the existing vartag_external with OID values for
backward-compatibility, and adds a second vartag_external that can
store 8-byte values. This model is the simplest one, and
optimizations are possible, where the Datum TOAST pointer could be
shorter depending on the ID type (OID or int8), the compression method
and the actual value to divide in chunks. For example, if you know
that a chunk of data to save has a value less than UINT32_MAX, we
could store 4 bytes worth of data instead of 8. This design has the
advantage to allow plugging in new TOAST external structures easily.
Now I've not spent extra time in this tuning, because there's no point
in spending more time without an actual agreement about three things,
and *that's what I'm looking for* as feedback for this upcoming CF:
- The structures of the external TOAST pointers. Variable-sized
pointers could be one possibility, across multiple vartags. Ideas are
welcome.
- How many vartag_external types we want.
- If people are actually OK with this translation layer or not, and I
don't disagree that there may be some paths hot enough where the
translation between the on-disk varlenas and this on-memory
toast_external_data hurts. Again, it is possible to hardcode more
types of vartags in the tree, or just bypass the translation in the
paths that are too hot. That's doable still brutal, but if that's the
final consensus reached I'm OK with that as well. (See for example
the changes in amcheck to see how simpler things get.)
The patch set has been divided into multiple pieces to ease its
review. Again, I'm not completely happy with everything in it, but
it's a start. Each patch has its own commit message, so feel free to
refer to them for more details:
- 0001 introduces the GUC default_toast_type. It is just defined, not
used in the tree at this stage.
- 0002 adds support for catcache lookups for int8 values, required to
allow TOAST values with int8 and its indexes. Potentially useful for
extensions.
- 0003 introduces the "TOAST counter", 8 bytes in the control file to
allocate values for the int8 chunk_id. That's cheap, reliable.
- 0004 is a mechanical change, that enlarges a couple of TOAST
interfaces to use values of uint64 instead of OID.
- 0005, again a mechanical change, reducing a bit the footprint of
TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
values.
- 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.
Then comes the "main" patches:
- 0007 adds support for int8 chunk_id in TOAST tables. This is mostly
a mechanical change. If applying the patches up to this point,
external Datums are applied to both OID and int8 values. Note that
there is one tweak I'm unhappy with: the toast counter generation
would need to be smarter to avoid concurrent values because we don't
cross-check the TOAST index for existing values. (Sorry, got slightly
lazy here).
- 0008 adds tests for external compressed and uncompressed TOAST
values for int8 TOAST types.
- 0009 adds support for dump, restore, upgrades of the TOAST table
types.
- 0010 is the main dish: refactoring of the TOAST code to use
toast_external_data, with OID vartags as the only type defined.
- 0011 adds a second vartag_external: the one with int8 values stored
in the external TOAST pointer.
- 0012 is a bonus for amcheck: what needs to be done in its TAP tests
to allow the corruption cases to work when supporting a new vartag.
That was a long message. Thank you for reading if you have reached
this point.
Regards,
[1]: /messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com -- Michael
--
Michael
Attachments:
Hi Michael
I'll take a look at the patch set.
While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?
Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.
And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.
----
Hannu
Show quoted text
On Thu, Jun 19, 2025 at 7:59 AM Michael Paquier <michael@paquier.xyz> wrote:
Hi all,
I have been looking at $subject and the many past reviews recently,
also related to some of the work related to the potential support for
zstandard compression in TOAST values, and found myself pondering
about the following message from Tom, to be reminded that nothing has
been done regarding the fact that the backend may finish in an
infinite loop once a TOAST table reaches 4 billion values:
/messages/by-id/764273.1669674269@sss.pgh.pa.usSpoiler: I have heard of users that are in this case, and the best
thing we can do currently except raising shoulders is to use
workarounds with data externalization AFAIK, which is not nice,
usually, and users notice the problem once they see some backends
stuck in the infinite loop. I have spent some time looking at the
problem, and looked at all the proposals in this area like these ones
(I hope so at least):
https://commitfest.postgresql.org/patch/4296/
/messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ruAnyway, it seems like nothing goes in a direction that I think would
be suited to fix the two following problems (some of the proposed
patches broke backward-compatibility, as well, and that's critical):
- The limit of TOAST values to 4 billions, because external TOAST
pointers want OIDs.
- New compression methods, see the recent proposal about zstandard at
[1]. ToastCompressionId is currently limited to 4 values because the
extinfo field of varatt_external has only this much data remaining.
Spoiler: I want to propose a new varatt_external dedicated to
zstandard-compressed external pointers, but that's not for this
thread.Please find attached a patch set I have finished with while poking at
the problem, to address points 1) and 2) in the first email mentioned
at the top of this message. It is not yet ready for prime day yet
(there are a couple of things that would need adjustments), but I have
reached the point where I am going to need a consensus about what
people would be OK to have in terms of design to be able to support
multiple types of varatt_external to address these issues. And I'm OK
to consume time on that for the v19 cycle.While hacking at (playing with?) the whole toasting and detoasting
code to understand the blast radius that this would involve, I have
quickly found that it is very annoying to have to touch at many places
of varatt.h to make variants of the existing varatt_external structure
(what we store on-disk as varlena Datum for external TOAST pointers).
Spoiler: it's my first time touching the internals of this code area
so deeply. Most of the code churns happen because we need to make the
[de]toast code aware of what to do depending on the vartags of the
external varlenas. It would be simple to hardcode a bunch of new
VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
While it is efficient, this has a cost for out-of-core code and in
core because all the places that touch external TOAST pointers need to
be adjusted. Point is: it can be done. But if we introduce more
types of external TOAST pointers we need to always patch all these
areas, and there's a cost in that each time one or more new vartags
are added.So, I have invented a new interface aimed at manipulating on-disk
external TOAST pointers, called toast_external, that is an on-memory
structure that services as an interface between on-disk external TOAST
pointers and what the backend wants to look at when retrieving chunks
of data from the TOAST relations. That's the main proposal of this
patch set, with a structure looking like that:
typedef struct toast_external_data
{
/* Original data size (includes header) */
int32 rawsize;
/* External saved size (without header) */
uint32 extsize;
/* compression method */
ToastCompressionId compression_method;
/* Relation OID of TOAST table containing the value */
Oid toastrelid;
/*
* Unique ID of value within TOAST table. This could be an OID or an
* int8 value. This field is large enough to be able to store any of
* them.
*/
uint64 value;
} toast_external_data;This is a bit similar to what the code does for R/W and R/O vartags,
only applying to the on-disk external pointers. Then, the [de]toast
code and extension code is updated so as varlenas are changed into
this structure if we need to retrieve some of its data, and these
areas of the code do not need to know about the details of the
external TOAST pointers. When saving an external set of chunks, this
structure is filled with information depending on what
toast_save_datum() deals with, be it a short varlena, a non-compressed
external value, or a compressed external value, then builds a varlena
with the vartag we want.External TOAST pointers have three properties that are hardcoded in
the tree, bringing some challenges of their own:
- The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
to 2k to make 4 chunks fit on a page. This depends on the size of the
external pointer. This one was actually easy to refactor.
- The varlena header size, based on VARTAG_SIZE(), which is kind of
tricky to refactor out in the new toast_external.c, but that seems OK
even if this knowledge stays in varatt.h.
- The toast pointer size, aka TOAST_POINTER_SIZE. This one is
actually very interesting (tricky): we use it in one place,
toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
attribute should be toastable or not. I've refactored the code to use
a "best" guess depending on the value type in the TOAST relation, but
that's not 100% waterproof. That needs more thoughts.Anyway, the patch set is able to demonstrate how much needs to be done
in the tree to support multiple chunk_id types, and the CI is happy
with the attached. Some of the things done:
- Introduction of a user-settable GUC called default_toast_type, that
can be switched between two modes "oid" and "int8", to force the
creation of a TOAST relation using one type or the other.
- Dump, restore and upgrade support are integrated, relying on a GUC
makes the logic a breeze.
- 64b values are retrieved from a single counter in the control file,
named a "TOAST counter", which has the same reliability and properties
as an OID, with checkpoint support, WAL records, etc.
- Rewrites are soft, so I have kicked the can down the toast on this
point to not make the proposal more complicated than it should be: a
VACUUM FULL retains the same TOAST value type as the original. We
could extend rewrites so as the type of TOAST value is changed. It is
possible to setup a new cluster with default_toast_type = int8 set
after an upgrade, with the existing tables still using the OID mode.
This relates to the recent proposal with a concurrent VACUUM FULL
(REPACK discussion).The patch set keeps the existing vartag_external with OID values for
backward-compatibility, and adds a second vartag_external that can
store 8-byte values. This model is the simplest one, and
optimizations are possible, where the Datum TOAST pointer could be
shorter depending on the ID type (OID or int8), the compression method
and the actual value to divide in chunks. For example, if you know
that a chunk of data to save has a value less than UINT32_MAX, we
could store 4 bytes worth of data instead of 8. This design has the
advantage to allow plugging in new TOAST external structures easily.
Now I've not spent extra time in this tuning, because there's no point
in spending more time without an actual agreement about three things,
and *that's what I'm looking for* as feedback for this upcoming CF:
- The structures of the external TOAST pointers. Variable-sized
pointers could be one possibility, across multiple vartags. Ideas are
welcome.
- How many vartag_external types we want.
- If people are actually OK with this translation layer or not, and I
don't disagree that there may be some paths hot enough where the
translation between the on-disk varlenas and this on-memory
toast_external_data hurts. Again, it is possible to hardcode more
types of vartags in the tree, or just bypass the translation in the
paths that are too hot. That's doable still brutal, but if that's the
final consensus reached I'm OK with that as well. (See for example
the changes in amcheck to see how simpler things get.)The patch set has been divided into multiple pieces to ease its
review. Again, I'm not completely happy with everything in it, but
it's a start. Each patch has its own commit message, so feel free to
refer to them for more details:
- 0001 introduces the GUC default_toast_type. It is just defined, not
used in the tree at this stage.
- 0002 adds support for catcache lookups for int8 values, required to
allow TOAST values with int8 and its indexes. Potentially useful for
extensions.
- 0003 introduces the "TOAST counter", 8 bytes in the control file to
allocate values for the int8 chunk_id. That's cheap, reliable.
- 0004 is a mechanical change, that enlarges a couple of TOAST
interfaces to use values of uint64 instead of OID.
- 0005, again a mechanical change, reducing a bit the footprint of
TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
values.
- 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.Then comes the "main" patches:
- 0007 adds support for int8 chunk_id in TOAST tables. This is mostly
a mechanical change. If applying the patches up to this point,
external Datums are applied to both OID and int8 values. Note that
there is one tweak I'm unhappy with: the toast counter generation
would need to be smarter to avoid concurrent values because we don't
cross-check the TOAST index for existing values. (Sorry, got slightly
lazy here).
- 0008 adds tests for external compressed and uncompressed TOAST
values for int8 TOAST types.
- 0009 adds support for dump, restore, upgrades of the TOAST table
types.
- 0010 is the main dish: refactoring of the TOAST code to use
toast_external_data, with OID vartags as the only type defined.
- 0011 adds a second vartag_external: the one with int8 values stored
in the external TOAST pointer.
- 0012 is a bonus for amcheck: what needs to be done in its TAP tests
to allow the corruption cases to work when supporting a new vartag.That was a long message. Thank you for reading if you have reached
this point.Regards,
[1]: /messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com
--
Michael
Hi!
Hannu, we'd already made an attempt to extract the TOAST functionality as
API
and make it extensible and usable by other AMs in [1]Pluggable TOAST </messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>, the patch set was
met calmly
but we still have some hopes on it.
Michael, glad you continue this work! Took patch set for review.
[1]: Pluggable TOAST </messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>
</messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Fri, Jul 4, 2025 at 2:03 PM Hannu Krosing <hannuk@google.com> wrote:
Show quoted text
Hi Michael
I'll take a look at the patch set.
While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.----
Hannu
On Fri, Jul 04, 2025 at 02:38:34PM +0300, Nikita Malakhov wrote:
Hannu, we'd already made an attempt to extract the TOAST functionality as
API and make it extensible and usable by other AMs in [1], the patch
set was met calmly but we still have some hopes on it.
Yeah, it's one of these I have studied, and just found that
overcomplicated, preventing us from moving on with a simpler proposal,
because I care about two things first:
- More compression methods, with more meta-data, but let's just add
more vartag_external for that once/if they're really required.
- Enlarge optionally to 8-byte values.
So I really want to stress about these two points, nothing else for
now, echoing from the feedback from 2022 and the fact that all
proposals done after that lacked a simple approach.
IMO, we would live fine enough, *if* being able to plug in a pluggable
TOAST engine makes sense, if we just limit ourselves with an external
interface. We could allow backends to load their own vartag_external
with their own set of callbacks like the ones I am proposing here, so
as we can translate from/to a Datum in heap (or a different table AM)
to an external source, with the backend able to understand what this
external source should be. The key is to define a structure good
enough for the backend (toast_external_data in the patch). So to
answer your and Hannu's question: I had the case of different table
AMs in mind with an interface able to plug into it, yes. And yes, I
have designed the patch set with this in scope. Now there's also a
data type component to that, so that's assuming that a table AM would
want to rely on a varlena to store this data externally, somewhere
else that may not be exactly TOAST, still we want an OID and a value
to be able to retrieve this external value, and we want to store this
external OID and this value (+extra like a compression method and
sizes) in a Datum of the main relation file.
FYI, the patch set posted on this thread is not the latest one. I
have a v2, posted on this branch, where I have reordered things:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2
The refactoring to the new toast_external_data with its callbacks is
done first, and the new vartag_external with 8-byte value support is
added on top of that. There were still two things I wanted to do, and
could not get down to it because I've spent my last week or so
working on other's stuff so I lacked room:
- Evaluate the cost of the transfer layer to toast_external_data. The
worst case I was planning to work with is a non-compressed data stored
in TOAST, then check profiles with the the detoasting path by grabbing
slices of the data with pgbench and a read-only query. The
write/insert path is not going to matter, the detoast is. The
reordering is actually for this reason: I want to see the effect of
the new interface first, and this needs to happen before we even
consider the option of adding 8-byte values.
- Add a callback for the value ID assignment. I was hesitating to add
that when I first backed on the patch but I think that's the correct
design moving forward, with an extra logic to be able to check if an
8-byte value is already in use in a relation, as we do for OID
assignment, but applied to the Toast generator added to the patch.
The backend should decide if a new value is required, we should not
decide the rewrite cases in the callback.
There is a second branch that I use for development, force-pushing to
it periodically, as well:
https://github.com/michaelpq/postgres/tree/toast_64bit
That's much dirtier, always WIP, just one of my playgrounds.
--
Michael
Hi!
I'm reviewing code at toast_64bit_v2.
Michael, isn't there a typo?
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] = {
[VARTAG_ONDISK_INT8] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE,
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_INT8,
.to_external_data = ondisk_int8_to_external_data,
.create_external_data = ondisk_int8_create_external_data,
},
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};
Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?
Regards,
--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Mon, Jul 07, 2025 at 05:33:11PM +0300, Nikita Malakhov wrote:
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?
Yes, thanks for pointing this out. This one has lurked in one of the
rebases (not sure how) and it was impacting the threshold calculation
where we consider if an attribute should be compressed or not. I have
taken this occasion to work a bit more on the patch set. The patch
structure is mostly the same, with two tweaks because I was unhappy
with these in the initial patch set:
- The addition of a new callback able to retrieve a new TOAST value,
to ease the diffs in toast_save_datum().
- Reordering of the patch set, with the TOAST external refactoring
done much earlier in the series, now placed in 0003.
The most interesting piece of the patch is still 0003 "Refactor
external TOAST pointer code for better pluggability". On top of that
stands a 0004 patch named "Introduce new callback to get fresh TOAST
values" where I have added value conflict handling for int8. The
split makes reviews easier, hopefully.
Please note that I still need to look at perf profiles and some flame
graphs with the refactoring done in 0003 with the worst case I've
mentioned upthread with detoasting and values stored uncompressed in
the TOAST relation.
I have also pushed this v2 on this branch, so feel free to grab it if
that makes your life easier:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2
--
Michael
Attachments:
On Tue, Jul 08, 2025 at 08:38:41AM +0900, Michael Paquier wrote:
Please note that I still need to look at perf profiles and some flame
graphs with the refactoring done in 0003 with the worst case I've
mentioned upthread with detoasting and values stored uncompressed in
the TOAST relation.
So, the worst case I could think of for the slice detoast path is
something like that:
create table toasttest_bytea (f1 bytea);
alter table toasttest_bytea alter column f1 set storage external;
insert into toasttest_bytea values(decode(repeat('1234567890',10000),'escape'));
And then use something like the following query that retrieves a small
substring many times, to force a maximum of detoast_attr_slice() to
happen, checking the effect of toast_external_info_get_data():
select length(string_agg(substr(f1, 2, 3), '')) from
toasttest_bytea, lateral generate_series(1,1000000) as a (id);
I have taken this query, kept running that with a \watch, and took
samples of 10s perf records, finishing with the attached graphs
(runtime does not show any difference):
- detoast_master.svg, for the graph on HEAD.
- detoast_patch.svg with the patch set up to 0003 and the external
TOAST pointer refactoring, where detoast_attr_slice() shows up.
- master_patch_diff.svg as the difference between both, with
difffolded.pl from [1]https://github.com/brendangregg/FlameGraph -- Michael.
I don't see a difference in the times spent in these stacks, as we are
spending most of the run retrieving the slices from the TOAST relation
in fetch_datum_slice(). Opinions and/or comments are welcome.
[1]: https://github.com/brendangregg/FlameGraph -- Michael
--
Michael
Attachments:
On Jul 7, 2025, at 7:38 PM, Michael Paquier <michael@paquier.xyz> wrote:
I have also pushed this v2 on this branch, so feel free to grab it if
that makes your life easier:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2
--
Michael
Thank you for spending time digging into this and for the well structured patch set (and GitHub branch which I personally find helpful). This $subject is important on its own, but even more so in the broader context of the zstd/dict work [1]https://commitfest.postgresql.org/patch/5702/ and also allowing for innovation when it comes to how externalized Datum are stored. The current model for toasting has served the community well for years, but I think that Hannu [2]"Yes, the idea is to put the tid pointer directly in the varlena external header and have a tid array in the toast table as an extra column. If all of the TOAST fits in the single record, this will be empty, else it will have an array of tids for all the pages for this toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025 and Nikita and others have promising ideas that should be allowable without forcing core changes. I've worked a bit in this area too, I re-based the Pluggble TOAST work by Nikita [3]/messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru onto 18devel earlier this year as I was looking for a way to implement a toaster for a custom type.
All that aside, I think you're right to tackle this one step at a time and try not to boil too much of the ocean at once (the patch set is already large enough). With that in mind I've read once or twice over your changes and have a few basic comments/questions.
v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid
This set of changes make sense and as you say are mechanical in nature, no real comments other than I think that using uint64 rather than Oid is the right call and addresses #2 on Tom's list.
v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code
I like this as well, clarifies the code and reduces repetition.
v2-0003 Refactor external TOAST pointer code for better pluggability
+ * For now there are only two types, all defined in this file. For now this
+ * is the maximum value of vartag_external, which is a historical choice.
This provides a bridge for compatibility, but doesn't open the door to a truly pluggable API. I'm guessing the goal is incremental change rather than wholesale rewrite.
+ * The different kinds of on-disk external TOAST pointers. divided by
+ * vartag_external.
Extra '.' in "TOAST pointers. divided" I'm guessing.
v2-0004 Introduce new callback to get fresh TOAST values
v2-0005 Add catcache support for INT8OID
v2-0006 Add GUC default_toast_type
Easily understood, good progression of supporting changes.
v2-0007 Introduce global 64-bit TOAST ID counter in control file
Do you have any concern that this might become a bottleneck when there are many relations and many backends all contending for a new id? I'd imagine that this would show up in a flame graph, but I think your test focused on the read side detoast_attr_slice() rather than insert/update and contention on the shared counter. Would this be even worse on NUMA systems?
v2-0008 Switch pg_column_toast_chunk_id() return value from oid to bigint
v2-0009 Add support for bigint TOAST values
v2-0010 Add tests for TOAST relations with bigint as value type
v2-0011 Add support for TOAST table types in pg_dump and pg_restore
v2-0012 Add new vartag_external for 8-byte TOAST values
V2-0013 amcheck: Add test cases for 8-byte TOAST values
I read through each of these patches, I like the break down and the attention to detail. The inclusion of good documentation at each step is helpful. Thank you.
Thanks for the flame graphs examining a heavy detoast_attr_slice() workload. I agree that there is little or no difference between them which is nice.
I think the only call out Tom made [4]/messages/by-id/764273.1669674269@sss.pgh.pa.us that isn't addressed was the ask for localized ID selection. That may make sense at some point, especially if there is contention on GetNewToastId(). I think that case is worth a separate performance test, something with a large number of relations and backends all performing a lot of updates generating a lot of new IDs. What do you think?
As for adding even more flexibility, I see the potential to move in that direction over time with this as a good focused incremental set of changes that address a few important issues now.
Really excited by this work, thank you.
-greg
[1]: https://commitfest.postgresql.org/patch/5702/
[2]: "Yes, the idea is to put the tid pointer directly in the varlena external header and have a tid array in the toast table as an extra column. If all of the TOAST fits in the single record, this will be empty, else it will have an array of tids for all the pages for this toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
[3]: /messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru
[4]: /messages/by-id/764273.1669674269@sss.pgh.pa.us
Hi!
Greg, thanks for the interest in our work!
Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.
What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.
It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)
I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"
My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record
Column | Type | Storage
-------------+---------+----------
chunk_id | oid | plain | 0 when not using toast index, 0xfffe -
non-deletable, for example when used as dictionary for multiple
toasted values.
chunk_seq | integer | plain | if not 0 when referenced from toast
pointer then the toasted data starts at toast_pages[0] (or below it in
that tree), which *must* have chunk_id = 0
chunk_data | bytea | plain
-- added fields
toast_pages | tid[] | plain | can be chained or make up a tree
offsets | int[] | plain | -- starting offsets of the toast_pages
(octets or type-specific units), upper bit is used to indicate that a
new compressed span starts at that offset, 2nd highest bit indicates
that the page is another tree page
comp_method | int | plain | -- compression methos used maybe should be enum ?
dict_pages | tid[] | plain | -- pages to use as compression
dictionary, up to N pages, one level
This seems to be flexible enough to allow for both compressin and
efficient partial updates
---
Hannu
Show quoted text
On Tue, Jul 8, 2025 at 8:31 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi!
Greg, thanks for the interest in our work!
Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On 2025-Jul-08, Hannu Krosing wrote:
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.
I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples. A lot to
reimplement though -- vacuum in particular. Maybe it can be thought of
as a new table AM. Not an easy project, I reckon.
--
Ãlvaro Herrera Breisgau, Deutschland â https://www.EnterpriseDB.com/
Hi,
Hannu, we have some thoughts on direct tids storage,
it was some time ago and done by another developer,
so I have to look. I'll share it as soon as I find it, if you
are interested.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
chunk_
On Tue, Jul 8, 2025 at 9:37 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
On 2025-Jul-08, Hannu Krosing wrote:
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples. A lot to
reimplement though -- vacuum in particular.
Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.
What other parts do you think need re-implementing in addition to
skipping the index lookup part and using the tid directly ?
The fact that per-page chunk_tid arrays allow also tree structures
should allow us much more flexibility in implementing
in-place-updatable structured storage in something otherways very
similar to toast, but this is not required for just moving from oid +
index ==> tid to using the tid directly.
I think that having a toast table as a normal table with full MVCC is
actually a good thing, as it can implement the "array element update"
as a real partial update of only the affected parts and not the
current 'copy everything' way of doing this. We already do collect the
array element update in the parse tree in a special way, now we just
need to have types that can do the partial update by changing a tid or
two in the chunk_tids array (and adjust the offsets array if needed)
This should make both
UPDATE t SET theintarray[3] = 5, theintarray[4] = 7 WHERE id = 1;
and even do partial up[dates for something like this
hannuk=# select * from jtab;
id | j
----+----------------------------
1 | {"a": 3, "b": 2}
2 | {"c": 1, "d": [10, 20, 3]}
(2 rows)
hannuk=# update jtab SET j['d'][3] = '7' WHERE id = 2;
UPDATE 1
hannuk=# select * from jtab;
id | j
----+-------------------------------
1 | {"a": 3, "b": 2}
2 | {"c": 1, "d": [10, 20, 3, 7]}
(2 rows)
when the JSON data is so large that changed part is in it's own chunk.
Maybe it can be thought of
as a new table AM. Not an easy project, I reckon.
I would prefer it to be an extension of current toast - just another
varatt_* type - as then you can upgrade to new storage CONCURRENTLY,
same way as you can currently switch compression methods.
Show quoted text
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
On Tue, Jul 08, 2025 at 08:54:33PM +0200, Hannu Krosing wrote:
I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)
There is also the point of backward-compatibility. We cannot just
replace things, and we have to provide a way for users to be able to
rely on the system so as upgrades are painless. So we need to think
about the correct application layer to use to maintain the legacy code
behavior while considering improvements.
I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"
If you do not have a lot of time looking at the full patch set, I'd
recommend looking at 0003, files toast_external.h and
toast_external.c which include the key idea. Adding a new external
TOAST pointer is then a two-step process:
- Add a new vartag_external.
- Add some callbacks to let the backend understand what it should do
with this new vartag_external.
My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record
Points 2. and 3. are things that the refactoring should allow. About
1., I have no idea how much you want to store in the TOAST external
points and how it affects the backend, but you could surely implement
an option that lets the backend know that it should still index
lookups based on what the external TOAST pointer says, if this stuff
has benefits.
This seems to be flexible enough to allow for both compressin and
efficient partial updates
I don't really disagree with all that. Now the history of the TOAST
threads point out that we've good at proposing complex things, but
these had a high footprint. What I'm proposing is lighter than that,
I think, tackling my core issue with the infra supporting backward
compatibility and the addition of more modes on top of it.
--
Michael
On Tue, Jul 08, 2025 at 09:31:29PM +0300, Nikita Malakhov wrote:
Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.
Yeah, I'm aware of that. The code is mostly to make it easier to read
while dealing with this historical behavior, even if it costs a bit
more in memory. I don't think that it's a big deal, and we could
always have one more level of redirection to reduce its size. Now
there's the extra complexity..
What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.
I've thought about that, and decided to discard this idea for now to
keep the whole proposal simpler. This has benefits if you have many
relations with few OIDs consumed, but this has a cost in itself as you
need to maintain the data for each TOAST relation. When I looked at
the problems a couple of weeks ago, I came to the conclusion that all
the checkbox properties of "local" TOAST values are filled with a
sequence: WAL logging to ensure uniqueness, etc. So I was even
considering the addition of some code to create sequences on-the-fly,
but at the end that was just more complexity with how we define
sequences currently compared to a unique 8-byte counter in the
control file that's good enough for a veeery long time.
I've also noticed that this sort of links to a piece I've implemented
last year and is still sitting in the CF app without much interest
from others: sequence AMs. You could implement a "TOAST" sequence
method, for example, optimized for this purpose. As a whole, I
propose to limit the scope of the proposal to the pluggability of the
external TOAST pointers. The approach I've taken should allow such
improvements, these can be added later if really needed.
--
Michael
On Tue, Jul 08, 2025 at 12:58:26PM -0400, Burd, Greg wrote:
All that aside, I think you're right to tackle this one step at a
time and try not to boil too much of the ocean at once (the patch
set is already large enough). With that in mind I've read once or
twice over your changes and have a few basic comments/questions.v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid
This set of changes make sense and as you say are mechanical in
nature, no real comments other than I think that using uint64 rather
than Oid is the right call and addresses #2 on Tom's list.v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code
I like this as well, clarifies the code and reduces repetition.
Thanks. These are independent pieces if you want to link the code
less to TOAST, assuming that an area of 8 bytes would be good enough
for any TOAST "value" concept. TIDs were mentioned as well on a
different part of the thread, ItemPointerData is 6 bytes.
v2-0003 Refactor external TOAST pointer code for better pluggability
+ * For now there are only two types, all defined in this file. For now this + * is the maximum value of vartag_external, which is a historical choice.This provides a bridge for compatibility, but doesn't open the door
to a truly pluggable API. I'm guessing the goal is incremental
change rather than wholesale rewrite.
Nope, it does not introduce a pluggable thing, but it does untangle
the fact that one needs to change 15-ish code paths when they want to
add a new type of external TOAST pointer, without showing an actual
impact AFAIK when we insert a TOAST value or fetch it, as long as we
know that we're dealing with an on-disk thing that requires an
external lookup.
+ * The different kinds of on-disk external TOAST pointers. divided by + * vartag_external.Extra '.' in "TOAST pointers. divided" I'm guessing.
Indeed, thanks.
v2-0007 Introduce global 64-bit TOAST ID counter in control file
Do you have any concern that this might become a bottleneck when
there are many relations and many backends all contending for a new
id? I'd imagine that this would show up in a flame graph, but I
think your test focused on the read side detoast_attr_slice() rather
than insert/update and contention on the shared counter. Would this
be even worse on NUMA systems?
That may be possible, see below.
Thanks for the flame graphs examining a heavy detoast_attr_slice()
workload. I agree that there is little or no difference between
them which is nice.
Cool. Yes. I was wondering why detoast_attr_slice() does not show up
in the profile on HEAD, perhaps it just got optimized away (I was
under -O2 for these profiles).
I think the only call out Tom made [4] that isn't addressed was the
ask for localized ID selection. That may make sense at some point,
especially if there is contention on GetNewToastId(). I think that
case is worth a separate performance test, something with a large
number of relations and backends all performing a lot of updates
generating a lot of new IDs. What do you think?
Yeah, I need to do more benchmark for the int8 part, I was holding on
such evaluations because this part of the patch does not fly if we
don't do the refactoring pieces first. Anyway, I cannot get excited
about the extra workload that this would require in the catalogs,
because we would need one TOAST sequence tracked in there, linked to
the TOAST relation so it would not be free. Or we invent a new
facility just for this purpose, meaning that we get far away even more
from being able to resolve the original problem with the values and
compression IDs. We're talking about two instructions. Well, I guess
that we could optimize it more some atomics or even cache a range of
values to save in ToastIdGenLock acquisitions in a single backend. I
suspect that the bottleneck is going to be the insertion of the TOAST
entries in toast_save_datum() anyway with the check for conflicting
values, even if your relation is unlogged or running-on-scissors in
memory.
[2] "Yes, the idea is to put the tid pointer directly in the varlena
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after
PGConf.dev/2025
Sure, you could do that as well, but I suspect that we'll need the
steps of at least up to 0003 to be able to handle more easily multiple
external TOAST pointer types, or the code will be messier than it
currently is. :D
--
Michael
Hi!
On this -
Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.
Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!
Have to mention though that we encountered issues in logical replication
when we made toast values updatable.
Also researching direct tids implementation.
Cheers!
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:
Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!Have to mention though that we encountered issues in logical replication
when we made toast values updatable.Also researching direct tids implementation.
I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do. I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael
Hi Michael,
I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.
On Tue, Jul 15, 2025 at 3:56 AM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:
Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is apromising
way to solve it once and for all!
Have to mention though that we encountered issues in logical replication
when we made toast values updatable.Also researching direct tids implementation.
I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do. I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Fri, Jul 18, 2025 at 9:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi Michael,
I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.
Great!
I also just started looking at it, starting from 0003 as recommended by Michael.
Will be interesting to see how similar / different our approaches will be :)
On Mon, Jul 14, 2025 at 8:15 AM Nikita Malakhov <hukutoc@gmail.com> wrote:
...
Have to mention though that we encountered issues in logical replication
when we made toast values updatable.
This seems to indicate that Logical Decoding does not honour
visibility checks in TOAST.
This works fine if the TOAST visibility never changes but will break
if it can change independently of heap-side tuple
Also researching direct tids implementation.
Cool!
On Fri, Jul 18, 2025 at 10:24:12PM +0300, Nikita Malakhov wrote:
I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.
Interesting. Of course I may be wrong because I have no idea of how
you have shaped things, still I suspect that for the basics you should
just need 0003, 0004, the parts with the GUC to switch the TOAST table
type and the dump/restore/upgrade bits.
--
Michael
Hi Michael,
I'm reviewing your patches(toast_64bit_v2 branch) and have prepared
two incremental patches that add ZSTD compression support. While doing
this I made a small refactoring in heaptoast.c (around
ToastTupleContext) to make adding additional TOAST pointer formats
easier.
Added two new on-disk external vartags to support new compression
methods: one using an Oid value identifier and one using a 64‑bit
(int8) identifier for toast table types. This lets us support extended
compression methods for both existing Oid‑based TOAST tables and a
variant that needs a wider ID space.
typedef enum vartag_external
{
VARTAG_INDIRECT = 1,
VARTAG_EXPANDED_RO = 2,
VARTAG_EXPANDED_RW = 3,
VARTAG_ONDISK_INT8 = 4,
VARTAG_ONDISK_CE_OID = 5,
VARTAG_ONDISK_CE_INT8 = 6,
VARTAG_ONDISK_OID = 18
} vartag_external;
Two new ondisk TOAST pointer structs carrying an va_ecinfo field for
extended compression methods:
typedef struct varatt_external_ce_oid
{
int32 va_rawsize; /* Original data size (includes header) */
uint32 va_extinfo; /* External saved size (without header) and
* VARATT_CE_FLAG in top 2 bits */
uint32 va_ecinfo; /* Extended compression info */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external_ce_oid;
typedef struct varatt_external_ce_int8
{
int32 va_rawsize; /* Original data size (includes header) */
uint32 va_extinfo; /* External saved size (without header) and
* VARATT_CE_FLAG in top 2 bits */
uint32 va_ecinfo; /* Extended compression info */
/*
* Unique ID of value within TOAST table, as two uint32 for alignment and
* padding.
*/
uint32 va_valueid_lo;
uint32 va_valueid_hi;
Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external_ce_int8;
The inline (varattrib_4b) format extension (discussed in [1]/messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com) is
included; I made one adjustment: the compression id field is now a
4‑byte va_ecinfo field (instead of 1 byte) for structural consistency
with the extended TOAST pointer formats.
typedef union
{
struct /* Normal varlena (4-byte length) */
{
uint32 va_header;
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_4byte;
struct /* Compressed-in-line format */
{
uint32 va_header;
uint32 va_tcinfo; /* Original data size (excludes header) and
* compression method; see va_extinfo */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Compressed data */
} va_compressed;
struct
{
uint32 va_header;
uint32 va_tcinfo; /* Original data size (excludes header) and
* compression method or VARATT_CE_FLAG flag;
* see va_extinfo */
uint32 va_ecinfo; /* Extended compression info: 32-bit field
* where only the lower 8 bits are used for
* compression method. Upper 24 bits are
* reserved/unused. Lower 8 bits layout: Bits
* 7–1: encode (cmid − 2), so cmid is
* [2…129] Bit 0: flag for extra metadata
*/
char va_data[FLEXIBLE_ARRAY_MEMBER];
} va_compressed_ext;
} varattrib_4b;
[1]: /messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com
Please review it and let me know if you have any questions or
feedback? Thank you!
v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.
--
Nikhil Veldanda
Attachments:
Hi,
v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.
Sending v27 patch with a small update over v26 patch.
v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.
--
Nikhil Veldanda
--
Nikhil Veldanda
Attachments:
I have been evolving details for Direct TOAST design in
https://wiki.postgresql.org/wiki/DirectTOAST
The top level goals are
* 8-byte TOAST pointer - just (header:1, tag:1 and TID:6)
* all other info moved from toast pointer to actual toast record(s),
so heap rows are smaller and faster.
* all extra fields are bytea with internal encoding (maybe will create
full new types for these, or maybe just introspection functions are
enough)
the reasons for this are
- PostgresSQL arrays add 20 byte overhead
- bytea gives other freedoms in encoding for minimal space usage
No solution yet for va_toastrelid , but hope is
- to use some kind of mapping and find one or two free bits somewhere
(tid has one free),
- or add a 12-byte toast pointer just for this.
- or to make sure that CLUSTER and VACUUM FULL can be done without
needing va_toastrelid. I assume it is there for clustering the TOAST
which will be not possible separately from the main heap with direct
toast tid pointers anyway.
Please take a look and poke holes in it !
On Sun, Jul 20, 2025 at 10:28 AM Nikhil Kumar Veldanda
<veldanda.nikhilkumar17@gmail.com> wrote:
Show quoted text
Hi,
v26-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v26-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.Sending v27 patch with a small update over v26 patch.
v27-0014-Design-to-extend-the-varattrib_4b-and-toast-poin.patch:
Design proposal covering varattrib_4b, TOAST pointer layouts, and
related macro updates.
v27-0015-Implement-Zstd-compression-no-dictionary-support.patch: Plain
ZSTD (non dict) support and few basic tests.--
Nikhil Veldanda--
Nikhil Veldanda
Hi!
Michael and Hannu, here's a POC patch with direct TIDs TOAST.
The simplest implementation where we store a chain of TIDs, each
chunk stores the next TID to be fetched. Patch applies on top of
commit 998b0b51d5ea763be081804434f177082ba6772b (origin/toast_64bit_v2)
Author: Michael Paquier <michael@paquier.xyz>
Date: Thu Jun 19 13:09:11 2025 +0900
While it is very fast on small data - I see several disadvantages:
- first of all, VACUUM should be revised to work with such tables;
- problematic batch insertion due to necessity to store TID chain.
It is just a POC implementation, so please don't blame me for
questionable decisions.
Any opinions and feedback welcome!
PS: Hannu, just seen your latest message, will check it out now.
On Mon, Jul 21, 2025 at 3:15 AM Hannu Krosing <hannuk@google.com> wrote:
I have been evolving details for Direct TOAST design in
https://wiki.postgresql.org/wiki/DirectTOASTThe top level goals are
* 8-byte TOAST pointer - just (header:1, tag:1 and TID:6)
* all other info moved from toast pointer to actual toast record(s),
so heap rows are smaller and faster.
* all extra fields are bytea with internal encoding (maybe will create
full new types for these, or maybe just introspection functions are
enough)
the reasons for this are
- PostgresSQL arrays add 20 byte overhead
- bytea gives other freedoms in encoding for minimal space usageNo solution yet for va_toastrelid , but hope is
- to use some kind of mapping and find one or two free bits somewhere
(tid has one free),
- or add a 12-byte toast pointer just for this.
- or to make sure that CLUSTER and VACUUM FULL can be done without
needing va_toastrelid. I assume it is there for clustering the TOAST
which will be not possible separately from the main heap with direct
toast tid pointers anyway.Please take a look and poke holes in it !
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
Attachments:
Hi!
I agree that storing reltoastrelid in each Toast pointer seems to be
a waste of disk space since the current Postgres state does not
allow multiple TOAST tables per relation.
But if we consider this as a viable option it could bring additional
advantages. I've successfully tried to use multiple TOAST tables,
with different variations - by type, by column and as-is just as
an extensible storage.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Mon, Jul 21, 2025 at 02:20:31PM +0300, Nikita Malakhov wrote:
I agree that storing reltoastrelid in each Toast pointer seems to be
a waste of disk space since the current Postgres state does not
allow multiple TOAST tables per relation.
va_toastrelid is a central part of the current system when dealing
with a TOAST relation rewrite, because we need to know to which
relation an on-disk TOAST pointer is part of. Or do you mean that we
don't need that with what you are proposing with TIDs? Perhaps yes,
sure, I've not studied this question when associated with your patch
(which has a bunch of duplicated code that could be avoided AFAIK).
But if we consider this as a viable option it could bring additional
advantages. I've successfully tried to use multiple TOAST tables,
with different variations - by type, by column and as-is just as
an extensible storage.
I don't think we need to be that ambitious. There is no way to say
such things could have any benefits in the long-term and there's the
catalog representation part. Even if we do that, I suspect that users
would never really know which setup makes sense because we would want
to control how things happen at a relation level and not purely
automate a behavior.
--
Michael
On Mon, Jul 21, 2025 at 02:06:03PM +0300, Nikita Malakhov wrote:
While it is very fast on small data - I see several disadvantages:
- first of all, VACUUM should be revised to work with such tables;
- problematic batch insertion due to necessity to store TID chain.It is just a POC implementation, so please don't blame me for
questionable decisions.Any opinions and feedback welcome!
I think that this is going to be helpful in taking some decisions with
the refactoring pieces I am proposing for the external TOAST pointer
layer. You have some interesting points around
detoast_external_attr() and detoast_attr_slice(), as far as I can see.
One point of the on-disk TOAST refactoring is that we should be able
to entirely avoid this level of redirection. I get that this is a
POC, of course, but it provides pointers that what I've done may not
be sufficient in terms of extensibility so that seems worth digging
into.
The patch posted by Nikhil is also something that touches this area,
that I have on my tablets:
/messages/by-id/CAFAfj_E-QLiUq--+Kdyvb+-Gg79LLayZRcH8+mFPzVuDQOVaAw@mail.gmail.com
It touches a different point: we need to be smarter with
CompressionToastId and not use that as the value for what we store on
disk. Each vartag_external should be able to use the compression
methods values it wishes, with the toast_external layer being in
charge of translating that back-and-forth with the GUC value.
--
Michael
Hi Michael!
Yes, I know about relation rewrite and have already thought about how
we can avoid excessive storage of toastrelid and do not spoil rewrite,
still do not have a good enough solution.
You have some interesting points around
detoast_external_attr() and detoast_attr_slice(), as far as I can see.
One point of the on-disk TOAST refactoring is that we should be able
to entirely avoid this level of redirection. I get that this is a
POC, of course, but it provides pointers that what I've done may not
be sufficient in terms of extensibility so that seems worth digging
into.
I'm currently re-visiting our TOAST API patch set, there are some
good (in terms of simplicity and lightweightness) proposals, will mail
later.
Some more thoughts on TIDs:
TIDs could be stored as a list instead of a chain (as Hannu proposes
in his design). This allows batch operations and storage optimization
'cause TID lists are highly compressible, but significantly complicates
the code responsible for chunk processing.
Also, Toast pointer in current state must store raw size and external
size - these two are used by the executor, and we cannot get rid
of them so lightly.
Vacuuming such a table would be a pain in the ass, we have to
somehow prevent bloating tables with a high update rate.
Also, current toast mechanics is insert-only, it does not support
updates (just to remind - the whole toasted value is marked dead
and new one is inserted during update), this is a subject to change.
And logical replication, as I mentioned earlier, does not have any
means for replicating toast diffs.
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Tue, Jul 22, 2025 at 1:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote:
Hi Michael!
Yes, I know about relation rewrite and have already thought about how
we can avoid excessive storage of toastrelid and do not spoil rewrite,
still do not have a good enough solution.
The high-level idea would be to any actual rewrite -- as opposed to
plain vacuum which frees empty space within the TOAST relation -- as
part of the vacuum of the main relation.
Another option would be to store a back-pointer to the heap tuple
inside the toast tuple and use that when rewriting, though it has its
own set of complexities.
You have some interesting points around
detoast_external_attr() and detoast_attr_slice(), as far as I can see.
One point of the on-disk TOAST refactoring is that we should be able
to entirely avoid this level of redirection. I get that this is a
POC, of course, but it provides pointers that what I've done may not
be sufficient in terms of extensibility so that seems worth digging
into.I'm currently re-visiting our TOAST API patch set, there are some
good (in terms of simplicity and lightweightness) proposals, will mail
later.
Sounds interesting.
Some more thoughts on TIDs:
TIDs could be stored as a list instead of a chain (as Hannu proposes
in his design). This allows batch operations and storage optimization
'cause TID lists are highly compressible, but significantly complicates
the code responsible for chunk processing.
I would not say it complicates the *code* very much, especially when
you keep offsets in the toast tuples so that you can copy them into
the final materialized datum in any order.
And it does allow many optimisations in terms of batching,
pre-fetching and even parallelism in case of huge toasted values.
Also, Toast pointer in current state must store raw size and external
size - these two are used by the executor, and we cannot get rid
of them so lightly.
Are these ever used without actually using the data ?
When the data _is_ also used then the cost of getting the length from
the toast record with direct toast should mostly amortize over the
full query.
Can you point to where in the code this is done ?
In long run we may want to store also the actual size in the toast
record (not toast pointer) as well for types where length() !=
octetsize bacuse currently a simple call like length(text) has to
materialize the whole thing before getting the length, whereas
pg_colum_size() and octertsize() are instantaneous.
Vacuuming such a table would be a pain in the ass, we have to
somehow prevent bloating tables with a high update rate.
Normal Vacuum should work fine. It is the rewrite that cis tricky.
Also, current toast mechanics is insert-only, it does not support
updates (just to remind - the whole toasted value is marked dead
and new one is inserted during update), this is a subject to change.
And logical replication, as I mentioned earlier, does not have any
means for replicating toast diffs.
Which points to the need to (optionally) store the diff in the toast
as well when there are defined replication slots.
Once we have a way to actually do JSON(B) updates at SQL or function level.
We may even want to store the JSON in some base JSON + JSON_PATCH
format where we materialize at retrieval.
But this goes way beyond the current patch's scope. Though my design
should accommodate it nicely.
---
Hannu
On Tue, Jul 22, 2025 at 02:56:23PM +0200, Hannu Krosing wrote:
The high-level idea would be to any actual rewrite -- as opposed to
plain vacuum which frees empty space within the TOAST relation -- as
part of the vacuum of the main relation.
Another option would be to store a back-pointer to the heap tuple
inside the toast tuple and use that when rewriting, though it has its
own set of complexities.
Well. I would suggest to begin with simpler to begin with, finding
building pieces that can be relied on, and I tend to think that I've
sorted out the first basic one of these because we want to be able to
give the backend the possibility to understand more formats of
external on-dist TOAST pointers. A simple whole rewrite of the TOAST
engine is not something that can just happen in one day: that's a very
risky move, and we need to worry about backward-compatibility while
maintaining the legacy code. FWIW, I think that we should still move
on with the 8-byte implementation anyway with specific vartag_external
that can lead to variable-sized pointers (shorter if value is less
than UINT32_MAX, still stored in a int8 TOAST table), extending the
code so as it's possible to think about more on top of that. That's
basically risk-free if done right while taking the problem at its
root. That's also what Tom and Andres meant back in 2022 before the
problem drifted away to different issues, and that should allow the
addition of more compression methods if done correctly (quoted at the
top of this thread).
You have some interesting points around
detoast_external_attr() and detoast_attr_slice(), as far as I can see.
One point of the on-disk TOAST refactoring is that we should be able
to entirely avoid this level of redirection. I get that this is a
POC, of course, but it provides pointers that what I've done may not
be sufficient in terms of extensibility so that seems worth digging
into.I'm currently re-visiting our TOAST API patch set, there are some
good (in terms of simplicity and lightweightness) proposals, will mail
later.Sounds interesting.
If you have refactoring proposals, that could be considered
separately, yes.
Now, the reason why nothing has been done about the original problem
is the same reason as what's happening now on this thread: I am seeing
a lot of designs and assumptions for new solutions, without knowing
all the problems we are trying to solve or even if they actually make
sense at this level of the baclend engine. And there is little to no
argument made about backward-compatibility, which is also something
I've tried to design (single GUC for dump/restore/upgrade with TOAST
type based on attribute data type of the TOAST table, which could be
also a tid later on).
But this goes way beyond the current patch's scope. Though my design
should accommodate it nicely.
If you see independent useful pieces that are worth their own, nobody
is going to object to that. The trick is to find incremental pieces
good enough to be able to build upon with individual evaluations and
reviews, at least that's how I want to tackle the problem because I
would be the one who would be responsible for its maintenance by
default. Finding out these "simple" independent relevant pieces is
the hard part.
--
Michael
On Wed, Jul 09, 2025 at 08:20:31AM +0900, Michael Paquier wrote:
Sure, you could do that as well, but I suspect that we'll need the
steps of at least up to 0003 to be able to handle more easily multiple
external TOAST pointer types, or the code will be messier than it
currently is. :D
Please find attached a v3, that I have spent some time polishing to
fix the value ID problem of this thread. v2 had some conflicts, and
the CI previously failed with warning job (CI is green here now).
There are two things that have changed in this patch series:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.
- In the refactoring to introduce the translation layer for external
ondisk TOAST pointers, aka 0004 in this set labelled v3, I was not
happy about the way we grab the vartag_external that gets assigned to
the TOAST varlena, and it depends on two factors in this patch set:
-- The type of TOAST table.
-- The value assigned. If, for example, we have a value less than 4
billion, we can make the TOAST pointer shorter. The interface of 0004
has been rethought to take that into consideration. That's the
function called toast_external_assign_vartag(), reused on heaptoast.c
to get TOAST_POINTER_SIZE as the compressibility threshold when
inserting a tuple and if it should be compressed.
As things stand, I am getting pretty happy with the patch set up to
0005 and how things are getting in shape for the interface, and I am
planning to begin applying this stuff up to 0005 in the next couple of
weeks.
This stuff introduces all the callbacks and the concept of
toast_external to get the refactoring into the tree first, which is
something that other patches that want to extend the compression
methods or the TID layer are going to need anyway. The rest of the
patch set is more a group shot, where they actually matter only when
the new vartag_external is added for the TOAST tables able to hold
8-byte values. As of now, we would need to add more one more
vartag_external:
- For values lower than 4 billion, perhaps we could just reuse the OID
vartag_external here? The code is modular now with
toast_external_assign_vartag(), and it is only a matter of using
VARTAG_ONDISK_OID if we have a value less than 4 billion. So there is
only one place in the code to change, and that's a one-line change
with the v3 patch set attached. And that's less duplication.
- The second one when we have more than 4 billion values.
--
Michael
Attachments:
On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.
Since you're already renaming things... ISTM "ondisk" has the potential for
confusion, assuming that at some point we'll have the ability to store
large datums directly in the filesystem (instead of breaking into chunks to
live in a relation). VARTAG_DURABLE might be a better option.
On Mon, Aug 04, 2025 at 02:37:19PM -0500, Jim Nasby wrote:
On Fri, Aug 1, 2025 at 4:03 AM Michael Paquier <michael@paquier.xyz> wrote:
- Addition of separate patch to rename varatt_external to
varatt_external_oid and VARTAG_ONDISK to VARTAG_ONDISK_OID, in 0003.Since you're already renaming things... ISTM "ondisk" has the potential for
confusion, assuming that at some point we'll have the ability to store
large datums directly in the filesystem (instead of breaking into chunks to
live in a relation). VARTAG_DURABLE might be a better option.
Hmm. I don't know about this one. Durable is an ACID property that
does not apply to all relation kinds. For example, take an unlogged
table: its data is not durable but its TOAST pointers would be marked
with a VARTAG_DURABLE. With that in mind ONDISK still sounds kind of
OK for me to use as a name.
--
Michael
On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote:
Please find attached a v3, that I have spent some time polishing to
fix the value ID problem of this thread. v2 had some conflicts, and
the CI previously failed with warning job (CI is green here now).
Attached is a v4, due to conflicts mainly caused by the recent changes
in varatt.h done by e035863c9a04. This had an interesting side
benefit when rebasing, where I have been able to isolate most of the
knowledge related to the struct varatt_external (well
varatt_external_oid in the patch set) into toast_external.c, at the
exception of VARTAG_SIZE. That's done in a separate patch, numbered
0006.
The rest of the patch set has a couple of adjustements to document
better the new API expectations for toast_external.{c,h}, comment
adjustments, some more beautification changes, some indentation
applied, etc.
As things stand, I am getting pretty happy with the patch set up to
0005 and how things are getting in shape for the interface, and I am
planning to begin applying this stuff up to 0005 in the next couple of
weeks.
As of this patch set, this means a new target of 0006, to get the
TOAST code refactored so as it is able to support more than 1 type of
external on-disk pointer with the 8-byte value problem in scope. Any
comments?
--
Michael
Attachments:
Hi michael,
I have a question regarding TOAST pointer handling.
As I understand, in the current design, each attribute in a HeapTuple
can have its own TOAST pointer, and TOAST pointers are possible only
for top-level attributes.
Would it make sense to maintain an array for ttc_toast_pointer_size in
ToastTupleContext, allowing us to estimate the size per attribute
based on compression or other criteria?
This approach could make the logic more generic in my opinion, but it
would require changes in toast_tuple_find_biggest_attribute and other
places.
I’d like to hear your thoughts on this.
On Fri, Aug 8, 2025 at 12:52 AM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, Aug 01, 2025 at 06:03:11PM +0900, Michael Paquier wrote:
Please find attached a v3, that I have spent some time polishing to
fix the value ID problem of this thread. v2 had some conflicts, and
the CI previously failed with warning job (CI is green here now).Attached is a v4, due to conflicts mainly caused by the recent changes
in varatt.h done by e035863c9a04. This had an interesting side
benefit when rebasing, where I have been able to isolate most of the
knowledge related to the struct varatt_external (well
varatt_external_oid in the patch set) into toast_external.c, at the
exception of VARTAG_SIZE. That's done in a separate patch, numbered
0006.The rest of the patch set has a couple of adjustements to document
better the new API expectations for toast_external.{c,h}, comment
adjustments, some more beautification changes, some indentation
applied, etc.As things stand, I am getting pretty happy with the patch set up to
0005 and how things are getting in shape for the interface, and I am
planning to begin applying this stuff up to 0005 in the next couple of
weeks.As of this patch set, this means a new target of 0006, to get the
TOAST code refactored so as it is able to support more than 1 type of
external on-disk pointer with the 8-byte value problem in scope. Any
comments?
--
Michael
--
Nikhil Veldanda
Michael Paquier <michael@paquier.xyz> writes:
Attached is a v4, due to conflicts mainly caused by the recent changes
in varatt.h done by e035863c9a04.
I found some time to look at the v4 patchset, and have a bunch of
comments of different sizes.
0001:
I'm good with widening all these values to 64 bits, but I wonder
if it's a great idea to use unadorned "uint64" as the data type.
That's impossible to grep for, doesn't convey anything much about
what the variables are, etc. I'm tempted to propose instead
inventing a typedef "BigOid" or some such name (bikeshedding
welcome). The elog's could be handled with, say,
#define PRIBO PRIu64
This suggestion isn't made with the idea that we'd someday switch
to an even wider type, but just with the idea of making it clearer
what these values are being used for. When you see "Oid" you
know it's some sort of object identifier, and I'm sad to give
that up here.
This hunk is flat out buggy:
@@ -1766,6 +1774,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
return true;
/* It is external, and we're looking at a page on disk */
+ toast_pointer_valueid = toast_pointer.va_valueid;
/*
* Must copy attr into toast_pointer for alignment considerations
toast_pointer isn't initialized at this point. I see you fixed that
in 0004, but it doesn't help to split the patch series if the
intermediate steps are broken.
0002:
OK, although some of the hunks in 0001 seem to belong here.
0003:
No objection to the struct renaming, but does this go far
enough? Aren't we going to need to rename TOAST_POINTER_SIZE
to TOAST_OID_POINTER_SIZE, etc, so that we can have similar
symbols for the wider version? I'm suspicious of not renaming
the functions that work on these, too. (Oh, it looks like you
did some of that in later parts.)
BTW, given that varatt.h has "typedef struct varatt_external {...}
varatt_external", I'm certain that the vast majority of uses of
"struct varatt_external" could be shortened to "varatt_external".
And I think we should do that, because using "struct foo" not "foo"
is not project style. This patch would be a fine time to do that.
0004:
Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
It looks to me like every use of that should be replaced by
toast_external_info_get_data().
I wonder if we shouldn't try to get rid of the phraseology "standard
TOAST pointer", and instead write something like "short TOAST pointer"
or "small TOAST pointer". These aren't really going to be more
"standard" than the wider ones, IMO.
I don't like replacing "va_valueid" with just "value". Dropping
the "id" is not an improvement, because now a reader might be
confused about whether this is somehow the actual value of the
toasted datum.
Functions in toast_external.c are under-documented. Some lack
any header comment whatever, and others have one that doesn't
actually say what they do.
I kind of wonder whether the run-time indirection this design causes
is going to be a performance problem. Perhaps not, given the expenses
involved in accessing a toasted value, but it has a faint feeling
of overdesign.
It looks like a lot of this code would just flat out dump core if
passed an invalid vartag value. Probably not good enough, given
that we might look at corrupted data.
0005:
Nice!
0006:
get_new_value is very confusingly documented. Is the indexid that of
the toastrel's index? What attribute is the attnum for? Why should
the caller need to provide either, rather than get_new_value knowing
that internally? Also, again you are using "value" for something
that is not the value of the to-be-toasted datum. I'd suggest
something more like "get_new_toast_identifier".
I kind of feel that treating this operation as a toast_external_info
method is the wrong thing, since where it looks like you are going
is to fetch an identifier and only then decide which vartag you
need to use. That ugliness comes out here:
+ /*
+ * Retrieve the external TOAST information, with the value still unknown.
+ * We need to do this again once we know the actual value assigned, to
+ * define the correct vartag_external for the new TOAST tuple.
+ */
0007:
Surely we do not need the cast in this:
+ return murmurhash64((int64) DatumGetInt64(datum));
I see you copied that from int4hashfast, but that's not good
style either.
More generally, though, why do we need catcache support for int8?
There aren't any caches on toast chunks, and I doubt we are going to
introduce any. Sure there might be a need for this down the road,
but I don't see that this patch series is the time to add it.
0008:
I totally hate the idea of introducing a GUC for this. This should be
user-transparent. Or if it isn't user-transparent, a GUC is still
the wrong thing; some kind of relation option would be more sensible.
0009:
I do not love this either. I think the right thing is just to widen
the existing nextOid counter to 64 bits, and increment it in-memory as
64 bits, and then return either all 64 bits when a 64-bit Oid is asked
for, or just the low-order 32 bits when a 32-bit OID is asked for
(with appropriate hacking for 32-bit wraparound). Having a separate
counter will make it too easy for the same value to be produced by
nearby calls to the 32-bit and 64-bit Oid generators, which is likely
a bad thing, if only because of potential confusion.
0010:
OK
0011:
Not reviewing this yet, because I disagree with the basic design.
I didn't look at the later patches either.
regards, tom lane
Hi,
On 2025-08-08 15:32:16 -0400, Tom Lane wrote:
Michael Paquier <michael@paquier.xyz> writes:
Attached is a v4, due to conflicts mainly caused by the recent changes
in varatt.h done by e035863c9a04.I found some time to look at the v4 patchset, and have a bunch of
comments of different sizes.0001:
I'm good with widening all these values to 64 bits, but I wonder
if it's a great idea to use unadorned "uint64" as the data type.
That's impossible to grep for, doesn't convey anything much about
what the variables are, etc. I'm tempted to propose instead
inventing a typedef "BigOid" or some such name (bikeshedding
welcome). The elog's could be handled with, say,
#define PRIBO PRIu64
This suggestion isn't made with the idea that we'd someday switch
to an even wider type, but just with the idea of making it clearer
what these values are being used for. When you see "Oid" you
know it's some sort of object identifier, and I'm sad to give
that up here.
I think we should consider introducing Oid64, instead of having a toast
specific type. I doubt this is the last place that we'll want to use 64 bit
wide types for cataloged entities.
0004:
Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
It looks to me like every use of that should be replaced by
toast_external_info_get_data().I wonder if we shouldn't try to get rid of the phraseology "standard
TOAST pointer", and instead write something like "short TOAST pointer"
or "small TOAST pointer". These aren't really going to be more
"standard" than the wider ones, IMO.I don't like replacing "va_valueid" with just "value". Dropping
the "id" is not an improvement, because now a reader might be
confused about whether this is somehow the actual value of the
toasted datum.Functions in toast_external.c are under-documented. Some lack
any header comment whatever, and others have one that doesn't
actually say what they do.I kind of wonder whether the run-time indirection this design causes
is going to be a performance problem. Perhaps not, given the expenses
involved in accessing a toasted value, but it has a faint feeling
of overdesign.
+1
0008:
I totally hate the idea of introducing a GUC for this. This should be
user-transparent. Or if it isn't user-transparent, a GUC is still
the wrong thing; some kind of relation option would be more sensible.
Agreed. I think we need backward compatibility for pg_upgrade purposes, but
that doesn't require a GUC, it just requires an option when creating tables in
binary upgrade mode.
0009:
I do not love this either. I think the right thing is just to widen
the existing nextOid counter to 64 bits, and increment it in-memory as
64 bits, and then return either all 64 bits when a 64-bit Oid is asked
for, or just the low-order 32 bits when a 32-bit OID is asked for
(with appropriate hacking for 32-bit wraparound). Having a separate
counter will make it too easy for the same value to be produced by
nearby calls to the 32-bit and 64-bit Oid generators, which is likely
a bad thing, if only because of potential confusion.
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.
A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would largely
vanish.
With 64bit toast IDs we shouldn't need to search the index for a
non-conflicting toast IDs, there can't be wraparounds (we'd hit wraparound of
LSNs well before that and that's not practically reachable).
Greetings,
Andres Freund
On Fri, Aug 08, 2025 at 05:02:46PM -0400, Andres Freund wrote:
On 2025-08-08 15:32:16 -0400, Tom Lane wrote:
I found some time to look at the v4 patchset, and have a bunch of
comments of different sizes.
Thanks for the input. This patch set exists as a result of a
discussion between you and Andres back in 2022.
(Replying here regarding the points that Andres is quoting, I'll add a
second mail for some of the other things later.)
I'm good with widening all these values to 64 bits, but I wonder
if it's a great idea to use unadorned "uint64" as the data type.
That's impossible to grep for, doesn't convey anything much about
what the variables are, etc. I'm tempted to propose instead
inventing a typedef "BigOid" or some such name (bikeshedding
welcome). The elog's could be handled with, say,
#define PRIBO PRIu64
This suggestion isn't made with the idea that we'd someday switch
to an even wider type, but just with the idea of making it clearer
what these values are being used for. When you see "Oid" you
know it's some sort of object identifier, and I'm sad to give
that up here.I think we should consider introducing Oid64, instead of having a toast
specific type. I doubt this is the last place that we'll want to use 64 bit
wide types for cataloged entities.
Works for me for the grepping argument. Using "64" as a number of
bits in the type name sounds a bit strange to me, not in line with
what's done with bigint/int8 or float, where we use bytes. Naming a
dedicated type "bigoid" with a structure named BigOid behind that's
used for TOAST or in the future for other code paths is OK by me.
AFAIK, the itchy point with unsigned 64b would be the casting, but my
take would be to introduce no casts in the first implementation,
particularly for the bigint -> bigoid case.
0004:
Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
It looks to me like every use of that should be replaced by
toast_external_info_get_data().
Indirect toast pointers still wanted it. But perhaps this could just
be renamed VARATT_EXTERNAL_INDIRECT_GET_POINTER() or something like
that if jt's used only for TOAST pointers?
I wonder if we shouldn't try to get rid of the phraseology "standard
TOAST pointer", and instead write something like "short TOAST pointer"
or "small TOAST pointer". These aren't really going to be more
"standard" than the wider ones, IMO.I don't like replacing "va_valueid" with just "value". Dropping
the "id" is not an improvement, because now a reader might be
confused about whether this is somehow the actual value of the
toasted datum.
Okay, sure.
Functions in toast_external.c are under-documented. Some lack
any header comment whatever, and others have one that doesn't
actually say what they do.
Okay. Will work on that. I am not sure if it's worth doing yet, it
does not seem like there's a clear agreement about patch 0004 and the
toast_external business I am proposing.
I kind of wonder whether the run-time indirection this design causes
is going to be a performance problem. Perhaps not, given the expenses
involved in accessing a toasted value, but it has a faint feeling
of overdesign.
toast_external_info_get_data() is called in 8 places as of the patch
set:
1) Twice for amcheck.
2) Twice for reorderbuffer.
3) Three times in detoast.c
3-1) When fetching a value in full, toast_fetch_datum(), which goes
through the slice API.
3-2) toast_fetch_datum_slice(), to retrieve a slice of the toasted
data.
3-3) detoast_attr_slice(), falling back to the slice fetch.
4) Twice in toast_internals.c:
4-1) Saving a datum.
4-2) Deleting a datum.
And if you see here, upthread, I've defined the worst case as
retrieving a minimal toasted slice stored uncompressed, to measure the
cost of the conversion to this toast_external, without seeing any
effects, the btree conflict lookups doing most of the work:
/messages/by-id/aGzLiDUB_18-8aVQ@paquier.xyz
Thoughts and counter-arguments are welcome, of course.
+1
Well, one reason why this design exists is a point from 2023, made by
Robert H., around here:
/messages/by-id/CA+TgmoaVcjUkmtWdc_9QjBzvSShjDBYk-5XFNaOvYLgGROjJMA@mail.gmail.com
The argument is that it's hard to miss for the existing code and
extensions how a new vartag should be handled. Inventing a new layer
means that extensions and existing code need to do a switch once, then
they cannot really miss be missed when we add a new vartag because the
from/to indirection between the toast_external layer and the varlenas
is handled in its own place.
A second reason why I've used this design is the problem related to
compression IDs, where the idea would be to add a new vartag for
zstandard for example. This still needs more work due to the existing
limitations that we currently have with the CompressionToastId and its
limitation to four values. The idea is that the deserialization of
this data into this toast_external proposal makes the future additions
easier. It would be of course more brutal and more efficient to just
extend the code paths where the toast_external layer (I mean where
VARATT_IS_EXTERNAL_ONDISK() is used currently) with an extra branch
made of a VARATT_IS_EXTERNAL_ONDISK_BIGOID(), but I've decided to
digest the argument from Robert instead, where I'm aiming at isolating
most of the code related to on-disk external TOAST pointers into a
single file, named toast_external.c here.
I'm OK with the "brutal" method if the patch presented here is
thought as overengineered and overdesigned, just wanted to say that
there are a couple of reasons behind these choices. It's great that
both of you are able to take some time to comment on these choices.
If you disagree with this method and just say that we should go ahead
with a more direct VARATT-like method, that's also fine by me as it
would still solve the value limitation problem. And that's primarily
what I want to solve here as it's a pain for some in the field.
0008:
I totally hate the idea of introducing a GUC for this. This should be
user-transparent. Or if it isn't user-transparent, a GUC is still
the wrong thing; some kind of relation option would be more sensible.Agreed. I think we need backward compatibility for pg_upgrade purposes, but
that doesn't require a GUC, it just requires an option when creating tables in
binary upgrade mode.
So, you basically mean that we should just make the 8-byte case the
default for all the new tables created on a new cluster at upgrade,
and just carry across upgrades the TOAST tables with OID values? If
this stuff uses a reloption or an option at DDL level, that's going to
need some dump/restore parts. Not sure that I'm betting the full
picture of what the default behavior should be. I have read on the
2022 thread where both of you have discussed this issue a point about
a "legacy" mode, to give users a way to get the old behavior if they
wanted. A GUC fit nicely around that idea, because it is them
possible to choose how all tables should behave, or perhaps I just did
not design correctly what you meant through that.
0009:
I do not love this either. I think the right thing is just to widen
the existing nextOid counter to 64 bits, and increment it in-memory as
64 bits, and then return either all 64 bits when a 64-bit Oid is asked
for, or just the low-order 32 bits when a 32-bit OID is asked for
(with appropriate hacking for 32-bit wraparound). Having a separate
counter will make it too easy for the same value to be produced by
nearby calls to the 32-bit and 64-bit Oid generators, which is likely
a bad thing, if only because of potential confusion.
I was wondering about doing that as well. FWIW, this approach works
for me, eating only 4 more bytes in the control file.
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would largely
vanish.
I've thought about this piece already and the reason why I have not
taken this approach is mostly simplicity. It's mentioned upthread,
with potentially pieces referring to sequence AMs and "toast"
sequences that could be optimized for the purpose of TOAST relations:
/messages/by-id/aG2iY26tXj1_MHfH@paquier.xyz
Yep. All the requirement boxes are filled with a sequence assigned to
the TOAST relation. Using a single 8-byte counter is actually
cheaper, even if the overhead when generating the TOAST tuples are in
the btree lookups and the insertions themselves. It is also possible
to do that as a two-step process:
- First have TOAST relations with 8-byte values.
- Add local sequences later on.
What matters is having the code in place to check for index lookups
upon values conflicting when a new sequence is attached to an existing
TOAST relation. If we assign sequences from the start, that would not
matter, of course. Another reason on top of the simplicity piece
behind the control file is that it's dead cheap to acquire a new
value when saving a chunk externally.
--
Michael
On Fri, Aug 08, 2025 at 03:32:16PM -0400, Tom Lane wrote:
I found some time to look at the v4 patchset, and have a bunch of
comments of different sizes.
Thanks for the comments. I have replied to some of the items here:
/messages/by-id/aJbygEBqJgmLS0wF@paquier.xyz
Will try to answer the rest here.
0001:
I'm good with widening all these values to 64 bits, but I wonder
if it's a great idea to use unadorned "uint64" as the data type.
That's impossible to grep for, doesn't convey anything much about
what the variables are, etc. I'm tempted to propose instead
inventing a typedef "BigOid" or some such name (bikeshedding
welcome). The elog's could be handled with, say,
#define PRIBO PRIu64
This suggestion isn't made with the idea that we'd someday switch
to an even wider type, but just with the idea of making it clearer
what these values are being used for. When you see "Oid" you
know it's some sort of object identifier, and I'm sad to give
that up here.
Already mentioned on the other message, but I'm OK with a "bigoid"
type and a BigOid type. Honestly, I'm not sure about a replacement
for PRIu64, as we don't really do that for Oids with %u.
toast_pointer isn't initialized at this point. I see you fixed that
in 0004, but it doesn't help to split the patch series if the
intermediate steps are broken.
Oops. FWIW, I've rebased this patch set much more than 4 times. It
looks like I've messed up some of the diffs. Sorry about that.
0003:
No objection to the struct renaming, but does this go far
enough? Aren't we going to need to rename TOAST_POINTER_SIZE
to TOAST_OID_POINTER_SIZE, etc, so that we can have similar
symbols for the wider version? I'm suspicious of not renaming
the functions that work on these, too. (Oh, it looks like you
did some of that in later parts.)
Yeah. I've stuck that into the later parts where the int8 bits have
been added. Perhaps I've not been ambitious enough.
BTW, given that varatt.h has "typedef struct varatt_external {...}
varatt_external", I'm certain that the vast majority of uses of
"struct varatt_external" could be shortened to "varatt_external".
And I think we should do that, because using "struct foo" not "foo"
is not project style. This patch would be a fine time to do that.
Good point.
0004:
Shouldn't VARATT_EXTERNAL_GET_POINTER go away entirely?
It looks to me like every use of that should be replaced by
toast_external_info_get_data().
Point about indirect pointers mentioned on the other message, where we
could rename VARATT_EXTERNAL_GET_POINTER to
VARATT_EXTERNAL_INDIRECT_GET_POINTER or equivalent to limit the
confusion?
I wonder if we shouldn't try to get rid of the phraseology "standard
TOAST pointer", and instead write something like "short TOAST pointer"
or "small TOAST pointer". These aren't really going to be more
"standard" than the wider ones, IMO.
I'm seeing one place in arrayfuncs.c and one in detoast.h using this
term on HEAD. I would do simpler: no standard and no short, just with
a removal of the "standard" part if I were to change something.
I don't like replacing "va_valueid" with just "value". Dropping
the "id" is not an improvement, because now a reader might be
confused about whether this is somehow the actual value of the
toasted datum.
Okay, sure. One reason behind the field renaming was also to track
down all the areas in need to be updated.
Functions in toast_external.c are under-documented. Some lack
any header comment whatever, and others have one that doesn't
actually say what they do.
Okay.
I kind of wonder whether the run-time indirection this design causes
is going to be a performance problem. Perhaps not, given the expenses
involved in accessing a toasted value, but it has a faint feeling
of overdesign.
Mentioned on the other message, linked to this message:
/messages/by-id/aGzLiDUB_18-8aVQ@paquier.xyz
It looks like a lot of this code would just flat out dump core if
passed an invalid vartag value. Probably not good enough, given
that we might look at corrupted data.
Right. That's where we would need an error path in
toast_external_get_info() if nothing is found. That's a cheap
defense, I guess. Good thing is that the lookups are centralized in a
single code path, at least with this design.
I kind of feel that treating this operation as a toast_external_info
method is the wrong thing, since where it looks like you are going
is to fetch an identifier and only then decide which vartag you
need to use. That ugliness comes out here:+ /* + * Retrieve the external TOAST information, with the value still unknown. + * We need to do this again once we know the actual value assigned, to + * define the correct vartag_external for the new TOAST tuple. + */
Yeah, I was feeling a bit depressed when writing the concept of what a
"default" method should be before assigning the value, but that was
still feeling right.
0007:
Surely we do not need the cast in this:
+ return murmurhash64((int64) DatumGetInt64(datum));
I see you copied that from int4hashfast, but that's not good
style either.
Okay.
More generally, though, why do we need catcache support for int8?
There aren't any caches on toast chunks, and I doubt we are going to
introduce any. Sure there might be a need for this down the road,
but I don't see that this patch series is the time to add it.
I recall that this part was required for the value conflict lookups
and also the TOAST value retrievals, so we are going to need it.
0008:
I totally hate the idea of introducing a GUC for this. This should be
user-transparent. Or if it isn't user-transparent, a GUC is still
the wrong thing; some kind of relation option would be more sensible.
Per other email, I'm not sure what you entirely mean here: should 8B
values be the default with existing TOAST OID values kept as they are
across upgrades? Or something else?
0009:
I do not love this either. I think the right thing is just to widen
the existing nextOid counter to 64 bits, and increment it in-memory as
64 bits, and then return either all 64 bits when a 64-bit Oid is asked
for, or just the low-order 32 bits when a 32-bit OID is asked for
(with appropriate hacking for 32-bit wraparound). Having a separate
counter will make it too easy for the same value to be produced by
nearby calls to the 32-bit and 64-bit Oid generators, which is likely
a bad thing, if only because of potential confusion.
Acknowledged on other message, fine my be.
0011:
Not reviewing this yet, because I disagree with the basic design.
I didn't look at the later patches either.
Without an agreement about the design choices related to the first
patches up to 0006, I doubt there this is any need to review any of
the follow-up patches yet because the choices of the first patches
influence the next patches in the series. Thanks for the feedback!
--
Michael
On Fri, Aug 08, 2025 at 09:09:20AM -0700, Nikhil Kumar Veldanda wrote:
I have a question regarding TOAST pointer handling.
As I understand, in the current design, each attribute in a HeapTuple
can have its own TOAST pointer, and TOAST pointers are possible only
for top-level attributes.Would it make sense to maintain an array for ttc_toast_pointer_size in
ToastTupleContext, allowing us to estimate the size per attribute
based on compression or other criteria?This approach could make the logic more generic in my opinion, but it
would require changes in toast_tuple_find_biggest_attribute and other
places.I’d like to hear your thoughts on this.
Yes, that's some complexity that you would need if plugging in more
vartags if these are related to compression methods, and also
something that we may need if we use multiple vartags depending on a
value ID assigned (aka for the TOAST table with 8-byte values, short
Datum if we have a value less than 4 billion with one vartag, longer
Datum if value more than 4 billion with a second vartag).
For the 4-byte vs 8-byte value case, I was wondering if we should be
simpler and less optimistic and assume that we are only going to use
the wider one depending on the type of chunk_id in the TOAST table, as
a minimum threshold when checking if a tuple should be toasted or not.
Perhaps my vision of things is too simple, but I cannot think about a
good reason that would justify making this code more complicated than
it already is.
--
Michael
Hi!
Michael, I'm looking into your patch set for Sequence AMs.
I'm not very happy with single global OID counter for TOAST
tables, so I've decided to investigate your work on sequences
and try to adapt it to TOAST tables. Would report my progress.
Some time ago I've proposed individual TOAST counter
to get rid of table lookups for unused ID, but later decided
that neither reloptions nor pg_class are not a good place
to store it due to lots of related catalog updates including
locks, so sequences look much more useful for this,
as you wrote above.
Personally I'm not too happy with
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
array and for me it seems that lookups using VARTAG
should be straightened out with more error-proof, currently
using wrong vartags would result in core dumps.
On Sat, Aug 9, 2025 at 10:40 AM Michael Paquier <michael@paquier.xyz> wrote:
For the 4-byte vs 8-byte value case, I was wondering if we should be
simpler and less optimistic and assume that we are only going to use
the wider one depending on the type of chunk_id in the TOAST table, as
a minimum threshold when checking if a tuple should be toasted or not.
Perhaps my vision of things is too simple, but I cannot think about a
good reason that would justify making this code more complicated than
it already is.
--
Michael
--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/
On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global
counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would
largely
vanish.
That's been my experience as well. I was actually toying with the idea of
simply switching from OIDs to per-table counters when I came across this,
specifically to address the problem of OID wraparound induced performance
problems.
On Wed, Aug 13, 2025 at 03:06:16PM -0500, Jim Nasby wrote:
On Fri, Aug 8, 2025 at 4:03 PM Andres Freund <andres@anarazel.de> wrote:
I'm not convinced that the global counter, be it a 32 or a 64 bit, approach
has merit in general, and I'm rather sure it's the wrong thing for toast
values. There's no straightforward path to move away from the global
counter
for plain oids, but I would suggest simply not using the global oid counter
for toast IDs.A large portion of the cases I've seen where toast ID assignments were a
problem were when the global OID counter wrapped around due to activity on
*other* tables (and/or temporary table creation). If you instead had a
per-toast-table sequence for assigning chunk IDs, that problem would
largely
vanish.That's been my experience as well. I was actually toying with the idea of
simply switching from OIDs to per-table counters when I came across this,
specifically to address the problem of OID wraparound induced performance
problems.
Yep, that exists. My original use case is not this one, where I have
a class of users able to reach the OID limit on a single table even
within the 32TB limit, meaning that TOAST blobs are large enough to
reach the external threshold, still small enough to have 4 billions of
them within the single-table limit.
Implementation-wise, switching to 8B would solve both things, and
it's so much cheaper to grab a value from a single source. I don't
really object to using 4B local values, but that does not really solve
the original issue that's causing the efforts I've initiated on this
thread. But yes, I'm also trying to implement things so as such an
addition like this one would be slightly easier. It just does not
have to be me doing it ;)
--
Michael
On Sat, Aug 09, 2025 at 04:31:05PM +0900, Michael Paquier wrote:
Without an agreement about the design choices related to the first
patches up to 0006, I doubt there this is any need to review any of
the follow-up patches yet because the choices of the first patches
influence the next patches in the series. Thanks for the feedback!
So, please find attached a rebased version set that addresses all the
feedback that has been provided, hopefully:
- 0001 is the requested data type for 64b OIDs, which is called here
oid8 for the data type and Oid8 for the internal name. I'm not
completely wedded to these names, but anyway. This comes with a
package that make this data type useful for other purposes than this
patch set, basically everything where I've wanted toys for TOAST:
hash/btree operators, min/max, a couple of casts (oid -> oid8,
integers, etc). The rest of the patch set builds upon that.
- 0002 changes a couple of APIs related to TOAST to use Oid8 (in
previous iterations this was uint64). In the last version, the patch
was mixing parts related to max_chunk_size, that should be split
cleanly now.
- 0003 is a preparatory change, reducing the footprint of
TOAST_MAX_CHUNK_SIZE, because we'd want to have something that changes
depending on the vartag we're dealing with.
- 0004 renames things around vartag_external to vartag_external_oid.
Based on the previous feedback, this is more aggressive now with the
renames, renaming a few more things like MAX_CHUNK_SIZE,
TOAST_POINTER_SIZE, etc.
- 0005 is the refactoring piece, that introduces toast_external.c. I
have included the previous feedback, cleaning up the code, adding more
documentation, adding a failure fallback if the vartag given in input
is incorrect, to serve as a corruption defense. And a few more
things.
- 0006 is a follow-up cleanup of varatt.h. It would make more sense
to merge that with 0005, but I've decided to keep that separate as it
makes the review slightly cleaner.
- 0007 is related to the previous feedback, and a follow-up cleanup
that could be merged with one of the previous steps. It changes the
remaining pieces of VARATT_EXTERNAL_GET_POINTER to be related to
indirect TOAST pointers, as it's only used there.
- 0008 is a previous piece, switching pg_column_toast_chunk_id() to
use oid8 as result instead.
- 0009 adds the catcache bits for OID8OID, required for toast values
lookups and deletions, in the upcoming patches. Same as previous.
- 0010 is a new piece, based on the previous feedback. This is an
independent piece that adds an extra step in binary upgrades to be
able to pass down the attribute type of chunk_id. Without oid8
support for the values, this does not bring much of course, but it's
less code churn in the follow-up patches.
- 0011 adds a new piece, based on the previous feedback, where
the existing nextOid is enlarged to 8 bytes, keeping compatibility for
the existing 4-byte OIDs where we don't want values within the [0,
FirstNormalObjectId] range. The next patches rely on the new API to
get 8-byte values.
- 0012 is a new piece requested: reloption able to define the
attribute type of chunk_id when a TOAST relation is initially created
for a table, replacing the previous GUC approach which is now gone.
- 0013 adds support for oid8 in TOAST relations, extending the new
reloption and the binary upgrade paths previously introduced.
- 0014 adds some tests with toast OID8 case, leaving some relations
around for pg_upgrade, covering the binary upgrade path for oid8.
- 0015 is the last patch, adding a new vartag for OID8. It would make
most sense to merge that with 0013, perhaps, the split is here to ease
reviews.
I have dropped the amcheck test patch for now, which was fun but it's
not really necessary for the "basics". I have done also more tests,
playing for example with pg_resetwal, installcheck and pg_upgrade
scenarios. I am wondering if it would be worth doing a pg_resetwal in
the node doing an installcheck on the instance to be upgraded, bumping
its next OID to be much larger than 4 billion, actually..
--
Michael
Attachments:
On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote:
I have dropped the amcheck test patch for now, which was fun but it's
not really necessary for the "basics". I have done also more tests,
playing for example with pg_resetwal, installcheck and pg_upgrade
scenarios. I am wondering if it would be worth doing a pg_resetwal in
the node doing an installcheck on the instance to be upgraded, bumping
its next OID to be much larger than 4 billion, actually..
Four patches had conflicts with 748caa9dcb68, so rebased as v6.
--
Michael
Attachments:
On Tue, Sep 16, 2025 at 02:14:43PM +0900, Michael Paquier wrote:
On Thu, Aug 14, 2025 at 02:49:06PM +0900, Michael Paquier wrote:
I have dropped the amcheck test patch for now, which was fun but it's
not really necessary for the "basics". I have done also more tests,
playing for example with pg_resetwal, installcheck and pg_upgrade
scenarios. I am wondering if it would be worth doing a pg_resetwal in
the node doing an installcheck on the instance to be upgraded, bumping
its next OID to be much larger than 4 billion, actually..Four patches had conflicts with 748caa9dcb68, so rebased as v6.
There were a few conflicts, so here is a rebased v7, moving the patch
to the next CF. I have been sitting on this patch for six weeks for
the moment.
Tom, you are registered as a reviewer of the patch. The point of
contention of the patch, where I see there is no consensus yet, is if
my approach of using a redirection for the external TOAST pointers
with a new layer to facilitate the addition of more vartags (aka the
64b value vartag proposed here, concept that could also apply to
compression methods later on) is acceptable. Moving to a different
approach, like the "brutal" one I am naming upthread where the
redirection layer is replaced by changes in all the code paths that
need to be touched, would be of course cheaper at runtime as there
would be no more redirection, but the maintenance would be a nightmare
the more vartags we add, and I have some plans for more of these.
Doing the switch would be a few hours work, so that would not be a big
deal, I guess. The important part is an agreement about the approach,
IMO.
Please note that the latest patch set also uses a new reloption to
control if 8 byte TOAST values are set (not a GUC), binary upgrades
are handled with binary_upgrade.h functions, and the 8-byte OID value
uses a dedicated data type, as reviewed upthread.
One thing that shows up in the last patch of the set is mentioned in
an XXX comment in toast_external_assign_vartag(), if it would be
better to use a 4-byte vartag even if we have a 8-byte value in the
TOAST table to make the TOAST pointers shorter when a value is less
than 4 billions. Not sure how much to do about this one, and there's
little point in doing this change without the earlier infrastructure
patches if the approach taken is thought as OK, as well.
--
Michael
Attachments:
On Tue, Sep 30, 2025 at 03:26:14PM +0900, Michael Paquier wrote:
There were a few conflicts, so here is a rebased v7, moving the patch
to the next CF. I have been sitting on this patch for six weeks for
the moment.
Attached is a rebased v8, fixing a couple of conflicts.
Tom, you are registered as a reviewer of the patch. The point of
contention of the patch, where I see there is no consensus yet, is if
my approach of using a redirection for the external TOAST pointers
with a new layer to facilitate the addition of more vartags (aka the
64b value vartag proposed here, concept that could also apply to
compression methods later on) is acceptable. Moving to a different
approach, like the "brutal" one I am naming upthread where the
redirection layer is replaced by changes in all the code paths that
need to be touched, would be of course cheaper at runtime as there
would be no more redirection, but the maintenance would be a nightmare
the more vartags we add, and I have some plans for more of these.
Doing the switch would be a few hours work, so that would not be a big
deal, I guess. The important part is an agreement about the approach,
IMO.
This point still got no reply. It would be nice to do something for
this release regarding this old issue, IMO..
--
Michael
Attachments:
Hi Michael,
On Tue, Nov 25, 2025 at 8:54 PM Michael Paquier <michael@paquier.xyz> wrote:
Tom, you are registered as a reviewer of the patch. The point of
contention of the patch, where I see there is no consensus yet, is if
my approach of using a redirection for the external TOAST pointers
with a new layer to facilitate the addition of more vartags (aka the
64b value vartag proposed here, concept that could also apply to
compression methods later on) is acceptable. Moving to a different
approach, like the "brutal" one I am naming upthread where the
redirection layer is replaced by changes in all the code paths that
need to be touched, would be of course cheaper at runtime as there
would be no more redirection, but the maintenance would be a nightmare
the more vartags we add, and I have some plans for more of these.
Doing the switch would be a few hours work, so that would not be a big
deal, I guess. The important part is an agreement about the approach,
IMO.This point still got no reply. It would be nice to do something for
this release regarding this old issue, IMO..
--
Thanks for the updated v8. I’m also looking forward to feedback on the
vartag approach. My work on adding ZSTD compression depends on this
design decision, so consensus here will help me proceed in the right
direction.
--
Nikhil Veldanda
On Tue, Nov 25, 2025 at 11:54 PM Michael Paquier <michael@paquier.xyz> wrote:
Tom, you are registered as a reviewer of the patch. The point of
contention of the patch, where I see there is no consensus yet, is if
my approach of using a redirection for the external TOAST pointers
with a new layer to facilitate the addition of more vartags (aka the
64b value vartag proposed here, concept that could also apply to
compression methods later on) is acceptable. Moving to a different
approach, like the "brutal" one I am naming upthread where the
redirection layer is replaced by changes in all the code paths that
need to be touched, would be of course cheaper at runtime as there
would be no more redirection, but the maintenance would be a nightmare
the more vartags we add, and I have some plans for more of these.
Doing the switch would be a few hours work, so that would not be a big
deal, I guess. The important part is an agreement about the approach,
IMO.This point still got no reply. It would be nice to do something for
this release regarding this old issue, IMO..
I took a brief look at this today, looking at parts of v8-0005 and
v8-0015. Although I don't dislike the idea of an abstraction layer in
concept, it's unclear to me how much this particular abstraction layer
is really buying you. It's basically abstracting away two things: the
difference between the current varatt_external and
varatt_external_oid8, and the unbundling of compression_method from
extsize. That's not a lot. This thread has a bunch of ideas already on
other things that people want to do to the TOAST system or think you
should be doing to the TOAST system, and while I'm -1 on letting those
discussions hijack the thread, it's worth paying attention to whether
the abstraction layer makes any of them easier. As far as I can see,
it doesn't. I suspect, for example, that direct TID access to toast
values is a much worse idea than people here seem to be supposing, but
whether that's true or false, toast_external_data doesn't help anyone
who is trying to implement it. I don't think it even really helps with
zstd compression, even though that's going to need a
ToastCompressionId, so I kind of find myself wondering what the point
is.
One alternative idea that I had is just provide a way to turn a 4-byte
TOAST pointe into an 8-byte TOAST pointer. If you inserted that
surgically at certain places, maybe you could make 4-byte TOAST
pointers invisible to certain parts of the system. But I'm also not
sure that's any better than what you call the "brutal" approach, which
I'm actually not sure is that brutal. I mean, how bad is it if we just
deal with one more possibility at each place where we currently deal
with VARTAG_ONDISK? To be clear, I am not trying to say "absolutely
don't do this abstraction layer". However, imagine a future where we
have 10 vartags that can appear on disk: the current VARTAG_ONDISK and
9 others. If, when that time comes, your abstraction layer handles 5
or 6 or more of those, I'd call that a win. If the only two it ever
handles are OID and OID8, I'd say that's a loss: it's not really an
abstraction layer at all at that point. In that situation you'd rather
just admit that what you're really trying to do is smooth over the
distinction between OID and OID8 and keep any other goals (including
unbundling the compression ID, IMHO) separate.
I am somewhat bothered by the idea of just bluntly changing everything
over to 8-byte TOAST OIDs. Widening the TOAST table column doesn't
concern me; it eats up 4 bytes, but the tuples are ~2k so it's not
that much of a difference. Widening the TOAST pointer seems like a
bigger concern. Every single toasted value is getting 4 bytes wider. A
single tuple could potentially have a significant number of such
columns, and as a result, get significantly wider. We could still use
the 4-byte toast pointer format if the value ID happens to be small,
but if we just assign the value ID using a 4-byte counter, after the
first 4B allocations, it will never be small again. After that,
relations that never would have needed anything like 4B toast pointers
will still pay the cost for those that do, which seems quite sad.
Whether this patch should be responsible for doing something about
that sadness is unclear to me: a sequence-per-relation to allow
separate counter allocation for each relation would be great, but
bottlenecking this patch set behind such a large change might well be
the wrong idea.
I am also uncertain how much loss we're really talking about: if for
example a wide table contains 5 long text blogs per row (which seems
fairly high) then that's "only" 20 bytes/row, and probably those rows
don't fit into 8kB blocks that well anyway, so maybe the loss of
efficiency isn't much. Maybe the biggest concern is that, IIUC, some
tuples that currently can be stored on disk wouldn't be storable at
all any more with the wider pointers, because the row couldn't be
squeezed into the block no matter how much pressure the TOAST code
applies. If that's correct, I think there's a pretty good chance that
this patch will make a few people who are already desperately unhappy
with the TOAST system even more unhappy. There's nothing you can
really do if you're up against that hard limit. I'm not sure what, if
anything, we can or should do about that, but it seems like something
we ought to at least discuss.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Mon, Dec 08, 2025 at 04:19:54PM -0500, Robert Haas wrote:
I took a brief look at this today, looking at parts of v8-0005 and
v8-0015.
Thanks for the input!
Although I don't dislike the idea of an abstraction layer in
concept, it's unclear to me how much this particular abstraction layer
is really buying you. It's basically abstracting away two things: the
difference between the current varatt_external and
varatt_external_oid8, and the unbundling of compression_method from
extsize. That's not a lot. This thread has a bunch of ideas already on
other things that people want to do to the TOAST system or think you
should be doing to the TOAST system, and while I'm -1 on letting those
discussions hijack the thread, it's worth paying attention to whether
the abstraction layer makes any of them easier. As far as I can see,
it doesn't. I suspect, for example, that direct TID access to toast
values is a much worse idea than people here seem to be supposing, but
whether that's true or false, toast_external_data doesn't help anyone
who is trying to implement it.
I don't think it even really helps with zstd compression, even
though that's going to need a ToastCompressionId, so I kind of
find myself wondering what the point is.
The idea for the support of zstd compression with this set of APIs
would be to allocate a new vartag_external, and perhaps just allow an
8-byte OID in this case to keep the code simple. It is true that I
have not attempted to rework the code behind the compression and the
decompression of the slices, but I could not directly see why we would
want to do that as this also relates with the in-memory representation
of compressed datums.
One alternative idea that I had is just provide a way to turn a 4-byte
TOAST pointe into an 8-byte TOAST pointer. If you inserted that
surgically at certain places, maybe you could make 4-byte TOAST
pointers invisible to certain parts of the system. But I'm also not
sure that's any better than what you call the "brutal" approach, which
I'm actually not sure is that brutal. I mean, how bad is it if we just
deal with one more possibility at each place where we currently deal
with VARTAG_ONDISK? To be clear, I am not trying to say "absolutely
don't do this abstraction layer".
To be honest, the brutal method is not that bad I think. For the
in-tree places, my patch touches all the areas that matter, so I have
identified what needs to be touched. Also please note that the
approach used in this patch is based on a remark that has been made in
the last few years (perhaps by you actually, I'd need to find the
reference?), where folks were worrying about the introduction of a new
vartag_external as something that could be a deadly trap if we don't
patch all the areas that need to handle with external on-disk datums.
The brutal method was actually the first thing that I have done, just
to notice that I was refactoring all these areas of the code the same
way, leading to the patch attached at the end as being a win if we add
more vartags in the future. Anyway, as far as things stand today on
this thread, I have not been able to get even a single +1 for this
design. I mean, that's fine, it's still what can be called a
consensus and that's how development happens.
However, imagine a future where we
have 10 vartags that can appear on disk: the current VARTAG_ONDISK and
9 others. If, when that time comes, your abstraction layer handles 5
or 6 or more of those, I'd call that a win. If the only two it ever
handles are OID and OID8, I'd say that's a loss: it's not really an
abstraction layer at all at that point. In that situation you'd rather
just admit that what you're really trying to do is smooth over the
distinction between OID and OID8 and keep any other goals (including
unbundling the compression ID, IMHO) separate.
I won't deny this argument. As far as things go for the backend core
engine, most of the bad things I hear back from users regarding TOAST
is the 4-byte OID limitation with TOAST values. The addition of zstd
comes second, but the 32 bit problem still primes because backends
just get suddenly stuck.
I am somewhat bothered by the idea of just bluntly changing everything
over to 8-byte TOAST OIDs.
I agree that forcing that everywhere is a bad idea and I am not
suggesting that, neither does the patch set enforce that. The first
design of the patch made an 8-byte enforcement possible by using a
GUC, with the default being 4 bytes. The first set of feedback I have
received in August was to use a reloption, making 8-byte OIDs an
option that one has to pass with CREATE TABLE, and to not use a GUC.
The latest versions of the patch use a reloption.
Widening the TOAST table column doesn't
concern me; it eats up 4 bytes, but the tuples are ~2k so it's not
that much of a difference. Widening the TOAST pointer seems like a
bigger concern.
Every single toasted value is getting 4 bytes wider. A
single tuple could potentially have a significant number of such
columns, and as a result, get significantly wider. We could still use
the 4-byte toast pointer format if the value ID happens to be small,
but if we just assign the value ID using a 4-byte counter, after the
first 4B allocations, it will never be small again. After that,
relations that never would have needed anything like 4B toast pointers
will still pay the cost for those that do, which seems quite sad.
Whether this patch should be responsible for doing something about
that sadness is unclear to me: a sequence-per-relation to allow
separate counter allocation for each relation would be great, but
bottlenecking this patch set behind such a large change might well be
the wrong idea.
As far as I am concerned about the user cases, the tables where the
4-byte limitation is reached usually concern a small subset of tables
in one or more schemas. The cost of a 8-byte OID also comes down to
how much the TOAST blobs are updated. If they are updated a lot,
we'll need a new value anyway. If the blobs are mostly static
content with other attributes updated a lot, the cost of 8-byte OIDs
gets much high. If I would be a user deadling with a set of tables
that have already 4 billions TOAST blobs of at least 2kB in more than
one relation, the extra 4 bytes when updating the other attributes are
not my main worry. :)
I am also uncertain how much loss we're really talking about: if for
example a wide table contains 5 long text blogs per row (which seems
fairly high) then that's "only" 20 bytes/row, and probably those rows
don't fit into 8kB blocks that well anyway, so maybe the loss of
efficiency isn't much. Maybe the biggest concern is that, IIUC, some
tuples that currently can be stored on disk wouldn't be storable at
all any more with the wider pointers, because the row couldn't be
squeezed into the block no matter how much pressure the TOAST code
applies. If that's correct, I think there's a pretty good chance that
this patch will make a few people who are already desperately unhappy
with the TOAST system even more unhappy. There's nothing you can
really do if you're up against that hard limit. I'm not sure what, if
anything, we can or should do about that, but it seems like something
we ought to at least discuss.
Yeah, perhaps. Again, for users that fight against the hard limit,
what I'm hearing is that for some users reworking a large table to be
rewritten as a partitioned one to bypass the limit is not acceptable
in some cases. With the recent planner changes that have improved the
planning time for many partitions, things are better, but it's
basically impossible to predict all the possible user behaviors that
there could be out there. I won't deny that it could be possible that
enlarging the toast values to 8-bytes for some relations leads to some
folks being unhappy because it makes the storage of the on-disk
external pointers less effective regarding to alignment.
At this point, I am mostly interested in a possible consensus about
what people would like to see, in the timeframe that remains for v19
and the timing is what it is because everybody is busy. What I am in
priority interested is giving a way for users to bypass the 4-byte
limit without reworking a schema. If this is opt-in, users have at
least a choice. A reloption has the disadvantage to not be something
that users are aware of by default. A rewrite of the TOAST table is
required anyway. If the consensus is this design is not good enough,
that's fine. If the consensus is to use the
"brutal-still-not-so-brutal" approach, that's fine. The only thing I
can do is commit my resources into making something happen for the
4-byte limitation, whatever the approach folks would be OK with.
--
Michael