Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

hannu@tm.ee

10 months ago

In reply to: Michael Paquier (#1)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi Michael

I'll take a look at the patch set.

While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?

Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.

And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.

----
Hannu

Show quoted text

On Thu, Jun 19, 2025 at 7:59 AM Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

I have been looking at $subject and the many past reviews recently,
also related to some of the work related to the potential support for
zstandard compression in TOAST values, and found myself pondering
about the following message from Tom, to be reminded that nothing has
been done regarding the fact that the backend may finish in an
infinite loop once a TOAST table reaches 4 billion values:
/messages/by-id/764273.1669674269@sss.pgh.pa.us

Spoiler: I have heard of users that are in this case, and the best
thing we can do currently except raising shoulders is to use
workarounds with data externalization AFAIK, which is not nice,
usually, and users notice the problem once they see some backends
stuck in the infinite loop. I have spent some time looking at the
problem, and looked at all the proposals in this area like these ones
(I hope so at least):
https://commitfest.postgresql.org/patch/4296/
/messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru

Anyway, it seems like nothing goes in a direction that I think would
be suited to fix the two following problems (some of the proposed
patches broke backward-compatibility, as well, and that's critical):
- The limit of TOAST values to 4 billions, because external TOAST
pointers want OIDs.
- New compression methods, see the recent proposal about zstandard at
[1]. ToastCompressionId is currently limited to 4 values because the
extinfo field of varatt_external has only this much data remaining.
Spoiler: I want to propose a new varatt_external dedicated to
zstandard-compressed external pointers, but that's not for this
thread.

Please find attached a patch set I have finished with while poking at
the problem, to address points 1) and 2) in the first email mentioned
at the top of this message. It is not yet ready for prime day yet
(there are a couple of things that would need adjustments), but I have
reached the point where I am going to need a consensus about what
people would be OK to have in terms of design to be able to support
multiple types of varatt_external to address these issues. And I'm OK
to consume time on that for the v19 cycle.

While hacking at (playing with?) the whole toasting and detoasting
code to understand the blast radius that this would involve, I have
quickly found that it is very annoying to have to touch at many places
of varatt.h to make variants of the existing varatt_external structure
(what we store on-disk as varlena Datum for external TOAST pointers).
Spoiler: it's my first time touching the internals of this code area
so deeply. Most of the code churns happen because we need to make the
[de]toast code aware of what to do depending on the vartags of the
external varlenas. It would be simple to hardcode a bunch of new
VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures.
While it is efficient, this has a cost for out-of-core code and in
core because all the places that touch external TOAST pointers need to
be adjusted. Point is: it can be done. But if we introduce more
types of external TOAST pointers we need to always patch all these
areas, and there's a cost in that each time one or more new vartags
are added.

So, I have invented a new interface aimed at manipulating on-disk
external TOAST pointers, called toast_external, that is an on-memory
structure that services as an interface between on-disk external TOAST
pointers and what the backend wants to look at when retrieving chunks
of data from the TOAST relations. That's the main proposal of this
patch set, with a structure looking like that:
typedef struct toast_external_data
{
/* Original data size (includes header) */
int32 rawsize;
/* External saved size (without header) */
uint32 extsize;
/* compression method */
ToastCompressionId compression_method;
/* Relation OID of TOAST table containing the value */
Oid toastrelid;
/*
* Unique ID of value within TOAST table. This could be an OID or an
* int8 value. This field is large enough to be able to store any of
* them.
*/
uint64 value;
} toast_external_data;

This is a bit similar to what the code does for R/W and R/O vartags,
only applying to the on-disk external pointers. Then, the [de]toast
code and extension code is updated so as varlenas are changed into
this structure if we need to retrieve some of its data, and these
areas of the code do not need to know about the details of the
external TOAST pointers. When saving an external set of chunks, this
structure is filled with information depending on what
toast_save_datum() deals with, be it a short varlena, a non-compressed
external value, or a compressed external value, then builds a varlena
with the vartag we want.

External TOAST pointers have three properties that are hardcoded in
the tree, bringing some challenges of their own:
- The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close
to 2k to make 4 chunks fit on a page. This depends on the size of the
external pointer. This one was actually easy to refactor.
- The varlena header size, based on VARTAG_SIZE(), which is kind of
tricky to refactor out in the new toast_external.c, but that seems OK
even if this knowledge stays in varatt.h.
- The toast pointer size, aka TOAST_POINTER_SIZE. This one is
actually very interesting (tricky): we use it in one place,
toast_tuple_find_biggest_attribute(), as a lower bound to decide if an
attribute should be toastable or not. I've refactored the code to use
a "best" guess depending on the value type in the TOAST relation, but
that's not 100% waterproof. That needs more thoughts.

Anyway, the patch set is able to demonstrate how much needs to be done
in the tree to support multiple chunk_id types, and the CI is happy
with the attached. Some of the things done:
- Introduction of a user-settable GUC called default_toast_type, that
can be switched between two modes "oid" and "int8", to force the
creation of a TOAST relation using one type or the other.
- Dump, restore and upgrade support are integrated, relying on a GUC
makes the logic a breeze.
- 64b values are retrieved from a single counter in the control file,
named a "TOAST counter", which has the same reliability and properties
as an OID, with checkpoint support, WAL records, etc.
- Rewrites are soft, so I have kicked the can down the toast on this
point to not make the proposal more complicated than it should be: a
VACUUM FULL retains the same TOAST value type as the original. We
could extend rewrites so as the type of TOAST value is changed. It is
possible to setup a new cluster with default_toast_type = int8 set
after an upgrade, with the existing tables still using the OID mode.
This relates to the recent proposal with a concurrent VACUUM FULL
(REPACK discussion).

The patch set keeps the existing vartag_external with OID values for
backward-compatibility, and adds a second vartag_external that can
store 8-byte values. This model is the simplest one, and
optimizations are possible, where the Datum TOAST pointer could be
shorter depending on the ID type (OID or int8), the compression method
and the actual value to divide in chunks. For example, if you know
that a chunk of data to save has a value less than UINT32_MAX, we
could store 4 bytes worth of data instead of 8. This design has the
advantage to allow plugging in new TOAST external structures easily.
Now I've not spent extra time in this tuning, because there's no point
in spending more time without an actual agreement about three things,
and *that's what I'm looking for* as feedback for this upcoming CF:
- The structures of the external TOAST pointers. Variable-sized
pointers could be one possibility, across multiple vartags. Ideas are
welcome.
- How many vartag_external types we want.
- If people are actually OK with this translation layer or not, and I
don't disagree that there may be some paths hot enough where the
translation between the on-disk varlenas and this on-memory
toast_external_data hurts. Again, it is possible to hardcode more
types of vartags in the tree, or just bypass the translation in the
paths that are too hot. That's doable still brutal, but if that's the
final consensus reached I'm OK with that as well. (See for example
the changes in amcheck to see how simpler things get.)

The patch set has been divided into multiple pieces to ease its
review. Again, I'm not completely happy with everything in it, but
it's a start. Each patch has its own commit message, so feel free to
refer to them for more details:
- 0001 introduces the GUC default_toast_type. It is just defined, not
used in the tree at this stage.
- 0002 adds support for catcache lookups for int8 values, required to
allow TOAST values with int8 and its indexes. Potentially useful for
extensions.
- 0003 introduces the "TOAST counter", 8 bytes in the control file to
allocate values for the int8 chunk_id. That's cheap, reliable.
- 0004 is a mechanical change, that enlarges a couple of TOAST
interfaces to use values of uint64 instead of OID.
- 0005, again a mechanical change, reducing a bit the footprint of
TOAST_MAX_CHUNK_SIZE because OID and int8 values need different
values.
- 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type.

Then comes the "main" patches:
- 0007 adds support for int8 chunk_id in TOAST tables. This is mostly
a mechanical change. If applying the patches up to this point,
external Datums are applied to both OID and int8 values. Note that
there is one tweak I'm unhappy with: the toast counter generation
would need to be smarter to avoid concurrent values because we don't
cross-check the TOAST index for existing values. (Sorry, got slightly
lazy here).
- 0008 adds tests for external compressed and uncompressed TOAST
values for int8 TOAST types.
- 0009 adds support for dump, restore, upgrades of the TOAST table
types.
- 0010 is the main dish: refactoring of the TOAST code to use
toast_external_data, with OID vartags as the only type defined.
- 0011 adds a second vartag_external: the one with int8 values stored
in the external TOAST pointer.
- 0012 is a bonus for amcheck: what needs to be done in its TAP tests
to allow the corruption cases to work when supporting a new vartag.

That was a long message. Thank you for reading if you have reached
this point.

Regards,

[1]: /messages/by-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-+FFwAAPo7JHx4aShCvunQ@mail.gmail.com
--
Michael

hukutoc@gmail.com

10 months ago

In reply to: Hannu Krosing (#2)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi!

Hannu, we'd already made an attempt to extract the TOAST functionality as
API
and make it extensible and usable by other AMs in [1]Pluggable TOAST </messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>, the patch set was
met calmly
but we still have some hopes on it.

Michael, glad you continue this work! Took patch set for review.

[1]: Pluggable TOAST </messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>
</messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru>

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

On Fri, Jul 4, 2025 at 2:03 PM Hannu Krosing <hannuk@google.com> wrote:

Show quoted text

Hi Michael

I'll take a look at the patch set.

While digging around in the TOAST code did you have any ideas on how
one could extract the TOAST APIs in a way that they can be added in
Table Access Method definition ?

Not all TAMs need TOAST, but the ones that do could also be the ones
that still like to do something different when materializing toasted
values.

And TOAST is actually a nice abstraction which could be used as basis
for both offloading more columns into separate forks and files as well
as implementing some kinds of vectored, columnar and compressed
storages.

----
Hannu

michael@paquier.xyz

10 months ago

In reply to: Nikita Malakhov (#3)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Fri, Jul 04, 2025 at 02:38:34PM +0300, Nikita Malakhov wrote:

Hannu, we'd already made an attempt to extract the TOAST functionality as
API and make it extensible and usable by other AMs in [1], the patch
set was met calmly but we still have some hopes on it.

Yeah, it's one of these I have studied, and just found that
overcomplicated, preventing us from moving on with a simpler proposal,
because I care about two things first:
- More compression methods, with more meta-data, but let's just add
more vartag_external for that once/if they're really required.
- Enlarge optionally to 8-byte values.
So I really want to stress about these two points, nothing else for
now, echoing from the feedback from 2022 and the fact that all
proposals done after that lacked a simple approach.

IMO, we would live fine enough, *if* being able to plug in a pluggable
TOAST engine makes sense, if we just limit ourselves with an external
interface. We could allow backends to load their own vartag_external
with their own set of callbacks like the ones I am proposing here, so
as we can translate from/to a Datum in heap (or a different table AM)
to an external source, with the backend able to understand what this
external source should be. The key is to define a structure good
enough for the backend (toast_external_data in the patch). So to
answer your and Hannu's question: I had the case of different table
AMs in mind with an interface able to plug into it, yes. And yes, I
have designed the patch set with this in scope. Now there's also a
data type component to that, so that's assuming that a table AM would
want to rely on a varlena to store this data externally, somewhere
else that may not be exactly TOAST, still we want an OID and a value
to be able to retrieve this external value, and we want to store this
external OID and this value (+extra like a compression method and
sizes) in a Datum of the main relation file.

FYI, the patch set posted on this thread is not the latest one. I
have a v2, posted on this branch, where I have reordered things:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2

The refactoring to the new toast_external_data with its callbacks is
done first, and the new vartag_external with 8-byte value support is
added on top of that. There were still two things I wanted to do, and
could not get down to it because I've spent my last week or so
working on other's stuff so I lacked room:
- Evaluate the cost of the transfer layer to toast_external_data. The
worst case I was planning to work with is a non-compressed data stored
in TOAST, then check profiles with the the detoasting path by grabbing
slices of the data with pgbench and a read-only query. The
write/insert path is not going to matter, the detoast is. The
reordering is actually for this reason: I want to see the effect of
the new interface first, and this needs to happen before we even
consider the option of adding 8-byte values.
- Add a callback for the value ID assignment. I was hesitating to add
that when I first backed on the patch but I think that's the correct
design moving forward, with an extra logic to be able to check if an
8-byte value is already in use in a relation, as we do for OID
assignment, but applied to the Toast generator added to the patch.
The backend should decide if a new value is required, we should not
decide the rewrite cases in the callback.

There is a second branch that I use for development, force-pushing to
it periodically, as well:
https://github.com/michaelpq/postgres/tree/toast_64bit
That's much dirtier, always WIP, just one of my playgrounds.
--
Michael

hukutoc@gmail.com

10 months ago

In reply to: Michael Paquier (#4)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi!

I'm reviewing code at toast_64bit_v2.
Michael, isn't there a typo?
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE] = {
[VARTAG_ONDISK_INT8] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE,
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_INT8,
.to_external_data = ondisk_int8_to_external_data,
.create_external_data = ondisk_int8_create_external_data,
},
[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};

Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?

Regards,

--
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

michael@paquier.xyz

10 months ago

In reply to: Nikita Malakhov (#5)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Mon, Jul 07, 2025 at 05:33:11PM +0300, Nikita Malakhov wrote:

[VARTAG_ONDISK_OID] = {
.toast_pointer_size = TOAST_POINTER_INT8_SIZE, <--- here
.maximum_chunk_size = TOAST_MAX_CHUNK_SIZE_OID,
.to_external_data = ondisk_oid_to_external_data,
.create_external_data = ondisk_oid_create_external_data,
},
};

Shouldn't TOAST_POINTER_INT8_SIZE be replaced with TOAST_POINTER_OID_SIZE?

Yes, thanks for pointing this out. This one has lurked in one of the
rebases (not sure how) and it was impacting the threshold calculation
where we consider if an attribute should be compressed or not. I have
taken this occasion to work a bit more on the patch set. The patch
structure is mostly the same, with two tweaks because I was unhappy
with these in the initial patch set:
- The addition of a new callback able to retrieve a new TOAST value,
to ease the diffs in toast_save_datum().
- Reordering of the patch set, with the TOAST external refactoring
done much earlier in the series, now placed in 0003.

The most interesting piece of the patch is still 0003 "Refactor
external TOAST pointer code for better pluggability". On top of that
stands a 0004 patch named "Introduce new callback to get fresh TOAST
values" where I have added value conflict handling for int8. The
split makes reviews easier, hopefully.

Please note that I still need to look at perf profiles and some flame
graphs with the refactoring done in 0003 with the worst case I've
mentioned upthread with detoasting and values stored uncompressed in
the TOAST relation.

I have also pushed this v2 on this branch, so feel free to grab it if
that makes your life easier:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2
--
Michael

michael@paquier.xyz

10 months ago

In reply to: Michael Paquier (#6)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Tue, Jul 08, 2025 at 08:38:41AM +0900, Michael Paquier wrote:

Please note that I still need to look at perf profiles and some flame
graphs with the refactoring done in 0003 with the worst case I've
mentioned upthread with detoasting and values stored uncompressed in
the TOAST relation.

So, the worst case I could think of for the slice detoast path is
something like that:
create table toasttest_bytea (f1 bytea);
alter table toasttest_bytea alter column f1 set storage external;
insert into toasttest_bytea values(decode(repeat('1234567890',10000),'escape'));

And then use something like the following query that retrieves a small
substring many times, to force a maximum of detoast_attr_slice() to
happen, checking the effect of toast_external_info_get_data():
select length(string_agg(substr(f1, 2, 3), '')) from
toasttest_bytea, lateral generate_series(1,1000000) as a (id);

I have taken this query, kept running that with a \watch, and took
samples of 10s perf records, finishing with the attached graphs
(runtime does not show any difference):
- detoast_master.svg, for the graph on HEAD.
- detoast_patch.svg with the patch set up to 0003 and the external
TOAST pointer refactoring, where detoast_attr_slice() shows up.
- master_patch_diff.svg as the difference between both, with
difffolded.pl from [1]https://github.com/brendangregg/FlameGraph -- Michael.

I don't see a difference in the times spent in these stacks, as we are
spending most of the run retrieving the slices from the TOAST relation
in fetch_datum_slice(). Opinions and/or comments are welcome.

[1]: https://github.com/brendangregg/FlameGraph -- Michael
--
Michael

Greg Burd

greg@burd.me

10 months ago

In reply to: Michael Paquier (#6)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Jul 7, 2025, at 7:38 PM, Michael Paquier <michael@paquier.xyz> wrote:

I have also pushed this v2 on this branch, so feel free to grab it if
that makes your life easier:
https://github.com/michaelpq/postgres/tree/toast_64bit_v2
--
Michael

Thank you for spending time digging into this and for the well structured patch set (and GitHub branch which I personally find helpful). This $subject is important on its own, but even more so in the broader context of the zstd/dict work [1]https://commitfest.postgresql.org/patch/5702/ and also allowing for innovation when it comes to how externalized Datum are stored. The current model for toasting has served the community well for years, but I think that Hannu [2]"Yes, the idea is to put the tid pointer directly in the varlena external header and have a tid array in the toast table as an extra column. If all of the TOAST fits in the single record, this will be empty, else it will have an array of tids for all the pages for this toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025 and Nikita and others have promising ideas that should be allowable without forcing core changes. I've worked a bit in this area too, I re-based the Pluggble TOAST work by Nikita [3]/messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru onto 18devel earlier this year as I was looking for a way to implement a toaster for a custom type.

All that aside, I think you're right to tackle this one step at a time and try not to boil too much of the ocean at once (the patch set is already large enough). With that in mind I've read once or twice over your changes and have a few basic comments/questions.

v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid

This set of changes make sense and as you say are mechanical in nature, no real comments other than I think that using uint64 rather than Oid is the right call and addresses #2 on Tom's list.

v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code

I like this as well, clarifies the code and reduces repetition.

v2-0003 Refactor external TOAST pointer code for better pluggability

+ * For now there are only two types, all defined in this file. For now this
+ * is the maximum value of vartag_external, which is a historical choice.

This provides a bridge for compatibility, but doesn't open the door to a truly pluggable API. I'm guessing the goal is incremental change rather than wholesale rewrite.

+ * The different kinds of on-disk external TOAST pointers. divided by
+ * vartag_external.

Extra '.' in "TOAST pointers. divided" I'm guessing.

v2-0004 Introduce new callback to get fresh TOAST values
v2-0005 Add catcache support for INT8OID
v2-0006 Add GUC default_toast_type

Easily understood, good progression of supporting changes.

v2-0007 Introduce global 64-bit TOAST ID counter in control file

Do you have any concern that this might become a bottleneck when there are many relations and many backends all contending for a new id? I'd imagine that this would show up in a flame graph, but I think your test focused on the read side detoast_attr_slice() rather than insert/update and contention on the shared counter. Would this be even worse on NUMA systems?

v2-0008 Switch pg_column_toast_chunk_id() return value from oid to bigint
v2-0009 Add support for bigint TOAST values
v2-0010 Add tests for TOAST relations with bigint as value type
v2-0011 Add support for TOAST table types in pg_dump and pg_restore
v2-0012 Add new vartag_external for 8-byte TOAST values
V2-0013 amcheck: Add test cases for 8-byte TOAST values

I read through each of these patches, I like the break down and the attention to detail. The inclusion of good documentation at each step is helpful. Thank you.

Thanks for the flame graphs examining a heavy detoast_attr_slice() workload. I agree that there is little or no difference between them which is nice.

I think the only call out Tom made [4]/messages/by-id/764273.1669674269@sss.pgh.pa.us that isn't addressed was the ask for localized ID selection. That may make sense at some point, especially if there is contention on GetNewToastId(). I think that case is worth a separate performance test, something with a large number of relations and backends all performing a lot of updates generating a lot of new IDs. What do you think?

As for adding even more flexibility, I see the potential to move in that direction over time with this as a good focused incremental set of changes that address a few important issues now.

Really excited by this work, thank you.

-greg

[1]: https://commitfest.postgresql.org/patch/5702/
[2]: "Yes, the idea is to put the tid pointer directly in the varlena external header and have a tid array in the toast table as an extra column. If all of the TOAST fits in the single record, this will be empty, else it will have an array of tids for all the pages for this toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after PGConf.dev/2025
[3]: /messages/by-id/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru
[4]: /messages/by-id/764273.1669674269@sss.pgh.pa.us

hukutoc@gmail.com

10 months ago

In reply to: Greg Burd (#8)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi!

Greg, thanks for the interest in our work!

Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.

What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#10

hannu@tm.ee

10 months ago

In reply to: Nikita Malakhov (#9)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)

I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"

My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record

-- added fields

This seems to be flexible enough to allow for both compressin and
efficient partial updates

---
Hannu

Show quoted text

On Tue, Jul 8, 2025 at 8:31 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

Hi!

Greg, thanks for the interest in our work!

Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.

What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#11

Alvaro Herrera

alvherre@2ndquadrant.com

10 months ago

In reply to: Hannu Krosing (#10)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On 2025-Jul-08, Hannu Krosing wrote:

I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples. A lot to
reimplement though -- vacuum in particular. Maybe it can be thought of
as a new table AM. Not an easy project, I reckon.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#12

hukutoc@gmail.com

10 months ago

In reply to: Hannu Krosing (#10)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi,

Hannu, we have some thoughts on direct tids storage,
it was some time ago and done by another developer,
so I have to look. I'll share it as soon as I find it, if you
are interested.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#13

hannu@tm.ee

10 months ago

In reply to: Alvaro Herrera (#11)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

chunk_

On Tue, Jul 8, 2025 at 9:37 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:

On 2025-Jul-08, Hannu Krosing wrote:

I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

I think this can be made to work, as long as we stop seeing the toast
table just like a normal heap table containing normal tuples. A lot to
reimplement though -- vacuum in particular.

Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.

What other parts do you think need re-implementing in addition to
skipping the index lookup part and using the tid directly ?

The fact that per-page chunk_tid arrays allow also tree structures
should allow us much more flexibility in implementing
in-place-updatable structured storage in something otherways very
similar to toast, but this is not required for just moving from oid +
index ==> tid to using the tid directly.

I think that having a toast table as a normal table with full MVCC is
actually a good thing, as it can implement the "array element update"
as a real partial update of only the affected parts and not the
current 'copy everything' way of doing this. We already do collect the
array element update in the parse tree in a special way, now we just
need to have types that can do the partial update by changing a tid or
two in the chunk_tids array (and adjust the offsets array if needed)

This should make both
UPDATE t SET theintarray[3] = 5, theintarray[4] = 7 WHERE id = 1;

and even do partial up[dates for something like this

hannuk=# select * from jtab;
id | j
----+----------------------------
1 | {"a": 3, "b": 2}
2 | {"c": 1, "d": [10, 20, 3]}
(2 rows)
hannuk=# update jtab SET j['d'][3] = '7' WHERE id = 2;
UPDATE 1
hannuk=# select * from jtab;
id | j
----+-------------------------------
1 | {"a": 3, "b": 2}
2 | {"c": 1, "d": [10, 20, 3, 7]}
(2 rows)

when the JSON data is so large that changed part is in it's own chunk.

Maybe it can be thought of
as a new table AM. Not an easy project, I reckon.

I would prefer it to be an extension of current toast - just another
varatt_* type - as then you can upgrade to new storage CONCURRENTLY,
same way as you can currently switch compression methods.

Show quoted text

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/

#14

michael@paquier.xyz

10 months ago

In reply to: Hannu Krosing (#10)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Tue, Jul 08, 2025 at 08:54:33PM +0200, Hannu Krosing wrote:

I still think we should go with direct toast tid pointers in varlena
and not some kind of oid.

It will remove the need for any oid management and also will be
many-many orders of magnitude faster for large tables (just 2x faster
for in-memory small tables)

There is also the point of backward-compatibility. We cannot just
replace things, and we have to provide a way for users to be able to
rely on the system so as upgrades are painless. So we need to think
about the correct application layer to use to maintain the legacy code
behavior while considering improvements.

I plan to go over Michael's patch set here and see how much change is
needed to add the "direct toast"

If you do not have a lot of time looking at the full patch set, I'd
recommend looking at 0003, files toast_external.h and
toast_external.c which include the key idea. Adding a new external
TOAST pointer is then a two-step process:
- Add a new vartag_external.
- Add some callbacks to let the backend understand what it should do
with this new vartag_external.

My goals are:
1. fast lookup from skipping index lookup
2. making the toast pointer in main heap as small as possible -
hopefully just the 6 bytes of tid pointer - so that scans that do not
need toasted values get more tuples from each page
3. adding all (optional) the extra data into toast chunk record as
there we are free to add whatever is needed
Currently I plan to introduces something like this for toast chunk record

Points 2. and 3. are things that the refactoring should allow. About
1., I have no idea how much you want to store in the TOAST external
points and how it affects the backend, but you could surely implement
an option that lets the backend know that it should still index
lookups based on what the external TOAST pointer says, if this stuff
has benefits.

This seems to be flexible enough to allow for both compressin and
efficient partial updates

I don't really disagree with all that. Now the history of the TOAST
threads point out that we've good at proposing complex things, but
these had a high footprint. What I'm proposing is lighter than that,
I think, tackling my core issue with the infra supporting backward
compatibility and the addition of more modes on top of it.
--
Michael

#15

michael@paquier.xyz

10 months ago

In reply to: Nikita Malakhov (#9)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Tue, Jul 08, 2025 at 09:31:29PM +0300, Nikita Malakhov wrote:

Michael, one more thing forgot to mention yesterday -
#define TOAST_EXTERNAL_INFO_SIZE (VARTAG_ONDISK_OID + 1)
static const toast_external_info
toast_external_infos[TOAST_EXTERNAL_INFO_SIZE]
VARTAG_ONDISK_OID historically has a value of 18
and here we got an array of 19 members with only 2 valid ones.

Yeah, I'm aware of that. The code is mostly to make it easier to read
while dealing with this historical behavior, even if it costs a bit
more in memory. I don't think that it's a big deal, and we could
always have one more level of redirection to reduce its size. Now
there's the extra complexity..

What do you think about having an individual
TOAST value id counter per relation instead of using
a common one? I think this is a very promising approach,
but a decision must be made where it should be stored.

I've thought about that, and decided to discard this idea for now to
keep the whole proposal simpler. This has benefits if you have many
relations with few OIDs consumed, but this has a cost in itself as you
need to maintain the data for each TOAST relation. When I looked at
the problems a couple of weeks ago, I came to the conclusion that all
the checkbox properties of "local" TOAST values are filled with a
sequence: WAL logging to ensure uniqueness, etc. So I was even
considering the addition of some code to create sequences on-the-fly,
but at the end that was just more complexity with how we define
sequences currently compared to a unique 8-byte counter in the
control file that's good enough for a veeery long time.

I've also noticed that this sort of links to a piece I've implemented
last year and is still sitting in the CF app without much interest
from others: sequence AMs. You could implement a "TOAST" sequence
method, for example, optimized for this purpose. As a whole, I
propose to limit the scope of the proposal to the pluggability of the
external TOAST pointers. The approach I've taken should allow such
improvements, these can be added later if really needed.
--
Michael

#16

michael@paquier.xyz

10 months ago

In reply to: Greg Burd (#8)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Tue, Jul 08, 2025 at 12:58:26PM -0400, Burd, Greg wrote:

All that aside, I think you're right to tackle this one step at a
time and try not to boil too much of the ocean at once (the patch
set is already large enough). With that in mind I've read once or
twice over your changes and have a few basic comments/questions.

v2-0001 Refactor some TOAST value ID code to use uint64 instead of Oid

This set of changes make sense and as you say are mechanical in
nature, no real comments other than I think that using uint64 rather
than Oid is the right call and addresses #2 on Tom's list.

v2-0002 Minimize footprint of TOAST_MAX_CHUNK_SIZE in heap TOAST code

I like this as well, clarifies the code and reduces repetition.

Thanks. These are independent pieces if you want to link the code
less to TOAST, assuming that an area of 8 bytes would be good enough
for any TOAST "value" concept. TIDs were mentioned as well on a
different part of the thread, ItemPointerData is 6 bytes.

v2-0003 Refactor external TOAST pointer code for better pluggability
+ * For now there are only two types, all defined in this file. For now this
+ * is the maximum value of vartag_external, which is a historical choice.
This provides a bridge for compatibility, but doesn't open the door
to a truly pluggable API. I'm guessing the goal is incremental
change rather than wholesale rewrite.

Nope, it does not introduce a pluggable thing, but it does untangle
the fact that one needs to change 15-ish code paths when they want to
add a new type of external TOAST pointer, without showing an actual
impact AFAIK when we insert a TOAST value or fetch it, as long as we
know that we're dealing with an on-disk thing that requires an
external lookup.

+ * The different kinds of on-disk external TOAST pointers. divided by
+ * vartag_external.
Extra '.' in "TOAST pointers. divided" I'm guessing.

Indeed, thanks.

v2-0007 Introduce global 64-bit TOAST ID counter in control file

Do you have any concern that this might become a bottleneck when
there are many relations and many backends all contending for a new
id? I'd imagine that this would show up in a flame graph, but I
think your test focused on the read side detoast_attr_slice() rather
than insert/update and contention on the shared counter. Would this
be even worse on NUMA systems?

That may be possible, see below.

Thanks for the flame graphs examining a heavy detoast_attr_slice()
workload. I agree that there is little or no difference between
them which is nice.

Cool. Yes. I was wondering why detoast_attr_slice() does not show up
in the profile on HEAD, perhaps it just got optimized away (I was
under -O2 for these profiles).

I think the only call out Tom made [4] that isn't addressed was the
ask for localized ID selection. That may make sense at some point,
especially if there is contention on GetNewToastId(). I think that
case is worth a separate performance test, something with a large
number of relations and backends all performing a lot of updates
generating a lot of new IDs. What do you think?

Yeah, I need to do more benchmark for the int8 part, I was holding on
such evaluations because this part of the patch does not fly if we
don't do the refactoring pieces first. Anyway, I cannot get excited
about the extra workload that this would require in the catalogs,
because we would need one TOAST sequence tracked in there, linked to
the TOAST relation so it would not be free. Or we invent a new
facility just for this purpose, meaning that we get far away even more
from being able to resolve the original problem with the values and
compression IDs. We're talking about two instructions. Well, I guess
that we could optimize it more some atomics or even cache a range of
values to save in ToastIdGenLock acquisitions in a single backend. I
suspect that the bottleneck is going to be the insertion of the TOAST
entries in toast_save_datum() anyway with the check for conflicting
values, even if your relation is unlogged or running-on-scissors in
memory.

[2] "Yes, the idea is to put the tid pointer directly in the varlena
external header and have a tid array in the toast table as an extra
column. If all of the TOAST fits in the single record, this will be
empty, else it will have an array of tids for all the pages for this
toasted field." - Hannu Krosing in an email to me after
PGConf.dev/2025

Sure, you could do that as well, but I suspect that we'll need the
steps of at least up to 0003 to be able to handle more easily multiple
external TOAST pointer types, or the code will be messier than it
currently is. :D
--
Michael

#17

hukutoc@gmail.com

10 months ago

In reply to: Hannu Krosing (#13)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi!

On this -

Non-FULL vacuum should already work. Only commands like VACUUM FULL
and CLUSTER which move tuples around should be disabled on TOAST
tables.

Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!

Have to mention though that we encountered issues in logical replication
when we made toast values updatable.

Also researching direct tids implementation.

Cheers!

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#18

michael@paquier.xyz

10 months ago

In reply to: Nikita Malakhov (#17)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:

Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a promising
way to solve it once and for all!

Have to mention though that we encountered issues in logical replication
when we made toast values updatable.

Also researching direct tids implementation.

I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do. I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael

#19

hukutoc@gmail.com

9 months ago

In reply to: Michael Paquier (#18)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

Hi Michael,

I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.

On Tue, Jul 15, 2025 at 3:56 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jul 14, 2025 at 09:01:28AM +0300, Nikita Malakhov wrote:

Cool, toast tables are subject to bloating in update-heavy scenarios
and it's a big problem in production systems, it seems there is a

promising

way to solve it once and for all!

Have to mention though that we encountered issues in logical replication
when we made toast values updatable.

Also researching direct tids implementation.

I would be curious to see if the refactoring done on this thread would
be useful in the scope of what you are trying to do. I'd suggest
dropping that on a different thread, though, if you finish with a
patch or something worth looking at for others.
--
Michael

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

#20

hannu@tm.ee

9 months ago

In reply to: Nikita Malakhov (#19)

Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)

On Fri, Jul 18, 2025 at 9:24 PM Nikita Malakhov <hukutoc@gmail.com> wrote:

Hi Michael,

I'm currently debugging POC direct tids TOAST patch (on top of your branch),
will mail it in a day or two.

Great!

I also just started looking at it, starting from 0003 as recommended by Michael.

Will be interesting to see how similar / different our approaches will be :)

#21

hannu@tm.ee

9 months ago

In reply to: Nikita Malakhov (#17)

#22

michael@paquier.xyz

9 months ago

In reply to: Nikita Malakhov (#19)

#23

veldanda.nikhilkumar17@gmail.com

9 months ago

In reply to: Michael Paquier (#22)

#24

veldanda.nikhilkumar17@gmail.com

9 months ago

In reply to: Nikhil Kumar Veldanda (#23)

#25

hannu@tm.ee

9 months ago

In reply to: Nikhil Kumar Veldanda (#24)

#26

hukutoc@gmail.com

9 months ago

In reply to: Hannu Krosing (#25)

#27

hukutoc@gmail.com

9 months ago

In reply to: Nikita Malakhov (#26)

#28

michael@paquier.xyz

9 months ago

In reply to: Nikita Malakhov (#27)

#29

michael@paquier.xyz

9 months ago

In reply to: Nikita Malakhov (#26)

#30

hukutoc@gmail.com

9 months ago

In reply to: Michael Paquier (#29)

#31

hannu@tm.ee

9 months ago

In reply to: Nikita Malakhov (#30)

#32

michael@paquier.xyz

9 months ago

In reply to: Hannu Krosing (#31)

#33

michael@paquier.xyz

9 months ago

In reply to: Michael Paquier (#16)

#34

Jim Nasby

Jim.Nasby@BlueTreble.com

9 months ago

In reply to: Michael Paquier (#33)

#35

michael@paquier.xyz

9 months ago

In reply to: Jim Nasby (#34)

#36

michael@paquier.xyz

9 months ago

In reply to: Michael Paquier (#33)

#37

veldanda.nikhilkumar17@gmail.com

9 months ago

In reply to: Michael Paquier (#36)

#38

Tom Lane

tgl@sss.pgh.pa.us

9 months ago

In reply to: Michael Paquier (#36)

#39

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Tom Lane (#38)

#40

michael@paquier.xyz

9 months ago

In reply to: Andres Freund (#39)

#41

michael@paquier.xyz

9 months ago

In reply to: Tom Lane (#38)

#42

michael@paquier.xyz

9 months ago

In reply to: Nikhil Kumar Veldanda (#37)

#43

hukutoc@gmail.com

9 months ago

In reply to: Michael Paquier (#42)

#44

Jim Nasby

Jim.Nasby@BlueTreble.com

9 months ago

In reply to: Andres Freund (#39)

#45

michael@paquier.xyz

9 months ago

In reply to: Jim Nasby (#44)

#46

michael@paquier.xyz

9 months ago

In reply to: Michael Paquier (#41)

#47

michael@paquier.xyz

7 months ago

In reply to: Michael Paquier (#46)

#48

michael@paquier.xyz

7 months ago

In reply to: Michael Paquier (#47)

#49

michael@paquier.xyz

5 months ago

In reply to: Michael Paquier (#48)

#50

veldanda.nikhilkumar17@gmail.com

5 months ago

In reply to: Michael Paquier (#49)

#51

Robert Haas

robertmhaas@gmail.com

5 months ago

In reply to: Michael Paquier (#49)

#52

michael@paquier.xyz

5 months ago

In reply to: Robert Haas (#51)

#53

michael@paquier.xyz

4 months ago

In reply to: Michael Paquier (#52)

#54

michael@paquier.xyz

4 months ago

In reply to: Michael Paquier (#53)

#55

veldanda.nikhilkumar17@gmail.com

4 months ago

In reply to: Michael Paquier (#54)

#56

michael@paquier.xyz

4 months ago

In reply to: Nikhil Kumar Veldanda (#55)

#57

zengman

zengman@halodbtech.com

4 months ago

In reply to: Michael Paquier (#56)

#58

michael@paquier.xyz

4 months ago

In reply to: Michael Paquier (#56)

#59

michael@paquier.xyz

4 months ago

In reply to: zengman (#57)

#60

michael@paquier.xyz

4 months ago

In reply to: Michael Paquier (#58)

#61