making tid and HOTness of UPDATE available to logical decoding plugins

Started by Hannu Krosing8 days ago11 messages

Hannu Krosing

hannuk@google.com

8 days ago

1 attachment(s)

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

Also modified test_decoding to show both tids -
- old tid has format -(pageno, slot)
- new tid has format +(pageno, slot)
if it is a HOT update, it is decoded prefixed with 'HOT '

Sample usage:

hannu=# SELECT pg_create_logical_replication_slot('test_slot', 'test_decoding');
pg_create_logical_replication_slot
------------------------------------
(test_slot,0/1BF1B38)
(1 row)
hannu=# CREATE TABLE nokey(data text);
CREATE TABLE
hannu=# insert into nokey (data) values('a');
INSERT 0 1
hannu=# update nokey set data = 'b';
UPDATE 1
hannu=# delete from nokey ;
DELETE 1
hannu=# SELECT lsn, xid, data FROM
pg_logical_slot_get_changes('test_slot', NULL, NULL);
lsn | xid | data
-----------+-----+------------------------------------------------------------
0/1C20538 | 767 | BEGIN 767
0/1C2B1E8 | 767 | COMMIT 767
0/1C2B220 | 768 | BEGIN 768
0/1C2B220 | 768 | table public.nokey: INSERT:+(0,1) data[text]:'a'
0/1C2B290 | 768 | COMMIT 768
0/1C2B300 | 769 | BEGIN 769
0/1C2B300 | 769 | table public.nokey: HOT UPDATE:-(0,1)+(0,2) data[text]:'b'
0/1C2B378 | 769 | COMMIT 769
0/1C2B3B0 | 770 | BEGIN 770
0/1C2B3B0 | 770 | table public.nokey: DELETE:-(0,2) (no-tuple-data)
0/1C2B418 | 770 | COMMIT 770
(11 rows)

My planned use case is for reliable logical replication of tables
without primary key or other declared IDENTITY (as long as there are
no updates on target, or at leas no non-hot updates)

Sending thgis part as an independent patch as there may be other
interesting use cases as well.

--
Hannu

Mihail Nikalayeu

mihailnikalayeu@gmail.com

8 days ago

In reply to: Hannu Krosing (#1)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

Hello!

Thanks for the patch.

Few comments:

1) tests are failing, expected output files need to be updated

* Treat HOT update as normal updates. There is no useful
* information in the fact that we could make it a HOT update
* locally and the WAL layout is compatible.

I think it feels a little bit irrelevant now. Also, I'll prefer to
give XLOG_HEAP_HOT_UPDATE a dedicated case switch.

3) _format_tid - not sure _ prefix is a good idea here, but not sure.

4) new double newlines before and after _format_tid

if (change->data.tp.newctid.ip_posid)

Should we change it to
if (ItemPointerIsValid(&change->data.tp.newctid))

Best regards,
Mikhail.

Hannu Krosing

hannuk@google.com

7 days ago

In reply to: Mihail Nikalayeu (#2)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Thu, Dec 4, 2025 at 11:35 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:

Hello!

Thanks for the patch.

Thanks for reviewing :)

Few comments:

1) tests are failing, expected output files need to be updated

Yes, I didn't want to start changing tests before other parts stabilized a bit.

2)

* Treat HOT update as normal updates. There is no useful
* information in the fact that we could make it a HOT update
* locally and the WAL layout is compatible.

I think it feels a little bit irrelevant now. Also, I'll prefer to
give XLOG_HEAP_HOT_UPDATE a dedicated case switch.

I added it after the comments about the need to track HOT in case we
would use it for collecting index entries.

As it is implemented now it should not affect any users who are not
interested in HOT.

3) _format_tid - not sure _ prefix is a good idea here, but not sure.

yeah, need to look around a little for established use in these parts
of the code.

4) new double newlines before and after _format_tid

ack

5)

if (change->data.tp.newctid.ip_posid)

Should we change it to
if (ItemPointerIsValid(&change->data.tp.newctid))

I'll have to check. It looks likely that the whole check is redundant
and the tids are always available

Show quoted text

Best regards,
Mikhail.

Euler Taveira

euler@eulerto.com

7 days ago

In reply to: Hannu Krosing (#1)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Thu, Dec 4, 2025, at 5:58 PM, Hannu Krosing wrote:

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

My first impression was: why do you want to expose an internal information that
is mostly useless for a broader audience? The logical decoding infrastructure
is a general purpose solution for streaming modifications made to Postgres.
Could you elaborate how other consumers (DBMS, data store, ...) would use it?

My planned use case is for reliable logical replication of tables
without primary key or other declared IDENTITY (as long as there are
no updates on target, or at leas no non-hot updates)

Wait, we already have a mechanism to handle it: replica identity. What is the
advantage of this proposal in comparison with replica identity?

It seems a Postgres-centric solution that you didn't provide strong arguments
in favor of it. How would logical replication take advantage of such change? If
that's the case, share the pgoutput and logical replication changes.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Hannu Krosing

hannuk@google.com

7 days ago

In reply to: Euler Taveira (#4)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Fri, Dec 5, 2025 at 3:58 PM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Dec 4, 2025, at 5:58 PM, Hannu Krosing wrote:

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

My first impression was: why do you want to expose an internal information that
is mostly useless for a broader audience? The logical decoding infrastructure
is a general purpose solution for streaming modifications made to Postgres.
Could you elaborate how other consumers (DBMS, data store, ...) would use it?

One "other consumer" that came up was possibility to use logical
decoding for collecting changes for CREATE INDEX CONCURRENTLY so there
would be no need for 2nd pass of CIC to scan the whole table again.

I understand that there already is an ongoing work to do this with a
specialized collector, but that involved some other ugliness like
having to use a specialized logging index acces methods.

And tracking changes for other CONCURRENTLY operations, like table
repack, could also benefit from having ctid and hotness info.

My planned use case is for reliable logical replication of tables
without primary key or other declared IDENTITY (as long as there are
no updates on target, or at leas no non-hot updates)

Wait, we already have a mechanism to handle it: replica identity. What is the
advantage of this proposal in comparison with replica identity?

Replica identity full can become a quite heavyweight operation if you
just want to set up logical replication but your table has no primary
key but still has occasional updates

If all you want to do is to be able to replicate UPDATEs and DELETEs
then having to save full tuple data in WAL seems excessive.

It seems a Postgres-centric solution that you didn't provide strong arguments
in favor of it. How would logical replication take advantage of such change? If
that's the case, share the pgoutput and logical replication changes.

Having though about the issue for quite some time I suddenly
discovered, that while ctid can not be used as a permanent enough
unique id for foreign keys or anything external, it is unique at any
moment in time making it very much sufficient for logical replication.

The high-level idea is to store the source (publisher) ctid value in
an extra column for sorce_ctid in the target (subscriber) table, that
column will also have a unique index and is of course NOT NULL (as
there can be by definition no row without a ctid) so it will form kind
of "replication primary key".

During CDC replay phase each change is sent with ctid (or two in case
of UPDATE) and the replay works as it currently does with the addition
of sorce ctid being stored in sorce_ctid column on the target.

And because UPDATEalso updates the source_ctid colum on target the
"replication primary key" stays nicely in sync.

Of course a manual update in the target database could break
replication , but this is no different than IDENTITY FULL. or for that
matter any other IDENTITY.

So the PoC I am working on will

- add a "materialised sorce ctid" column to target table, defined as
"source_ctid tid NOT NULL UNIQUE"
- initial copy will copy over `SELECT *, ctid as source_ctid FROM ...`
- replication decoding plugin will include actual ctid(s) in change records

For the above PoC the replay part needs no changes beyond knowing that
source_ctid is the identity column

PoC phase 2 will be more complex and will introduce the "index-only
source_ctid column" to avoid bloating the table by storing source
ctids there if the sole purpose of the replication is migrating the
database, But more on this once I have the basic PoC working :)

--
Cheers
Hannu

P.S: I am also mulling over an idea of adding semi-virtual GENERATED
ALWAYS AS ROW IDENTITY where the ROW identity starts as bigint cast of
actual ctid and gets materialized only on (non-HOT) update. This does
not need this logical decoding patch, but as it is closely related I
mention it here as well.
ROW IDENTITY has two big advantages over other identity types for
mostly static tables -
a) identity column takes up no extra space and
b) it allows super fast direct lookups without needing an index at all
for fully write-only tables or a quick index lookup in a tiny index to
check that the ROWID is not there and then direct lookup by ctid.

--
Hannu

Matthias van de Meent

boekewurm+postgres@gmail.com

4 days ago

In reply to: Hannu Krosing (#1)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Thu, 4 Dec 2025 at 21:58, Hannu Krosing <hannuk@google.com> wrote:

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

Also modified test_decoding to show both tids -
- old tid has format -(pageno, slot)
- new tid has format +(pageno, slot)
if it is a HOT update, it is decoded prefixed with 'HOT '

I don't think this added information has meaning in a logical database
framework; I see only demerits in adding this. No query reaches for
tuples by TID unless the query concerns the physical layout of the
data, in which case it isn't a logical query anymore.

One more concern about this is that this may require significant
additional effort in CLUSTER/REPACK -related operations, as those
operations rewrite which TIDs are associated with any logical tuple.
Currently, this never requires a rewrite of the remote table, but by
effectively exposing the TID as addressable column that change must be
considered a possible update of the remote values, and thus requires
logical processing; increasing the cost of those operations in
logical-enabled databases by a huge margin.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Matthias van de Meent

boekewurm+postgres@gmail.com

4 days ago

In reply to: Hannu Krosing (#5)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Fri, 5 Dec 2025 at 16:50, Hannu Krosing <hannuk@google.com> wrote:

On Fri, Dec 5, 2025 at 3:58 PM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Dec 4, 2025, at 5:58 PM, Hannu Krosing wrote:

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

My first impression was: why do you want to expose an internal information that
is mostly useless for a broader audience? The logical decoding infrastructure
is a general purpose solution for streaming modifications made to Postgres.
Could you elaborate how other consumers (DBMS, data store, ...) would use it?

One "other consumer" that came up was possibility to use logical
decoding for collecting changes for CREATE INDEX CONCURRENTLY so there
would be no need for 2nd pass of CIC to scan the whole table again.

I understand that there already is an ongoing work to do this with a
specialized collector, but that involved some other ugliness like
having to use a specialized logging index acces methods.

I don't see logical decoding as a viable alternative to any
indexing-related workload. Creating and maintaining indexes need
strict limits on their resource usage, and logical decoding is unable
to give those guarantees: It needs to read WAL, which can be produced
in approximately arbitrarily large amounts in any period. This is
further worsened by the index build itself, which will have to write
out WAL if the relation it's building on also needs to log WAL, which
then also needs to be parsed and decoded by LR. And lastly, we want
CIC/RIC to work on all indexes, not just those on logged relations. "A
table with WAL-logging enabled" cannot be a requirement for CIC.

It seems a Postgres-centric solution that you didn't provide strong arguments
in favor of it. How would logical replication take advantage of such change? If
that's the case, share the pgoutput and logical replication changes.

Having though about the issue for quite some time I suddenly
discovered, that while ctid can not be used as a permanent enough
unique id for foreign keys or anything external, it is unique at any
moment in time making it very much sufficient for logical replication.

The high-level idea is to store the source (publisher) ctid value in
an extra column for sorce_ctid in the target (subscriber) table, that
column will also have a unique index and is of course NOT NULL (as
there can be by definition no row without a ctid) so it will form kind
of "replication primary key".

So you're using it as a poor man's id column.

I understand how you got to this point, but the right solution here
still is to get the user to specify their own identity column that is
stable across operations, and not to use the volatile and
guaranteed-unstable ctid. As I also said in my other mail, adding
ctid to the logical replication system will expose too much internal
information and will turn current logical no-ops into logical
operations; possibly even bloating the subscriber by a good deal more
than what the publisher cleaned up.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Hannu Krosing

hannuk@google.com

4 days ago

In reply to: Matthias van de Meent (#7)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Mon, Dec 8, 2025 at 2:46 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Fri, 5 Dec 2025 at 16:50, Hannu Krosing <hannuk@google.com> wrote:

On Fri, Dec 5, 2025 at 3:58 PM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Dec 4, 2025, at 5:58 PM, Hannu Krosing wrote:

Please find attached a patch that makes tuple ids and info about
weather it was plain or HOT update available to logical decoding
callbacks.

My first impression was: why do you want to expose an internal information that
is mostly useless for a broader audience? The logical decoding infrastructure
is a general purpose solution for streaming modifications made to Postgres.
Could you elaborate how other consumers (DBMS, data store, ...) would use it?

One "other consumer" that came up was possibility to use logical
decoding for collecting changes for CREATE INDEX CONCURRENTLY so there
would be no need for 2nd pass of CIC to scan the whole table again.

I understand that there already is an ongoing work to do this with a
specialized collector, but that involved some other ugliness like
having to use a specialized logging index acces methods.

I don't see logical decoding as a viable alternative to any
indexing-related workload. Creating and maintaining indexes need
strict limits on their resource usage, and logical decoding is unable
to give those guarantees:

Are these 'strict limits on their resource usage' defined somewhere so
I could take a look?

It needs to read WAL, which can be produced
in approximately arbitrarily large amounts in any period.

If you read the WAL as it is generated, then you are essentially just
filtering an in-memory stream, most of the time just jumping to next
WAL record.

The upsides for logical decoding based collection are
- is that you do not need to have any extra settings and conditions in
you index methods to do in-index-method collection
- you avoid the hassle of synchronoizing collection starts and stops
between all active backends
- you are collecting in a single process, so no overhead from
synchronizing between all the backends that capture/log index
insertions
- you can choose between collecting immediately in a background worker
and collecting later by re-reading WAL.

The one upside (?) of in-index capture is that it will naturally
throttle your production workload if capture can not keep up for some
reason.

This is further worsened by the index build itself, which will have to write
out WAL if the relation it's building on also needs to log WAL, which
then also needs to be parsed and decoded by LR.

Not "parsed and decoded" - just read the (database, tablespace,
relation) triplet, decide "not for me" and jump to the next record.

I am working on this specifically because of huge databases with heavy
production workloads which by definition generate a huge amount of
WAL.
One of the top goals is to avoid REPLICA IDENTITY FULL which can
hugely bloat amount of WAL generated .
The fact that you can turn on REPLICA IDENTITY ROWID with no extra
cost on write side is a bonus (you can even replace the small overhead
of writing th eprimary key recorde by turning on rowid)

And lastly, we want
CIC/RIC to work on all indexes, not just those on logged relations. "A
table with WAL-logging enabled" cannot be a requirement for CIC.

We can use the easy, straightforward collection method - logical
decoding - when available, and fall back to the complicated method
(collecting inside index access method) or the resource-intensive
method (re-scanning the whole table) if logical decoding is
unavailable.

It seems a Postgres-centric solution that you didn't provide strong arguments
in favor of it. How would logical replication take advantage of such change? If
that's the case, share the pgoutput and logical replication changes.

Having though about the issue for quite some time I suddenly
discovered, that while ctid can not be used as a permanent enough
unique id for foreign keys or anything external, it is unique at any
moment in time making it very much sufficient for logical replication.

The high-level idea is to store the source (publisher) ctid value in
an extra column for sorce_ctid in the target (subscriber) table, that
column will also have a unique index and is of course NOT NULL (as
there can be by definition no row without a ctid) so it will form kind
of "replication primary key".

So you're using it as a poor man's id column.

Or a smart man's id column for replication :)

I understand how you got to this point, but the right solution here
still is to get the user to specify their own identity column that is
stable across operations, and not to use the volatile and
guaranteed-unstable ctid.

Suggestion to "get the user specify their own identity column" sounds
good in purely theoretical sense, but can have unacceptable overheads
in practice, especially if a large table started out - often for a
good reason - without a PK or other suittable identiuty column

As I said before, the row id does not have to be stable across row
versions for logical replication, it just has to be able to track the
identity of "the row that is changed or deleted" which ctid already
does very well.

You were very worried about extra WAL usage above, but seem to be
oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
as a poor man's row id and not because the old row data is needed for
some other reason)

When you at some point discover the need for logical replication of a
large table inside a 24/7 production database where you do have
occasional updates - or even frequent updates, just not based on
unique id - you have currently a few options.

1. add REPLICA IDENTITY FULL
- this will double the WAL traffic for updates and usually more
than double for DELETEs (could be 1x or 100x)
- it can also be REALLY REALLY SLOW to replicate, the worst case
requiring 1 sequential scan of the whole table for each UPDATE or
DELETE
2. add a primary key column - quite hard to do CONCURRENTLY, will have
severe disk and cpu space demands and once it has been added (which
could have taken up to a few weeks) it will slow down any inserts.
3. implement updates and deletes in a way similar to overlay file
systems, where updates and deletes are in a different table and any
reading of the main table needs to join with "the overlay table" for
current state.

To reiterate - "stable across operations" is not at a requirement for
logical replication, tuple id is "stable enough" for streaming
replication changes. Think of it as somebody changing the primary key
column at each update - it seems weird, but the updated PK still
uniquely identifies the tuple for the next operation.

As I also said in my other mail, adding
ctid to the logical replication system will expose too much internal
information and will turn current logical no-ops into logical
operations;

Can you provide an example of this?

possibly even bloating the subscriber by a good deal more
than what the publisher cleaned up.

The absolute biggest bloater is REPLICA IDENTITY FULL.

The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
extra needs to be added to WAL*.

The tids are already there even for physical replication (and even for
wal_level=minimal) as they are required even for crash recovery. All
my patch to core did is exposing them to logical decodoing mechanism.
You do not have to use them, nothing changes for decoding plugins not
using them.

---
Best Regards
Hannu

Matthias van de Meent

boekewurm+postgres@gmail.com

3 days ago

In reply to: Hannu Krosing (#8)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote:

On Mon, Dec 8, 2025 at 2:46 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I don't see logical decoding as a viable alternative to any
indexing-related workload. Creating and maintaining indexes need
strict limits on their resource usage, and logical decoding is unable
to give those guarantees:

Are these 'strict limits on their resource usage' defined somewhere so
I could take a look?

None formalized, but "don't read more than O(tablesize) of base data",
"try to keep temporary disk usage to a minimum", "alloc only what you
need", and "keep yourself to maintenance_work_mem" are some I'd
consider basic.
LR would read arbitrarily large amounts of data from WAL, allocates at
least 2 chunks of memory per WAL record, has a tendency to spill large
transactions to disk when it runs out of memory, and at times does so
at scales much larger than the subscribed-to tables.

I also have issues with the permission models here --
pg_create_subscription is a separate permissions from pg_maintain for
good reasons, and I personally would like to keep it that way. Let's
not add another way for people to accidentally lose track of a logical
slot.

It needs to read WAL, which can be produced
in approximately arbitrarily large amounts in any period.

If you read the WAL as it is generated, then you are essentially just
filtering an in-memory stream, most of the time just jumping to next
WAL record.

I think the main point here is that you can't guarantee that you're
going to read from memory, so you can't assume that the performance is
going to be comparable to "reading from memory".

The upsides for logical decoding based collection are
- is that you do not need to have any extra settings and conditions in
you index methods to do in-index-method collection

What do you mean by this? AFAIK, we don't have any such "settings or
conditions" inside index AMs for CIC/RIC. Most, if not everything else
is handled outside the AM code, in either generic indexing code, or in
heapam's decision for which snapshot to use in each of the scans.
IF LR at some point was to be used for indexing, it'd probably have at
least a similar (if not larger) footprint in the code.

- you avoid the hassle of synchronoizing collection starts and stops
between all active backends

I don't see how you can prevent the synchronization step before the
first heap scan ("collection start"). We need to be certain all
backends see the new index, or we might corrupt the index with too
much/to little data through HOT updates in concurrent workloads.
AFAICT, this can not be prevented with LR.
I also can't think of a correct way to prevent the synchronization
step after the initial index is built ("collection end"). We must wait
for all backends to consider this index for insertions, or some
backend may still produce tuples that aren't inserted into the index.
LR can't help with this either - it can't see into the future and
determine which backends will still insert which tuples into which
tables.

So, which hassle would be avoided specifically?

- you are collecting in a single process, so no overhead from
synchronizing between all the backends that capture/log index
insertions

With LR, all backends still have to synchronize through WAL
insertions. Mihail's STIR index does not use WAL in operations, so
it's safe to say that there is also minimal additional overhead there.

- you can choose between collecting immediately in a background worker
and collecting later by re-reading WAL.

Which requires writing a new integration with this system, right?

The one upside (?) of in-index capture is that it will naturally
throttle your production workload if capture can not keep up for some
reason.

The insertion of TIDs into the STIR index is 99 times out of 100 going
to be cheaper than the insertion into the index that's being built. I
don't see why you consider STIR an issue, but not LR.

This is further worsened by the index build itself, which will have to write
out WAL if the relation it's building on also needs to log WAL, which
then also needs to be parsed and decoded by LR.

Not "parsed and decoded" - just read the (database, tablespace,
relation) triplet, decide "not for me" and jump to the next record.

AFAIK we have yet to move record filtering ahead of the point where we
allocate the XLogRecord (and DecodedXLogRecord) that we're retrieving
from raw WAL pages. We don't actually skip the data and variable
headers sections when the record doesn't have rm_decode; we don't
actually skip the data sections when the page headers indicate the WAL
record only modified pages in a different database.
So in my view, yes, we do parse and decode every WAL record in LR;
even if that doesn't always involve calling RMgrData->rm_decode.

I am working on this specifically because of huge databases with heavy
production workloads which by definition generate a huge amount of
WAL.
One of the top goals is to avoid REPLICA IDENTITY FULL which can
hugely bloat amount of WAL generated .
The fact that you can turn on REPLICA IDENTITY ROWID with no extra
cost on write side is a bonus

There is an extra cost for maintenance operations; as you can see
below; and that ignores the increased overhead for users of pgoutput.

(you can even replace the small overhead
of writing th eprimary key recorde by turning on rowid)

You can't use rowid for Primary Keys, because you'd create referential
update loops for foreign keys, or (with ON UPDATE RESTRICT) you'd be
unable to update the rows at all.

And lastly, we want
CIC/RIC to work on all indexes, not just those on logged relations. "A
table with WAL-logging enabled" cannot be a requirement for CIC.

We can use the easy, straightforward collection method - logical
decoding - when available, and fall back to the complicated method
(collecting inside index access method) or the resource-intensive
method (re-scanning the whole table) if logical decoding is
unavailable.

I prefer a single method that always works, is testable, and that has
a guaranteed limit, over two that are only excercised in certain
paths, one of which with a tendency to keep WAL on disk.

I think it's much, much easier to reason about the performance profile
and impact of just one more index.

I understand how you got to this point, but the right solution here
still is to get the user to specify their own identity column that is
stable across operations, and not to use the volatile and
guaranteed-unstable ctid.

Suggestion to "get the user specify their own identity column" sounds
good in purely theoretical sense, but can have unacceptable overheads
in practice, especially if a large table started out - often for a
good reason - without a PK or other suittable identiuty column

I don't think that this user problem is something we should be making
our problem; at least not like this.

As I said before, the row id does not have to be stable across row
versions for logical replication, it just has to be able to track the
identity of "the row that is changed or deleted" which ctid already
does very well.

Yes, I'm aware. But a TID doesn't really mean anything if you don't
have indexes - its value may change arbitrarily between queries.
Anything may happen to a tuple's CTID, long as 1.) within a statement,
the CTID doesn't change, and 2.) indexed TIDs won't have changes to
the attributes represented in indexes.

You were very worried about extra WAL usage above, but seem to be
oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
as a poor man's row id and not because the old row data is needed for
some other reason)

I don't generally suggest people run their systems with
wal_level=logical and try to avoid those systems that have, exactly
because of issues like the significant additional overhead involved
with logging page-local update records.

When you at some point discover the need for logical replication of a
large table inside a 24/7 production database where you do have
occasional updates - or even frequent updates, just not based on
unique id - you have currently a few options.

1. add REPLICA IDENTITY FULL
- this will double the WAL traffic for updates and usually more
than double for DELETEs (could be 1x or 100x)
- it can also be REALLY REALLY SLOW to replicate, the worst case
requiring 1 sequential scan of the whole table for each UPDATE or
DELETE

Yep, that's about expected; if you want performant UPDATE with
OLTP-style databases you have to index your data.

2. add a primary key column - quite hard to do CONCURRENTLY, will have
severe disk and cpu space demands and once it has been added (which
could have taken up to a few weeks) it will slow down any inserts.

It's quite possible. Not trivial, but it is a road that many have
taken. And yes, indexing slows down inserts. That's how we make sure
the index remains correct.

To reiterate - "stable across operations" is not at a requirement for
logical replication, tuple id is "stable enough" for streaming
replication changes. Think of it as somebody changing the primary key
column at each update - it seems weird, but the updated PK still
uniquely identifies the tuple for the next operation.

Yeah, I'm aware of that part.

As I also said in my other mail, adding
ctid to the logical replication system will expose too much internal
information and will turn current logical no-ops into logical
operations;

Can you provide an example of this?

CLUSTER itself doesn't modify any logical columns, and would thus
normally have been no-op from a LR PoV; allowing us to ignore it in LR
of user tables.
By adding CTID as new logical column, however, you force effectively
CLUSTER to issue a logical UPDATE record for all tuples that get a new
CTID in the table, or every UPDATE afterward would be replicated with
the wrong CTID from the subscriber's point of view.

Example:
A table with live CTIDs A: (0, 1), B: (0, 2)
This table gets CLUSTERed: the CTIDs on disk in the new table are A:
(0, 0), B: (0, 1).

Note how row B now has the CTID that row A had before clustering; A
has a previously unused CTID; and B's old CTID is now orphaned.
Assuming LR didn't get an update about this CLUSTER changing the CTIDs
of logical rows, the replica will be desynced from the primary.
An UPDATE on row B will still get recorded and replicated, but because
the CTID is now (0, 1) the remote thinks the update was for row A
(which it had recorded previously as the one that last had CTID (0,
1)). Updates for row A on the primary will get replicated and fail to
find a matching row, because its new CTID (0, 0) wasn't an ID that was
replicated yet, and there's now an additional row B that's been
orphaned on the replica.

possibly even bloating the subscriber by a good deal more
than what the publisher cleaned up.

The absolute biggest bloater is REPLICA IDENTITY FULL.

The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
extra needs to be added to WAL*.

Except for CLUSTER and related operations, which now need to log TID mappings.

The tids are already there even for physical replication (and even for
wal_level=minimal) as they are required even for crash recovery. All
my patch to core did is exposing them to logical decodoing mechanism.
You do not have to use them, nothing changes for decoding plugins not
using them.

About the patch:

The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
better stored in {old,new}tuple->t_self, if we're going to store them.
The size of .data is currently 32 bytes, increasing that by 16 bytes
is rather wasteful if we already have a more convenient and accurate
place to store this data.

I see even less reasons why logical replication can care about
HOT-ness of updates (that's mostly an implementation detail of heap;
logically there is and should be no difference between a normal update
and a HOT update).
But, if the decoder really did care about those implementation
details, it would arguably be better pull that data from the tuple's
infomasks, as then there wouldn't be a need for the additional boolean
ReorderBufferChange->data.tp.is_hot_update. Alternatively, move it
into an alignment gap, because the current placement (after removing
the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
the size of this union.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

#10

Hannu Krosing

hannuk@google.com

3 days ago

In reply to: Matthias van de Meent (#9)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Tue, Dec 9, 2025 at 7:41 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote:

...

Suggestion to "get the user specify their own identity column" sounds
good in purely theoretical sense, but can have unacceptable overheads
in practice, especially if a large table started out - often for a
good reason - without a PK or other suittable identiuty column

I don't think that this user problem is something we should be making
our problem; at least not like this.

I assume you don't have to deal with real-world users much :)

And we had similar objections for all CONCURRENTLY commands - why on
earth would anyone want a slower version of INDEX

I see again and again real user problems taking ages to address, for
example when I sent a patch it possible to run move than one VACUUM
concurrently and actually clean up tables in 2005 it lingered for
about a year before I got Alvaro to put it in at PostgreSQL
Anniversary Summit in July 2006.

As I said before, the row id does not have to be stable across row
versions for logical replication, it just has to be able to track the
identity of "the row that is changed or deleted" which ctid already
does very well.

Yes, I'm aware. But a TID doesn't really mean anything if you don't
have indexes - its value may change arbitrarily between queries.
Anything may happen to a tuple's CTID, long as 1.) within a statement,
the CTID doesn't change, and 2.) indexed TIDs won't have changes to
the attributes represented in indexes.

Can't parse that :(

Can you perhaps elaborate ?

You were very worried about extra WAL usage above, but seem to be
oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
as a poor man's row id and not because the old row data is needed for
some other reason)

I don't generally suggest people run their systems with
wal_level=logical and try to avoid those systems that have, exactly
because of issues like the significant additional overhead involved
with logging page-local update records.

But real-world PostgreSQL users do use logical replication and logical
decoding for CDC all the time.

And logical replication is currently the only way to do no-downtime
major version upgrades.

When you at some point discover the need for logical replication of a
large table inside a 24/7 production database where you do have
occasional updates - or even frequent updates, just not based on
unique id - you have currently a few options.

1. add REPLICA IDENTITY FULL
- this will double the WAL traffic for updates and usually more
than double for DELETEs (could be 1x or 100x)
- it can also be REALLY REALLY SLOW to replicate, the worst case
requiring 1 sequential scan of the whole table for each UPDATE or
DELETE

Yep, that's about expected; if you want performant UPDATE with
OLTP-style databases you have to index your data.

People don't always do that for all tables, like logs.

And they still may occasionally need to change them, for example
getting a right-to-be-forgotten request and running a delete of all
log records with a specific unindexed attribute. A single sequential
scan is cheap enough to be a good compromise against an all-around
slowdown caused by the index, but if that delete affects 100,000 lines
in a 10TB table you suddenly have 100k sequential scans on the
replica.

2. add a primary key column - quite hard to do CONCURRENTLY, will have
severe disk and cpu space demands and once it has been added (which
could have taken up to a few weeks) it will slow down any inserts.

It's quite possible. Not trivial, but it is a road that many have
taken.

Yes, but only because they have no better option than to have a senior
DBA spend a week or two on this.

And yes, indexing slows down inserts. That's how we make sure
the index remains correct.

I know *why* it slows down inserts :).
I'm just saying that there are many cases where you don't want this slowdown.

To reiterate - "stable across operations" is not at a requirement for
logical replication, tuple id is "stable enough" for streaming
replication changes. Think of it as somebody changing the primary key
column at each update - it seems weird, but the updated PK still
uniquely identifies the tuple for the next operation.

Yeah, I'm aware of that part.

As I also said in my other mail, adding
ctid to the logical replication system will expose too much internal
information and will turn current logical no-ops into logical
operations;

Can you provide an example of this?

I thought if I should mention CLUSTER and VACUUM FULL in the question
but they seemed so obvious that I left them out.

When writing I was in the mindset of multi-terabyte 24/7 high-traffic
databases where these things are out of the question anyway.

...<an example of how CLUSTER works was here> ...

possibly even bloating the subscriber by a good deal more
than what the publisher cleaned up.

The absolute biggest bloater is REPLICA IDENTITY FULL.

The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
extra needs to be added to WAL*.

Except for CLUSTER and related operations, which now need to log TID mappings.

I would rather restrict CLUSTER and VACUUM FULL to refuse to run on
tables with REPLICA IDENTITY ROWID and any valid replication slots.

At least until my next proposal for GENERATED ALWAYS AS ROW IDENTITY
gets in (ROWID which starts out as a virtual column showing ctid cast
to bigint, materialized only in case of an update)

The tids are already there even for physical replication (and even for
wal_level=minimal) as they are required even for crash recovery. All
my patch to core did is exposing them to logical decodoing mechanism.
You do not have to use them, nothing changes for decoding plugins not
using them.

About the patch:

The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
better stored in {old,new}tuple->t_self, if we're going to store them.

I thought so too, but then found out that no, because old ctid is
present even when oldtuple is not. It is there even for
wal_level=minimal.

also keeping them separate keeps the overhead minimal when the
decoding plugin does not need them.

The size of .data is currently 32 bytes, increasing that by 16 bytes
is rather wasteful if we already have a more convenient and accurate
place to store this data.

Since this is just an in-memory structure I would mainly worry about
going over 64 bytes (x64 cache line, likely also palloc's internal
step)

I see even less reasons why logical replication can care about
HOT-ness of updates (that's mostly an implementation detail of heap;
logically there is and should be no difference between a normal update
and a HOT update).
But, if the decoder really did care about those implementation
details, it would arguably be better pull that data from the tuple's
infomasks, as then there wouldn't be a need for the additional boolean
ReorderBufferChange->data.tp.is_hot_update.

It is there for efficiency and low overhead. It can be made even more
efficient by givin HOT update its own if() and skipping the second
comparison.

Alternatively, move it
into an alignment gap, because the current placement (after removing
the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
the size of this union.

Have to check, maybe there is a free bit somewhere.

But does it push it over 64-byte boundary ?

My main gripe with the union is that there is one struct member after
the union of different-sized things, so changing the lrgest union
member moves that one. Not a big issue, but it would be nice if you
did not have to recompile all extensions that do not need the new
fields.

--
Best Regards
Hannu

#11

Matthias van de Meent

boekewurm+postgres@gmail.com

1 day ago

In reply to: Hannu Krosing (#10)

Re: making tid and HOTness of UPDATE available to logical decoding plugins

On Tue, 9 Dec 2025 at 21:08, Hannu Krosing <hannuk@google.com> wrote:

On Tue, Dec 9, 2025 at 7:41 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 8 Dec 2025 at 16:25, Hannu Krosing <hannuk@google.com> wrote:

Suggestion to "get the user specify their own identity column" sounds
good in purely theoretical sense, but can have unacceptable overheads
in practice, especially if a large table started out - often for a
good reason - without a PK or other suittable identiuty column

I don't think that this user problem is something we should be making
our problem; at least not like this.

I assume you don't have to deal with real-world users much :)

At least none of the kind that don't pay a penny and expect the world
in return, no :)

And we had similar objections for all CONCURRENTLY commands - why on
earth would anyone want a slower version of INDEX

With CONCURRENTLY, you're reducing the work done whilst holding heavy
and expensive locks, and that is a real concern in any database that
wants to do maintenance without significant downtime.

But in this case there are roads to get your maintenance (adding a PK)
done without significant downtime, and that without exposing
non-logical data. I don't see the point in bloating the logical
decoding system for this. We might be able to add facilities that add
a primary key CONCURRENTLY, but that's not what's being proposed.

I see again and again real user problems taking ages to address, for
example when I sent a patch it possible to run move than one VACUUM
concurrently and actually clean up tables in 2005 it lingered for
about a year before I got Alvaro to put it in at PostgreSQL
Anniversary Summit in July 2006.

Yeah, you're not alone in those pain points. I suspect nearly everyone
who frequents this list has had similar experiences.

As I said before, the row id does not have to be stable across row
versions for logical replication, it just has to be able to track the
identity of "the row that is changed or deleted" which ctid already
does very well.

Yes, I'm aware. But a TID doesn't really mean anything if you don't
have indexes - its value may change arbitrarily between queries.
Anything may happen to a tuple's CTID, long as 1.) within a statement,
the CTID doesn't change, and 2.) indexed TIDs won't have changes to
the attributes represented in indexes.

Can't parse that :(

Can you perhaps elaborate ?

There are practically two mechanisms in PostgreSQL that hold
references to TIDs which prevent housekeeping jobs in the AM from
reassigning TIDs at will:
1.) indexes, which reference specific TIDs (or the blknos of TIDs for
amsummarizing indexes) in the table that have a specific set of
unchanging attribute values for the indexed attributes (in HOT terms,
the HOT root), and
2.) active scans, which use the TIDs for updates, deletes, locking,
and other tuple-level operations.
2.a) catcache, whilst not exactly an active scan, also holds TIDs for
those same reasons. I grouped it under (2) because it's not _that_
different, and because it exclusively uses heapam it's otherwise
irrelevant to the points below about how other AMs would expose TIDs.

But outside any single query, catcache internals, and indexes, we
don't have any internal expectation of TID stability. If you had an AM
that could guarantee that there's no index that references the tuple,
and no scan that'll need to refer to that tuple by its TID, then the
TID of the tuple could well be reassigned at will.

<tangent>
I've actually tried to implement something similar at some point by
making scans refer to the HOT root instead of the live tuple's TID,
but that failed due to unforeseen complexities.
In principle, MVCC scans could refer to only the HOT root, and ignore
non-visible tuples of that HOT chain. Pruning would then be allowed to
move HOT tuples' ItemIds around in the page's line pointer array
(because scans only refer to the root, and can find the original
tuples again using visibility checks), enabling some more
defragmentation in the heap page.
Sadly that doesn't work for non-MVCC scans, as they (may) need to
access all of the tuples in the HOT chain and not just the current
visible tuple; thus requiring more work than just this. But
theoretically, that problem can also be solved.
</tangent>

You were very worried about extra WAL usage above, but seem to be
oblivious of huge resource usage of REPLICA IDENTITY FULL (when used
as a poor man's row id and not because the old row data is needed for
some other reason)

I don't generally suggest people run their systems with
wal_level=logical and try to avoid those systems that have, exactly
because of issues like the significant additional overhead involved
with logging page-local update records.

But real-world PostgreSQL users do use logical replication and logical
decoding for CDC all the time.

And logical replication is currently the only way to do no-downtime
major version upgrades.

Yep. Hence my point - add a PK column. Then you don't have the
overhead of REPLICA IDENTITY FULL, nor any new requirements during
table rewrites.

When you at some point discover the need for logical replication of a
large table inside a 24/7 production database where you do have
occasional updates - or even frequent updates, just not based on
unique id - you have currently a few options.

1. add REPLICA IDENTITY FULL
- this will double the WAL traffic for updates and usually more
than double for DELETEs (could be 1x or 100x)
- it can also be REALLY REALLY SLOW to replicate, the worst case
requiring 1 sequential scan of the whole table for each UPDATE or
DELETE

Yep, that's about expected; if you want performant UPDATE with
OLTP-style databases you have to index your data.

People don't always do that for all tables, like logs.

So, let me understand this.

This hypothetical table has no indexes, because errnoresources. And
you want to replicate this data, and want to use TID because it's slow
to use FULL identity.

But I still don't see how TID will make this meaningfully faster; you
still don't have an index to improve the O(tablesize) scans with. And
if you're putting an index on it on the remote side, why wouldn't the
origin be able to support this index?

And they still may occasionally need to change them, for example
getting a right-to-be-forgotten request and running a delete of all
log records with a specific unindexed attribute. A single sequential
scan is cheap enough to be a good compromise against an all-around
slowdown caused by the index, but if that delete affects 100,000 lines
in a 10TB table you suddenly have 100k sequential scans on the
replica.

Again, that's not solved by replicating TIDs when you refuse to add an
index. I don't think that the cost of a full row in WAL is that
expensive when the replica will do a full table scan regardless of the
row ID used.

2. add a primary key column - quite hard to do CONCURRENTLY, will have
severe disk and cpu space demands and once it has been added (which
could have taken up to a few weeks) it will slow down any inserts.

It's quite possible. Not trivial, but it is a road that many have
taken.

Yes, but only because they have no better option than to have a senior
DBA spend a week or two on this.

How about a normal DBA or database-aware application developer? Those
are also often employed, and can apply the same techniques which have
been developed over time. It isn't rocket science, and not even deep
database technology anymore.

And yes, indexing slows down inserts. That's how we make sure
the index remains correct.

I know *why* it slows down inserts :).
I'm just saying that there are many cases where you don't want this slowdown.

Well, yes. I don't expect anyone would say "yes" if you asked them "do
you like the slowdown caused by indexing tuples during insertions and
updates?", but they probably meant "improve the performance of index
insertion" rather than "remove my indexes".

As I also said in my other mail, adding
ctid to the logical replication system will expose too much internal
information and will turn current logical no-ops into logical
operations;

Can you provide an example of this?

I thought if I should mention CLUSTER and VACUUM FULL in the question
but they seemed so obvious that I left them out.

When writing I was in the mindset of multi-terabyte 24/7 high-traffic
databases where these things are out of the question anyway.

Most PostgreSQL tables are much smaller than the multi-terabyte scale
that you think of, and most have much lower traffic. I don't think
it's that unreasonable to consider that use case as a valid issue with
this patch, and table rewrites are not that uncommon (even if most
which are done through ALTER TABLE are accidental nowadays).

possibly even bloating the subscriber by a good deal more
than what the publisher cleaned up.

The absolute biggest bloater is REPLICA IDENTITY FULL.

The beauty of using REPLICA IDENTITY ROWID is that absolutely *nothing
extra needs to be added to WAL*.

Except for CLUSTER and related operations, which now need to log TID mappings.

I would rather restrict CLUSTER and VACUUM FULL to refuse to run on
tables with REPLICA IDENTITY ROWID and any valid replication slots.

So, that's effectively blocked for all tables with REPLICA IDENTITY
ROWID, because you can't know which invalid replication slots will be
turning valid during the operation, or which slots will get started. I
don't think that's a nice approach.

About the patch:

The ItemPointerDatas newly added to ReorderBufferChange.data.tp are
better stored in {old,new}tuple->t_self, if we're going to store them.

I thought so too, but then found out that no, because old ctid is
present even when oldtuple is not. It is there even for
wal_level=minimal.

We can't (well, shouldn't) do logical decoding for non-logical wal_level's WAL.
But more importantly, how can we correctly process an UPDATE when
there's no old tuple that was updated? Presumably that means 'the
primary key wasn't updated', but that also means that oldtuple must be
populated when the TID-based surrogate primary key changes.

Alternatively, you could make {old,new}tuple a union {HeapTuple, TID};
with the specific type indicated by a bool that's stored in the 3
alignment bytes after .clear_toast_afterwards. It'd prevent the data
union from growing further.

also keeping them separate keeps the overhead minimal when the
decoding plugin does not need them.

What do you mean by "keeping them seprate ... does not need them"?

The size of .data is currently 32 bytes, increasing that by 16 bytes
is rather wasteful if we already have a more convenient and accurate
place to store this data.

Since this is just an in-memory structure I would mainly worry about
going over 64 bytes (x64 cache line, likely also palloc's internal
step)

The ReorderBufferChanges are palloc-ed in a SlabContext, which is
optimized for (and only allows) allocations of a single size, saving
the additional overhead of bucketed sizing.
This means that for ReorderBufferChange, every byte (well, MAXALIGN()
of bytes) counts, there is no bucket alignment.

//aside, all this talk about sizing got me to realize that we're
wasting 10% of the struct's size on alignment gaps in the first 4
fields. That's not great either.

I see even less reasons why logical replication can care about
HOT-ness of updates (that's mostly an implementation detail of heap;
logically there is and should be no difference between a normal update
and a HOT update).
But, if the decoder really did care about those implementation
details, it would arguably be better pull that data from the tuple's
infomasks, as then there wouldn't be a need for the additional boolean
ReorderBufferChange->data.tp.is_hot_update.

It is there for efficiency and low overhead. It can be made even more
efficient by givin HOT update its own if() and skipping the second
comparison.

What efficiency is there to gain in logical decoding from adding HOT
to logical decoding? Logically speaking, there is no distinction
between a normal update and one that applied the HOT mechanism. Even
if you were to include TIDs as part of the tuples' attributes, HOT
doesn't change anything there. Indexes are not logically replicated,
and even if they were the replica would have to decide for itself to
apply HOT or not.

Alternatively, move it
into an alignment gap, because the current placement (after removing
the TIDs from data.tp) is rather unfortunate and would add 8 bytes to
the size of this union.

Have to check, maybe there is a free bit somewhere.

There are still 3 bytes of alignment left just after clear_toast_afterwards.

But does it push it over 64-byte boundary ?

No (HEAD is at 80B), but without the ItemPointerDatas, and at the
current location it would increase the size of ReorderBufferChange by
8B to 88B, which would increase ReorderBufferChange's memory usage by
10%. The ItemPointerDatas in your patch create a gap in which the bool
is stored without additional alignment losses, but as I mentioned
above those are also a change I'm not particularly happy about.

My main gripe with the union is that there is one struct member after
the union of different-sized things, so changing the lrgest union
member moves that one. Not a big issue, but it would be nice if you
did not have to recompile all extensions that do not need the new
fields.

I would be extremely hesitant to run extensions that were compiled for
one major PostgreSQL version against a different major PostgreSQL
version. There are too many changing internals across versions to
realistically expect everything to just work.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

Participants

Thread Outline

making tid and HOTness of UPDATE available to logical decoding plugins

Attachments: