Re: Adding REPACK [concurrently]

Started by Alvaro Herrera11 months ago31 messageshackers

alvherre@2ndquadrant.com

11 months ago

On 2025-Jul-26, Robert Treat wrote:

For clarity, are you intending to commit this patch before having the
other parts ready? (If that sounds like an objection, it isn't) After
a first pass, I think there's some confusing bits in the new docs that
could use straightening out, but there likely going to overlap changes
once concurrently is brought in, so it might make sense to hold off on
those.

I'm aiming at getting 0001 committed during the September commitfest,
and the CONCURRENTLY flag addition later in the pg19 cycle. But I'd
rather have good-enough docs at every step of the way. They don't have
to be *perfect* if we want to get everything in pg19, but I'd rather not
leave anything openly confusing even transiently.

That said, I did not review the docs this time around, so here's them
the same as they were in the previous post. But if you want to suggest
changes for the docs in 0001, please do. Just don't get too carried
away.

speaking of, for this bit in src/backend/commands/cluster.c

+    switch (cmd)
+    {
+        case REPACK_COMMAND_REPACK:
+            return "REPACK";
+        case REPACK_COMMAND_VACUUMFULL:
+            return "VACUUM";
+        case REPACK_COMMAND_CLUSTER:
+            return "VACUUM";
+    }
+    return "???";

The last one should return "CLUSTER" no?

Absolutely -- my blunder.

On 2025-Jul-27, Fujii Masao wrote:

The patch submitted here, largely by Antonin Houska with some
changes by me, is based on the the pg_squeeze code which he
authored, and first introduces a new command called REPACK to absorb
both VACUUM FULL and CLUSTER, followed by addition of a CONCURRENTLY
flag to allow some forms of REPACK to operate online using logical
decoding.

Does this mean REPACK CONCURRENTLY requires wal_level = logical, while
plain REPACK (without CONCURRENTLY) works with any wal_level setting?
If we eventually deprecate VACUUM FULL and CLUSTER, I think plain
REPACK should still be allowed with wal_level = minimal or replica, so
users with those settings can perform equivalent processing.

Absolutely.

One of the later patches in the series, which I have not included yet,
intends to implement the idea of transiently enabling wal_level=logical
for the table being repacked concurrently, so that you can still use
the concurrent mode if you have a non-logical-wal_level instance.

+ if (!cluster_is_permitted_for_relation(tableOid, userid,
+    CLUSTER_COMMAND_CLUSTER))
As for the patch you attached, it seems to be an early WIP and
might not be ready for review yet?? BTW, I got the following
compilation failure and probably CLUSTER_COMMAND_CLUSTER
the above should be GetUserId().

This was a silly merge mistake, caused by my squashing Antonin's 0004
(trivial code restructuring) into 0001 at the last minute and failing to
"git add" the compile fixes before doing git-format-patch.

Here's v17. (I decided that calling my previous one "v1" after Antonin
had gone all the way to v15 was stupid on my part.) The important part
here is that I rebased Antonin 0004's, that is, the addition of the
CONCURRENTLY flag, plus 0005 regression tests.

The only interesting change here is that I decided to not mess with the
grammar by allowing an unparenthesized CONCURRENTLY keyword; if you want
concurrent, you have to say "REPACK (CONCURRENTLY)". This is at odds
with the way we use the keyword in other commands, but ISTM we don't
_need_ to support that legacy syntax. Anyway, this is easy to put back
afterwards, if enough people find it not useless.

I've not reviewed 0003 in depth yet, just rebased it. But it works to
the point that CI is happy with it.

I've not yet included Antonin's 0006 and 0007.

TODO list for 0001:

- addition of src/bin/scripts/repackdb
- clean up the progress report infrastructure
- doc review

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

Import Notes

Reply to msg id not found: CAHGQGwELP6OAOprxjkS5iYSj1KXKmVPULJyYMddE7-uRxrSP_Q@mail.gmail.comCABV9wwMjm_8GQ_6A7WZkXzEkbz5ZPurB3ijxu74MbFJ4+XrWUg@mail.gmail.com

Fujii Masao

masao.fujii@gmail.com

11 months ago

In reply to: Alvaro Herrera (#1)

On Fri, Aug 1, 2025 at 1:50 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

One of the later patches in the series, which I have not included yet,
intends to implement the idea of transiently enabling wal_level=logical
for the table being repacked concurrently, so that you can still use
the concurrent mode if you have a non-logical-wal_level instance.

Sounds good to me!

Here's v17.

I just tried REPACK command and observed a few things:

When I repeatedly ran REPACK on the regression database
while make installcheck was running, I got the following error:

ERROR: StartTransactionCommand: unexpected state STARTED

"REPACK (VERBOSE);" failed with the following error.

ERROR: syntax error at or near ";"

REPACK (CONCURRENTLY) USING INDEX failed with the following error,
while the same command without CONCURRENTLY completed successfully:

=# REPACK (CONCURRENTLY) parallel_vacuum_table using index
regular_sized_index ;
ERROR: cannot process relation "parallel_vacuum_table"
HINT: Relation "parallel_vacuum_table" has no identity index.

When I ran REPACK (CONCURRENTLY) on a table that's also a logical
replication target, I saw the following log messages. Is this expected?

=# REPACK (CONCURRENTLY) t;
LOG: logical decoding found consistent point at 1/00021F20
DETAIL: There are no running transactions.
STATEMENT: REPACK (CONCURRENTLY) t;

Regards,

--
Fujii Masao

Mihail Nikalayeu

mihailnikalayeu@gmail.com

11 months ago

In reply to: Fujii Masao (#2)

Hello Álvaro,

Should we skip non-ready indexes in build_new_indexes?

Also, in the current implementation, concurrent mode is marked as
non-MVCC safe. From my point of view, this is a significant limitation
for practical use.
Should we consider an option to exchange non-MVCC issues to short
exclusive lock of ProcArrayLock + cancellation of some transactions
with older xmin?

Best regards,
Mikhail

Mihail Nikalayeu

mihailnikalayeu@gmail.com

11 months ago

In reply to: Mihail Nikalayeu (#3)

Hello!

One more thing - I think build_new_indexes and
index_concurrently_create_copy are very close in semantics, so it
might be a good idea to refactor them a bit.

I’m still concerned about MVCC-related issues. For multiple
applications, this is a dealbreaker, because in some cases correctness
is a higher priority than availability.

Possible options:

1) Terminate connections with old snapshots.

Add a flag to terminate all connections with snapshots during the
ExclusiveLock period for the swap. From the application’s perspective,
this is not a big deal - it's similar to a primary switch. We would
also need to prevent new snapshots from being taken during the swap
transaction, so a short exclusive lock on ProcArrayLock would also be
required.

2) MVCC-safe two-phase approach (inspired by CREATE INDEX).

- copy the data from T1 to the new table T2.
- apply the log.
- take a table-exclusive lock on T1
- apply the log again.
- instead of swapping, mark the T2 as a kind of shadow table - any
transaction applying changes to T1 must also apply them to T2, while
reads still use T1 as the source of truth.
- commit (and record the transaction ID as XID1).
- at this point, all changes are applied to both tables with the same
XIDs because of the "shadow table" mechanism.
- wait until older snapshots no longer treat XID1 as uncommitted.
- now the tables are identical from the MVCC perspective.
- take an exclusive lock on both T1 and T2.
- perform the swap and drop T1.
- commit.

This is more complex and would require implementing some sort of
"shadow table" mechanism, so it might not be worth the effort. Option
1 feels more appealing to me.

If others think this is a good idea, I might try implementing a proof
of concept.

Best regards,
Mikhail

Alvaro Herrera

alvherre@2ndquadrant.com

11 months ago

In reply to: Mihail Nikalayeu (#4)

On 2025-Aug-09, Mihail Nikalayeu wrote:

Hello!

One more thing - I think build_new_indexes and
index_concurrently_create_copy are very close in semantics, so it
might be a good idea to refactor them a bit.

I’m still concerned about MVCC-related issues. For multiple
applications, this is a dealbreaker, because in some cases correctness
is a higher priority than availability.

Please note that Antonin already implemented this. See his patches
here:
/messages/by-id/77690.1725610115@antos
I proposed to leave this part out initially, which is why it hasn't been
reposted. We can review and discuss after the initial patches are in.
Because having an MVCC-safe mode has drawbacks, IMO we should make it
optional.

But you're welcome to review that part specifically if you're so
inclined, and offer feedback on it. (I suggest to rewind back your
checked-out tree to branch master at the time that patch was posted, for
easy application. We can deal with a rebase later.)

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

Antonin Houska

ah@cybertec.at

11 months ago

In reply to: Mihail Nikalayeu (#4)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

One more thing - I think build_new_indexes and
index_concurrently_create_copy are very close in semantics, so it
might be a good idea to refactor them a bit.

You're right. I think I even used the latter for reference when writing the
first.

0002 in the attached series tries to fix that. build_new_indexes() (in 0004)
is simpler now.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Antonin Houska (#6)

Antonin Houska <ah@cybertec.at> wrote:

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

One more thing - I think build_new_indexes and
index_concurrently_create_copy are very close in semantics, so it
might be a good idea to refactor them a bit.

You're right. I think I even used the latter for reference when writing the
first.

0002 in the attached series tries to fix that. build_new_indexes() (in 0004)
is simpler now.

This is v18 again. Parts 0001 through 0004 are unchanged, however 0005 is
added. It implements a new client application pg_repackdb. (If I posted 0005
alone its regression tests would not work. I wonder if the cfbot handles the
repeated occurence of the 'v18-' prefix correctly.)

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Alvaro Herrera

alvherre@2ndquadrant.com

10 months ago

In reply to: Antonin Houska (#7)

On 2025-Aug-15, Antonin Houska wrote:

This is v18 again.

Thanks for this!

Parts 0001 through 0004 are unchanged, however 0005 is added. It
implements a new client application pg_repackdb. (If I posted 0005
alone its regression tests would not work. I wonder if the cfbot
handles the repeated occurence of the 'v18-' prefix correctly.)

Yeah, the cfbot is just going to take the attachments from the latest
email in the thread that has any, and assume they are the whole that
make up the patch. It wouldn't work to post just v18-0005 and assume
that the bot is going grab patches 0001 through 0004 from a previous
email, if that's what you're thinking. In short, what you did is
correct and necessary.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Alvaro Herrera (#5)

Hello everyone!

Alvaro Herrera <alvherre@alvh.no-ip.org>:

Please note that Antonin already implemented this. See his patches
here:
/messages/by-id/77690.1725610115@antos
I proposed to leave this part out initially, which is why it hasn't been
reposted. We can review and discuss after the initial patches are in.

I think it is worth pushing it at least in the same release cycle.

But you're welcome to review that part specifically if you're so
inclined, and offer feedback on it. (I suggest to rewind back your
checked-out tree to branch master at the time that patch was posted, for
easy application. We can deal with a rebase later.)

I have rebased that on top of v18 (attached).

Also, I think I found an issue (or lost something during rebase): we
must preserve xmin,cmin during initial copy
to make sure that data is going to be visible by snapshots of
concurrent changes later:

static void
reform_and_rewrite_tuple(......)
.....
/*It is also crucial to stamp the new record with the exact same
xid and cid,
* because the tuple must be visible to the snapshot of the
applied concurrent
* change later.
*/
CommandId cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);

heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);

I'll try to polish that part a little bit.

Because having an MVCC-safe mode has drawbacks, IMO we should make it
optional.

Do you mean some option for the command? Like REPACK (CONCURRENTLY, SAFE)?

Best regards,
Mikhail.

#10

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#9)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Also, I think I found an issue (or lost something during rebase): we
must preserve xmin,cmin during initial copy
to make sure that data is going to be visible by snapshots of
concurrent changes later:

static void
reform_and_rewrite_tuple(......)
.....
/*It is also crucial to stamp the new record with the exact same
xid and cid,
* because the tuple must be visible to the snapshot of the
applied concurrent
* change later.
*/
CommandId cid = HeapTupleHeaderGetRawCommandId(tuple->t_data);
TransactionId xid = HeapTupleHeaderGetXmin(tuple->t_data);

heap_insert(NewHeap, copiedTuple, xid, cid, HEAP_INSERT_NO_LOGICAL, NULL);

When posting version 12 of the patch [1]/messages/by-id/178741.1743514291@localhost I raised a concern that the the MVCC
safety is too expensive when it comes to logical decoding. Therefore, I
abandoned the concept for now, and v13 [2]/messages/by-id/97795.1744363522@localhost uses plain heap_insert(). Once we
implement the MVCC safety, we simply rewrite the tuple like v12 did - that's
the simplest way to preserve fields like xmin, cmin, ...

[1]: /messages/by-id/178741.1743514291@localhost
[2]: /messages/by-id/97795.1744363522@localhost

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#11

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Antonin Houska (#10)

Hello, Antonin!

When posting version 12 of the patch [1] I raised a concern that the the MVCC
safety is too expensive when it comes to logical decoding. Therefore, I
abandoned the concept for now, and v13 [2] uses plain heap_insert(). Once we
implement the MVCC safety, we simply rewrite the tuple like v12 did - that's
the simplest way to preserve fields like xmin, cmin, ...

Thanks for the explanation.

I was looking into catalog-related logical decoding features, and it
seems like they are clearly overkill for the repack case.
We don't need CID tracking or even a snapshot for each commit if we’re
okay with passing xmin/xmax as arguments.

What do you think about the following approach for replaying:
* use the extracted XID as the value for xmin/xmax.
* use SnapshotSelf to find the tuple for update/delete operations.

SnapshotSelf seems like a good fit here:
* it sees the last "existing" version.
* any XID set as xmin/xmax in the repacked version is already
committed - so each update/insert is effectively "committed" once
written.
* it works with multiple updates of the same tuple within a single
transaction - SnapshotSelf sees the last version.
* all updates are ordered and replayed sequentially - so the last
version is always the one we want.

If I'm not missing anything, this looks like something worth including
in the patch set.
If so, I can try implementing a test version.

Best regards,
Mikhail

#12

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#11)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

I was looking into catalog-related logical decoding features, and it
seems like they are clearly overkill for the repack case.
We don't need CID tracking or even a snapshot for each commit if we’re
okay with passing xmin/xmax as arguments.

I assume you are concerned with the patch part 0005 of the v12 patch
("Preserve visibility information of the concurrent data changes."), aren't
you?

What do you think about the following approach for replaying:
* use the extracted XID as the value for xmin/xmax.
* use SnapshotSelf to find the tuple for update/delete operations.

SnapshotSelf seems like a good fit here:
* it sees the last "existing" version.
* any XID set as xmin/xmax in the repacked version is already
committed - so each update/insert is effectively "committed" once
written.
* it works with multiple updates of the same tuple within a single
transaction - SnapshotSelf sees the last version.
* all updates are ordered and replayed sequentially - so the last
version is always the one we want.

If I'm not missing anything, this looks like something worth including
in the patch set.
If so, I can try implementing a test version.

Not sure I understand in all details, but I don't think SnapshotSelf is the
correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
'snapshot' argument at all. Instead, it considers the set of running
transactions as it is at the time the function is called.

One particular problem I imagine is replaying an UPDATE to a row that some
later transaction will eventually delete, but the transaction that ran the
UPDATE obviously had to see it. When looking for the old version during the
replay, HeapTupleSatisfiesMVCC() will find the old version as long as we pass
the correct snapshot to it.

However, at the time we're replaying the UPDATE in the new table, the tuple
may have been already deleted from the old table, and the deleting transaction
may already have committed. In such a case, HeapTupleSatisfiesSelf() will
conclude the old version invisible and the we'll fail to replay the UPDATE.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#13

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Antonin Houska (#12)

Hi, Antonin!

I assume you are concerned with the patch part 0005 of the v12 patch
("Preserve visibility information of the concurrent data changes."), aren't
you?

Yes, of course. I got an idea while trying to find a way to optimize it.

Not sure I understand in all details, but I don't think SnapshotSelf is the
correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
'snapshot' argument at all. Instead, it considers the set of running
transactions as it is at the time the function is called.

Yes, and it is almost the same behavior when a typical MVCC snapshot
encounters a tuple created by its own transaction.

So, how it works in the non MVCC-safe case (current patch behaviour):

1) we have a whole initial table snapshot with all the xmin = repack XID
2) appling transaction sees ALL the self-alive (no xmax) tuples in it
because all tuples created\deleted by transaction itself
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax and inserts a new one
4) so, there is no any real MVCC involved - just find the latest
version and create a new version
5) and it works correctly because all ordering issues were resolved by
locking mechanisms on the original table or by reordering buffer

How it maps to MVCC-safe case (SnapshotSelf):

1) we have a whole initial table snapshot with all xmin copied from
the original table. All such xmin are committed.
2) appling transaction sees ALL the self-alive (no xmax) tuple in it
because its xmin\xmax is committed and SnapshotSelf is happy with it
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax=original xid and inserts a new one
keeping with xmin=orignal xid
4) --//--
5) --//--

However, at the time we're replaying the UPDATE in the new table, the tuple
may have been already deleted from the old table, and the deleting transaction
may already have committed. In such a case, HeapTupleSatisfiesSelf() will
conclude the old version invisible and the we'll fail to replay the UPDATE.

No, it will see it - because its xmax will be empty in the repacked
version of the table.

From your question I feel you understood the concept - but feel free
to ask for an additional explanation/scheme.

Best regards,
Mikhail.

#14

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#13)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Not sure I understand in all details, but I don't think SnapshotSelf is the
correct snapshot. Note that HeapTupleSatisfiesSelf() does not use its
'snapshot' argument at all. Instead, it considers the set of running
transactions as it is at the time the function is called.

Yes, and it is almost the same behavior when a typical MVCC snapshot
encounters a tuple created by its own transaction.

So, how it works in the non MVCC-safe case (current patch behaviour):

1) we have a whole initial table snapshot with all the xmin = repack XID
2) appling transaction sees ALL the self-alive (no xmax) tuples in it
because all tuples created\deleted by transaction itself
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax and inserts a new one
4) so, there is no any real MVCC involved - just find the latest
version and create a new version
5) and it works correctly because all ordering issues were resolved by
locking mechanisms on the original table or by reordering buffer

How it maps to MVCC-safe case (SnapshotSelf):

1) we have a whole initial table snapshot with all xmin copied from
the original table. All such xmin are committed.
2) appling transaction sees ALL the self-alive (no xmax) tuple in it
because its xmin\xmax is committed and SnapshotSelf is happy with it

How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?

3) each update/delete during the replay selects the last existing
tuple version, updates it xmax=original xid and inserts a new one
keeping with xmin=orignal xid
4) --//--
5) --//--

However, at the time we're replaying the UPDATE in the new table, the tuple
may have been already deleted from the old table, and the deleting transaction
may already have committed. In such a case, HeapTupleSatisfiesSelf() will
conclude the old version invisible and the we'll fail to replay the UPDATE.

No, it will see it - because its xmax will be empty in the repacked
version of the table.

You're right, it'll be empty in the new table.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#15

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Antonin Houska (#14)

Hi, Antonin

How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?

Yes, TransactionIdDidCommit. Another option is just invent a new
snapshot type - SnapshotBelieveEverythingCommitted - for that
particular case it should work - because all xmin/xmax written into
the new table are committed by design.

#16

Robert Treat

xzilla@users.sourceforge.net

10 months ago

In reply to: Mihail Nikalayeu (#13)

On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:

1) we have a whole initial table snapshot with all xmin copied from
the original table. All such xmin are committed.
2) appling transaction sees ALL the self-alive (no xmax) tuple in it
because its xmin\xmax is committed and SnapshotSelf is happy with it
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax=original xid and inserts a new one
keeping with xmin=orignal xid
4) --//--
5) --//--

Advancing the tables min xid to at least repack XID is a pretty big
feature, but the above scenario sounds like it would result in any
non-modified pre-existing tuples ending up with their original xmin
rather than repack XID, which seems like it could lead to weird
side-effects. Maybe I am mis-thinking it though?

Robert Treat
https://xzilla.net

#17

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Robert Treat (#16)

Robert Treat <rob@xzilla.net> wrote:

On Mon, Aug 25, 2025 at 10:15 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:

1) we have a whole initial table snapshot with all xmin copied from
the original table. All such xmin are committed.
2) appling transaction sees ALL the self-alive (no xmax) tuple in it
because its xmin\xmax is committed and SnapshotSelf is happy with it
3) each update/delete during the replay selects the last existing
tuple version, updates it xmax=original xid and inserts a new one
keeping with xmin=orignal xid
4) --//--
5) --//--

Advancing the tables min xid to at least repack XID is a pretty big
feature, but the above scenario sounds like it would result in any
non-modified pre-existing tuples ending up with their original xmin
rather than repack XID, which seems like it could lead to weird
side-effects. Maybe I am mis-thinking it though?

What we discuss here is how to keep visibility information of tuples (xmin,
xmax, ...) unchanged. Both CLUSTER and VACUUM FULL already do that. However
it's not trivial to ensure that REPACK with the CONCURRENTLY option does as
well.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#18

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#15)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Hi, Antonin

How does HeapTupleSatisfiesSelf() recognize the status of any XID w/o using a
snapshot? Do you mean by checking the commit log (TransactionIdDidCommit) ?

Yes, TransactionIdDidCommit.

I think the problem is that HeapTupleSatisfiesSelf() uses
TransactionIdIsInProgress() instead of checking the snapshot:

...
else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuple)))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
...

When decoding (and replaying) data changes, you deal with the database state
as it was (far) in the past. However TransactionIdIsInProgress() is not
suitable for this purpose.

And since CommitTransaction() updates the commit log before removing the
transaction from ProcArray, I can even imagine race conditions: if a
transaction is committed and decoded fast enough, TransactionIdIsInProgress()
might still return true. In such a case, HeapTupleSatisfiesSelf() returns
false instead of calling TransactionIdDidCommit().

Another option is just invent a new
snapshot type - SnapshotBelieveEverythingCommitted - for that
particular case it should work - because all xmin/xmax written into
the new table are committed by design.

I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and
using the MVCC snapshots.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#19

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Antonin Houska (#18)

Antonin Houska <ah@cybertec.at>:

I think the problem is that HeapTupleSatisfiesSelf() uses
TransactionIdIsInProgress() instead of checking the snapshot:

Yes, some issues might be possible for SnapshotSelf.
Possible solution is to override TransactionIdIsCurrentTransactionId
to true (like you did with nParallelCurrentXids but just return true).
IIUC, in that case all checks are going to behave the same way as in v5 version.

I'd prefer optimization of the logical decoding for REPACK CONCURRENTLY, and
using the MVCC snapshots.

It is also possible, but it is much more complex and feels like overkill to me.
We need just a way to find the latest version of row in the world of
all-committed transactions without any concurrent writers - I am
pretty sure it is possible to achieve in a more simple and effective
way.

#20

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#19)

Mihail Nikalayeu <mihailnikalayeu@gmail.com> wrote:

Antonin Houska <ah@cybertec.at>:

I think the problem is that HeapTupleSatisfiesSelf() uses
TransactionIdIsInProgress() instead of checking the snapshot:

Yes, some issues might be possible for SnapshotSelf.
Possible solution is to override TransactionIdIsCurrentTransactionId
to true (like you did with nParallelCurrentXids but just return true).
IIUC, in that case all checks are going to behave the same way as in v5 version.

I assume you mean v12-0005. Yes, that modifies
TransactionIdIsCurrentTransactionId(), so that the the transaction being
replayed recognizes if it (or its subtransaction) performed particular change
itself.

Although it could work, I think it'd be confusing to consider the transactions
being replayed as "current" from the point of view of the backend that
executes REPACK CONCURRENTLY.

But the primary issue is that in v12-0005,
TransactionIdIsCurrentTransactionId() gets the information on "current
transactions" from snapshots - see the calls of SetRepackCurrentXids() before
each scan. It's probably not what you want.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#21

Mihail Nikalayeu

mihailnikalayeu@gmail.com

10 months ago

In reply to: Antonin Houska (#20)

#22

Antonin Houska

ah@cybertec.at

10 months ago

In reply to: Mihail Nikalayeu (#21)

#23