Parallel Apply
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.
Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.
Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.
The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.
Dependency Detection
--------------------------------
1. Basic Dependency Tracking: Maintain a hash table keyed by
(RelationId, ReplicaIdentity) with the value as the transaction XID.
Before dispatching a change to a parallel worker, the leader checks
for existing entries: (a) If no match: add the entry and proceed; (b)
If match: instruct the worker to wait until the dependent transaction
completes.
2. Unique Keys
In addition to RI, track unique keys to detect conflicts. Example:
CREATE TABLE tab1(a INT PRIMARY KEY, b INT UNIQUE);
Transactions on publisher:
Txn1: INSERT (1,1)
Txn2: INSERT (2,2)
Txn3: DELETE (2,2)
Txn4: UPDATE (1,1) → (1,2)
If Txn4 is applied before Txn2 and Txn3, it will fail due to a unique
constraint violation. To prevent this, track both RI and unique keys
in the hash table. Compare keys of both old and new tuples to detect
dependencies. Then old_tuple's RI needs to be compared, and new
tuple's, both unique key and RI (new tuple's RI is required to detect
some prior insertion with the same key) needs to be compared with
existing hash table entries to identify transaction dependency.
3. Foreign Keys
Consider FK constraints between tables. Example:
TABLE owner(user_id INT PRIMARY KEY);
TABLE car(car_name TEXT, user_id INT REFERENCES owner);
Transactions:
Txn1: INSERT INTO owner(1)
Txn2: INSERT INTO car('bz', 1)
Applying Txn2 before Txn1 will fail. To avoid this, check if FK values
in new tuples match any RI or unique key in the hash table. If
matched, treat the transaction as dependent.
4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.
Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:
Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and highest.
Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.
After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.
During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.
To illustrate how this mechanism works, consider the following four
transactions:
Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300
Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504
Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]
Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]
Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.
Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.
Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.
Thoughts?
--
With Regards,
Amit Kapila.
Hi!
On Mon, 11 Aug 2025 at 09:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.
Sure.
Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.
Build-in subsystem for transaction dependency tracking would be highly
beneficial for physical replication speedup projects like[0]https://github.com/koichi-szk/postgres
Thoughts?
Surely we need to give it a try.
[0]: https://github.com/koichi-szk/postgres
--
Best regards,
Kirill Reshke
On Mon, Aug 11, 2025 at 1:39 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.Build-in subsystem for transaction dependency tracking would be highly
beneficial for physical replication speedup projects like[0]
I am not sure if that is directly applicable because this work
proposes to track dependencies based on logical WAL contents. However,
if you can point me to README on the overall design of the work you
are pointing to then I can check it once.
--
With Regards,
Amit Kapila.
On Mon, 11 Aug 2025 at 13:45, Amit Kapila <amit.kapila16@gmail.com> wrote:
I am not sure if that is directly applicable because this work
proposes to track dependencies based on logical WAL contents. However,
if you can point me to README on the overall design of the work you
are pointing to then I can check it once.
The only doc on this that I am aware of is [0]https://wiki.postgresql.org/wiki/Parallel_Recovery. The project is however
more dead than alive, but I hope this is just a temporary stop of
development, not permanent.
[0]: https://wiki.postgresql.org/wiki/Parallel_Recovery
--
Best regards,
Kirill Reshke
On 11/8/2025 06:45, Amit Kapila wrote:
The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.
Dependency detection may be quite an expensive operation. What about a
'positive' approach - deadlock detection on replica and, restart apply
of a record that should be applied later? Have you thought about this
way? What are the pros and cons here? Do you envision common cases where
such a deadlock will be frequent?
--
regards, Andrei Lepikhov
On Tue, Aug 12, 2025 at 12:04 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 11/8/2025 06:45, Amit Kapila wrote:
The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.Dependency detection may be quite an expensive operation. What about a
'positive' approach - deadlock detection on replica and, restart apply
of a record that should be applied later? Have you thought about this
way? What are the pros and cons here? Do you envision common cases where
such a deadlock will be frequent?
It is not only deadlocks but we could also incorrectly apply some
transactions which should otherwise fail. For example, consider
following case:
Pub: t1(c1 int unique key, c2 int)
Sub: t1(c1 int unique key, c2 int)
On Pub:
TXN-1
insert(1,11)
TXN-2
update (1,11) --> update (2,12)
On Sub:
table contains (1,11) before replication.
Now, if we allow dependent transactions to go in parallel, instead of
giving an ERROR while doing Insert, the update will be successful and
next insert will also be successful. This will create inconsistency on
the subscriber-side.
Similarly consider another set of transactions:
On Pub:
TXN-1
insert(1,11)
TXN-2
Delete (1,11)
On subscriber, if we allow TXN-2 before TXN-1, then the subscriber
will apply both transactions successfully but will become inconsistent
w.r.t publisher.
My colleague had already built a POC based on this idea and we did
check some initial numbers for non-dependent transactions and the
apply speed has improved drastically. We will share the POC patch and
numbers in the next few days.
For the dependent transactions workload, if we choose to go with the
deadlock detection approach, there will be lot of retries which may
not lead to good apply improvements. Also, we may choose to enable
this form of parallel-apply optionally due to reasons mentioned in my
first email, so if there is overhead due to dependency tracking then
one can disable parally apply for those particular subscriptions.
--
With Regards,
Amit Kapila.
On Mon, Aug 11, 2025 at 3:00 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
On Mon, 11 Aug 2025 at 13:45, Amit Kapila <amit.kapila16@gmail.com> wrote:
I am not sure if that is directly applicable because this work
proposes to track dependencies based on logical WAL contents. However,
if you can point me to README on the overall design of the work you
are pointing to then I can check it once.The only doc on this that I am aware of is [0]. The project is however
more dead than alive, but I hope this is just a temporary stop of
development, not permanent.
Thanks for sharing the wiki page. After reading, it seems we can't use
the exact dependency tracking mechanism as both the projects have
different dependency requirements. However, it could be an example to
refer to and maybe some parts of the infrastructure could be reused.
--
With Regards,
Amit Kapila.
On 11.08.2025 7:45 AM, Amit Kapila wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.
Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.
Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.
The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.
Dependency Detection
--------------------------------
1. Basic Dependency Tracking: Maintain a hash table keyed by
(RelationId, ReplicaIdentity) with the value as the transaction XID.
Before dispatching a change to a parallel worker, the leader checks
for existing entries: (a) If no match: add the entry and proceed; (b)
If match: instruct the worker to wait until the dependent transaction
completes.
2. Unique Keys
In addition to RI, track unique keys to detect conflicts. Example:
CREATE TABLE tab1(a INT PRIMARY KEY, b INT UNIQUE);
Transactions on publisher:
Txn1: INSERT (1,1)
Txn2: INSERT (2,2)
Txn3: DELETE (2,2)
Txn4: UPDATE (1,1) → (1,2)
If Txn4 is applied before Txn2 and Txn3, it will fail due to a unique
constraint violation. To prevent this, track both RI and unique keys
in the hash table. Compare keys of both old and new tuples to detect
dependencies. Then old_tuple's RI needs to be compared, and new
tuple's, both unique key and RI (new tuple's RI is required to detect
some prior insertion with the same key) needs to be compared with
existing hash table entries to identify transaction dependency.
3. Foreign Keys
Consider FK constraints between tables. Example:
TABLE owner(user_id INT PRIMARY KEY);
TABLE car(car_name TEXT, user_id INT REFERENCES owner);
Transactions:
Txn1: INSERT INTO owner(1)
Txn2: INSERT INTO car('bz', 1)
Applying Txn2 before Txn1 will fail. To avoid this, check if FK values
in new tuples match any RI or unique key in the hash table. If
matched, treat the transaction as dependent.
4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.
Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:
Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and highest.
Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.
After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.
During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.
To illustrate how this mechanism works, consider the following four
transactions:
Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300
Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504
Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]
Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]
Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.
Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.
Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.
Thoughts?
Hi,
This is something similar to what I have in mind when starting my
experiments with LR apply speed improvements. I think that maintaining a
full (RelationId, ReplicaIdentity) hash may be too expensive - there can
be hundreds of active transactions updating millions of rows.
I thought about something like a bloom filter. But frankly speaking I
didn't go far in thinking about all implementation details. Your proposal
is much more concrete.
But I decided to implement first approach with prefetch, which is much more
simple, similar with prefetching currently used for physical replication
and still provide quite significant improvement:
/messages/by-id/84ed36b8-7d06-4945-9a6b-3826b3f999a6@garret.ru
There is one thing which I do not completely understand with your proposal:
do you assume that LR walsender at publisher will use reorder buffer to
"serialize" transactions
or you assume that streaming mode will be used (now it is possible to
enforce parallel apply of short transactions using
`debug_logical_replication_streaming`)?
It seems to be senseless to spend time and memory trying to serialize
transactions at the publisher if we in any case want to apply them in
parallel at subscriber.
But then there is another problem: at publisher there can be hundreds of
concurrent active transactions (limited only by `max_connections`) which
records are intermixed in WAL.
If we try to apply them concurrently at subscriber, we need a corresponding
number of parallel apply workers. But usually the number of such workers is
less than 10 (and default is 2).
So looks like we need to serialize transactions at subscriber side.
Assume that there are 100 concurrent transactions T1..T100, i.e. before
first COMMIT record there are mixed records of 100 transactions.
And there are just two parallel apply workers W1 and W2. Main LR apply
worker with send T1 record to W1, T2 record to W2 and ... there are not
more vacant workers.
It has either to spawn additional ones, but it is not always possible
because total number of background workers is limited.
Either serialize all other transactions in memory or on disk, until it
reaches COMMIT of T1 or T2.
I afraid that such serialization will eliminate any advantages of parallel
apply.
Certainly if we do reordering of transactions at publisher side, then there
is no such problem. Subscriber receives all records for T1, then all
records for T2, ... If there are no more vacant workers, it can just wait
until any of this transactions is completed. But I am afraid that in this
case the reorder buffer at the publisher will be a bottleneck.
On Mon, Aug 11, 2025 at 10:15:41AM +0530, Amit Kapila wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.
I thought the approach for improving WAL apply speed, for both binary
and logical, was pipelining:
https://en.wikipedia.org/wiki/Instruction_pipelining
rather than trying to do all the steps in parallel.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
On Tue, Aug 12, 2025 at 10:40 PM Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Aug 11, 2025 at 10:15:41AM +0530, Amit Kapila wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.I thought the approach for improving WAL apply speed, for both binary
and logical, was pipelining:https://en.wikipedia.org/wiki/Instruction_pipelining
rather than trying to do all the steps in parallel.
It is not clear to me how the speed for a mix of dependent and
independent transactions can be improved using the technique you
shared as we still need to follow the commit order for dependent
transactions. Can you please elaborate more on the high-level idea of
how this technique can be used to improve speed for applying logical
WAL records?
--
With Regards,
Amit Kapila.
On Tue, Aug 12, 2025 at 9:22 PM Константин Книжник <knizhnik@garret.ru> wrote:
Hi,
This is something similar to what I have in mind when starting my experiments with LR apply speed improvements. I think that maintaining a full (RelationId, ReplicaIdentity) hash may be too expensive - there can be hundreds of active transactions updating millions of rows.
I thought about something like a bloom filter. But frankly speaking I didn't go far in thinking about all implementation details. Your proposal is much more concrete.
We can surely investigate a different hash_key if that works for all cases.
But I decided to implement first approach with prefetch, which is much more simple, similar with prefetching currently used for physical replication and still provide quite significant improvement:
/messages/by-id/84ed36b8-7d06-4945-9a6b-3826b3f999a6@garret.ruThere is one thing which I do not completely understand with your proposal: do you assume that LR walsender at publisher will use reorder buffer to "serialize" transactions
or you assume that streaming mode will be used (now it is possible to enforce parallel apply of short transactions using `debug_logical_replication_streaming`)?
The current proposal is based on reorderbuffer serializing
transactions as we are doing now.
It seems to be senseless to spend time and memory trying to serialize transactions at the publisher if we in any case want to apply them in parallel at subscriber.
But then there is another problem: at publisher there can be hundreds of concurrent active transactions (limited only by `max_connections`) which records are intermixed in WAL.
If we try to apply them concurrently at subscriber, we need a corresponding number of parallel apply workers. But usually the number of such workers is less than 10 (and default is 2).
So looks like we need to serialize transactions at subscriber side.Assume that there are 100 concurrent transactions T1..T100, i.e. before first COMMIT record there are mixed records of 100 transactions.
And there are just two parallel apply workers W1 and W2. Main LR apply worker with send T1 record to W1, T2 record to W2 and ... there are not more vacant workers.
It has either to spawn additional ones, but it is not always possible because total number of background workers is limited.
Either serialize all other transactions in memory or on disk, until it reaches COMMIT of T1 or T2.
I afraid that such serialization will eliminate any advantages of parallel apply.
Right, I also think so and we will probably end up doing something
what we are doing now in publisher.
Certainly if we do reordering of transactions at publisher side, then there is no such problem. Subscriber receives all records for T1, then all records for T2, ... If there are no more vacant workers, it can just wait until any of this transactions is completed. But I am afraid that in this case the reorder buffer at the publisher will be a bottleneck.
This is a point to investigate if we observe so. But till now in our
internal testing parallel apply gives good improvement in pgbench kind
of workload.
--
With Regards,
Amit Kapila.
On Monday, August 11, 2025 12:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.Dependency Detection
--------------------------------
1. Basic Dependency Tracking: Maintain a hash table keyed by
(RelationId, ReplicaIdentity) with the value as the transaction XID.
Before dispatching a change to a parallel worker, the leader checks
for existing entries: (a) If no match: add the entry and proceed; (b)
If match: instruct the worker to wait until the dependent transaction
completes.2. Unique Keys
In addition to RI, track unique keys to detect conflicts. Example:
CREATE TABLE tab1(a INT PRIMARY KEY, b INT UNIQUE);
Transactions on publisher:
Txn1: INSERT (1,1)
Txn2: INSERT (2,2)
Txn3: DELETE (2,2)
Txn4: UPDATE (1,1) → (1,2)If Txn4 is applied before Txn2 and Txn3, it will fail due to a unique
constraint violation. To prevent this, track both RI and unique keys
in the hash table. Compare keys of both old and new tuples to detect
dependencies. Then old_tuple's RI needs to be compared, and new
tuple's, both unique key and RI (new tuple's RI is required to detect
some prior insertion with the same key) needs to be compared with
existing hash table entries to identify transaction dependency.3. Foreign Keys
Consider FK constraints between tables. Example:TABLE owner(user_id INT PRIMARY KEY);
TABLE car(car_name TEXT, user_id INT REFERENCES owner);Transactions:
Txn1: INSERT INTO owner(1)
Txn2: INSERT INTO car('bz', 1)Applying Txn2 before Txn1 will fail. To avoid this, check if FK values
in new tuples match any RI or unique key in the hash table. If
matched, treat the transaction as dependent.4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and highest.Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.To illustrate how this mechanism works, consider the following four
transactions:Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.Thoughts?
Here is the initial POC patch for this idea.
The basic implementation is outlined below. Please note that there are several
TODO items remaining, which we are actively working on; these are also detailed
further down.
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding. Parallel
apply workers do not maintain commit order; transactions can be committed at any
time provided there are no dependencies.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
TODO - replication progress tracking for out of order commit.
TODO - dependency on local unique key, foreign key.
TODO - restrict user defined trigger and constraints.
TODO - enable the parallel apply optionally
TODO - potential improvement to use shared hash table for tracking dependencies.
--
The above TODO items are also included in the initial email[1]/messages/by-id/CAA4eK1+SEus_6vQay9TF_r4ow+E-Q7LYNLfsD78HaOsLSgppxQ@mail.gmail.com.
[1]: /messages/by-id/CAA4eK1+SEus_6vQay9TF_r4ow+E-Q7LYNLfsD78HaOsLSgppxQ@mail.gmail.com
Best Regards,
Hou zj
Attachments:
v1-0001-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v1-0001-Parallel-apply-non-streaming-transactions.patchDownload
From 15d24048224f75cfa083a4874e1666da509d7f01 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 8 Aug 2025 11:35:59 +0800
Subject: [PATCH v1] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding. Parallel
apply workers do not maintain commit order; transactions can be committed at any
time provided there are no dependencies.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
TODO - error handling
--
If preceding transactions fail, and independent later transactions are already
applied, a mechanism is needed to skip already applied transactions upon
restart. One solution is to PREPARE transactions whose preceding ones remain
uncommitted, then COMMIT PREPARE once all preceding transactions finish. This
allows the worker to skip applied transactions by scanning prepared ones.
--
TODO - dependency on local unique key, foreign key.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
--
TODO - potiential improvement to use shared hash table for tracking dendpencies.
--
Instead of a local hash table, a shared hash table could track replica identity
key dependencies, allowing parallel apply workers to clean up entries. However,
this might increase contention, so need to research whether it's worth it.
---
.../replication/logical/applyparallelworker.c | 554 ++++++++++-
src/backend/replication/logical/proto.c | 42 +
src/backend/replication/logical/relation.c | 55 ++
src/backend/replication/logical/worker.c | 869 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/logicalproto.h | 4 +
src/include/replication/logicalrelation.h | 5 +
src/include/replication/worker_internal.h | 26 +-
src/include/storage/lwlocklist.h | 1 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/tools/pgindent/typedefs.list | 4 +
14 files changed, 1497 insertions(+), 76 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index cd0e19176fd..f30d9b9bd8e 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -146,6 +149,23 @@
* which will detect deadlock if any. See pa_send_data() and
* enum TransApplyAction.
*
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * -------------------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding. Parallel apply workers do not
+ * maintain commit order; transactions can be committed at any time provided
+ * there are no dependencies.
+ *
* Lock types
* ----------
* Both the stream lock and the transaction lock mentioned above are
@@ -216,14 +236,38 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +301,9 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +381,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -364,11 +420,14 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
/* Set up the header region. */
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
-
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
+ shared->has_dependent_txn = false;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -406,6 +465,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -413,10 +474,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by this
+ * worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -444,18 +528,31 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -468,7 +565,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -505,11 +602,14 @@ pa_allocate_worker(TransactionId xid)
SpinLockAcquire(&winfo->shared->mutex);
winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;
winfo->shared->xid = xid;
+ winfo->shared->has_dependent_txn = false;
SpinLockRelease(&winfo->shared->mutex);
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -558,7 +658,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -574,9 +675,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -706,6 +805,105 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused for
+ * handling other transactions. Consequently, the local end LSN has already
+ * been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -781,21 +979,35 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -812,6 +1024,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -849,6 +1064,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -864,6 +1082,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -944,6 +1164,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1150,7 +1372,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1202,6 +1423,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait for
+ * the worker to complete its transaction and then stop it. This is
+ * consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1284,8 +1566,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1299,6 +1581,7 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("lost connection to the logical replication parallel apply worker")));
+
}
/*
@@ -1362,6 +1645,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1618,23 +1904,215 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for stop
+ * as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (!XLogRecPtrIsInvalid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
+}
+
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ if (winfo->shared->has_dependent_txn)
+ continue;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ if (found)
+ elog(ERROR, "hash table corrupted");
+
+ winfo->shared->has_dependent_txn = true;
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
+/*
+ * Mark the transaction state as finished and remove the shared hash entry if
+ * there are dependent transactions waiting for this transaction to complete.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+ bool has_dependent_txn;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ has_dependent_txn = MyParallelShared->has_dependent_txn;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ if (!has_dependent_txn)
+ return;
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 1b3d9eb49dd..de6c1e930e4 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
@@ -1250,6 +1288,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index f59046ad620..2e15f8e69b0 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -946,3 +946,58 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0fdc5de57ba..11726b691fa 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -287,6 +287,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -462,6 +463,7 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -523,6 +525,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -537,11 +582,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -601,6 +642,595 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, dependencies.len, dependencies.data))
+ return;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ dependencies.cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY,
+ &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -748,13 +1378,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
-
- Assert(TransactionIdIsValid(stream_xid));
+ }
/*
* The parallel apply worker needs the xid in this message to decide
@@ -766,15 +1401,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1177,17 +1825,49 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+ pa_send_data(winfo, s->len, s->data);
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1202,6 +1882,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1212,7 +1897,70 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ apply_handle_commit_internal(&commit_data);
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+ /* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/* Process any tables that are being synchronized in parallel. */
process_syncing_tables(commit_data.end_lsn);
@@ -1332,7 +2080,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1377,6 +2126,8 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ /* TODO wait for xid to finish */
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -1389,7 +2140,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/* Process any tables that are being synchronized in parallel. */
@@ -1455,7 +2207,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/* Process any tables that are being synchronized in parallel. */
@@ -1514,7 +2267,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1705,7 +2459,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
@@ -1763,6 +2517,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -2486,7 +3245,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -2519,6 +3279,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -3270,6 +4033,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -3579,6 +4344,8 @@ apply_handle_truncate(StringInfo s)
ListCell *lc;
LOCKMODE lockmode = AccessExclusiveLock;
+ elog(LOG, "truncate");
+
/*
* Quick return if we are skipping data modification changes or handling
* streamed transactions.
@@ -3790,6 +4557,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -3810,6 +4585,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -3819,6 +4598,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -3828,6 +4608,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -3836,29 +4646,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -3876,6 +4676,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = (FlushPosition *) palloc(sizeof(FlushPosition));
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -5057,7 +5858,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -5102,7 +5903,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -5150,7 +5951,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0be307d2ca0..fd66b2c0a41 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -406,6 +406,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
@@ -251,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 7c0204dd6f4..bf9a7f00f87 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -191,6 +192,11 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ bool has_dependent_txn;
} ParallelApplyWorkerShared;
/*
@@ -225,6 +231,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -287,6 +295,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -300,19 +312,23 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -337,12 +353,18 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
+extern void pa_commit_transaction(void);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTablesyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_TABLESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 208d2e3a8ed..cc995f3a252 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -135,3 +135,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index 00a1c2fcd48..6842476c8b0 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 36af1c16e7f..aec039d565b 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -87,6 +87,7 @@ $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
$node_publisher->start;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
my %remainder_a = (
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6f2e93b2d6..fa4bfdcbd75 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2075,6 +2075,7 @@ ParallelTransState
ParallelVacuumState
ParallelWorkerContext
ParallelWorkerInfo
+ParallelizedTxnEntry
Param
ParamCompileHook
ParamExecData
@@ -2540,6 +2541,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4027,6 +4030,7 @@ remoteDep
remove_nulling_relids_context
rendezvousHashEntry
rep
+replica_identity_hash
replace_rte_variables_callback
replace_rte_variables_context
report_error_fn
--
2.50.1.windows.1
On Wed, Aug 13, 2025 at 09:50:27AM +0530, Amit Kapila wrote:
On Tue, Aug 12, 2025 at 10:40 PM Bruce Momjian <bruce@momjian.us> wrote:
Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.I thought the approach for improving WAL apply speed, for both binary
and logical, was pipelining:https://en.wikipedia.org/wiki/Instruction_pipelining
rather than trying to do all the steps in parallel.
It is not clear to me how the speed for a mix of dependent and
independent transactions can be improved using the technique you
shared as we still need to follow the commit order for dependent
transactions. Can you please elaborate more on the high-level idea of
how this technique can be used to improve speed for applying logical
WAL records?
This blog post from February I think has some good ideas for binary
replication pipelining:
https://www.cybertec-postgresql.com/en/end-of-the-road-for-postgresql-streaming-replication/
Surprisingly, what could be considered the actual replay work
seems to be a minority of the total workload. The largest parts
involve reading WAL and decoding page references from it, followed
by looking up those pages in the cache, and pinning them so they
are not evicted while in use. All of this work could be performed
concurrently with the replay loop. For example, a separate
read-ahead process could handle these tasks, ensuring that the
replay process receives a queue of transaction log records with
associated cache references already pinned, ready for application.
The beauty of the approach is that there is no need for dependency
tracking. I have CC'ed the author, Ants Aasma.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
On Wed, Aug 13, 2025 at 8:57 PM Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 13, 2025 at 09:50:27AM +0530, Amit Kapila wrote:
On Tue, Aug 12, 2025 at 10:40 PM Bruce Momjian <bruce@momjian.us> wrote:
Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.I thought the approach for improving WAL apply speed, for both binary
and logical, was pipelining:https://en.wikipedia.org/wiki/Instruction_pipelining
rather than trying to do all the steps in parallel.
It is not clear to me how the speed for a mix of dependent and
independent transactions can be improved using the technique you
shared as we still need to follow the commit order for dependent
transactions. Can you please elaborate more on the high-level idea of
how this technique can be used to improve speed for applying logical
WAL records?This blog post from February I think has some good ideas for binary
replication pipelining:https://www.cybertec-postgresql.com/en/end-of-the-road-for-postgresql-streaming-replication/
Surprisingly, what could be considered the actual replay work
seems to be a minority of the total workload.
This is the biggest difference between physical and logical WAL apply.
In the case of logical WAL, the actual replay is the majority of the
work. We don't need to read WAL or decode it or find/pin the
appropriate pages to apply. Here, you can consider it is almost
equivalent to how primary receives insert/update/delete from the user.
Firstly, the idea shared in the blog is not applicable for logical
replication and even if we try to somehow map with logical apply, I
don't see how or why it will be able to match up the speed of applying
with multiple workers in case of logical replication. Also, note that
dependency calculation is not as tricky for logical replication as we
can easily retrieve such information from logical WAL records in most
cases.
--
With Regards,
Amit Kapila.
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
Here is the initial POC patch for this idea.
Thank you Hou-san for the patch.
I did some performance benchmarking for the patch and overall, the
results show substantial performance improvements.
Please find the details as follows:
Source code:
----------------
pgHead (572c0f1b0e) and v1-0001 patch
Setup:
---------
Pub --> Sub
- Two nodes created in pub-sub logical replication setup.
- Both nodes have the same set of pgbench tables created with scale=300.
- The sub node is subscribed to all the changes from the pub node's
pgbench tables.
Workload Run:
--------------------
- Disable the subscription on Sub node
- Run default pgbench(read-write) only on Pub node with #clients=40
and run duration=10 minutes
- Enable the subscription on Sub once pgbench completes and then
measure time taken in replication.
~~~
Test-01: Measure Replication lag
----------------------------------------
Observations:
---------------
- Replication time improved as the number of parallel workers
increased with the patch.
- On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
- With just 2 parallel workers (default), replication time was cut in
half, and with 8 workers it completed in ~13 minutes(3.5x faster).
- With 16 parallel workers, achieved ~3.7x speedup over pgHead.
- With 32 workers, performance gains plateaued slightly, likely due
to more workers running on the machine and work done parallelly is not
that high to see further improvements.
Detailed Result:
-----------------
Case Time_taken_in_replication(sec) rep_time_in_minutes
faster_than_head
1. pgHead 2760.791 46.01318333 -
2. patched_#worker=2 1463.853 24.3975 1.88 times
3. patched_#worker=4 1031.376 17.1896 2.68 times
4. patched_#worker=8 781.007 13.0168 3.54 times
5. patched_#worker=16 741.108 12.3518 3.73 times
6. patched_#worker=32 787.203 13.1201 3.51 times
~~~~
Test-02: Measure number of transactions parallelized
-----------------------------------------------------
- Used a top up patch to LOG the number of transactions applied by
parallel worker, applied by leader, and are depended.
- The LOG output e.g. -
```
LOG: parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
```
- parallelized_nxact: gives the number of parallelized transactions
- dependent_nxact: gives the dependent transactions
- leader_applied_nxact: gives the transactions applied by leader worker
(the required top-up v1-002 patch is attached.)
Observations:
----------------
- With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
- As the number of workers increased, the parallelized percentage
increased and reached 99.99% with 32 workers.
Detailed Result:
-----------------
case1: #parallel_workers = 2(default)
#total_pgbench_txns = 24745648
parallelized_nxact = 14439480 (58.35%)
dependent_nxact = 16 (0.00006%)
leader_applied_nxact = 10306153 (41.64%)
case2: #parallel_workers = 4
#total_pgbench_txns = 24776108
parallelized_nxact = 19666593 (79.37%)
dependent_nxact = 212 (0.0008%)
leader_applied_nxact = 5109304 (20.62%)
case3: #parallel_workers = 8
#total_pgbench_txns = 24821333
parallelized_nxact = 24397431 (98.29%)
dependent_nxact = 282 (0.001%)
leader_applied_nxact = 423621 (1.71%)
case4: #parallel_workers = 16
#total_pgbench_txns = 24938255
parallelized_nxact = 24937754 (99.99%)
dependent_nxact = 142 (0.0005%)
leader_applied_nxact = 360 (0.0014%)
case5: #parallel_workers = 32
#total_pgbench_txns = 24769474
parallelized_nxact = 24769135 (99.99%)
dependent_nxact = 312 (0.0013%)
leader_applied_nxact = 28 (0.0001%)
~~~~~
The scripts used for above tests are attached.
Next, I plan to extend the testing to larger workloads by running
pgbench for 20–30 minutes.
We will also benchmark performance across different workload types to
evaluate the improvements once the patch has matured further.
--
Thanks,
Nisha
Attachments:
v1-0002-Add-some-simple-statistics.txttext/plain; charset=US-ASCII; name=v1-0002-Add-some-simple-statistics.txtDownload
From 00c05e510015fd72e9f1ede34868e0f691ded299 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Thu, 14 Aug 2025 18:38:08 +0800
Subject: [PATCH v1] Add some simple statistics
---
src/backend/replication/logical/worker.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 11726b691fa..ff550900c2e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -507,6 +507,12 @@ static BufFile *stream_fd = NULL;
*/
static XLogRecPtr last_flushpos = InvalidXLogRecPtr;
+static uint64 parallelized_nxact = 0;
+static uint64 dependent_nxact = 0;
+static uint64 leader_applied_nxact = 0;
+
+static bool dependent_xact = false;
+
typedef struct SubXactInfo
{
TransactionId xid; /* XID of the subxact */
@@ -1138,6 +1144,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
if (!depends_on_xids)
return;
+ dependent_xact = true;
+
/*
* Notify the transactions that they are dependent on the current
* transaction.
@@ -1831,6 +1839,8 @@ apply_handle_begin(StringInfo s)
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
+ dependent_xact = false;
+
logicalrep_read_begin(s, &begin_data);
remote_xid = begin_data.xid;
@@ -1903,11 +1913,17 @@ apply_handle_commit(StringInfo s)
{
case TRANS_LEADER_APPLY:
apply_handle_commit_internal(&commit_data);
+ leader_applied_nxact++;
break;
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
+ if (dependent_xact)
+ dependent_nxact++;
+ else
+ parallelized_nxact++;
+
if (pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the transaction. */
@@ -1967,6 +1983,8 @@ apply_handle_commit(StringInfo s)
pgstat_report_activity(STATE_IDLE, NULL);
reset_apply_error_context_info();
+
+ dependent_xact = false;
}
/*
@@ -5058,6 +5076,9 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
return;
send_time = now;
+ elog(LOG, "parallelized_nxact: " UINT64_FORMAT " dependent_nxact: " UINT64_FORMAT " leader_applied_nxact: " UINT64_FORMAT,
+ parallelized_nxact, dependent_nxact, leader_applied_nxact);
+
if (!reply_message)
{
MemoryContext oldctx = MemoryContextSwitchTo(ApplyContext);
--
2.50.1.windows.1
On 18/08/2025 9:56 AM, Nisha Moond wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:Here is the initial POC patch for this idea.
Thank you Hou-san for the patch.
I did some performance benchmarking for the patch and overall, the
results show substantial performance improvements.
Please find the details as follows:Source code:
----------------
pgHead (572c0f1b0e) and v1-0001 patchSetup:
---------
Pub --> Sub
- Two nodes created in pub-sub logical replication setup.
- Both nodes have the same set of pgbench tables created with scale=300.
- The sub node is subscribed to all the changes from the pub node's
pgbench tables.Workload Run:
--------------------
- Disable the subscription on Sub node
- Run default pgbench(read-write) only on Pub node with #clients=40
and run duration=10 minutes
- Enable the subscription on Sub once pgbench completes and then
measure time taken in replication.
~~~Test-01: Measure Replication lag
----------------------------------------
Observations:
---------------
- Replication time improved as the number of parallel workers
increased with the patch.
- On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
- With just 2 parallel workers (default), replication time was cut in
half, and with 8 workers it completed in ~13 minutes(3.5x faster).
- With 16 parallel workers, achieved ~3.7x speedup over pgHead.
- With 32 workers, performance gains plateaued slightly, likely due
to more workers running on the machine and work done parallelly is not
that high to see further improvements.Detailed Result:
-----------------
Case Time_taken_in_replication(sec) rep_time_in_minutes
faster_than_head
1. pgHead 2760.791 46.01318333 -
2. patched_#worker=2 1463.853 24.3975 1.88 times
3. patched_#worker=4 1031.376 17.1896 2.68 times
4. patched_#worker=8 781.007 13.0168 3.54 times
5. patched_#worker=16 741.108 12.3518 3.73 times
6. patched_#worker=32 787.203 13.1201 3.51 times
~~~~Test-02: Measure number of transactions parallelized
-----------------------------------------------------
- Used a top up patch to LOG the number of transactions applied by
parallel worker, applied by leader, and are depended.
- The LOG output e.g. -
```
LOG: parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
```
- parallelized_nxact: gives the number of parallelized transactions
- dependent_nxact: gives the dependent transactions
- leader_applied_nxact: gives the transactions applied by leader worker
(the required top-up v1-002 patch is attached.)Observations:
----------------
- With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
- As the number of workers increased, the parallelized percentage
increased and reached 99.99% with 32 workers.Detailed Result:
-----------------
case1: #parallel_workers = 2(default)
#total_pgbench_txns = 24745648
parallelized_nxact = 14439480 (58.35%)
dependent_nxact = 16 (0.00006%)
leader_applied_nxact = 10306153 (41.64%)case2: #parallel_workers = 4
#total_pgbench_txns = 24776108
parallelized_nxact = 19666593 (79.37%)
dependent_nxact = 212 (0.0008%)
leader_applied_nxact = 5109304 (20.62%)case3: #parallel_workers = 8
#total_pgbench_txns = 24821333
parallelized_nxact = 24397431 (98.29%)
dependent_nxact = 282 (0.001%)
leader_applied_nxact = 423621 (1.71%)case4: #parallel_workers = 16
#total_pgbench_txns = 24938255
parallelized_nxact = 24937754 (99.99%)
dependent_nxact = 142 (0.0005%)
leader_applied_nxact = 360 (0.0014%)case5: #parallel_workers = 32
#total_pgbench_txns = 24769474
parallelized_nxact = 24769135 (99.99%)
dependent_nxact = 312 (0.0013%)
leader_applied_nxact = 28 (0.0001%)~~~~~
The scripts used for above tests are attached.Next, I plan to extend the testing to larger workloads by running
pgbench for 20–30 minutes.
We will also benchmark performance across different workload types to
evaluate the improvements once the patch has matured further.--
Thanks,
Nisha
I also did some benchmarking of the proposed parallel apply patch and
compare it with my prewarming approach.
And parallel apply is significantly more efficient than prefetch (it is
expected).
So I had two tests (more details here):
/messages/by-id/84ed36b8-7d06-4945-9a6b-3826b3f999a6@garret.ru
One is performing random updates and another - inserts with random key.
I stop subscriber, apply workload at publisher during 100 seconds and
then measure how long time it will take subscriber to caught up.
update test (with 8 parallel apply workers):
master: 8:30 min
prefetch: 2:05 min
parallel apply: 1:30 min
insert test (with 8 parallel apply workers):
master: 9:20 min
prefetch: 3:08 min
parallel apply: 1:54 min
On Mon, Aug 18, 2025 at 8:20 PM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
On 18/08/2025 9:56 AM, Nisha Moond wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:Here is the initial POC patch for this idea.
Thank you Hou-san for the patch.
I did some performance benchmarking for the patch and overall, the
results show substantial performance improvements.
Please find the details as follows:Source code:
----------------
pgHead (572c0f1b0e) and v1-0001 patchSetup:
---------
Pub --> Sub
- Two nodes created in pub-sub logical replication setup.
- Both nodes have the same set of pgbench tables created with scale=300.
- The sub node is subscribed to all the changes from the pub node's
pgbench tables.Workload Run:
--------------------
- Disable the subscription on Sub node
- Run default pgbench(read-write) only on Pub node with #clients=40
and run duration=10 minutes
- Enable the subscription on Sub once pgbench completes and then
measure time taken in replication.
~~~Test-01: Measure Replication lag
----------------------------------------
Observations:
---------------
- Replication time improved as the number of parallel workers
increased with the patch.
- On pgHead, replicating a 10-minute publisher workload took ~46 minutes.
- With just 2 parallel workers (default), replication time was cut in
half, and with 8 workers it completed in ~13 minutes(3.5x faster).
- With 16 parallel workers, achieved ~3.7x speedup over pgHead.
- With 32 workers, performance gains plateaued slightly, likely due
to more workers running on the machine and work done parallelly is not
that high to see further improvements.Detailed Result:
-----------------
Case Time_taken_in_replication(sec) rep_time_in_minutes
faster_than_head
1. pgHead 2760.791 46.01318333 -
2. patched_#worker=2 1463.853 24.3975 1.88 times
3. patched_#worker=4 1031.376 17.1896 2.68 times
4. patched_#worker=8 781.007 13.0168 3.54 times
5. patched_#worker=16 741.108 12.3518 3.73 times
6. patched_#worker=32 787.203 13.1201 3.51 times
~~~~Test-02: Measure number of transactions parallelized
-----------------------------------------------------
- Used a top up patch to LOG the number of transactions applied by
parallel worker, applied by leader, and are depended.
- The LOG output e.g. -
```
LOG: parallelized_nxact: 11497254 dependent_nxact: 0 leader_applied_nxact: 600
```
- parallelized_nxact: gives the number of parallelized transactions
- dependent_nxact: gives the dependent transactions
- leader_applied_nxact: gives the transactions applied by leader worker
(the required top-up v1-002 patch is attached.)Observations:
----------------
- With 4 to 8 parallel workers, ~80%-98% transactions are parallelized
- As the number of workers increased, the parallelized percentage
increased and reached 99.99% with 32 workers.Detailed Result:
-----------------
case1: #parallel_workers = 2(default)
#total_pgbench_txns = 24745648
parallelized_nxact = 14439480 (58.35%)
dependent_nxact = 16 (0.00006%)
leader_applied_nxact = 10306153 (41.64%)case2: #parallel_workers = 4
#total_pgbench_txns = 24776108
parallelized_nxact = 19666593 (79.37%)
dependent_nxact = 212 (0.0008%)
leader_applied_nxact = 5109304 (20.62%)case3: #parallel_workers = 8
#total_pgbench_txns = 24821333
parallelized_nxact = 24397431 (98.29%)
dependent_nxact = 282 (0.001%)
leader_applied_nxact = 423621 (1.71%)case4: #parallel_workers = 16
#total_pgbench_txns = 24938255
parallelized_nxact = 24937754 (99.99%)
dependent_nxact = 142 (0.0005%)
leader_applied_nxact = 360 (0.0014%)case5: #parallel_workers = 32
#total_pgbench_txns = 24769474
parallelized_nxact = 24769135 (99.99%)
dependent_nxact = 312 (0.0013%)
leader_applied_nxact = 28 (0.0001%)~~~~~
The scripts used for above tests are attached.Next, I plan to extend the testing to larger workloads by running
pgbench for 20–30 minutes.
We will also benchmark performance across different workload types to
evaluate the improvements once the patch has matured further.--
Thanks,
NishaI also did some benchmarking of the proposed parallel apply patch and
compare it with my prewarming approach.
And parallel apply is significantly more efficient than prefetch (it is
expected).
Thanks to you and Nisha for doing some preliminary performance
testing, the results are really encouraging (more than 3 to 4 times
improvement in multiple workloads). I hope we keep making progress on
this patch and make it ready for the next release.
--
With Regards,
Amit Kapila.
Hi,
I ran tests to compare the performance of logical synchronous
replication with parallel-apply against physical synchronous
replication.
Highlights
===============
On pgHead:(current behavior)
- With synchronous physical replication set to remote_apply, the
Primary’s TPS drops by ~60% (≈2.5x slower than asynchronous).
- With synchronous logical replication set to remote_apply, the
Publisher’s TPS drops drastically by ~94% (≈16x slower than
asynchronous).
With proposed Parallel-Apply Patch(v1):
- Parallel apply significantly improves logical synchronous
replication performance by 5-6×.
- With 40 parallel workers on the subscriber, the Publisher achieves
30045.82 TPS, which is 5.5× faster than the no-patch case (5435.46
TPS).
- With the patch, the Publisher’s performance is only ~3x slower than
asynchronous, bringing it much closer to the physical replication
case.
Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
Source code:
===============
- pgHead(e9a31c0cc60) and v1 patch
Test-01: Physical replication:
======================
- To measure the physical synchronous replication performance on pgHead.
Setup & Workload:
-----------------
Primary --> Standby
- Two nodes created in physical (primary-standby) replication setup.
- Default pgbench (read-write) was run on the Primary with scale=300,
#clients=40, run duration=20 minutes.
- The TPS is measured with the synchronous_commit set as "off" vs
"remote_apply" on pgHead.
Results:
---------
synchronous_commit Primary_TPS regression
OFF 90466.57743 -
remote_apply(run1) 35848.6558 -60%
remote_apply(run2) 35306.25479 -61%
- on phHead, when synchronous_commit is set to "remote_apply" during
physical replication, the Primary experiences a 60–61% reduction in
TPS, which is ~2.5 times slower.
~~~
Test-02: Logical replication:
=====================
- To measure the logical synchronous replication performance on
pgHead and with parallel-apply patch.
Setup & Workload:
-----------------
Publisher --> Subscriber
- Two nodes created in logical (publisher-subscriber) replication setup.
- Default pgbench (read-write) was run on the Pub with scale=300,
#clients=40, run duration=20 minutes.
- The TPS is measured on pgHead and with the parallel-apply v1 patch.
- The number of parallel workers was varied as 2, 4, 8, 16, 32, 40.
case-01: pgHead
-------------------
Results:
synchronous_commit Primary_TPS regression
pgHead(OFF) 89138.14626 --
pgHead(remote_apply) 5435.464525 -94%
- By default(pgHead), the synchronous logical replication sees a 94%
drop in TPS which is -
a) 16.4 times slower than the logical async case and,
b) 6.6 times slower than physical sync replication case.
case-02: patched
---------------------
- synchronous_commit = 'remote_apply'
- measured the performance by varying #parallel workers as 2, 4, 8, 16, 32, 40
Results:
#workers Primary_TPS Improvement_with_patch faster_than_no-patch
2 9679.077736 78% 1.78x
4 14329.64073 164% 2.64x
8 21832.04285 302% 4.02x
16 27676.47085 409% 5.09x
32 29718.40090 447% 5.47x
40 30045.82365 453% 5.53x
- The TPS on the publisher improves significantly as the number of
parallel workers increases.
- At 40 workers, the TPS reaches 30045.82, which is about 5.5x higher
than the no-patch case..
- With 40 parallel workers, logical sync replication is only about
1.2x slower than physical sync replication.
~~~
The scripts used for the tests are attached. We'll do tests with
larger data sets later and share results.
--
Thanks,
Nisha
Attachments:
On Mon, Aug 11, 2025 at 10:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.Dependency Detection
--------------------------------
1. Basic Dependency Tracking: Maintain a hash table keyed by
(RelationId, ReplicaIdentity) with the value as the transaction XID.
Before dispatching a change to a parallel worker, the leader checks
for existing entries: (a) If no match: add the entry and proceed; (b)
If match: instruct the worker to wait until the dependent transaction
completes.2. Unique Keys
In addition to RI, track unique keys to detect conflicts. Example:
CREATE TABLE tab1(a INT PRIMARY KEY, b INT UNIQUE);
Transactions on publisher:
Txn1: INSERT (1,1)
Txn2: INSERT (2,2)
Txn3: DELETE (2,2)
Txn4: UPDATE (1,1) → (1,2)If Txn4 is applied before Txn2 and Txn3, it will fail due to a unique
constraint violation. To prevent this, track both RI and unique keys
in the hash table. Compare keys of both old and new tuples to detect
dependencies. Then old_tuple's RI needs to be compared, and new
tuple's, both unique key and RI (new tuple's RI is required to detect
some prior insertion with the same key) needs to be compared with
existing hash table entries to identify transaction dependency.3. Foreign Keys
Consider FK constraints between tables. Example:TABLE owner(user_id INT PRIMARY KEY);
TABLE car(car_name TEXT, user_id INT REFERENCES owner);Transactions:
Txn1: INSERT INTO owner(1)
Txn2: INSERT INTO car('bz', 1)Applying Txn2 before Txn1 will fail. To avoid this, check if FK values
in new tuples match any RI or unique key in the hash table. If
matched, treat the transaction as dependent.4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and highest.Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.To illustrate how this mechanism works, consider the following four
transactions:Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.Thoughts?
+1 for the idea. So I see we already have the parallel apply workers
for the large streaming transaction so I am trying to think what
additional problem we need to solve here. IIUC we are actually
parallely applying the transaction which were actually running
parallel on the publisher and commits are actually applied in serial
order. Whereas now we are trying to parallel apply the small
transactions so we are not controlling the commit apply order at the
leader worker so we need extra handling of dependency and also we need
to track which transaction we need to apply and which we need to skip
after the restarts as well. Is that right?
I am reading the proposal and POC patch in more detail to get the
fundamentals of the design and will share my thoughts.
--
Regards,
Dilip Kumar
Google
On Fri, Sep 5, 2025 at 2:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Aug 11, 2025 at 10:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
+1 for the idea. So I see we already have the parallel apply workers
for the large streaming transaction so I am trying to think what
additional problem we need to solve here. IIUC we are actually
parallely applying the transaction which were actually running
parallel on the publisher and commits are actually applied in serial
order. Whereas now we are trying to parallel apply the small
transactions so we are not controlling the commit apply order at the
leader worker so we need extra handling of dependency and also we need
to track which transaction we need to apply and which we need to skip
after the restarts as well. Is that right?
Right.
I am reading the proposal and POC patch in more detail to get the
fundamentals of the design and will share my thoughts.
Thanks.
--
With Regards,
Amit Kapila.
Hello, Amit!
Amit Kapila <amit.kapila16@gmail.com>:
So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.
I think it is better to enable preserve order by default - for safety reasons.
I also checked the patch for potential issues like [0]/messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com - seems like it
is unaffected, because parallel apply workers sync their concurrent
updates and wait for each other to commit.
[0]: /messages/by-id/CADzfLwWC49oanFSGPTf=6FJoTw-kAnpPZV8nVqAyR5KL68LrHQ@mail.gmail.com
Best regards,
Mikhail.
On Fri, Sep 5, 2025 at 5:15 PM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:
Hello, Amit!
Amit Kapila <amit.kapila16@gmail.com>:
So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.I think it is better to enable preserve order by default - for safety reasons.
+1.
--
With Regards,
Amit Kapila.
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
Here is the initial POC patch for this idea.
The basic implementation is outlined below. Please note that there are several
TODO items remaining, which we are actively working on; these are also detailed
further down.
Thanks for the patch.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).If no parallel apply worker is available, the leader will apply the transaction
independently.
I suspect this might not be the most performant default strategy and
could frequently cause a performance dip. In general, we utilize
parallel apply workers, considering that the time taken to apply
changes is much costlier than reading and sending messages to workers.
The current strategy involves the leader picking one transaction for
itself after distributing transactions to all apply workers, assuming
the apply task will take some time to complete. When the leader takes
on an apply task, it becomes a bottleneck for complete parallelism.
This is because it needs to finish applying previous messages before
accepting any new ones. Consequently, even as workers slowly become
free, they won't receive new tasks because the leader is busy applying
its own transaction.
This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.
--
Regards,
Dilip Kumar
Google
On Sat, Sep 6, 2025 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:Here is the initial POC patch for this idea.
If no parallel apply worker is available, the leader will apply the transaction
independently.This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.
In case the leader encounters an error while applying a transaction,
it will have to be restarted. Would that restart all the parallel
apply workers? That will be another (minor) risk when letting the
leader apply transactions. The probability of hitting an error while
applying a transaction is more than when just transmitting messages.
--
Best Wishes,
Ashutosh Bapat
Hi Amit,
Really interesting proposal! I've been thinking through some of the
implementation challenges:
*On the memory side:* That hash table tracking RelationId and
ReplicaIdentity could get pretty hefty under load. Maybe bloom filters
could help with the initial screening? Also wondering
about size caps with some kind of LRU cleanup when things get tight.
*Worker bottleneck:* This is the tricky part - hundreds of active
transactions but only a handful of workers. Seems like we'll hit
serialization anyway when workers are maxed out. What
about spawning workers dynamically (within limits) or having some smart
queuing for when we're worker-starved?
*Alternative approach(if it can be consider): *Rather than full
parallelization, break transaction processing into overlapping stages:
• *Stage 1:* Parse WAL records
• *Stage 2:* Analyze dependencies
• *Stage 3:* Execute changes
• *Stage 4:* Commit and track progress
This creates a pipeline where Transaction A executes changes while
Transaction B analyzes dependencies and Transaction C parses data - all
happening simultaneously in different stages.
The out-of-order commit option you mentioned makes sense for apps handling
integrity themselves.
*Question:* What's the fallback behavior when dependency detection fails?
Thanks,
Abhishek Mehta
On Sat, Sep 13, 2025 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Hi,
Background and Motivation
-------------------------------------
In high-throughput systems, where hundreds of sessions generate data
on the publisher, the subscriber's apply process often becomes a
bottleneck due to the single apply worker model. While users can
mitigate this by creating multiple publication-subscription pairs,
this approach has scalability and usability limitations.Currently, PostgreSQL supports parallel apply only for large streaming
transactions (streaming=parallel). This proposal aims to extend
parallelism to non-streaming transactions, thereby improving
replication performance in workloads dominated by smaller, frequent
transactions.Design Overview
------------------------
To safely parallelize non-streaming transactions, we must ensure that
transaction dependencies are respected to avoid failures and
deadlocks. Consider the following scenarios to understand it better:
(a) Transaction failures: Say, if we insert a row in the first
transaction and update it in the second transaction on the publisher,
then allowing the subscriber to apply both in parallel can lead to
failure in the update; (b) Deadlocks - allowing transactions that
update the same set of rows in a table in the opposite order in
parallel can lead to deadlocks.The core idea is that the leader apply worker ensures the following:
a. Identifies dependencies between transactions. b. Coordinates
parallel workers to apply independent transactions concurrently. c.
Ensures correct ordering for dependent transactions.Dependency Detection
--------------------------------
1. Basic Dependency Tracking: Maintain a hash table keyed by
(RelationId, ReplicaIdentity) with the value as the transaction XID.
Before dispatching a change to a parallel worker, the leader checks
for existing entries: (a) If no match: add the entry and proceed; (b)
If match: instruct the worker to wait until the dependent transaction
completes.2. Unique Keys
In addition to RI, track unique keys to detect conflicts. Example:
CREATE TABLE tab1(a INT PRIMARY KEY, b INT UNIQUE);
Transactions on publisher:
Txn1: INSERT (1,1)
Txn2: INSERT (2,2)
Txn3: DELETE (2,2)
Txn4: UPDATE (1,1) → (1,2)If Txn4 is applied before Txn2 and Txn3, it will fail due to a unique
constraint violation. To prevent this, track both RI and unique keys
in the hash table. Compare keys of both old and new tuples to detect
dependencies. Then old_tuple's RI needs to be compared, and new
tuple's, both unique key and RI (new tuple's RI is required to detect
some prior insertion with the same key) needs to be compared with
existing hash table entries to identify transaction dependency.3. Foreign Keys
Consider FK constraints between tables. Example:TABLE owner(user_id INT PRIMARY KEY);
TABLE car(car_name TEXT, user_id INT REFERENCES owner);Transactions:
Txn1: INSERT INTO owner(1)
Txn2: INSERT INTO car('bz', 1)Applying Txn2 before Txn1 will fail. To avoid this, check if FK values
in new tuples match any RI or unique key in the hash table. If
matched, treat the transaction as dependent.4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and
highest.Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.To illustrate how this mechanism works, consider the following four
transactions:Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.Thoughts?
--
With Regards,
Amit Kapila.
--
Thanks & Regards,
Abhishek Mehta
On Sat, Sep 6, 2025 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:Here is the initial POC patch for this idea.
The basic implementation is outlined below. Please note that there are several
TODO items remaining, which we are actively working on; these are also detailed
further down.Thanks for the patch.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).If no parallel apply worker is available, the leader will apply the transaction
independently.I suspect this might not be the most performant default strategy and
could frequently cause a performance dip. In general, we utilize
parallel apply workers, considering that the time taken to apply
changes is much costlier than reading and sending messages to workers.The current strategy involves the leader picking one transaction for
itself after distributing transactions to all apply workers, assuming
the apply task will take some time to complete. When the leader takes
on an apply task, it becomes a bottleneck for complete parallelism.
This is because it needs to finish applying previous messages before
accepting any new ones. Consequently, even as workers slowly become
free, they won't receive new tasks because the leader is busy applying
its own transaction.This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.
I see your point but consider a scenario where we have two pa workers.
pa-1 is waiting for some backend on unique_key insertion and pa-2 is
waiting for pa-1 to complete its transaction as pa-2 has to perform
some change which is dependent on pa-1's transaction. So, leader can
either simply wait for a third transaction to be distributed or just
apply it and process another change. If we follow the earlier then it
is quite possible that the sender fills the network queue to send data
and simply timed out.
--
With Regards,
Amit Kapila.
On Sat, Sep 13, 2025 at 9:49 PM Abhi Mehta <abhi15.mehta@gmail.com> wrote:
Hi Amit,
Really interesting proposal! I've been thinking through some of the implementation challenges:
On the memory side: That hash table tracking RelationId and ReplicaIdentity could get pretty hefty under load. Maybe bloom filters could help with the initial screening? Also wondering
about size caps with some kind of LRU cleanup when things get tight.
Yeah, this is an interesting thought and we should test, if we really
hit this case and if we could improve it with your suggestion.
Worker bottleneck: This is the tricky part - hundreds of active transactions but only a handful of workers. Seems like we'll hit serialization anyway when workers are maxed out. What
about spawning workers dynamically (within limits) or having some smart queuing for when we're worker-starved?
Yeah, we would have a GUC or subscription-option max parallel workers.
We can consider smart-queuing or any advanced techniques for such
cases after the first version is committed as making that work in
itself is a big undertaking.
Alternative approach(if it can be consider): Rather than full parallelization, break transaction processing into overlapping stages:
• Stage 1: Parse WAL records
Hmm, this is already performed by the publisher.
• Stage 2: Analyze dependencies
• Stage 3: Execute changes
• Stage 4: Commit and track progress
This creates a pipeline where Transaction A executes changes while Transaction B analyzes dependencies
I don't know how to make this work in the current framework of apply.
But feel free to propose this with some more details as to how it will
work?
--
With Regards,
Amit Kapila.
On Mon, Sep 8, 2025 at 3:10 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
On Sat, Sep 6, 2025 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Aug 13, 2025 at 4:17 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:Here is the initial POC patch for this idea.
If no parallel apply worker is available, the leader will apply the transaction
independently.This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.In case the leader encounters an error while applying a transaction,
it will have to be restarted. Would that restart all the parallel
apply workers? That will be another (minor) risk when letting the
leader apply transactions. The probability of hitting an error while
applying a transaction is more than when just transmitting messages.
I think we have to anyway (irrespective of whether it applies changes
by itself or not) let leader restart in this case because otherwise,
we may not get the failed transaction again. Also, if one of the pa
exits without completing the transaction, it is important to let other
pa's also exit otherwise dependency calculation can go wrong. There
could be some cases where we could let some pa complete its current
ongoing transaction if it is independent of other transactions and has
received all its changes.
--
With Regards,
Amit Kapila.
On 11/08/2025 7:45 AM, Amit Kapila wrote:
Hi,
4. Triggers and Constraints
For the initial version, exclude tables with user-defined triggers or
constraints from parallel apply due to complexity in dependency
detection. We may need some parallel-apply-safe marking to allow this.
I think that the problem is wider than just triggers and constrains.
Even if database has no triggers and constraints, there still can be
causality violations.
If transactions at subscriber are executed in different order than on
publisher, then it is possible to observe some "invalid" database state
which is never possible at publisher. Assume very simple example: you
withdraw some money in ATM from one account and then deposit them to
some other account. There are two different transactions. And there are
no any dependencies between them (they update different records). But if
second transaction is committed before first, then we can view incorrect
report where total number of money at all accounts exceeds real balance.
Another case is when you persisting some stream of events (with
timestamps). It may be confusing if at subscriber monotony of events is
violated.
And there can be many other similar situations when tjere are no
"direct" data dependencies between transactions, but there are hidden
"indirect"dependencies. The most popular case you have mentioned:
foreign keys. Certainly support of referential integrity constraints can
be added. But there can be such dependencies without correspondent
constraints in database schema.
You have also suggested to add option which will force preserving commit
order. But my experiments with
`debug_logical_replication_streaming=immediate` shows that in this case
for short transactions performance with parallel workers is even worser
than with single apply worker.
May be it is possible to enforce some weaker commit order: do not try to
commit transactions in exactly the same order as at publisher, but if
transaction T1 at publisher is started after T2 is committed, then T2
can not be committed before T1 at subscriber. Unfortunately it is not
clear how to enforce such "partial order" - `LogicalRepBeginData`
contains `finish_lsn`, but not `start_lsn`.
First time I read your proposal and especially after seen concrete
results of it's implementation, I decided than parallel apply approach
is definitely better than prefetch approach. But now I am not so sure.
Yes, parallel apply is about 2x times faster than parallel prefetch. But
still parallel prefetch allows to 2-3 times increase LR speed without
causing any problems with deadlock, constraints, triggers,...
Show quoted text
Replication Progress Tracking
-----------------------------------------
Parallel apply introduces out-of-order commit application,
complicating replication progress tracking. To handle restarts and
ensure consistency:Track Three Key Metrics:
lowest_remote_lsn: Starting point for applying transactions.
highest_remote_lsn: Highest LSN that has been applied.
list_remote_lsn: List of commit LSNs applied between the lowest and highest.Mechanism:
Store these in ReplicationState: lowest_remote_lsn,
highest_remote_lsn, list_remote_lsn. Flush these to disk during
checkpoints similar to CheckPointReplicationOrigin.After Restart, Start from lowest_remote_lsn and for each transaction,
if its commit LSN is in list_remote_lsn, skip it, otherwise, apply it.
Once commit LSN > highest_remote_lsn, apply without checking the list.During apply, the leader maintains list_in_progress_xacts in the
increasing commit order. On commit, update highest_remote_lsn. If
commit LSN matches the first in-progress xact of
list_in_progress_xacts, update lowest_remote_lsn, otherwise, add to
list_remote_lsn. After commit, also remove it from the
list_in_progress_xacts. We need to clean up entries below
lowest_remote_lsn in list_remote_lsn while updating its value.To illustrate how this mechanism works, consider the following four
transactions:Transaction ID Commit LSN
501 1000
502 1100
503 1200
504 1300Assume:
Transactions 501 and 502 take longer to apply whereas transactions 503
and 504 finish earlier. Parallel apply workers are assigned as
follows:
pa-1 → 501
pa-2 → 502
pa-3 → 503
pa-4 → 504Initial state: list_in_progress_xacts = [501, 502, 503, 504]
Step 1: Transaction 503 commits first and in RecordTransactionCommit,
it updates highest_remote_lsn to 1200. In apply_handle_commit, since
503 is not the first in list_in_progress_xacts, add 1200 to
list_remote_lsn. Remove 503 from list_in_progress_xacts.
Step 2: Transaction 504 commits, Update highest_remote_lsn to 1300.
Add 1300 to list_remote_lsn. Remove 504 from list_in_progress_xacts.
ReplicationState now:
lowest_remote_lsn = 0
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [501, 502]Step 3: Transaction 501 commits. Since 501 is now the first in
list_in_progress_xacts, update lowest_remote_lsn to 1000. Remove 501
from list_in_progress_xacts. Clean up list_remote_lsn to remove
entries < lowest_remote_lsn (none in this case).
ReplicationState now:
lowest_remote_lsn = 1000
list_remote_lsn = [1200, 1300]
highest_remote_lsn = 1300
list_in_progress_xacts = [502]Step 4: System crash and restart
Upon restart, Start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502, since it is not present in
list_remote_lsn, apply it. As transactions 503 and 504 are present in
list_remote_lsn, we skip them. Note that each transaction's
end_lsn/commit_lsn has to be compared which the apply worker receives
along with the first transaction command BEGIN. This ensures
correctness and avoids duplicate application of already committed
transactions.Upon restart, start replication from lowest_remote_lsn = 1000. First
transaction encountered is 502 with commit LSN 1100, since it is not
present in list_remote_lsn, apply it. As transactions 503 and 504's
respective commit LSNs [1200, 1300] are present in list_remote_lsn, we
skip them. This ensures correctness and avoids duplicate application
of already committed transactions.Now, it is possible that some users may want to parallelize the
transaction but still want to maintain commit order because they don't
explicitly annotate FK, PK for columns but maintain the integrity via
application. So, in such cases as we won't be able to detect
transaction dependencies, it would be better to allow out-of-order
commits optionally.Thoughts?
On Wednesday, September 17, 2025 2:40 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
On 11/08/2025 7:45 AM, Amit Kapila wrote:
4. Triggers and Constraints For the initial version, exclude tables with
user-defined triggers or constraints from parallel apply due to complexity in
dependency detection. We may need some parallel-apply-safe marking to allow
this. I think that the problem is wider than just triggers and constrains.Even if database has no triggers and constraints, there still can be causality
violations.If transactions at subscriber are executed in different order than
on publisher, then it is possible to observe some "invalid" database state which
is never possible at publisher. Assume very simple example: you withdraw some
money in ATM from one account and then deposit them to some other account. There
are two different transactions. And there are no any dependencies between them
(they update different records). But if second transaction is committed before
first, then we can view incorrect report where total number of money at all
accounts exceeds real balance. Another case is when you persisting some stream
of events (with timestamps). It may be confusing if at subscriber monotony of
events is violated.And there can be many other similar situations when tjere are no "direct" data
dependencies between transactions, but there are hidden "indirect"dependencies.
The most popular case you have mentioned: foreign keys. Certainly support of
referential integrity constraints can be added. But there can be such
dependencies without correspondent constraints in database schema.
Yes, I agree with these situations, which is why we suggest allowing
out-of-commit options while preserving commit order by default. However, I think
not all use cases are affected by non-direct dependencies because we ensure
eventual consistency in out-of-order commit anyway. Additionally, databases like
Oracle and MySQL support out-of-order parallel apply, IIRC.
You have also suggested to add option which will force preserving commit order.
But my experiments with `debug_logical_replication_streaming=immediate` shows
that in this case for short transactions performance with parallel workers is
even worser than with single apply worker.
I think debug_logical_replication_streaming=immediate differs from real parallel
apply . It wasn't designed to simulate genuine parallel application because it
restricts parallelism by requiring the leader to wait for each transaction to
complete on commit. To achieve in-order parallel apply, each parallel apply
worker should wait for the preceding transaction to finish, similar to the
dependency wait in the current POC patch. We plan to extend the patch to support
in-order parallel apply and will test its performance.
Best Regards,
Hou zj
On 17/09/2025 8:18 AM, Zhijie Hou (Fujitsu) wrote:
I think debug_logical_replication_streaming=immediate differs from real parallel
apply . It wasn't designed to simulate genuine parallel application because it
restricts parallelism by requiring the leader to wait for each transaction to
complete on commit. To achieve in-order parallel apply, each parallel apply
worker should wait for the preceding transaction to finish, similar to the
dependency wait in the current POC patch. We plan to extend the patch to support
in-order parallel apply and will test its performance.
Will be interesting to see such results.
Actually, I have tried to improve parallelism in case of `debug_log And
debug_logical_replication_streaming=immediate` mode but faced with
deadlock issue: assume that T1 and T2 are updating the same tuples and
T1 is committed before T2 at publishers. If we let them execute in
parallel, then T2 can update the tuple first and T1 will wait end of T2.
But if we want to preserve commit order, we should not allow T2 to
commit before T1. And so we will get deadlock.
Certainly if we take in account dependencies between transactions (as in
your proposal), then we can avoid such situations. But I am not sure if
such deadlock can not happen even if there are conflicts between
transactions. Let's assume that T1 and T2 inserting some new records in
one table. Can index update in T2 cause obtaining some locks which
blocks T1? And T2 is not able to able to complete transaction and
release this locks because we want to commit T1 first.
On 17/09/2025 8:18 AM, Zhijie Hou (Fujitsu) wrote:
On Wednesday, September 17, 2025 2:40 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
I think debug_logical_replication_streaming=immediate differs from real parallel
apply . It wasn't designed to simulate genuine parallel application because it
restricts parallelism by requiring the leader to wait for each transaction to
complete on commit. To achieve in-order parallel apply, each parallel apply
worker should wait for the preceding transaction to finish, similar to the
dependency wait in the current POC patch. We plan to extend the patch to support
in-order parallel apply and will test its performance.
You was right.
I tried to preserve commit order with your patch (using my random update
test) and was surprised that performance penalty is quite small:
I run pgbench performing random updates using 10 clients during 100
seconds and then check how long time it takes subscriber to caught up
(seconds):
master: 488
parallel-apply no order: 74
parallel-apply preserve order: 88
So looks like serialization of commits adds not so much overhead and it
makes it possible to use it by default, avoiding all effects which may
be caused by changing commit order at subscriber.
Patch is attached (it is based on your patch) and adds
preserve_commit_order GUC.
Attachments:
preserve_commit_order.patchtext/plain; charset=UTF-8; name=preserve_commit_order.patchDownload
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 31a92d1a24a..13e5fc218d8 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -254,6 +298,9 @@ static ParallelApplyWorkerInfo *stream_apply_worker = NULL;
/* A list to maintain subtransactions, if any. */
static List *subxactlist = NIL;
+/* GUC */
+bool preserve_commit_order = true;
+
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
@@ -2117,2 +2120,76 @@ write_internal_relation(StringInfo s, LogicalRepRelation *rel)
}
}
+
+#include "postmaster/bgworker_internals.h"
+
+typedef struct
+{
+ ConditionVariable cv;
+ slock_t mutex;
+ size_t head;
+ size_t tail;
+ TransactionId ring[MAX_PARALLEL_WORKER_LIMIT];
+} ParallelApplyShmem;
+
+static ParallelApplyShmem* pa_shmem;
+
+void
+pa_commit(TransactionId xid)
+{
+ SpinLockAcquire(&pa_shmem->mutex);
+ pa_shmem->ring[pa_shmem->head++ % MAX_PARALLEL_WORKER_LIMIT] = xid;
+ SpinLockRelease(&pa_shmem->mutex);
+ ConditionVariableBroadcast(&pa_shmem->cv);
+}
+
+
+void
+pa_before_apply_commit(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ if (!preserve_commit_order)
+ return;
+
+ while (true)
+ {
+ SpinLockAcquire(&pa_shmem->mutex);
+ if (pa_shmem->head > pa_shmem->tail && pa_shmem->ring[pa_shmem->tail % MAX_PARALLEL_WORKER_LIMIT] == xid)
+ {
+ SpinLockRelease(&pa_shmem->mutex);
+ break;
+ }
+ SpinLockRelease(&pa_shmem->mutex);
+ ConditionVariableSleep(&pa_shmem->cv, WAIT_EVENT_LOGICAL_PARALLEL_APPLY_MAIN);
+ }
+ ConditionVariableCancelSleep();
+}
+
+void
+pa_after_apply_commit(void)
+{
+ SpinLockAcquire(&pa_shmem->mutex);
+ pa_shmem->tail += 1;
+ SpinLockRelease(&pa_shmem->mutex);
+ ConditionVariableBroadcast(&pa_shmem->cv);
+}
+
+Size
+ParallelApplyShmemSize(void)
+{
+ return sizeof(ParallelApplyShmem);
+}
+
+void
+ParallelApplyShmemInit(void)
+{
+ bool found;
+
+ pa_shmem = (ParallelApplyShmem*)ShmemInitStruct("Parallel worker shmem", sizeof(ParallelApplyShmem), &found);
+ if (!found)
+ {
+ pa_shmem->head = pa_shmem->tail = 0;
+ ConditionVariableInit(&pa_shmem->cv);
+ SpinLockInit(&pa_shmem->mutex);
+ }
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 22ad9051db3..bf8bfcbdd3b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1911,40 +1911,43 @@ apply_handle_commit(StringInfo s)
if (pa_send_data(winfo, s->len, s->data))
{
+ pa_commit(winfo->shared->xid);
/* Finish processing the transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
break;
}
/*
* Switch to serialize mode when we are not able to send the
* change to parallel apply worker.
*/
pa_switch_to_partial_serialize(winfo, true);
/* fall through */
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
&original_msg);
pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
/* Finish processing the transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
break;
case TRANS_PARALLEL_APPLY:
/*
* If the parallel apply worker is applying spooled messages then
* close the file before committing.
*/
if (stream_fd)
stream_close_file();
+ pa_before_apply_commit();
apply_handle_commit_internal(&commit_data);
+ pa_after_apply_commit();
MyParallelShared->last_commit_end = XactLastCommitEnd;
pa_commit_transaction();
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2fa045e6b0f..71b8abc4337 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CalculateShmemSize(int *num_semaphores)
size = add_size(size, InjectionPointShmemSize());
size = add_size(size, SlotSyncShmemSize());
size = add_size(size, AioShmemSize());
+ size = add_size(size, ParallelApplyShmemSize());
/* include additional requested shmem from preload libraries */
size = add_size(size, total_addin_request);
@@ -332,6 +333,7 @@ CreateOrAttachShmemStructs(void)
PgArchShmemInit();
ApplyLauncherShmemInit();
SlotSyncShmemInit();
+ ParallelApplyShmemInit();
/*
* Set up other modules that need some shared memory space
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index f137129209f..5b60a4c6655 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -73,6 +73,7 @@
#include "postmaster/walsummarizer.h"
#include "postmaster/walwriter.h"
#include "replication/logicallauncher.h"
+#include "replication/logicalworker.h"
#include "replication/slot.h"
#include "replication/slotsync.h"
#include "replication/syncrep.h"
@@ -976,6 +977,17 @@ struct config_bool ConfigureNamesBool[] =
NULL, NULL, NULL
},
+ {
+ {"preserve_commit_order", PGC_SIGHUP, REPLICATION_SUBSCRIBERS,
+ gettext_noop("Commit LR transactions at subscriber in the same order as at publisher."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &preserve_commit_order,
+ true,
+ NULL, NULL, NULL
+ },
+
{
{"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD,
gettext_noop("Enables the planner's use of parallel hash plans."),
diff --git a/src/include/replication/logicallauncher.h b/src/include/replication/logicallauncher.h
index b29453e8e4f..2efeff720f2 100644
--- a/src/include/replication/logicallauncher.h
+++ b/src/include/replication/logicallauncher.h
@@ -34,4 +34,7 @@ extern bool IsLogicalLauncher(void);
extern pid_t GetLeaderApplyWorkerPid(pid_t pid);
+extern Size ParallelApplyShmemSize(void);
+extern void ParallelApplyShmemInit(void);
+
#endif /* LOGICALLAUNCHER_H */
diff --git a/src/include/replication/logicalworker.h b/src/include/replication/logicalworker.h
index 88912606e4d..cb030eea402 100644
--- a/src/include/replication/logicalworker.h
+++ b/src/include/replication/logicalworker.h
@@ -14,6 +14,8 @@
#include <signal.h>
+extern PGDLLIMPORT bool preserve_commit_order;
+
extern PGDLLIMPORT volatile sig_atomic_t ParallelApplyMessagePending;
extern void ApplyWorkerMain(Datum main_arg);
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 7c0204dd6f4..e48a2219131 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -363,4 +385,10 @@ am_parallel_apply_worker(void)
return isParallelApplyWorker(MyLogicalRepWorker);
}
+extern void pa_before_apply_commit(void);
+
+extern void pa_after_apply_commit(void);
+
+extern void pa_commit(TransactionId xid);
+
#endif /* WORKER_INTERNAL_H */
Dear hackers,
TODO - potential improvement to use shared hash table for tracking
dependencies.
I measured the performance data for the shared hash table approach. Based on the result,
local hash table approach seems better.
Abstract
========
No good performance improvement was observed by the shared hash, it had 1-2% regression.
The trend was not changed by number of parallel apply workers.
Machine details
===============
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
Used patch
==========
0001 is same as Hou posted on -hackers [1]/messages/by-id/OS0PR01MB5716D43CB68DB8FFE73BF65D942AA@OS0PR01MB5716.jpnprd01.prod.outlook.com, and 0002 is the patch for shared hash.
0002 introduces a shared hash table dependency_dshash. 0002 introduces a shared
hash table dependency_dshash. Since the length of shared hash key must be fixed
value, it is computed from the replica identity of tuples. When the parallel apply
worker receives changes, it computes the hash key again and remember it by the list.
At the commit time it iterates the list and remove hash entries based on the keys.
0001 has the mechanism to clean up the local hash but it was removed.
Workload
========
Setup:
---------
Pub --> Sub
- Two nodes created in pub-sub synchronous logical replication setup.
- Both nodes have same set of pgbench tables created with scale=100.
- The Sub node is subscribed to all the changes from the Pub's pgbench tables
Workload Run:
--------------------
- Run built-in pgbench(simple-update)[2]https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN only on Pub with #clients=40 and run duration=5 minutes
Results:
--------------------
Number of worker is changed to 4, 8 or 16. In any cases 0001 has better performance.
#worker = 4:
------------
0001 0001+0002 diff
TPS 14499.33387 14097.74469 3%
14361.7166 14359.87781 0%
14467.91344 14153.53934 2%
14451.8596 14381.70987 0%
14646.90346 14239.4712 3%
14530.66788 14298.33845 2%
14733.35987 14189.41794 4%
14543.9252 14373.21266 1%
14945.57568 14249.46787 5%
14638.6342 14125.87626 4%
AVE 14581.988979 14246.865608 2%
MEDIAN 14537.296540 14244.469536 2%
#worker=8
---------
0001 0001+0002 diff
TPS 21531.08712 21443.68765 0%
22337.60439 21383.94778 4%
21806.70504 21097.42874 3%
22192.99695 21424.78921 4%
21721.95472 21470.8714 1%
21450.6779 21265.89539 1%
21397.51433 21606.51486 -1%
21551.09391 21306.97061 1%
21455.89699 21351.38868 0%
21849.52528 21304.42329 3%
AVE 21729.505662 21365.591761 2%
MEDIAN 21636.524316 21367.668229 1%
#worker=16
-----------
0001 0001+0002 diff
TPS 28034.64652 28129.85068 0%
27839.10942 27364.40725 2%
27693.94576 27871.80199 -1%
27717.83971 27129.96132 2%
28453.25381 27439.77526 4%
28083.73208 27201.0004 3%
27842.19262 27226.43813 2%
27729.44205 27459.01256 1%
28103.76727 27385.80016 3%
27688.52482 27485.67209 1%
AVE 27918.645405 27469.371982 2%
MEDIAN 27840.651020 27412.787708 2%
[1]: /messages/by-id/OS0PR01MB5716D43CB68DB8FFE73BF65D942AA@OS0PR01MB5716.jpnprd01.prod.outlook.com
[2]: https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v20251031-0001-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v20251031-0001-Parallel-apply-non-streaming-transactions.patchDownload
From 5e9f8899ea587d9c1b0f74556c568e73b290a036 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 8 Aug 2025 11:35:59 +0800
Subject: [PATCH v20251031 1/2] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding. Parallel
apply workers do not maintain commit order; transactions can be committed at any
time provided there are no dependencies.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
TODO - error handling
--
If preceding transactions fail, and independent later transactions are already
applied, a mechanism is needed to skip already applied transactions upon
restart. One solution is to PREPARE transactions whose preceding ones remain
uncommitted, then COMMIT PREPARE once all preceding transactions finish. This
allows the worker to skip applied transactions by scanning prepared ones.
--
TODO - dependency on local unique key, foreign key.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
--
TODO - potiential improvement to use shared hash table for tracking dendpencies.
--
Instead of a local hash table, a shared hash table could track replica identity
key dependencies, allowing parallel apply workers to clean up entries. However,
this might increase contention, so need to research whether it's worth it.
---
.../replication/logical/applyparallelworker.c | 554 ++++++++++-
src/backend/replication/logical/proto.c | 42 +
src/backend/replication/logical/relation.c | 55 ++
src/backend/replication/logical/worker.c | 869 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/logicalproto.h | 4 +
src/include/replication/logicalrelation.h | 5 +
src/include/replication/worker_internal.h | 26 +-
src/include/storage/lwlocklist.h | 1 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/tools/pgindent/typedefs.list | 4 +
14 files changed, 1497 insertions(+), 76 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 14325581afc..dccd221ad01 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -146,6 +149,23 @@
* which will detect deadlock if any. See pa_send_data() and
* enum TransApplyAction.
*
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * -------------------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding. Parallel apply workers do not
+ * maintain commit order; transactions can be committed at any time provided
+ * there are no dependencies.
+ *
* Lock types
* ----------
* Both the stream lock and the transaction lock mentioned above are
@@ -216,14 +236,38 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +301,9 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +381,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -364,11 +420,14 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
/* Set up the header region. */
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
-
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
+ shared->has_dependent_txn = false;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -406,6 +465,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -413,10 +474,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by this
+ * worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -444,18 +528,31 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -468,7 +565,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -505,11 +602,14 @@ pa_allocate_worker(TransactionId xid)
SpinLockAcquire(&winfo->shared->mutex);
winfo->shared->xact_state = PARALLEL_TRANS_UNKNOWN;
winfo->shared->xid = xid;
+ winfo->shared->has_dependent_txn = false;
SpinLockRelease(&winfo->shared->mutex);
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -558,7 +658,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -574,9 +675,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -706,6 +805,105 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused for
+ * handling other transactions. Consequently, the local end LSN has already
+ * been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -781,21 +979,35 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -812,6 +1024,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -849,6 +1064,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -864,6 +1082,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +1171,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1157,7 +1379,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1209,6 +1430,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait for
+ * the worker to complete its transaction and then stop it. This is
+ * consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1291,8 +1573,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1306,6 +1588,7 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("lost connection to the logical replication parallel apply worker")));
+
}
/*
@@ -1369,6 +1652,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1625,23 +1911,215 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for stop
+ * as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (!XLogRecPtrIsInvalid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
+}
+
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ if (winfo->shared->has_dependent_txn)
+ continue;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ if (found)
+ elog(ERROR, "hash table corrupted");
+
+ winfo->shared->has_dependent_txn = true;
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
+/*
+ * Mark the transaction state as finished and remove the shared hash entry if
+ * there are dependent transactions waiting for this transaction to complete.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+ bool has_dependent_txn;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ has_dependent_txn = MyParallelShared->has_dependent_txn;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ if (!has_dependent_txn)
+ return;
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ed62888764c..111691e93eb 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
@@ -1253,6 +1291,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 745fd3bab64..34375df3a4b 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -958,3 +958,58 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7edd1c9cf06..8dd2e28522b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -483,6 +484,7 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -544,6 +546,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -558,11 +603,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -629,6 +670,595 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, dependencies.len, dependencies.data))
+ return;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ dependencies.cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY,
+ &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -776,13 +1406,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
-
- Assert(TransactionIdIsValid(stream_xid));
+ }
/*
* The parallel apply worker needs the xid in this message to decide
@@ -794,15 +1429,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1206,17 +1854,49 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+ pa_send_data(winfo, s->len, s->data);
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1231,6 +1911,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1241,7 +1926,70 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ apply_handle_commit_internal(&commit_data);
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+ /* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/* Process any tables that are being synchronized in parallel. */
ProcessSyncingRelations(commit_data.end_lsn);
@@ -1361,7 +2109,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1406,6 +2155,8 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ /* TODO wait for xid to finish */
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -1418,7 +2169,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/* Process any tables that are being synchronized in parallel. */
@@ -1484,7 +2236,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/* Process any tables that are being synchronized in parallel. */
@@ -1543,7 +2296,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1734,7 +2488,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
@@ -1792,6 +2546,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -2516,7 +3275,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -2549,6 +3309,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -3323,6 +4086,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -3634,6 +4399,8 @@ apply_handle_truncate(StringInfo s)
ListCell *lc;
LOCKMODE lockmode = AccessExclusiveLock;
+ elog(LOG, "truncate");
+
/*
* Quick return if we are skipping data modification changes or handling
* streamed transactions.
@@ -3845,6 +4612,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -3865,6 +4640,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -3874,6 +4653,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -3883,6 +4663,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -3891,29 +4701,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -3931,6 +4731,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = (FlushPosition *) palloc(sizeof(FlushPosition));
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -5375,7 +6176,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -5420,7 +6221,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -5468,7 +6269,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 7553f6eacef..9c3737693ba 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -401,6 +401,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
@@ -251,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index e23fa9a4514..c70fae9efda 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -194,6 +195,11 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ bool has_dependent_txn;
} ParallelApplyWorkerShared;
/*
@@ -228,6 +234,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -300,6 +308,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -313,19 +325,23 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -350,12 +366,18 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
+extern void pa_commit_transaction(void);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTablesyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_TABLESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 06a1ffd4b08..f461bd67827 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -135,3 +135,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index 00a1c2fcd48..6842476c8b0 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..18b7542274e 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -87,6 +87,7 @@ $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
$node_publisher->start;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
my %remainder_a = (
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 018b5919cf6..9055738ca71 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2079,6 +2079,7 @@ ParallelTransState
ParallelVacuumState
ParallelWorkerContext
ParallelWorkerInfo
+ParallelizedTxnEntry
Param
ParamCompileHook
ParamExecData
@@ -2548,6 +2549,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4033,6 +4036,7 @@ remoteDep
remove_nulling_relids_context
rendezvousHashEntry
rep
+replica_identity_hash
replace_rte_variables_callback
replace_rte_variables_context
report_error_fn
--
2.47.3
v20251031-0002-WIP-convert-the-hash-table-into-shared-one.patchapplication/octet-stream; name=v20251031-0002-WIP-convert-the-hash-table-into-shared-one.patchDownload
From 9cd724e329591e037e9563ae5d28b811b0be8f91 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <hayato@example.com>
Date: Thu, 11 Sep 2025 13:27:24 +0900
Subject: [PATCH v20251031 2/2] WIP: convert the hash table into shared one
This patch extends the parallel apply to use the shared hash table for the
dependency tracking.
Currently entries are added only by the leader apply, and parallel apply worker
can remove entries at the commit.
---
.../replication/logical/applyparallelworker.c | 35 ++
src/backend/replication/logical/worker.c | 444 ++++++++++--------
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 18 +
src/include/storage/lwlocklist.h | 1 +
src/test/subscription/t/001_rep_changes.pl | 1 +
6 files changed, 310 insertions(+), 190 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index dccd221ad01..bc57fa298a1 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -298,6 +298,9 @@ static ParallelApplyWorkerInfo *stream_apply_worker = NULL;
/* A list to maintain subtransactions, if any. */
static List *subxactlist = NIL;
+/* A list of known replica identity keys */
+List *replica_identity_keys = NIL;
+
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
@@ -428,6 +431,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
shared->parallelized_txns_handle = parallelized_txns_handle;
shared->has_dependent_txn = false;
+ shared->dependency_dsa_handle = DSA_HANDLE_INVALID;
+ shared->dependency_dshash_handle = DSHASH_HANDLE_INVALID;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -452,6 +457,10 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
winfo->dsm_seg = seg;
winfo->shared = shared;
+ /* Set dependency hash table handler */
+ atach_dependency_hash(&winfo->shared->dependency_dsa_handle,
+ &winfo->shared->dependency_dshash_handle);
+
return true;
}
@@ -1728,6 +1737,11 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_FINISHED);
/*
+ * XXX: no need to remove dependency hash entries because it is not
+ * used for streamed transactions
+ */
+
+ /*
* Release the lock as we might be processing an empty streaming
* transaction in which case the lock won't be released during
* transaction rollback.
@@ -2062,6 +2076,9 @@ pa_commit_transaction(void)
TransactionId xid = MyParallelShared->xid;
bool has_dependent_txn;
+ /* Remove the transaction from the dependency hash table */
+ dependency_cleanup();
+
SpinLockAcquire(&MyParallelShared->mutex);
MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
has_dependent_txn = MyParallelShared->has_dependent_txn;
@@ -2123,3 +2140,21 @@ write_internal_relation(StringInfo s, LogicalRepRelation *rel)
logicalrep_write_all_rels(s);
}
}
+
+/*
+ * Remember the given replica identity key.
+ */
+void
+remember_replica_identity_key(ReplicaIdentityKey *key)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /*
+ * XXX Currently we do not take care the uniqueness. Same entry can be
+ * appended twice.
+ */
+ replica_identity_keys = lappend(replica_identity_keys, key);
+ MemoryContextSwitchTo(oldctx);
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 8dd2e28522b..d62b783fc48 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -256,11 +256,13 @@
#include "catalog/pg_inherits.h"
#include "catalog/pg_subscription.h"
#include "catalog/pg_subscription_rel.h"
+#include "common/hashfn.h"
#include "commands/subscriptioncmds.h"
#include "commands/tablecmds.h"
#include "commands/trigger.h"
#include "executor/executor.h"
#include "executor/execPartition.h"
+#include "lib/dshash.h"
#include "libpq/pqformat.h"
#include "miscadmin.h"
#include "optimizer/optimizer.h"
@@ -546,46 +548,45 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
-typedef struct ReplicaIdentityKey
-{
- Oid relid;
- LogicalRepTupleData *data;
-} ReplicaIdentityKey;
-
+/*
+ * dshash entry; Holds last remote_xid that modified the tuple on the publisher.
+ */
typedef struct ReplicaIdentityEntry
{
- ReplicaIdentityKey *keydata;
+ ReplicaIdentityKey keydata;
TransactionId remote_xid;
-
- /* needed for simplehash */
- uint32 hash;
- char status;
} ReplicaIdentityEntry;
-#include "common/hashfn.h"
+/*
+ * Build a hash value from the oid and replica identity columns.
+ *
+ * XXX: do we have to extend hash value?
+ */
+static uint32
+build_hash(Oid relid, LogicalRepTupleData *data, LogicalRepRelMapEntry *relentry)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(relid));
+
+ for (i = 0; i < data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
-static uint32 hash_replica_identity(ReplicaIdentityKey *key);
-static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
- ReplicaIdentityKey *b);
-
-/* Define parameters for replica identity hash table code generation. */
-#define SH_PREFIX replica_identity
-#define SH_ELEMENT_TYPE ReplicaIdentityEntry
-#define SH_KEY_TYPE ReplicaIdentityKey *
-#define SH_KEY keydata
-#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
-#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
-#define SH_STORE_HASH
-#define SH_GET_HASH(tb, a) (a)->hash
-#define SH_SCOPE static inline
-#define SH_DECLARE
-#define SH_DEFINE
-#include "lib/simplehash.h"
-
-#define REPLICA_IDENTITY_INITIAL_SIZE 128
-#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
-
-static replica_identity_hash *replica_identity_table = NULL;
+ hkey = hash_any((const unsigned char *) data->colvalues[i].data,
+ data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
@@ -669,135 +670,148 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
ParallelApplyWorkerInfo **winfo);
static void replorigin_reset(int code, Datum arg);
+static void dependency_dsa_detach(int code, Datum arg);
+static void ensure_dependency_dshash(void);
-/*
- * Compute the hash value for entries in the replica_identity_table.
- */
-static uint32
-hash_replica_identity(ReplicaIdentityKey *key)
+/* parameters for the RI dependency shared hash table */
+static const dshash_parameters dependency_dsh_params =
{
- int i;
- uint32 hashkey = 0;
-
- hashkey = hash_combine(hashkey, hash_uint32(key->relid));
-
- for (i = 0; i < key->data->ncols; i++)
- {
- uint32 hkey;
-
- if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
- continue;
-
- hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
- key->data->colvalues[i].len);
- hashkey = hash_combine(hashkey, hkey);
- }
+ sizeof(ReplicaIdentityKey),
+ sizeof(ReplicaIdentityEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_DEPENDENCY_APPLY_DSA
+};
- return hashkey;
-}
+static dsa_area *dependency_dsa_area = NULL;
+static dshash_table *dependency_dshash = NULL;
/*
- * Compare two entries in the replica_identity_table.
+ * Allocate dependency hash table on the shared memory, or attach to it.
+ *
+ * It is always called by a leader apply worker first, then called by parallel
+ * workers.
*/
-static bool
-hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+static void
+ensure_dependency_dshash(void)
{
- if (a->relid != b->relid ||
- a->data->ncols != b->data->ncols)
- return false;
+ MemoryContext oldcontext;
- for (int i = 0; i < a->data->ncols; i++)
- {
- if (a->data->colstatus[i] != b->data->colstatus[i])
- return false;
+ /* Already initialized */
+ if (dependency_dshash)
+ return;
- if (a->data->colvalues[i].len != b->data->colvalues[i].len)
- return false;
+ /* XXX: is it OK to swtich the context only at the */
+ oldcontext = MemoryContextSwitchTo(ApplyContext);
- if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
- return false;
+ /* Parallel apply worker should attach to existing dsa and dshash */
+ if (am_parallel_apply_worker())
+ {
+ Assert(MyParallelShared->dependency_dsa_handle != DSA_HANDLE_INVALID &&
+ MyParallelShared->dependency_dshash_handle != DSHASH_HANDLE_INVALID);
- elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ dependency_dsa_area = dsa_attach(MyParallelShared->dependency_dsa_handle);
+ dsa_pin_mapping(dependency_dsa_area);
+ dependency_dshash = dshash_attach(dependency_dsa_area, &dependency_dsh_params,
+ MyParallelShared->dependency_dshash_handle, NULL);
+ }
+ else
+ {
+ /* Leader apply worker should create dsa and dshash */
+ dependency_dsa_area = dsa_create(LWTRANCHE_DEPENDENCY_APPLY_DSA);
+ dsa_pin(dependency_dsa_area);
+ dsa_pin_mapping(dependency_dsa_area);
+ dependency_dshash = dshash_create(dependency_dsa_area, &dependency_dsh_params, NULL);
}
- return true;
+ before_shmem_exit(dependency_dsa_detach, (Datum) 0);
+ MemoryContextSwitchTo(oldcontext);
}
/*
- * Free resources associated with a replica identity key.
+ * Attach to the shared hash table for dependency tracking.
*/
-static void
-free_replica_identity_key(ReplicaIdentityKey *key)
+void
+atach_dependency_hash(dsa_handle *out_dsa, dshash_table_handle *out_hash)
{
- Assert(key);
+ Assert(dependency_dsa_area && dependency_dshash);
- pfree(key->data->colvalues);
- pfree(key->data->colstatus);
- pfree(key->data);
- pfree(key);
+ *out_dsa = dsa_get_handle(dependency_dsa_area);
+ *out_hash = dshash_get_hash_table_handle(dependency_dshash);
}
/*
* Clean up hash table entries associated with the given transaction IDs.
+ *
+ * XXX: do we have to retain this? Or it is enough done by parallel workers?
*/
static void
cleanup_replica_identity_table(List *committed_xid)
{
- replica_identity_iterator i;
+ dshash_seq_status hstat;
ReplicaIdentityEntry *rientry;
+ if (!dependency_dshash)
+ return;
+
if (!committed_xid)
return;
- replica_identity_start_iterate(replica_identity_table, &i);
- while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ dshash_seq_init(&hstat, dependency_dshash, true);
+ while ((rientry = dshash_seq_next(&hstat)) != NULL)
{
if (!list_member_xid(committed_xid, rientry->remote_xid))
continue;
/* Clean up the hash entry for committed transaction */
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ dshash_delete_current(&hstat);
}
+ dshash_seq_term(&hstat);
}
/*
- * Check committed transactions and clean up corresponding entries in the hash
- * table.
+ * Remove all hash table entries handled by the worker.
+ *
+ * This is called when a transaction is committed by parallel apply workers.
*/
-static void
-cleanup_committed_replica_identity_entries(void)
+void
+dependency_cleanup(void)
{
- dlist_mutable_iter iter;
- List *committed_xids = NIL;
-
- dlist_foreach_modify(iter, &lsn_mapping)
- {
- FlushPosition *pos =
- dlist_container(FlushPosition, node, iter.cur);
- bool skipped_write;
+ Assert(am_parallel_apply_worker());
- if (!TransactionIdIsValid(pos->pa_remote_xid) ||
- !XLogRecPtrIsInvalid(pos->local_end))
- continue;
-
- pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
- &skipped_write);
+ /*
+ * Quick exit if dependency hash is not attached yet.
+ *
+ * XXX is it possible?
+ */
+ if (!dependency_dshash)
+ return;
- elog(DEBUG1,
- "got commit end from parallel apply worker, "
- "txn: %u, remote_end %X/%X, local_end %X/%X",
- pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
- LSN_FORMAT_ARGS(pos->local_end));
+ /*
+ * Quick exit if no replication identity key is remembered.
+ *
+ * XXX is it possible?
+ */
+ if (replica_identity_keys == NIL)
+ return;
- if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
- continue;
+ /* Walk through the list to remove all related entries */
+ foreach_ptr(ReplicaIdentityKey, rikey, replica_identity_keys)
+ {
+ /*
+ * XXX We cannot ensure that dshash entries were deleted. There is a
+ * case that same key was appended several times and no entry could be
+ * removed by the second try.
+ */
+ dshash_delete_key(dependency_dshash, rikey);
- committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ replica_identity_keys = foreach_delete_current(replica_identity_keys,
+ rikey);
}
- /* cleanup the entries for committed transactions */
- cleanup_replica_identity_table(committed_xids);
+ list_free(replica_identity_keys);
+ replica_identity_keys = NIL;
}
/*
@@ -833,39 +847,22 @@ check_and_append_xid_dependency(List *depends_on_xids,
}
/*
- * Check for dependencies on preceding transactions that modify the same key.
- * Returns the dependent transactions in 'depends_on_xids' and records the
- * current change.
+ * Compute a hash key for the dependency hash table.
+ *
+ * Returns NULL in case of below, otherwise palloc'd ReplicaIdentityKey:
+ * - There are no replica identity columns
+ * - RI key is NULL or is explicitly marked unchanged
*/
-static void
-check_dependency_on_replica_identity(Oid relid,
- LogicalRepTupleData *original_data,
- TransactionId new_depended_xid,
- List **depends_on_xids)
+static ReplicaIdentityKey *
+compute_replca_identity_key(LogicalRepRelMapEntry *relentry,
+ LogicalRepTupleData *original_data)
{
- LogicalRepRelMapEntry *relentry;
- LogicalRepTupleData *ridata;
- ReplicaIdentityKey *rikey;
- ReplicaIdentityEntry *rientry;
- MemoryContext oldctx;
int n_ri;
- bool found = false;
-
- Assert(depends_on_xids);
-
- /* Search for existing entry */
- relentry = logicalrep_get_relentry(relid);
+ MemoryContext oldctx;
+ ReplicaIdentityKey *rikey;
Assert(relentry);
- /*
- * First search whether any previous transaction has affected the whole
- * table e.g., truncate or schema change from publisher.
- */
- *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
- &relentry->last_depended_xid,
- new_depended_xid);
-
n_ri = bms_num_members(relentry->remoterel.attkeys);
/*
@@ -874,7 +871,7 @@ check_dependency_on_replica_identity(Oid relid,
* replica identity full.
*/
if (!n_ri)
- return;
+ return NULL;
/* Check if the RI key value of the tuple is invalid */
for (int i = 0; i < original_data->ncols; i++)
@@ -889,64 +886,85 @@ check_dependency_on_replica_identity(Oid relid,
*/
if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
- return;
+ return NULL;
}
oldctx = MemoryContextSwitchTo(ApplyContext);
- /* Allocate space for replica identity values */
- ridata = palloc0_object(LogicalRepTupleData);
- ridata->colvalues = palloc0_array(StringInfoData, n_ri);
- ridata->colstatus = palloc0_array(char, n_ri);
- ridata->ncols = n_ri;
+ rikey = palloc_object(ReplicaIdentityKey);
+ rikey->relid = relentry->remoterel.remoteid;
+ rikey->hash = build_hash(relentry->remoterel.remoteid, original_data,
+ relentry);
- for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
- {
- StringInfo original_colvalue = &original_data->colvalues[i_original];
+ MemoryContextSwitchTo(oldctx);
- if (!bms_is_member(i_original, relentry->remoterel.attkeys))
- continue;
+ return rikey;
+}
- initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
- appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
- ridata->colstatus[i_ri] = original_data->colstatus[i_original];
- i_ri++;
- }
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
- rikey = palloc0_object(ReplicaIdentityKey);
- rikey->relid = relid;
- rikey->data = ridata;
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole table
+ * e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ /* Compute the hash key */
+ rikey = compute_replca_identity_key(relentry, original_data);
+
+ if (!rikey)
+ return;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
if (TransactionIdIsValid(new_depended_xid))
{
- rientry = replica_identity_insert(replica_identity_table, rikey,
- &found);
+ rientry = dshash_find_or_insert(dependency_dshash, rikey, &found);
- /*
- * Release the key built to search the entry, if the entry already
- * exists. Otherwise, initialize the remote_xid.
- */
+ /* Reuse the existing entry if found */
if (found)
{
elog(DEBUG1, "found conflicting replica identity change from %u",
rientry->remote_xid);
-
- free_replica_identity_key(rikey);
}
else
rientry->remote_xid = InvalidTransactionId;
}
else
- {
- rientry = replica_identity_lookup(replica_identity_table, rikey);
- free_replica_identity_key(rikey);
- }
+ rientry = dshash_find(dependency_dshash, rikey, true);
MemoryContextSwitchTo(oldctx);
/* Return if no entry found */
if (!rientry)
+ {
+ pfree(rikey);
return;
+ }
Assert(!found || TransactionIdIsValid(rientry->remote_xid));
@@ -969,9 +987,14 @@ check_dependency_on_replica_identity(Oid relid,
*/
else if (!TransactionIdIsValid(rientry->remote_xid))
{
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ dshash_delete_entry(dependency_dshash, rientry);
+ rientry = NULL;
}
+
+ pfree(rikey);
+
+ if (rientry)
+ dshash_release_lock(dependency_dshash, rientry);
}
/*
@@ -982,25 +1005,30 @@ static void
find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
List **depends_on_xids)
{
- replica_identity_iterator i;
+ dshash_seq_status hstat;
ReplicaIdentityEntry *rientry;
Assert(depends_on_xids);
- replica_identity_start_iterate(replica_identity_table, &i);
- while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ /*
+ * XXX We ensure dependency hash table exists even here, because relation
+ * message seems to be able to reach before the BEGIN/STREAM_START
+ * message.
+ */
+ ensure_dependency_dshash();
+
+ dshash_seq_init(&hstat, dependency_dshash, true);
+ while ((rientry = dshash_seq_next(&hstat)) != NULL)
{
Assert(TransactionIdIsValid(rientry->remote_xid));
- if (rientry->keydata->relid != relid)
+ if (rientry->keydata.relid != relid)
continue;
/* Clean up the hash entry for committed transaction while on it */
if (pa_transaction_committed(rientry->remote_xid))
{
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
-
+ dshash_delete_current(&hstat);
continue;
}
@@ -1008,6 +1036,7 @@ find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_x
&rientry->remote_xid,
new_depended_xid);
}
+ dshash_seq_term(&hstat);
}
/*
@@ -1082,14 +1111,6 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
if (!am_leader_apply_worker())
return;
- if (!replica_identity_table)
- replica_identity_table = replica_identity_create(ApplyContext,
- REPLICA_IDENTITY_INITIAL_SIZE,
- NULL);
-
- if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
- cleanup_committed_replica_identity_entries();
-
switch (action)
{
case LOGICAL_REP_MSG_INSERT:
@@ -1870,6 +1891,8 @@ apply_handle_begin(StringInfo s)
maybe_start_skipping_changes(begin_data.final_lsn);
+ ensure_dependency_dshash();
+
pa_allocate_worker(remote_xid, false);
apply_action = get_transaction_apply_action(remote_xid, &winfo);
@@ -2488,7 +2511,10 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
+ {
+ ensure_dependency_dshash();
pa_allocate_worker(stream_xid, true);
+ }
apply_action = get_transaction_apply_action(stream_xid, &winfo);
@@ -3381,6 +3407,7 @@ apply_handle_insert(StringInfo s)
TupleTableSlot *remoteslot;
MemoryContext oldctx;
bool run_as_owner;
+ ReplicaIdentityKey *rikey;
/*
* Quick return if we are skipping data modification changes or handling
@@ -3405,6 +3432,10 @@ apply_handle_insert(StringInfo s)
return;
}
+ rikey = compute_replca_identity_key(rel, &newtup);
+ if (rikey)
+ remember_replica_identity_key(rikey);
+
/*
* Make sure that any user-supplied code runs as the table owner, unless
* the user has opted out of that behavior.
@@ -3541,6 +3572,7 @@ apply_handle_update(StringInfo s)
RTEPermissionInfo *target_perminfo;
MemoryContext oldctx;
bool run_as_owner;
+ ReplicaIdentityKey *rikey;
/*
* Quick return if we are skipping data modification changes or handling
@@ -3566,6 +3598,17 @@ apply_handle_update(StringInfo s)
return;
}
+ if (has_oldtup)
+ {
+ rikey = compute_replca_identity_key(rel, &oldtup);
+ if (rikey)
+ remember_replica_identity_key(rikey);
+ }
+
+ rikey = compute_replca_identity_key(rel, &newtup);
+ if (rikey)
+ remember_replica_identity_key(rikey);
+
/* Set relation for error callback */
apply_error_callback_arg.rel = rel;
@@ -3760,6 +3803,7 @@ apply_handle_delete(StringInfo s)
TupleTableSlot *remoteslot;
MemoryContext oldctx;
bool run_as_owner;
+ ReplicaIdentityKey *rikey;
/*
* Quick return if we are skipping data modification changes or handling
@@ -3784,6 +3828,10 @@ apply_handle_delete(StringInfo s)
return;
}
+ rikey = compute_replca_identity_key(rel, &oldtup);
+ if (rikey)
+ remember_replica_identity_key(rikey);
+
/* Set relation for error callback */
apply_error_callback_arg.rel = rel;
@@ -6621,6 +6669,22 @@ InitializeLogRepWorker(void)
CommitTransactionCommand();
}
+/*
+ * Detach from dependency hash table
+ */
+static void
+dependency_dsa_detach(int code, Datum arg)
+{
+ if (dependency_dshash)
+ {
+ /* XXX: do we have to detach or destory? */
+ dshash_detach(dependency_dshash);
+ }
+
+ if (dependency_dsa_area)
+ dsa_detach(dependency_dsa_area);
+}
+
/*
* Reset the origin state.
*/
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 9c3737693ba..0755cac073c 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -402,6 +402,7 @@ XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
+DependencyApplyDSA "Waiting for worker dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index c70fae9efda..616e0ba8bec 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -25,6 +25,7 @@
#include "storage/shm_mq.h"
#include "storage/shm_toc.h"
#include "storage/spin.h"
+#include "utils/dsa.h"
/* Different types of worker */
typedef enum LogicalRepWorkerType
@@ -200,6 +201,10 @@ typedef struct ParallelApplyWorkerShared
dshash_table_handle parallelized_txns_handle;
bool has_dependent_txn;
+
+ /* Dependency hash table handler */
+ dsa_handle dependency_dsa_handle;
+ dshash_table_handle dependency_dshash_handle;
} ParallelApplyWorkerShared;
/*
@@ -239,6 +244,13 @@ typedef struct ParallelApplyWorkerInfo
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
+/* dshash key; hash is computed from relid and replica identity columns */
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ uint32 hash;
+} ReplicaIdentityKey;
+
/* Main memory context for apply worker. Permanent during worker lifetime. */
extern PGDLLIMPORT MemoryContext ApplyContext;
@@ -261,6 +273,8 @@ extern PGDLLIMPORT bool InitializingApplyWorker;
extern PGDLLIMPORT List *table_states_not_ready;
+extern PGDLLEXPORT List *replica_identity_keys;
+
extern void logicalrep_worker_attach(int slot);
extern LogicalRepWorker *logicalrep_worker_find(LogicalRepWorkerType wtype,
Oid subid, Oid relid,
@@ -371,6 +385,10 @@ extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void atach_dependency_hash(dsa_handle *out_dsa, dshash_table_handle *out_hash);
+extern void dependency_cleanup(void);
+extern void remember_replica_identity_key(ReplicaIdentityKey *key);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTablesyncWorker(worker) ((worker)->in_use && \
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index f461bd67827..8362fbf0b9d 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -136,3 +136,4 @@ PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
+PG_LWLOCKTRANCHE(DEPENDENCY_APPLY_DSA, DependencyApplyDSA)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 430c1246d14..09f670a6785 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', 'max_logical_replication_workers = 10');
$node_subscriber->start;
# Create some preexisting content on publisher
--
2.47.3
Dear Hackers,
I measured the performance data for the shared hash table approach. Based on
the result,
local hash table approach seems better.
I did analyze bit more detail for tests. Let me share from the beginning...
Background and current implementation
==========
Even if apply worker is being parallelized, some transactions which depend on
other transactions must wait until others are committed.
In the first version of PoC, leader apply worker has a local hash table, which
has the key {txid,replica identity}. When the leader sends a replication message
to one of parallel apply worker, the leader checks for existing entries:
(a) If no match: add the entry and proceed; (b) If match: instruct the worker to
wait until the dependent transaction completes.
One possible downside of the approach is to clean up the dependency tracking hash table.
First PoC does when: a) the leader worker sends feedback to walsender or
b) the number of entries exceeds the limit (1024). Leader worker cannot receive
replication messages to other workers while cleaning up entries thus this might
be a bottleneck.
Proposal
========
Based on above, one possible idea to improve the performance was to make the
dependency hash table shared one. A leader worker and parallel apply workers
assigned from the leader could attach to the same shared hash table.
Leader worker would use the hash table samely when it put replication messages.
One difference was that when parallel apply worker commits a transaction,
it removes the used entry from the shared hash table. This could reduce entries
continuously and leader did not have to maintain the hash.
Downside of the approach was to need additional overhead accessing the hash.
Results and considerations
==========================
As I shared on -hackers, there are no performance improvement by making the hash
shared. I found the reason is the cleanup task is not so expensive.
I did profile leader worker during the benchmark, and I found that that cleanup
function `cleanup_replica_identity_table` wastes only 0.84% CPU time.
(I did try to attach results, but the file was too huge)
Attached histogram (simple_cleanup) shows the spent time in the cleanup for each
patches. The average of elapsed was 1.2 microseconds in the 0001 patch.
The needed time per transaction is around 74 microseconds (from TPS) thus it might
not affect the whole performance.
Another experiment - contains 2000 changes per transaction
===========================================================
First example used the built-in simple-update workload, and there was a possibility
that the trend might be different if each transaction has more changed, because
each cleanup might spend more time.
Based on that, the second workload had the 1000 deletion and 1000 insertions per
transaction.
Below table shows the results (with #worker = 4). They have mostly same TPSs,
same trend as simpler-update workload case. Histogram for the case is also attached.
0001 0001+0002 diff
TPS 10297.58551 10146.71342 1%
10046.75987 9865.730785 2%
9970.800272 9977.835592 0%
9927.863416 9909.675726 0%
10033.03796 9886.181373 1%
AVE 10055.209405 9957.227380 1%
MEDIAN 10033.037957 9909.675726 1%
Overall, I think local hash approach seems enough for now, unless we find better
approaches and corner cases.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Dear hackers,
I think it is better to enable preserve order by default - for safety reasons.
Per some discussions on -hackers, I implemented the patch which preserves the
commit ordering on publisher. Let me clarify from the beginning.
Background
==========
Current patch, say v1, does not preserve the commit ordering on the publisher node.
After the leader worker sends a COMMIT message to parallel apply worker, the
leader does not wait to apply the transaction and continue reading messages from
the publisher node. This can cause that a parallel apply worker assigned later may
commit earlier, which breaks the commit ordering on the pub node.
Proposal
========
We decided to preserve the commit ordering by default not to break data between
nodes [1]/messages/by-id/CADzfLwXnJ1H4HncFugGPdnm8t+aUAU4E-yfi1j3BbiP5VfXD8g@mail.gmail.com. The basic idea is that leader apply worker caches the remote_xid when
it sends to commit record to the parallel apply worker. Leader worker sends
INTERNAL_DEPENDENCY message with the cached xid to the parallel apply worker
before the leader sends commit message to p.a. P.a. would read the DEPENDENCY
message and wait until the transaction finishes. The cached xid would be updated
after the leader sends COMMIT.
This approach requires less codes because DEPENDENCY message has already been
introduced by v1, but the number of transaction messages would be increased.
Performance testing
===================
I confirmed that even if we preserve the commit ordering, the parallel apply still
has 2.x improvement compared with the HEAD. Below contains the detail.
Machine details
---------------
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
Used patch
----------
v1 is same as Hou posted on -hackers [1]/messages/by-id/CADzfLwXnJ1H4HncFugGPdnm8t+aUAU4E-yfi1j3BbiP5VfXD8g@mail.gmail.com, and v2 implements preserve-commit-order
part. Attached patch is what I used here.
Workload
-----
Setup:
Pub --> Sub
- Two nodes created in pub-sub synchronous logical replication setup.
- Both nodes have same set of pgbench tables created with scale=100.
- The Sub node is subscribed to all the changes from the Pub's pgbench tables
Workload Run:
- Run built-in pgbench(simple-update)[2]https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN only on Pub with #clients=40 and run duration=5 minutes
This means that same tuples would be rarely modified between transactions.
I can imagine that v1 patch would work mostly without waits, and 0002 would
be slower because it waits until previous commit would be done every time.
Results:
Number of workers is fixed to 4. v2 was 2.1 times faster than HEAD, and
v1 was 2.6 times faster than HEAD. I think it is very good improvement.
I can continue some other benchmarks with different workloads and parameters.
HEAD v1 v2
TPS 6134.7 16194.8 12944.4
6030.5 16303.9 13043.0
6181.9 16251.5 12815.7
6108.1 16173.3 12771.8
6035.6 16180.3 13054.5
AVE 6098.2 16220.8 12925.8
MEDIAN 6108.1 16194.8 12944.4
[1]: /messages/by-id/CADzfLwXnJ1H4HncFugGPdnm8t+aUAU4E-yfi1j3BbiP5VfXD8g@mail.gmail.com
[2]: https://www.postgresql.org/docs/current/pgbench.html#PGBENCH-OPTION-BUILTIN
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v2-0001-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v2-0001-Parallel-apply-non-streaming-transactions.patchDownload
From da547799e37ba297ed2f87534f6fcc100e342fe2 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Fri, 8 Aug 2025 11:35:59 +0800
Subject: [PATCH v2] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
---
.../replication/logical/applyparallelworker.c | 570 ++++++++++-
src/backend/replication/logical/proto.c | 42 +
src/backend/replication/logical/relation.c | 55 +
src/backend/replication/logical/worker.c | 960 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/logicalproto.h | 4 +
src/include/replication/logicalrelation.h | 5 +
src/include/replication/worker_internal.h | 26 +-
src/include/storage/lwlocklist.h | 1 +
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/tools/pgindent/typedefs.list | 4 +
15 files changed, 1602 insertions(+), 80 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index baa68c1ab6c..b42b0d60143 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -146,6 +149,34 @@
* which will detect deadlock if any. See pa_send_data() and
* enum TransApplyAction.
*
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * -------------------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
+ *
* Lock types
* ----------
* Both the stream lock and the transaction lock mentioned above are
@@ -216,14 +247,38 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +312,9 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +392,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -364,11 +431,13 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
/* Set up the header region. */
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
-
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -406,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -413,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by this
+ * worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -444,18 +538,31 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -468,7 +575,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -509,7 +616,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -558,7 +667,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -574,9 +684,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -706,6 +814,105 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused for
+ * handling other transactions. Consequently, the local end LSN has already
+ * been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -781,21 +988,35 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -812,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -849,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -864,6 +1091,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +1180,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1157,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1209,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait for
+ * the worker to complete its transaction and then stop it. This is
+ * consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1291,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1306,6 +1597,7 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("lost connection to the logical replication parallel apply worker")));
+
}
/*
@@ -1369,6 +1661,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1625,23 +1920,222 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for stop
+ * as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
+
+ pa_set_stream_apply_worker(NULL);
+}
+
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+ }
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
- pa_free_worker(winfo);
+ dshash_release_lock(parallelized_txns, txn_entry);
}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f0a913892b9..73a1bd36963 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
@@ -1253,6 +1291,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 745fd3bab64..34375df3a4b 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -958,3 +958,58 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 93970c6af29..52a8e8df486 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -483,6 +484,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -544,6 +547,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -558,11 +604,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -629,6 +671,589 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -781,13 +1406,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
-
- Assert(TransactionIdIsValid(stream_xid));
+ }
/*
* The parallel apply worker needs the xid in this message to decide
@@ -799,15 +1429,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1211,22 +1854,112 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+ pa_send_data(winfo, s->len, s->data);
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1236,6 +1969,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1246,7 +1984,84 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ apply_handle_commit_internal(&commit_data);
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the change
+ * to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -1369,7 +2184,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1417,6 +2233,8 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ /* TODO wait for xid to finish */
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -1429,7 +2247,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -1498,7 +2317,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -1560,7 +2380,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -1754,7 +2575,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
@@ -1812,6 +2633,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -2429,7 +3255,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
@@ -2437,12 +3276,12 @@ apply_handle_stream_commit(StringInfo s)
}
/*
- * Switch to serialize mode when we are not able to send the
- * change to parallel apply worker.
+ * Switch to serialize mode when we are not able to send the change
+ * to parallel apply worker.
*/
pa_switch_to_partial_serialize(winfo, true);
- /* fall through */
+/* fall through */
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
@@ -2485,6 +3324,9 @@ apply_handle_stream_commit(StringInfo s)
break;
}
+ /* Cache the remote xid */
+ last_remote_xid = xid;
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -2539,7 +3381,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -2572,6 +3415,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -3346,6 +4192,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -3657,6 +4505,8 @@ apply_handle_truncate(StringInfo s)
ListCell *lc;
LOCKMODE lockmode = AccessExclusiveLock;
+ elog(LOG, "truncate");
+
/*
* Quick return if we are skipping data modification changes or handling
* streamed transactions.
@@ -3868,6 +4718,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -3888,6 +4746,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -3897,6 +4759,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -3906,6 +4769,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -3914,29 +4807,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -3954,6 +4837,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = (FlushPosition *) palloc(sizeof(FlushPosition));
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -5401,7 +6285,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -5446,7 +6330,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -5494,7 +6378,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..a561f8ff459 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -404,6 +404,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
@@ -251,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..b598a955a6a 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
@@ -231,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -308,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -321,19 +331,23 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -358,6 +372,12 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
+extern void pa_commit_transaction(void);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
@@ -366,6 +386,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..d68940b02bc 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -136,3 +136,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 430c1246d14..2caf798ee0a 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index fc0bcee5187..42ea8584c05 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..18b7542274e 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -87,6 +87,7 @@ $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
$node_publisher->start;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
my %remainder_a = (
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bce72ae64..ef0ec7b02a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2081,6 +2081,7 @@ ParallelTransState
ParallelVacuumState
ParallelWorkerContext
ParallelWorkerInfo
+ParallelizedTxnEntry
Param
ParamCompileHook
ParamExecData
@@ -2550,6 +2551,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4041,6 +4044,7 @@ remoteDep
remove_nulling_relids_context
rendezvousHashEntry
rep
+replica_identity_hash
replace_rte_variables_callback
replace_rte_variables_context
report_error_fn
--
2.47.3
On Tue, Nov 18, 2025 at 1:46 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear hackers,
I think it is better to enable preserve order by default - for safety reasons.
Per some discussions on -hackers, I implemented the patch which preserves the
commit ordering on publisher. Let me clarify from the beginning.Background
==========
Current patch, say v1, does not preserve the commit ordering on the publisher node.
After the leader worker sends a COMMIT message to parallel apply worker, the
leader does not wait to apply the transaction and continue reading messages from
the publisher node. This can cause that a parallel apply worker assigned later may
commit earlier, which breaks the commit ordering on the pub node.Proposal
========
We decided to preserve the commit ordering by default not to break data between
nodes [1]. The basic idea is that leader apply worker caches the remote_xid when
it sends to commit record to the parallel apply worker. Leader worker sends
INTERNAL_DEPENDENCY message with the cached xid to the parallel apply worker
before the leader sends commit message to p.a. P.a. would read the DEPENDENCY
message and wait until the transaction finishes. The cached xid would be updated
after the leader sends COMMIT.
This approach requires less codes because DEPENDENCY message has already been
introduced by v1, but the number of transaction messages would be increased.
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order. I feel it may be better to just
discuss preserve commit order patch that also contains some comments
as to how to extend it further, once that is done, we can do further
discussion of the other patch.
--
With Regards,
Amit Kapila.
Dear Amit,
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order.
I think I attached the correct file. Since we are trying to preserve the commit
order by default, everything was merged into one patch.
One point to clarify is that dependency tracking is essential even if we fully
preserve the commit ordering not to violate constrains like PK. Assuming there is
a table which has PK, txn1 inserts a tuple and txn2 updates it. UPDATE statement
in txn2 must be done after committing txn1.
I feel it may be better to just
discuss preserve commit order patch that also contains some comments
as to how to extend it further, once that is done, we can do further
discussion of the other patch.
I do agree, let me implement one by one.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Hello Kuroda-san,
On 11/18/25 12:00, Hayato Kuroda (Fujitsu) wrote:
Dear Amit,
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order.I think I attached the correct file. Since we are trying to preserve
the commit order by default, everything was merged into one patch.
I agree the goal should be preserving the commit order, unless someone
can demonstrate (a) clear performance benefits and (b) correctness. It's
not clear to me how would that deal e.g. with crashes, where some of the
"future" replicated transactions committed. Maybe it's fine, not sure.
But keeping the same commit order just makes it easier to think about
the consistency model, no?
So it seems natural to target the same commit order first, and then
maybe explore if relaxing that would be beneficial for some cases.
However, the patch seems fairly large (~80kB, although a fair bit of
that is comments). Would it be possible to split it into smaller chunks?
Is there some "minimal patch", which could be moved to 0001, and then
followed by improvements in 0002, 0003, ...? I sometimes do some
"infrastructure" first, and the actual patch in the last part (simply
using the earlier parts).
I'm not saying it has to be split (or how exactly), but I personally
find smaller patches easier to review ...
One point to clarify is that dependency tracking is essential even if we fully
preserve the commit ordering not to violate constrains like PK. Assuming there is
a table which has PK, txn1 inserts a tuple and txn2 updates it. UPDATE statement
in txn2 must be done after committing txn1.
Right. I don't see how we could do parallel apply correct in general
case without tracking these dependencies.
I feel it may be better to just
discuss preserve commit order patch that also contains some comments
as to how to extend it further, once that is done, we can do further
discussion of the other patch.I do agree, let me implement one by one.
Some comments / questions after looking at the patch today:
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe we
should stick to the same naming?
2) If I understand correctly, the patch maintains a "replica_identity"
hash table, with replica identity keys for all changes for all
concurrent transactions. How expensive can this be, in terms of CPU and
memory? What if I have multiple large batch transactions, each updating
millions of rows?
3) Would it make sense to use some alternative data structure? A bloom
filter, for example. Just a random idea, not sure if that's a good fit.
4) I've seen the benchmarks posted a couple days ago, and I'm running
some tests myself. But it's hard to say if the result is good or bad
without knowing what fraction of transactions finds a dependency and has
to wait for an earlier one. Would it be possible to track this
somewhere? Is there a suitable pg_stats_ view?
5) It's not clear to me how did you measure the TPS in your benchmark.
Did you measure how long it takes for the standby to catch up, or what
did you do?
6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e.
about half of the "ideal" speedup? Is it bottlenecked on WAL, leader
having to determine dependencies, or something else?
7) I'm a bit confused about the different types of dependencies, and at
which point they make the workers wait. There are the dependencies due
to modifying the same row, in which case the worker waits before
starting to apply the changes that hits the dependency. And then there's
a dependency to enforce commit order, in which case it waits before
commit. Right? Or did I get that wrong?
8) The commit message says:
It would be challenge to check the dependency if the table has user
defined trigger or constraints. the most viable solution might be to
disallow parallel apply for relations whose triggers and constraints
are not marked as parallel-safe or immutable.
Wouldn't this have similar issues with verifying these features on
partitioned tables as the patch that attempted to allow parallelism for
INSERT ... SELECT [1]/messages/by-id/E1lJoQ6-0005BJ-DY@gemulon.postgresql.org? AFAICS it was too expensive to do with large
partitioning hierarchies.
9) I think it'd be good to make sure the "design" comments explain how
the new parts work in more detail. For example, the existing comment at
the beginning of applyparallelworker.c goes into a lot of detail, but
the patch adds only two fairly short paragraphs. Even the commit message
has more detail, which seems a bit strange.
10) For example it would be good to explain what "internal dependency"
and "internal relation" are for. I think I understand the internal
dependency, I'm still not quite sure why we need internal relation (or
rather why we didn't need it before).
11) I think it might be good to have TAP tests that stress this out in
various ways. Say, a test that randomly restarts the standby during
parallel apply, and checks it does not miss any records, etc. In the
online checksums patch this was quite useful. It wouldn't be part of
regular check-world, of course. Or maybe it'd be for development only?
regards
[1]: /messages/by-id/E1lJoQ6-0005BJ-DY@gemulon.postgresql.org
/messages/by-id/E1lJoQ6-0005BJ-DY@gemulon.postgresql.org
--
Tomas Vondra
On Thursday, November 20, 2025 5:31 AM Tomas Vondra <tomas@vondra.me> wrote:
Hello Kuroda-san,
On 11/18/25 12:00, Hayato Kuroda (Fujitsu) wrote:
Dear Amit,
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order.I think I attached the correct file. Since we are trying to preserve
the commit order by default, everything was merged into one patch.
...
However, the patch seems fairly large (~80kB, although a fair bit of
that is comments). Would it be possible to split it into smaller chunks?
Is there some "minimal patch", which could be moved to 0001, and then
followed by improvements in 0002, 0003, ...? I sometimes do some
"infrastructure" first, and the actual patch in the last part (simply
using the earlier parts).I'm not saying it has to be split (or how exactly), but I personally
find smaller patches easier to review ...
Agreed and thanks for the suggestion, we will try to split the patches into
smaller ones.
One point to clarify is that dependency tracking is essential even if we fully
preserve the commit ordering not to violate constrains like PK. Assumingthere is
a table which has PK, txn1 inserts a tuple and txn2 updates it. UPDATE
statement
in txn2 must be done after committing txn1.
Right. I don't see how we could do parallel apply correct in general
case without tracking these dependencies.I feel it may be better to just
discuss preserve commit order patch that also contains some comments
as to how to extend it further, once that is done, we can do further
discussion of the other patch.I do agree, let me implement one by one.
Some comments / questions after looking at the patch today:
Thanks for the comments!
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe we
should stick to the same naming?
OK, I did not research the design in MySQL in detail but will try to analyze it.
2) If I understand correctly, the patch maintains a "replica_identity" hash
table, with replica identity keys for all changes for all concurrent
transactions. How expensive can this be, in terms of CPU and memory? What if I
have multiple large batch transactions, each updating millions of rows?
In case TPC-B or simple-update the cost of dependency seems trivial (e.g., the
data in profile of previous simple-update test shows
--1.39%--check_dependency_on_replica_identity), but we will try to analyze more
for large transaction cases as suggested.
3) Would it make sense to use some alternative data structure? A bloom filter,
for example. Just a random idea, not sure if that's a good fit.
It's worth analyzing. We will do some more tests and if we find some bottlenecks
due to the current dependency tracking, then we will research more on
alternative approaches like bloom filter.
4) I've seen the benchmarks posted a couple days ago, and I'm running some
tests myself. But it's hard to say if the result is good or bad without
knowing what fraction of transactions finds a dependency and has to wait for
an earlier one. Would it be possible to track this somewhere? Is there a
suitable pg_stats_ view?
Right, we will consider this idea and will try to implement this.
5) It's not clear to me how did you measure the TPS in your benchmark. Did you
measure how long it takes for the standby to catch up, or what did you do?
The test we shared has enabled synchronous logical replication and then use pgbench
(simple-update) to write on the publisher and count the TPS output by pgbench.
6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e. about
half of the "ideal" speedup? Is it bottlenecked on WAL, leader having to
determine dependencies, or something else?7) I'm a bit confused about the different types of dependencies, and at which
point they make the workers wait. There are the dependencies due to modifying
the same row, in which case the worker waits before starting to apply the
changes that hits the dependency. And then there's a dependency to enforce
commit order, in which case it waits before commit. Right? Or did I get that
wrong?
Right, your understanding is correct, there are only two dependencies for now
(same row modification and commit order)
8) The commit message says:
It would be challenge to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow
parallel apply for relations whose triggers and constraints are not marked
as parallel-safe or immutable.Wouldn't this have similar issues with verifying these features on partitioned
tables as the patch that attempted to allow parallelism for INSERT ... SELECT
[1]? AFAICS it was too expensive to do with large partitioning hierarchies.
By default, since publish_via_partition_root is set to false in the publication,
we normally replicate changes to the leaf partition directly. So, for
non-partitioned tables, we can directly assess their parallel safety and cache
the results.
Partitioned tables require additional handling. But unlike INSERT ... SELECT,
logical replication provides remote data changes upfront, allowing us to
identify the target leaf partition for each change and assess safety for that
table. So, we can avoid examining all partition hierarchies for a change.
To check the safety for a change on partitioned table, the leader worker could
initially perform tuple routing for the remote change and evaluate the
user-defined triggers or functions in the target partition before determining
whether to parallelize the transaction. Although this approach may introduce
some overhead for the leader, we plan to test its impact. If the overhead is
unacceptable, we might also consider disallowing parallelism for changes on
partitioned tables.
9) I think it'd be good to make sure the "design" comments explain how the new
parts work in more detail. For example, the existing comment at the beginning
of applyparallelworker.c goes into a lot of detail, but the patch adds only
two fairly short paragraphs. Even the commit message has more detail, which
seems a bit strange.
Agreed, we will add more comments.
10) For example it would be good to explain what "internal dependency" and
"internal relation" are for. I think I understand the internal dependency, I'm
still not quite sure why we need internal relation (or rather why we didn't
need it before).
The internal relation is used to share relation information (such as the
publisher's table name, schema name, relkind, column names, etc) with parallel
apply workers. This information is needed for verifying whether the publisher's
relation data aligns with the subscriber's data when applying changes.
Previously, sharing this information wasn't necessary because parallel apply
workers were only tasked with applying streamed replication. In those cases, the
relation information for modified relations was always sent within streamed
transactions (see maybe_send_schema() for details), eliminating the need for
additional sharing. However, in non-streaming transactions, relation information
might not be included in every transaction. Therefore, we request the leader to
distribute the received relation information to parallel apply workers before
assigning them a transaction.
11) I think it might be good to have TAP tests that stress this out in various
ways. Say, a test that randomly restarts the standby during parallel apply,
and checks it does not miss any records, etc. In the online checksums patch
this was quite useful. It wouldn't be part of regular check-world, of course.
Or maybe it'd be for development only?
We will think more on this.
Best Regards,
Hou zj
On Thu, Nov 20, 2025 at 3:00 AM Tomas Vondra <tomas@vondra.me> wrote:
Hello Kuroda-san,
On 11/18/25 12:00, Hayato Kuroda (Fujitsu) wrote:
Dear Amit,
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order.I think I attached the correct file. Since we are trying to preserve
the commit order by default, everything was merged into one patch.I agree the goal should be preserving the commit order, unless someone
can demonstrate (a) clear performance benefits and (b) correctness. It's
not clear to me how would that deal e.g. with crashes, where some of the
"future" replicated transactions committed.
Yeah, the key challenge in not-preserving commit order is that the
future transactions can be applied when some of the previous
transactions were still in the apply phase and the crash happens. With
the current replication progress tracking scheme, we won't be able to
apply the transactions that were still in-progress when the crash
happened. However, I came up with a scheme to change the replication
progress tracking mechanism to allow out-of-order commits during
apply. See [1]/messages/by-id/CAA4eK1+SEus_6vQay9TF_r4ow+E-Q7LYNLfsD78HaOsLSgppxQ@mail.gmail.com (Replication Progress Tracking). Anyway, as discussed
in this thread, it is better to keep that as optional non-default
behavior, so we want to focus first on preserving the commit-order
part.
Thanks for paying attention, your comments/suggestions are helpful.
[1]: /messages/by-id/CAA4eK1+SEus_6vQay9TF_r4ow+E-Q7LYNLfsD78HaOsLSgppxQ@mail.gmail.com
--
With Regards,
Amit Kapila
Hi
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe we
should stick to the same naming?
OK, I did not research the design in MySQL in detail but will try to
analyze it.
I have some documents for mysql parallel apply binlog event.But after
MySQL 8.4, only the writeset mode is available. In scenarios with a primary
key or unique key, the replica replay is not ordered, but the data is
eventually consistent."
https://dev.mysql.com/worklog/task/?id=9556
https://dev.mysql.com/blog-archive/improving-the-parallel-applier-with-writeset-based-dependency-tracking/
https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c
Thanks
On Thu, Nov 20, 2025 at 5:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Show quoted text
On Thu, Nov 20, 2025 at 3:00 AM Tomas Vondra <tomas@vondra.me> wrote:
Hello Kuroda-san,
On 11/18/25 12:00, Hayato Kuroda (Fujitsu) wrote:
Dear Amit,
It seems you haven't sent the patch that preserves commit order or the
commit message of the attached patch is wrong. I think the first patch
in series should be the one that preserves commit order and then we
can build a patch that tracks dependencies and allows parallelization
without preserving commit order.I think I attached the correct file. Since we are trying to preserve
the commit order by default, everything was merged into one patch.I agree the goal should be preserving the commit order, unless someone
can demonstrate (a) clear performance benefits and (b) correctness. It's
not clear to me how would that deal e.g. with crashes, where some of the
"future" replicated transactions committed.Yeah, the key challenge in not-preserving commit order is that the
future transactions can be applied when some of the previous
transactions were still in the apply phase and the crash happens. With
the current replication progress tracking scheme, we won't be able to
apply the transactions that were still in-progress when the crash
happened. However, I came up with a scheme to change the replication
progress tracking mechanism to allow out-of-order commits during
apply. See [1] (Replication Progress Tracking). Anyway, as discussed
in this thread, it is better to keep that as optional non-default
behavior, so we want to focus first on preserving the commit-order
part.Thanks for paying attention, your comments/suggestions are helpful.
[1] -
/messages/by-id/CAA4eK1+SEus_6vQay9TF_r4ow+E-Q7LYNLfsD78HaOsLSgppxQ@mail.gmail.com--
With Regards,
Amit Kapila
On 11/20/25 14:10, wenhui qiu wrote:
Hi
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe we
should stick to the same naming?OK, I did not research the design in MySQL in detail but will try to
analyze it.
I have some documents for mysql parallel apply binlog event.But after
MySQL 8.4, only the writeset mode is available. In scenarios with a
primary key or unique key, the replica replay is not ordered, but the
data is eventually consistent."
https://dev.mysql.com/worklog/task/?id=9556 <https://dev.mysql.com/
worklog/task/?id=9556>
https://dev.mysql.com/blog-archive/improving-the-parallel-applier-with-
writeset-based-dependency-tracking/ <https://dev.mysql.com/blog-archive/
improving-the-parallel-applier-with-writeset-based-dependency-tracking/>
https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c
<https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c>
FWIW there was a talk about MySQL replication at pgconf.dev 2024
https://www.youtube.com/watch?v=eOfUqh5PltM
discussing some of this stuff. I'm not saying we should copy all of
this, but it seems like a good source of inspiration what (not) to do.
regards
--
Tomas Vondra
Hi Tomas
discussing some of this stuff. I'm not saying we should copy all of
this, but it seems like a good source of inspiration what (not) to do.
I'm not saying we should copy MySQL's implementation. MySQL’s parallel
replication is based on group commit, and PostgreSQL can’t directly adopt
that approach. However, MySQL hashes transactions within the same commit
group by primary and unique keys, assuming that transactions with different
hashes do not conflict (since MySQL's row locks are based on index ). This
allows transactions to be safely replayed in parallel on replicas, and
their execution order within the group doesn’t matter.
Thanks
On Thu, Nov 20, 2025 at 10:50 PM Tomas Vondra <tomas@vondra.me> wrote:
Show quoted text
On 11/20/25 14:10, wenhui qiu wrote:
Hi
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe we
should stick to the same naming?OK, I did not research the design in MySQL in detail but will try to
analyze it.
I have some documents for mysql parallel apply binlog event.But after
MySQL 8.4, only the writeset mode is available. In scenarios with a
primary key or unique key, the replica replay is not ordered, but the
data is eventually consistent."
https://dev.mysql.com/worklog/task/?id=9556 <https://dev.mysql.com/
worklog/task/?id=9556>
https://dev.mysql.com/blog-archive/improving-the-parallel-applier-with-
writeset-based-dependency-tracking/ <https://dev.mysql.com/blog-archive/
improving-the-parallel-applier-with-writeset-based-dependency-tracking/>
https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c
<https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c>FWIW there was a talk about MySQL replication at pgconf.dev 2024
https://www.youtube.com/watch?v=eOfUqh5PltM
discussing some of this stuff. I'm not saying we should copy all of
this, but it seems like a good source of inspiration what (not) to do.regards
--
Tomas Vondra
On Thursday, November 20, 2025 10:50 PM Tomas Vondra <tomas@vondra.me> wrote:
On 11/20/25 14:10, wenhui qiu wrote:
Hi
1) The way the patch determines dependencies seems to be the "writeset"
approach from other replication systems (e.g. MySQL does that). Maybe
we should stick to the same naming?OK, I did not research the design in MySQL in detail but will try to
analyze it.
I have some documents for mysql parallel apply binlog event.But after
MySQL 8.4, only the writeset mode is available. In scenarios with a
primary key or unique key, the replica replay is not ordered, but the
data is eventually consistent."
https://dev.mysql.com/worklog/task/?id=9556 <https://dev.mysql.com/
worklog/task/?id=9556>
https://dev.mysql.com/blog-archive/improving-the-parallel-applier-with
- writeset-based-dependency-tracking/
<https://dev.mysql.com/blog-archive/
improving-the-parallel-applier-with-writeset-based-dependency-tracking
/>
https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf02
3c
<https://medium.com/airtable-eng/optimizing-mysql-replication-lag-with
-
parallel-replication-and-writeset-based-dependency-tracking-1fc405cf023c>
FWIW there was a talk about MySQL replication at pgconf.dev 2024
https://www.youtube.com/watch?v=eOfUqh5PltM
discussing some of this stuff. I'm not saying we should copy all of this, but it
seems like a good source of inspiration what (not) to do.
Thank you both for the information. We'll look into these further.
Best Regards,
Hou zj
On Tue, Sep 16, 2025 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Sep 6, 2025 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I suspect this might not be the most performant default strategy and
could frequently cause a performance dip. In general, we utilize
parallel apply workers, considering that the time taken to apply
changes is much costlier than reading and sending messages to workers.The current strategy involves the leader picking one transaction for
itself after distributing transactions to all apply workers, assuming
the apply task will take some time to complete. When the leader takes
on an apply task, it becomes a bottleneck for complete parallelism.
This is because it needs to finish applying previous messages before
accepting any new ones. Consequently, even as workers slowly become
free, they won't receive new tasks because the leader is busy applying
its own transaction.This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.I see your point but consider a scenario where we have two pa workers.
pa-1 is waiting for some backend on unique_key insertion and pa-2 is
waiting for pa-1 to complete its transaction as pa-2 has to perform
some change which is dependent on pa-1's transaction. So, leader can
either simply wait for a third transaction to be distributed or just
apply it and process another change. If we follow the earlier then it
is quite possible that the sender fills the network queue to send data
and simply timed out.
Sorry I took a while to come back to this. I understand your point and
agree that it's a valid concern. However, I question whether limiting
this to a single choice is the optimal solution. The core issue
involves two distinct roles: work distribution and applying changes.
Work distribution is exclusively handled by the leader, while any
worker can apply the changes. This is essentially a single-producer,
multiple-consumer problem.
While it might seem efficient for the producer (leader) to assist
consumers (workers) when there's a limited number of consumers, I
believe this isn't the best design. In such scenarios, it's generally
better to allow the producer to focus solely on its primary task,
unless there's a severe shortage of processing power.
If computing resources are constrained, allowing producers to join
consumers in applying changes is acceptable. However, if sufficient
processing power is available, the producer should ideally be left to
its own duties. The question then becomes: how do we make this
decision?
My suggestion is to make this a configurable parameter. Users could
then decide whether the leader participates in applying changes. This
would provide flexibility: If there are enough workers, user can set
the leader can focus on its distribution task only OTOH If processing
power is limited and only a few apply workers (e.g., two, as in your
example) can be set up, users would have the option to configure the
leader to also act as an apply worker when needed.
--
Regards,
Dilip Kumar
Google
On Mon, Nov 24, 2025 at 9:56 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Sep 16, 2025 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Sep 6, 2025 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
I suspect this might not be the most performant default strategy and
could frequently cause a performance dip. In general, we utilize
parallel apply workers, considering that the time taken to apply
changes is much costlier than reading and sending messages to workers.The current strategy involves the leader picking one transaction for
itself after distributing transactions to all apply workers, assuming
the apply task will take some time to complete. When the leader takes
on an apply task, it becomes a bottleneck for complete parallelism.
This is because it needs to finish applying previous messages before
accepting any new ones. Consequently, even as workers slowly become
free, they won't receive new tasks because the leader is busy applying
its own transaction.This type of strategy might be suitable in scenarios where users
cannot supply more workers due to resource limitations. However, on
high-end machines, it is more efficient to let the leader act solely
as a message transmitter and allow the apply workers to handle all
apply tasks. This could be a configurable parameter, determining
whether the leader also participates in applying changes. I believe
this should not be the default strategy; in fact, the default should
be for the leader to act purely as a transmitter.I see your point but consider a scenario where we have two pa workers.
pa-1 is waiting for some backend on unique_key insertion and pa-2 is
waiting for pa-1 to complete its transaction as pa-2 has to perform
some change which is dependent on pa-1's transaction. So, leader can
either simply wait for a third transaction to be distributed or just
apply it and process another change. If we follow the earlier then it
is quite possible that the sender fills the network queue to send data
and simply timed out.Sorry I took a while to come back to this. I understand your point and
agree that it's a valid concern. However, I question whether limiting
this to a single choice is the optimal solution. The core issue
involves two distinct roles: work distribution and applying changes.
Work distribution is exclusively handled by the leader, while any
worker can apply the changes. This is essentially a single-producer,
multiple-consumer problem.While it might seem efficient for the producer (leader) to assist
consumers (workers) when there's a limited number of consumers, I
believe this isn't the best design. In such scenarios, it's generally
better to allow the producer to focus solely on its primary task,
unless there's a severe shortage of processing power.If computing resources are constrained, allowing producers to join
consumers in applying changes is acceptable. However, if sufficient
processing power is available, the producer should ideally be left to
its own duties. The question then becomes: how do we make this
decision?My suggestion is to make this a configurable parameter. Users could
then decide whether the leader participates in applying changes.
We could do this but another possibility is that the leader does
distribute some threshold of pending transactions (say 5 or 10) to
each of the workers and if none of the workers is still available then
it can perform the task by itself. I think this will avoid the system
performing poorly when the existing workers are waiting on each other
and or backend to finish the current transaction. Having said that, I
think this can be done as a separate optimization patch as well.
--
With Regards,
Amit Kapila.
On Mon, Nov 24, 2025 at 5:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
While it might seem efficient for the producer (leader) to assist
consumers (workers) when there's a limited number of consumers, I
believe this isn't the best design. In such scenarios, it's generally
better to allow the producer to focus solely on its primary task,
unless there's a severe shortage of processing power.If computing resources are constrained, allowing producers to join
consumers in applying changes is acceptable. However, if sufficient
processing power is available, the producer should ideally be left to
its own duties. The question then becomes: how do we make this
decision?My suggestion is to make this a configurable parameter. Users could
then decide whether the leader participates in applying changes.We could do this but another possibility is that the leader does
distribute some threshold of pending transactions (say 5 or 10) to
each of the workers and if none of the workers is still available then
it can perform the task by itself.
IMHO making the producer (the leader) join as a consumer (an apply
worker) is not the best default behavior for a single-producer,
multi-consumer design. This design choice is generally not scalable
because the producer is a unique resource no other process can handle
its job while multiple parallel workers can act as consumers. By
keeping the roles separate, a user always has the option to set up a
sufficiently high number of dedicated consumer workers. However, in
resource constrained environments where maximum resource utilization
is prioritized over the most scalable solution, a configuration
parameter could be introduced. This parameter would allow the producer
to act as a consumer worker whenever it is free and other consumers
are busy. This offers a trade-off between resource efficiency and
overall scalability.
I think this will avoid the system
performing poorly when the existing workers are waiting on each other
and or backend to finish the current transaction.
The core issue is that integrating the producer (sender) as an extra
consumer (apply worker) just adds an N+1 worker capacity, but doesn't
fundamentally solve the problem of all workers eventually becoming
busy or blocked (waiting on transactions) or am I missing something?
The possibility remains that all N+1 workers could become busy
applying or, more commonly, waiting for transactions to commit or
resources to free up. Adding one extra worker doesn't resolve the
underlying problem if the workload exceeds the total available
processing power or if transactions are frequently waiting. Users
already have the ability to address this by configuring N+1 or more
dedicated consumer workers based on their resource availability and
performance needs.
Therefore, relying on the producer as an occasional consumer offers
only a minor, temporary capacity gain and doesn't resolve the overall
scalability limit or the likelihood of full worker saturation.
Having said that, I
think this can be done as a separate optimization patch as well.
Yeah we could.
--
Regards,
Dilip Kumar
Google
Dear Tomas,
Thanks for seeing the thread and sorry for late response.
I had a PostgreSQL conference in Japan.
However, the patch seems fairly large (~80kB, although a fair bit of
that is comments). Would it be possible to split it into smaller chunks?
Is there some "minimal patch", which could be moved to 0001, and then
followed by improvements in 0002, 0003, ...? I sometimes do some
"infrastructure" first, and the actual patch in the last part (simply
using the earlier parts).I'm not saying it has to be split (or how exactly), but I personally
find smaller patches easier to review ...
Yes, smaller patches are always better than huge monolith. I splitted the patch
into four patches - three of them introduces a mechanism to track dependencies
and wait until other transactions finish, and fourth patch launches parallel
workers with them. Each patch can be built and pass tests individually.
Two of them might be still large (-800 lines) but I hope this is helpful for
reviewers.
Some comments / questions after looking at the patch today:
We would answer them after more analysis.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v3-0001-Introduce-new-type-of-logical-replication-message.patchapplication/octet-stream; name=v3-0001-Introduce-new-type-of-logical-replication-message.patchDownload
From 926b08dbbfedc199e101d407b27a5a57fd76b9c4 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 10:37:27 +0900
Subject: [PATCH v3 1/4] Introduce new type of logical replication messages to
track dependencies
This patch introduces two logical replication messages,
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY and LOGICAL_REP_MSG_INTERNAL_RELATION.
Apart from other messages, they are not sent by walsnders; the leader worker
sends to parallel workers based on the needs.
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY ensures that dependent transactions are
committed in the correct order. It has a list of transaction IDs that parallel
workers must wait for. The message type would be generated when the leader
detects a dependency between the current and other transactions, or just before
the COMMIT message. The latter one is used to preserve the commit ordering
between the publisher and the subscriber.
LOGICAL_REP_MSG_INTERNAL_RELATION is used to synchronize the relation
information between the leader and parallel workers. It has a list of relations
that the leader already knows, and parallel workers also update the relmap in
response to the message. This type of message is generated when the leader
allocates a new parallel worker to the transaction, or when the publisher sends
additional RELATION messages.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 16 ++++++
src/backend/replication/logical/proto.c | 4 ++
src/backend/replication/logical/worker.c | 49 +++++++++++++++++++
src/include/replication/logicalproto.h | 2 +
src/include/replication/worker_internal.h | 4 ++
5 files changed, 75 insertions(+)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index baa68c1ab6c..735a3e9acad 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1645,3 +1645,19 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ /* XXX wait until given transaction is finished */
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f0a913892b9..72dedee3a43 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -1253,6 +1253,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 93970c6af29..ebf8cd62552 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -629,6 +629,47 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -3868,6 +3909,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..5d91e2a4287 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..a3526eae578 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -359,6 +359,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTableSyncWorker(worker) ((worker)->in_use && \
@@ -366,6 +368,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
--
2.47.3
v3-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchapplication/octet-stream; name=v3-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchDownload
From 88ce1eede98cc6fa14de5cfb674b77a4961847f1 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v3 2/4] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 735a3e9acad..bc8a0480778 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..a561f8ff459 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -404,6 +404,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..d68940b02bc 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -136,3 +136,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v3-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchapplication/octet-stream; name=v3-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchDownload
From 5339af3055461f7a9229eb8acb28fd2e70944481 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v3 3/4] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index bc8a0480778..40d57daf179 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 10b3d0d9b82..66c73ce34a1 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ebf8cd62552..269a3ac5804 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v3-0004-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v3-0004-Parallel-apply-non-streaming-transactions.patchDownload
From c613be4949c211bac01c9fdc02490c9c8143fa56 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v3 4/4] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on local unique key, foreign key.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
---
.../replication/logical/applyparallelworker.c | 332 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 309 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/tools/pgindent/typedefs.list | 4 +
13 files changed, 668 insertions(+), 75 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 40d57daf179..47b5bc3b48a 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -146,6 +149,33 @@
* which will detect deadlock if any. See pa_send_data() and
* enum TransApplyAction.
*
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * -------------------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
+ *
* Lock types
* ----------
* Both the stream lock and the transaction lock mentioned above are
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1880,3 +2096,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 72dedee3a43..73a1bd36963 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 66c73ce34a1..001cf6a143f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 269a3ac5804..8c871b205fc 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -484,6 +484,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +604,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +674,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1406,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1419,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1429,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1854,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1951,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1991,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2006,84 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ apply_handle_commit_internal(&commit_data);
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2206,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2072,6 +2255,8 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ /* TODO wait for xid to finish */
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2084,7 +2269,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2339,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2402,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2655,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3084,7 +3277,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
@@ -3140,6 +3346,9 @@ apply_handle_stream_commit(StringInfo s)
break;
}
+ /* Cache the remote xid */
+ last_remote_xid = xid;
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3194,7 +3403,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3437,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4214,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4766,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4779,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4789,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4827,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4857,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = (FlushPosition *) palloc(sizeof(FlushPosition));
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6305,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6350,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6398,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 430c1246d14..2caf798ee0a 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf3f6a7dafd..c1bdd918df5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2088,6 +2088,7 @@ ParallelTransState
ParallelVacuumState
ParallelWorkerContext
ParallelWorkerInfo
+ParallelizedTxnEntry
Param
ParamCompileHook
ParamExecData
@@ -2558,6 +2559,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4054,6 +4057,7 @@ remoteDep
remove_nulling_relids_context
rendezvousHashEntry
rep
+replica_identity_hash
replace_rte_variables_callback
replace_rte_variables_context
report_error_fn
--
2.47.3
On Mon, Dec 1, 2025 at 4:16 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Dear Tomas,
Thanks for seeing the thread and sorry for late response.
I had a PostgreSQL conference in Japan.However, the patch seems fairly large (~80kB, although a fair bit of
that is comments). Would it be possible to split it into smaller chunks?
Is there some "minimal patch", which could be moved to 0001, and then
followed by improvements in 0002, 0003, ...? I sometimes do some
"infrastructure" first, and the actual patch in the last part (simply
using the earlier parts).I'm not saying it has to be split (or how exactly), but I personally
find smaller patches easier to review ...Yes, smaller patches are always better than huge monolith. I splitted the patch
into four patches - three of them introduces a mechanism to track dependencies
and wait until other transactions finish, and fourth patch launches parallel
workers with them. Each patch can be built and pass tests individually.
Two of them might be still large (-800 lines) but I hope this is helpful for
reviewers.Some comments / questions after looking at the patch today:
We would answer them after more analysis.
I was just going through the commit messages of all the patches, I
could not understand the last line of below paragraph in v3-0004, what
do you mean by the last line which says "after which the leader
updates the
hash entry with the current xid"?
"The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid."
--
Regards,
Dilip Kumar
Google
Dear Dilip,
I was just going through the commit messages of all the patches, I
could not understand the last line of below paragraph in v3-0004, what
do you mean by the last line which says "after which the leader
updates the
hash entry with the current xid"?"The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader tells the parallel worker
to wait for the remote xid in the hash entry, after which the leader updates the
hash entry with the current xid."
This meant if two transactions had changes for the same RI, lastly committed
transaction's XID could be stored here. In other words, each local hash entry always
has the latest XID which modifies a key (RI).
Assuming that there are three transactions T1->T2->T3 and they modify the same
tuple. When subscriber applies T3, it should wait till T2 is committed, not T1.
XID of the entry should be updated for implementing it.
I tried to rephrase that line a bit, how do you feel? All patches are attached
to keep CI happy.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v4-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchapplication/octet-stream; name=v4-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchDownload
From 88ce1eede98cc6fa14de5cfb674b77a4961847f1 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v4 2/4] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 735a3e9acad..bc8a0480778 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c1ac71ff7f2..a561f8ff459 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -404,6 +404,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..d68940b02bc 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -136,3 +136,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v4-0001-Introduce-new-type-of-logical-replication-message.patchapplication/octet-stream; name=v4-0001-Introduce-new-type-of-logical-replication-message.patchDownload
From 926b08dbbfedc199e101d407b27a5a57fd76b9c4 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 10:37:27 +0900
Subject: [PATCH v4 1/4] Introduce new type of logical replication messages to
track dependencies
This patch introduces two logical replication messages,
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY and LOGICAL_REP_MSG_INTERNAL_RELATION.
Apart from other messages, they are not sent by walsnders; the leader worker
sends to parallel workers based on the needs.
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY ensures that dependent transactions are
committed in the correct order. It has a list of transaction IDs that parallel
workers must wait for. The message type would be generated when the leader
detects a dependency between the current and other transactions, or just before
the COMMIT message. The latter one is used to preserve the commit ordering
between the publisher and the subscriber.
LOGICAL_REP_MSG_INTERNAL_RELATION is used to synchronize the relation
information between the leader and parallel workers. It has a list of relations
that the leader already knows, and parallel workers also update the relmap in
response to the message. This type of message is generated when the leader
allocates a new parallel worker to the transaction, or when the publisher sends
additional RELATION messages.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 16 ++++++
src/backend/replication/logical/proto.c | 4 ++
src/backend/replication/logical/worker.c | 49 +++++++++++++++++++
src/include/replication/logicalproto.h | 2 +
src/include/replication/worker_internal.h | 4 ++
5 files changed, 75 insertions(+)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index baa68c1ab6c..735a3e9acad 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1645,3 +1645,19 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ /* XXX wait until given transaction is finished */
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index f0a913892b9..72dedee3a43 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -1253,6 +1253,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 93970c6af29..ebf8cd62552 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -629,6 +629,47 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -3868,6 +3909,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..5d91e2a4287 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..a3526eae578 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -359,6 +359,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTableSyncWorker(worker) ((worker)->in_use && \
@@ -366,6 +368,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
--
2.47.3
v4-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchapplication/octet-stream; name=v4-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchDownload
From 5339af3055461f7a9229eb8acb28fd2e70944481 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v4 3/4] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index bc8a0480778..40d57daf179 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 10b3d0d9b82..66c73ce34a1 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ebf8cd62552..269a3ac5804 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v4-0004-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v4-0004-Parallel-apply-non-streaming-transactions.patchDownload
From c66469e01c56bffb722fb5366c600c61f440f0b4 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v4 4/4] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader first updates the hash entry
with the receiving remote xid then tells the parallel worker to wait for it.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on local unique key, foreign key.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
---
.../replication/logical/applyparallelworker.c | 332 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 309 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/tools/pgindent/typedefs.list | 4 +
13 files changed, 668 insertions(+), 75 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 40d57daf179..47b5bc3b48a 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -146,6 +149,33 @@
* which will detect deadlock if any. See pa_send_data() and
* enum TransApplyAction.
*
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * -------------------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
+ *
* Lock types
* ----------
* Both the stream lock and the transaction lock mentioned above are
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1880,3 +2096,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 72dedee3a43..73a1bd36963 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 66c73ce34a1..001cf6a143f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 269a3ac5804..8c871b205fc 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -484,6 +484,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +604,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +674,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1406,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1419,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1429,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1854,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1951,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1991,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2006,84 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ apply_handle_commit_internal(&commit_data);
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2206,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2072,6 +2255,8 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ /* TODO wait for xid to finish */
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2084,7 +2269,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2339,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2402,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2655,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3084,7 +3277,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
@@ -3140,6 +3346,9 @@ apply_handle_stream_commit(StringInfo s)
break;
}
+ /* Cache the remote xid */
+ last_remote_xid = xid;
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3194,7 +3403,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3437,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4214,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4766,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4779,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4789,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4827,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4857,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = (FlushPosition *) palloc(sizeof(FlushPosition));
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6305,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6350,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6398,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 430c1246d14..2caf798ee0a 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cf3f6a7dafd..c1bdd918df5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2088,6 +2088,7 @@ ParallelTransState
ParallelVacuumState
ParallelWorkerContext
ParallelWorkerInfo
+ParallelizedTxnEntry
Param
ParamCompileHook
ParamExecData
@@ -2558,6 +2559,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4054,6 +4057,7 @@ remoteDep
remove_nulling_relids_context
rendezvousHashEntry
rep
+replica_identity_hash
replace_rte_variables_callback
replace_rte_variables_context
report_error_fn
--
2.47.3
Dear hackers,
I have been spending time for benchmarking the patch set. Here is an updated
report. Firstly, I want to reply few points raised by Tomas.
5) It's not clear to me how did you measure the TPS in your benchmark.
Did you measure how long it takes for the standby to catch up, or what
did you do?
Since the approach was not straightforward, we changed the metric - latency
for replication was measured. See the "workload" section for more details.
2) If I understand correctly, the patch maintains a "replica_identity"
hash table, with replica identity keys for all changes for all
concurrent transactions. How expensive can this be, in terms of CPU and
memory? What if I have multiple large batch transactions, each updating
millions of rows?
I have profiled large transaction cases and confirmed that cleanup is not CPU
costly. E.g., the attached .dat file showed the profile for the leader worker,
with 1 M update workload and 16 parallelisms. We can see that the leader worker
spends most of its time reading data from the stream, while the cleanup function
spends only around 5%. Also, I temporary removed the dependency tracking part
then ran tests, but the performance was not changed. Based on that, the CPU
consumption for dependency tracking can be ignored.
I have not attached the profile for other cases, tell me if needed.
We are still analyzing the memory consumption, will share later.
6) Did you investigate why the speedup is just ~2.1 with 4 workers, i.e.
about half of the "ideal" speedup? Is it bottlenecked on WAL, leader
having to determine dependencies, or something else?
Even in the 1M insert/update workload with the replica identity, parallelism
could not be improved. My theory was that parallel workers were fast enough,
and four workers could finish applying all transactions.
Thus, I did further experiment, which removed a replica identity and used REPLICA
IDENTITY FULL for applying UPDATEs. It increased the application time, and
performance could be improved up to w=16. See "Result" part.
Below contains details of benchmarks.
Abstract
----------
I did benchmarks with two workloads: 1) 1 million tuples are inserted in total,
and 2) 1 million tuples are updated in total. Overall, we can say that parallel
apply can improve performance, especially when transactions are long and
needs time to apply them.
Regarding the INSERT workload, the patch applies changes about 10% faster than
HEAD, but results remain constant regardless of parallelism. IIUC, because
applying transactions was relatively fast, fewer parallel workers could be
launched. Another point is that performance worsens when the number of workers
is set to 0. We may be able to skip additional patches in this case.
Regarding the UPDATE workload, performance could be improved till
max_parallel_apply_workers_per_subscription=4, but it was stable for {8, 16} cases.
This is because four workers are enough to apply all changes. When leader tries to
assign a new transaction, the first parallel worker has already finished its task.
Additionally, I ran UPDATE workload with REPLICA IDENTITY FULL, and this allows us
to improve performance till the w=16 case. This also shows that each parallel
worker spent more time, and the leader assigned workers from the pool.
Machine details
----------------
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz CPU(s) :88 cores, - 503 GiB RAM
Source code:
----------------
pgHead (19b966243c) and v4 patch set
Setup:
---------
Pub --> Sub
- Two nodes created in pub-sub logical replication setup.
- both instances had a table " foo (id int PRIMARY KEY, value double precision)"
and it was included in the publication
Workload:
----------------
Two workloads were run:
1. Disabled the subscription on the Sub node
2. ran 1000 transactions. Each transaction inserted 1000 tuples.
I.e., there were 1 million tuples on the publisher.
3. Enabled the subscription on Sub and measured the time taken in replication.
Case 2) UPDATE 1 million tuples
1. Inserted one million tuples on the Pub node
2. Waited until tuples were replicated
3. Disabled the subscription on the Sub node
4. ran 1000 transactions. Each transaction updated 1000 tuples.
Note that each transaction modified different tuples.
5. Enabled the subscription on Sub and measured the time taken in replication.
Furthermore, I ran one additional case that performed a 1 M update without PK.
Result:
---------------------
I measured with varying the parallelism of the apply, max_parallel_apply_workers_per_subscription.
Case 1) 1 M insert
Each cell is the median of 5-time runs. Also, insert 1 million tuples spends
*8.28 second*s on publisher side.
(w means the max_parallel_apply_workers_per_subscription)
Used source elapsed time [s]
------------------------
HEAD 6.750675
patched, w=0 7.215072
patched, w=1 5.674886
patched, w=2 5.566869
patched, w=4 5.491499
patched, w=8 5.541768
patched, w=16 5.556885
We can see a regression if number of workers is set to zero because the leader
worker checks the dependency even in the case. We may be able to discuss optimizing
the part, one idea is to skip them if the parallelism is disabled.
w=1 case has better performance. Because the leader can concentrate receiving
the changes and parallel worker can apply in parallel. This looks like what
streaming replication does.
In case of w=2 and larger, the performance was not changed. I found that after the
benchmark only one parallel apply worker was launched at that time. The reason was
that the launched parallel worker can finish applying a transaction before the
leader worker receives further changes. When the leader worker tries to assign,
it finds the parallel worker has already finished the task thus leader re-uses it.
This scenario means that the parallelism can work effectively if transactions have
dependency or applying transactions need time more than leader receives new ones.
Also, I think it is OK that the performance cannot be improved linearly because such
a workload can be applied very quicky. In this experiment the applying on subscriber
is mostly the same as (or faster than) publisher.
Case 2) 1 M update
Used source elapsed time [s]
------------------------
HEAD 17.180169
patched, w=0 18.284964
patched, w=1 13.390546
patched, w=2 11.978078
patched, w=4 8.906887
patched, w=8 9.004753
patched, w=16 8.974946
Same as the INSERT case w=0 has worse performance than HEAD, and w=1 is better
than it. In case of updates, performance could be improved up to the w=4 case.
Per my analysis, the p.a. could be launched up to 4 in the workload. Before
receiving the 5th transaction, the first p.a. could finish applying the task and
start applying the next one.
Additionally, I ran the same workload with case 2), without PK on both nodes.
REPLICA IDENTITY was set to FULL on publisher node to replicate UPDATE commands.
Since it needs more than 2 hrs for HEAD/w=0 I did not run these cases.
Used source elapsed time [s]
------------------------
patched, w=1 7571.225952
patched, w=2 2688.792047
patched, w=4 1681.862011
patched, w=8 995.177401
patched, w=16 718.488441
Apart from above, performance can be improved for all max_parallel_apply_workers_per_subscription.
This meant that leader fully used the worker pool for all cases. I checked the
perf report at that time and found that leader spent most of time
at RelationFindReplTupleSeq - this meant leader could not assign transactions to
parallel workers and it applied by itself.
Used scripts were attached, you could run to verify the same workload.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On 16/12/25 12:35, Hayato Kuroda (Fujitsu) wrote:
Dear hackers,
I have been spending time for benchmarking the patch set. Here is an updated
report.
I apologise if my question is incorrect. But what about asynchronous
replication? Does this method help to reduce lag?
My case is a replica located far from the main instance. There are an
inevitable lag exists. Do your benchmarks provide any insights into the
lag reduction? Or the WALsender process that decodes WAL records from a
hundred actively committing backends, a bottleneck here?
--
regards, Andrei Lepikhov,
pgEdge
Dear Andrei,
I have been spending time for benchmarking the patch set. Here is an updated
report.I apologise if my question is incorrect. But what about asynchronous
replication? Does this method help to reduce lag?My case is a replica located far from the main instance. There are an
inevitable lag exists. Do your benchmarks provide any insights into the
lag reduction?
Yes, ideally parallel apply can reduce the lag, but note that it affects after
changes are reached to the subscriber. It may not be so effective if lag is
caused by the network. If your transaction is large and you did not enable the
streaming option, changing it to 'on' or 'parallel' can improve the lag.
It allows to replicate changes before huge transactions are committed.
Or the WALsender process that decodes WAL records from a
hundred actively committing backends, a bottleneck here?
Can you clarify your use case bit more? E.g., how many instances subscribe the
change from the same publisher. The cheat sheet [1]https://wiki.postgresql.org/wiki/Operations_cheat_sheet may be helpful to distinguish
the bottleneck.
[1]: https://wiki.postgresql.org/wiki/Operations_cheat_sheet
Best regards,
Hayato Kuroda
FUJITSU LIMITED
On 18/12/25 07:44, Hayato Kuroda (Fujitsu) wrote:
Dear Andrei,
I have been spending time for benchmarking the patch set. Here is an updated
report.I apologise if my question is incorrect. But what about asynchronous
replication? Does this method help to reduce lag?My case is a replica located far from the main instance. There are an
inevitable lag exists. Do your benchmarks provide any insights into the
lag reduction?Yes, ideally parallel apply can reduce the lag, but note that it affects after
changes are reached to the subscriber. It may not be so effective if lag is
caused by the network. If your transaction is large and you did not enable the
streaming option, changing it to 'on' or 'parallel' can improve the lag.
It allows to replicate changes before huge transactions are committed.
Sorry if I was inaccurate. I want to understand the scope of this
feature: what benefit does the code provide to the current master in the
case of async LR? Of course, it is a prerequisite to enable streaming
and parallel apply - without these settings, your code is not working,
is it?
Put aside transaction sizes - it's usually hard to predict. We may think
about a mix, but it would be enough to benchmark two corner cases - very
short (single row) and long (let’s say 10% of a table) transactions to
be sure we have no degradation.
I just wonder if the main use case for this approach is synchronous
commit and a good-enough network. Is it correct?
Or the WALsender process that decodes WAL records from a
hundred actively committing backends, a bottleneck here?Can you clarify your use case bit more? E.g., how many instances subscribe the
change from the same publisher. The cheat sheet [1] may be helpful to distinguish
the bottleneck.
I keep in mind two cases (For simplicity, let’s imagine we have only one
publisher-subscriber.):
1. We have a low-latency network. If we add more and more load to the
main instance, which process will be the first bottleneck: walsender or
subscriber?
2. We have a stable load and walsender cope the WAL decoding and fills
the output socket with transactions. In case latency goes down
(geographically distributed configuration), may we profit from these new
changes in parallel apply feature if the network bandwidth is wide enough?
--
regards, Andrei Lepikhov,
pgEdge
On Thu, Dec 18, 2025 at 2:14 PM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 18/12/25 07:44, Hayato Kuroda (Fujitsu) wrote:
Dear Andrei,
I have been spending time for benchmarking the patch set. Here is an updated
report.I apologise if my question is incorrect. But what about asynchronous
replication? Does this method help to reduce lag?My case is a replica located far from the main instance. There are an
inevitable lag exists. Do your benchmarks provide any insights into the
lag reduction?Yes, ideally parallel apply can reduce the lag, but note that it affects after
changes are reached to the subscriber. It may not be so effective if lag is
caused by the network. If your transaction is large and you did not enable the
streaming option, changing it to 'on' or 'parallel' can improve the lag.
It allows to replicate changes before huge transactions are committed.Sorry if I was inaccurate. I want to understand the scope of this
feature: what benefit does the code provide to the current master in the
case of async LR? Of course, it is a prerequisite to enable streaming
and parallel apply - without these settings, your code is not working,
is it?Put aside transaction sizes - it's usually hard to predict. We may think
about a mix, but it would be enough to benchmark two corner cases - very
short (single row) and long (let’s say 10% of a table) transactions to
be sure we have no degradation.I just wonder if the main use case for this approach is synchronous
commit and a good-enough network. Is it correct?
It should help async workload as well, the key criteria is that the
apply-worker is not able to deal with load from the publisher.
Or the WALsender process that decodes WAL records from a
hundred actively committing backends, a bottleneck here?Can you clarify your use case bit more? E.g., how many instances subscribe the
change from the same publisher. The cheat sheet [1] may be helpful to distinguish
the bottleneck.I keep in mind two cases (For simplicity, let’s imagine we have only one
publisher-subscriber.):1. We have a low-latency network. If we add more and more load to the
main instance, which process will be the first bottleneck: walsender or
subscriber?
Ideally, it should be subscriber because it has to do more work w.r.t
applying the changes. So, the proposed feature should help these
cases.
2. We have a stable load and walsender cope the WAL decoding and fills
the output socket with transactions. In case latency goes down
(geographically distributed configuration), may we profit from these new
changes in parallel apply feature if the network bandwidth is wide enough?
I think so. However, it would be helpful if you can measure
performance in such cases either now or once the patch is in bit more
stabilized shape after some cycles of review.
--
With Regards,
Amit Kapila.
Dear Andrei,
Yes, ideally parallel apply can reduce the lag, but note that it affects after
changes are reached to the subscriber. It may not be so effective if lag is
caused by the network. If your transaction is large and you did not enable the
streaming option, changing it to 'on' or 'parallel' can improve the lag.
It allows to replicate changes before huge transactions are committed.Sorry if I was inaccurate. I want to understand the scope of this
feature: what benefit does the code provide to the current master in the
case of async LR?
This feature, applying non-streaming transactions in parallel, can improve the
performance when many numbers of transactions are committed on the publisher side
and apply worker is a bottleneck.
Please see the attached primitive diagram. Assuming receiving changes need one
time unit and applying changes also need a time unit. If leader does all tasks alone,
it needs eight time-units. But if there are parallel workers which apply changes
in parallel, leader can concentrate receiving items and reduce the total time.
I think this fact is not depends on whether it is the sync LR or not.
Of course, it is a prerequisite to enable streaming
and parallel apply - without these settings, your code is not working,
is it?
Let me clarify. A subscription option 'streaming' affects how we handle large
transactions. 'on' means that large transactions can be streamed before the commit,
and it is stored on the subscriber side. 'parallel' also means transactions can
be streamed and it can be applied by the parallel workers.
Actually these options are not related with the proposal. This patch focuses on
the relatively small ones which are not streamed before committing.
I just wonder if the main use case for this approach is synchronous
commit and a good-enough network. Is it correct?
Both (a)-sync replication can work well.
But it might not so effective if the transporting data spent 90% of the time.
Leader would spend most of the same time with HEAD and the patched case.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
Dear Hackers,
I have been spending time for implementing the patch, and I think it's time to
share on -hackers.
Patches 0001-0004 are largely not changed; some refactoring were done.
Now 0004 has a basic test for dependency tracking.
Remained patches enhance the parallel apply feature. 0006, 0007 and 0008 contains tests.
0005 was copied from [1]/messages/by-id/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com. The patch is needed for applying the prepared
transactions correctly. Please post comments at [1]/messages/by-id/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com if you have any comments on
it.
0006 contains changes for supporting two-phase transactions in parallel.
Parallel workers can be assigned when the BEGIN_PREPARE message comes, and
released after the PREPARE message. As with normal non-streamed transactions,
prepared transactions are marked as parallelized when the leader dispatches a
PREPARE message to the parallel workers, and they are removed when the parallel
worker finishes preparing. This allows upcoming transactions to not commit
transactions till the parallel worker finishes the preparation.
Same as streaming transactions, COMMIT/ROLLBACK PREPARED messages are handled by
the leader worker. At that time, the leader waits for the last transaction
launched to finish.
0007 contains changes to track dependencies for streamed transactions.
In streaming=on mode, dependency tracking and waiting are performed while changes
are applied. The leader does nothing while serializing changes.
In the case of streaming=parallel mode, we must track and wait based on
dependencies. Basically, non-streamed transactions do not have to wait for
streamed transactions because the leader worker always waits for them to be
applied. In contrast, streamed transactions must wait for the lastly dispatched
non-streamed transactions. Based on that, streamed transactions won't be marked
as parallelized, and the XID of the streamed transaction won't be set for the
replica identity hash entry. This means no parallel workers would wait for the
streamed transactions. Other than that, dependency tracking is done the same as
in a non-streaming case.
0008 contains changes to track dependencies based on subscriber-local indexes.
This extends the RI hash table to allow values to be stored based on local
indexes. The information, which indexes are defined for the table, is gathered
by leader, when the dependency checking for the table is firstly done in a transaction.
The detection mechanism is mostly the same as the RI case.
How do you feel?
[1]: /messages/by-id/TY4PR01MB169078771FB31B395AB496A6B94B4A@TY4PR01MB16907.jpnprd01.prod.outlook.com
[2]: /messages/by-id/OS0PR01MB5716D43CB68DB8FFE73BF65D942AA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v5-0001-Introduce-new-type-of-logical-replication-message.patchapplication/octet-stream; name=v5-0001-Introduce-new-type-of-logical-replication-message.patchDownload
From b601a49c4789a14aa7ab4765d6e97e20bfb7a29a Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 10:37:27 +0900
Subject: [PATCH v5 1/8] Introduce new type of logical replication messages to
track dependencies
This patch introduces two logical replication messages,
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY and LOGICAL_REP_MSG_INTERNAL_RELATION.
Apart from other messages, they are not sent by walsnders; the leader worker
sends to parallel workers based on the needs.
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY ensures that dependent transactions are
committed in the correct order. It has a list of transaction IDs that parallel
workers must wait for. The message type would be generated when the leader
detects a dependency between the current and other transactions, or just before
the COMMIT message. The latter one is used to preserve the commit ordering
between the publisher and the subscriber.
LOGICAL_REP_MSG_INTERNAL_RELATION is used to synchronize the relation
information between the leader and parallel workers. It has a list of relations
that the leader already knows, and parallel workers also update the relmap in
response to the message. This type of message is generated when the leader
allocates a new parallel worker to the transaction, or when the publisher sends
additional RELATION messages.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 16 ++++++
src/backend/replication/logical/proto.c | 4 ++
src/backend/replication/logical/worker.c | 49 +++++++++++++++++++
src/include/replication/logicalproto.h | 2 +
src/include/replication/worker_internal.h | 4 ++
5 files changed, 75 insertions(+)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index a4aafcf5b6e..055feea0bc5 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1645,3 +1645,19 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ /* XXX wait until given transaction is finished */
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 27ad74fd759..ded46c49a83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -1253,6 +1253,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index fc64476a9ef..55c264b9d39 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -629,6 +629,47 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -3868,6 +3909,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..5d91e2a4287 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..a3526eae578 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -359,6 +359,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTableSyncWorker(worker) ((worker)->in_use && \
@@ -366,6 +368,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
--
2.47.3
v5-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchapplication/octet-stream; name=v5-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchDownload
From cef9aeca4259071aa857bac8f7867cd6806781ea Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v5 2/8] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 055feea0bc5..6ca5f778a3b 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c0632bf901a..d7aaef70fb1 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -405,6 +405,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 5b0ce383408..d68940b02bc 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -136,3 +136,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v5-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchapplication/octet-stream; name=v5-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchDownload
From 66460baa47da05a0b5086688a4b68e3609808b43 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v5 3/8] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 6ca5f778a3b..cf08206d9fd 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b881f..13f8cb74e9f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 55c264b9d39..4c154363277 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v5-0004-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v5-0004-Parallel-apply-non-streaming-transactions.patchDownload
From c980bbec07c700a0dea1faabe7ed8e7373178947 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v5 4/8] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader first updates the hash entry
with the receiving remote xid then tells the parallel worker to wait for it.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on local unique key, foreign key.
--
A transaction could conflict with another if modifying the same unique key.
While current patches don't address conflicts involving unique or foreign keys,
tracking these dependencies might be needed.
--
TODO - user defined trigger and constraints.
--
It would be chanllege to check the dependency if the table has user defined
trigger or constraints. the most viable solution might be to disallow parallel
apply for relations whose triggers and constraints are not marked as
parallel-safe or immutable.
---
.../replication/logical/applyparallelworker.c | 339 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 303 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/meson.build | 1 +
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/test/subscription/t/050_parallel_apply.pl | 130 +++++++
src/tools/pgindent/typedefs.list | 4 +
15 files changed, 801 insertions(+), 74 deletions(-)
create mode 100644 src/test/subscription/t/050_parallel_apply.pl
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index cf08206d9fd..5b6267c6047 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -152,6 +155,33 @@
* session-level locks because both locks could be acquired outside the
* transaction, and the stream lock in the leader needs to persist across
* transaction boundaries i.e. until the end of the streaming transaction.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * ----------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
*-------------------------------------------------------------------------
*/
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1860,6 +2076,13 @@ pa_wait_for_depended_transaction(TransactionId xid)
{
elog(DEBUG1, "wait for depended xid %u", xid);
+ /*
+ * Quick exit if parallelized_txns has not been initialized yet. This can
+ * happen when this function is called by the leader worker.
+ */
+ if (!parallelized_txns)
+ return;
+
for (;;)
{
ParallelizedTxnEntry *txn_entry;
@@ -1880,3 +2103,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ded46c49a83..96b6a74055e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 13f8cb74e9f..9991bfe76cc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4c154363277..7790c2d8457 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -484,6 +485,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +605,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +675,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1407,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1420,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1430,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1855,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1952,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1992,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2007,97 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Apart from parallelized transactions, we do not have to register
+ * this transaction to parallelized_txns. The commit ordering is
+ * always preserved.
+ */
+
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
+ apply_handle_commit_internal(&commit_data);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ INJECTION_POINT("parallel-worker-before-commit", NULL);
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2220,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2084,7 +2281,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2351,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2414,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2667,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3194,7 +3399,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3433,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4210,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4762,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4775,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4785,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4823,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4853,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = palloc_object(FlushPosition);
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6301,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6346,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6394,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..e877ca09c30 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,7 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/050_parallel_apply.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 430c1246d14..2caf798ee0a 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
new file mode 100644
index 00000000000..69cf48cb7ac
--- /dev/null
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# This tests that dependency tracking between transactions can work well
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+# Insert initial data
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(1, 10), 'test');");
+
+# Create a publication
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION regress_pub FOR ALL TABLES;");
+
+# Initialize subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Create a subscription
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_subscriber->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub CONNECTION '$publisher_connstr' PUBLICATION regress_pub;");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub');
+
+# Insert tuples on publisher
+#
+# XXX This may not enough to launch a parallel apply worker, because
+# table_states_not_ready is not discarded yet.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(11, 20), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Insert tuples again
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(21, 30), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Verify the parallel apply worker is launched
+my $result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '1', "parallel apply worker is laucnhed by a non-streamed transaction");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert tuples on publisher
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(31, 40), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+my $offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is independent from the
+# previous one, but the parallel worker would wait till it finishes
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(41, 50), 'test');");
+
+# Verify the parallel worker waits for the transaction
+my $str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+my $xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Update tuples which have not been applied yet on subscriber because the
+# parallel worker stops at the injection point. Newly assigned worker also
+# waits for the same transactions as above.
+$node_publisher->safe_psql('postgres',
+ "UPDATE regress_tab SET value = 'updated' WHERE id BETWEEN 31 AND 35;");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
+# Wakeup the parallel worker. We detach first no to stop other parallel workers
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the parallel worker wakes up
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(1) FROM regress_tab");
+is ($result, 50, 'inserts are replicated to subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM regress_tab WHERE value = 'updated'");
+is ($result, 5, 'updates are also replicated to subscriber');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 04845d5e680..d9ec32de3ea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,7 @@ ParallelHashGrowth
ParallelHashJoinBatch
ParallelHashJoinBatchAccessor
ParallelHashJoinState
+ParallelizedTxnEntry
ParallelIndexScanDesc
ParallelSlot
ParallelSlotArray
@@ -2573,6 +2574,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4082,6 +4085,7 @@ rendezvousHashEntry
rep
replace_rte_variables_callback
replace_rte_variables_context
+replica_identity_hash
report_error_fn
ret_type
rewind_source
--
2.47.3
v5-0005-Fix-unexpected-origin-advancement-during-parallel.patchapplication/octet-stream; name=v5-0005-Fix-unexpected-origin-advancement-during-parallel.patchDownload
From 9c97e1d28c715eaf80c9ca1df3b6f1767feb6d86 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <houzj.fnst@fujitsu.com>
Date: Mon, 22 Dec 2025 14:01:08 +0800
Subject: [PATCH v5 5/8] Fix unexpected origin advancement during parallel
apply failure
The logical replication parallel apply worker may erroneously advance the origin
progress during an error or unsuccessful apply. This can lead to transaction
loss, as these transactions will not be resent by the server.
Commit 3f28b2fc addressed a similar issue in both the apply worker and table
sync worker, by registering a before_shmem_exit callback to reset the origin
information, preventing the worker from advancing it during transaction abortion
on shutdown. This commit registers the same callback for the parallel apply
worker, ensuring consistent behavior across all workers.
---
src/backend/replication/logical/worker.c | 30 +++++++++-------
.../subscription/t/023_twophase_stream.pl | 34 +++++++++++++++++++
2 files changed, 51 insertions(+), 13 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 7790c2d8457..5808cd11c15 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -6749,6 +6749,23 @@ InitializeLogRepWorker(void)
MySubscription->name));
CommitTransactionCommand();
+
+ /*
+ * Register a callback to reset the origin state before aborting any
+ * pending transaction during shutdown (see ShutdownPostgres()). This will
+ * avoid origin advancement for an in-complete transaction which could
+ * otherwise lead to its loss as such a transaction won't be sent by the
+ * server again.
+ *
+ * Note that even a LOG or DEBUG statement placed after setting the origin
+ * state may process a shutdown signal before committing the current apply
+ * operation. So, it is important to register such a callback here.
+ *
+ * Register this callback here to ensure that all types of logical
+ * replication workers that set up origins and apply remote transactions
+ * are protected.
+ */
+ before_shmem_exit(replorigin_reset, (Datum) 0);
}
/*
@@ -6792,19 +6809,6 @@ SetupApplyOrSyncWorker(int worker_slot)
InitializeLogRepWorker();
- /*
- * Register a callback to reset the origin state before aborting any
- * pending transaction during shutdown (see ShutdownPostgres()). This will
- * avoid origin advancement for an in-complete transaction which could
- * otherwise lead to its loss as such a transaction won't be sent by the
- * server again.
- *
- * Note that even a LOG or DEBUG statement placed after setting the origin
- * state may process a shutdown signal before committing the current apply
- * operation. So, it is important to register such a callback here.
- */
- before_shmem_exit(replorigin_reset, (Datum) 0);
-
/* Connect to the origin and start the replication. */
elog(DEBUG1, "connecting to publisher using connection string \"%s\"",
MySubscription->conninfo);
diff --git a/src/test/subscription/t/023_twophase_stream.pl b/src/test/subscription/t/023_twophase_stream.pl
index e01347ca699..9b9f189308e 100644
--- a/src/test/subscription/t/023_twophase_stream.pl
+++ b/src/test/subscription/t/023_twophase_stream.pl
@@ -429,6 +429,40 @@ $result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
is($result, qq(1), 'transaction is committed on subscriber');
+# Test the ability to re-apply a transaction when a parallel apply worker fails
+# to prepare the transaction due to insufficient max_prepared_transactions
+# setting.
+$node_subscriber->append_conf('postgresql.conf',
+ qq(max_prepared_transactions = 0));
+$node_subscriber->restart;
+
+$node_publisher->safe_psql(
+ 'postgres', q{
+ BEGIN;
+ INSERT INTO test_tab_2 values(2);
+ PREPARE TRANSACTION 'xact';
+ COMMIT PREPARED 'xact';
+ });
+
+$offset = -s $node_subscriber->logfile;
+
+# Confirm the ERROR is reported because max_prepared_transactions is zero
+$node_subscriber->wait_for_log(
+ qr/ERROR: ( [A-Z0-9]+:)? prepared transactions are disabled/,
+ $offset);
+
+# Set max_prepared_transactions to correct value to resume the replication
+$node_subscriber->append_conf('postgresql.conf',
+ qq(max_prepared_transactions = 10));
+$node_subscriber->restart;
+
+$node_publisher->wait_for_catchup($appname);
+
+# Check that transaction is committed on subscriber
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
+is($result, qq(2), 'transaction is committed on subscriber after retrying');
+
###############################
# check all the cleanup
###############################
--
2.47.3
v5-0006-support-2PC.patchapplication/octet-stream; name=v5-0006-support-2PC.patchDownload
From bb374ecb991cecf1376af084e9ed95d7481ab7c0 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 2 Dec 2025 13:01:26 +0900
Subject: [PATCH v5 6/8] support 2PC
This patch allows the PREPARE transaction to be applied in parallel. Parallel
apply workers are assigned to a transaction when BEGIN_PREPARE is received. This
part and the dependency-waiting mechanism are the same as a normal transaction.
A parallel worker can be freed after it handles a PREPARE message. The prepared
transaction can be deleted from parallelized_txns at that time; the upcoming
transactions will wait until then.
The leader apply worker resolves COMMIT PREPARED/ROLLBACK PREPARED. Since it can
be serialized automatically, it does not add the transaction to parallelized_txns.
---
src/backend/replication/logical/worker.c | 230 +++++++++++++++---
src/test/subscription/t/050_parallel_apply.pl | 57 +++++
2 files changed, 259 insertions(+), 28 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 5808cd11c15..b6d3d43e8c0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2116,6 +2116,11 @@ static void
apply_handle_begin_prepare(StringInfo s)
{
LogicalRepPreparedTxnData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* Tablesync should never receive prepare. */
if (am_tablesync_worker())
@@ -2127,12 +2132,61 @@ apply_handle_begin_prepare(StringInfo s)
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin_prepare(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.prepare_lsn);
remote_final_lsn = begin_data.prepare_lsn;
maybe_start_skipping_changes(begin_data.prepare_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN_PREPARE, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -2182,6 +2236,11 @@ static void
apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_prepare(s, &prepare_data);
@@ -2192,36 +2251,136 @@ apply_handle_prepare(StringInfo s)
LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- /*
- * Unlike commit, here, we always prepare the transaction even though no
- * change has happened in this transaction or all changes are skipped. It
- * is done this way because at commit prepared time, we won't know whether
- * we have skipped preparing a transaction because of those reasons.
- *
- * XXX, We can optimize such that at commit prepared time, we first check
- * whether we have prepared the transaction or not but that doesn't seem
- * worthwhile because such cases shouldn't be common.
- */
- begin_replication_step();
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
- apply_handle_prepare_internal(&prepare_data);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Unlike commit, here, we always prepare the transaction even
+ * though no change has happened in this transaction or all changes
+ * are skipped. It is done this way because at commit prepared
+ * time, we won't know whether we have skipped preparing a
+ * transaction because of those reasons.
+ *
+ * XXX, We can optimize such that at commit prepared time, we first
+ * check whether we have prepared the transaction or not but that
+ * doesn't seem worthwhile because such cases shouldn't be common.
+ */
+ begin_replication_step();
- end_replication_step();
- CommitTransactionCommand();
- pgstat_report_stat(false);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
- /*
- * It is okay not to set the local_end LSN for the prepare because we
- * always flush the prepare record. So, we can send the acknowledgment of
- * the remote_end LSN as soon as prepare is finished.
- *
- * XXX For the sake of consistency with commit, we could have set it with
- * the LSN of prepare but as of now we don't track that value similar to
- * XactLastCommitEnd, and adding it for this purpose doesn't seems worth
- * it.
- */
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
- InvalidTransactionId);
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. So, we can send the
+ * acknowledgment of the remote_end LSN as soon as prepare is
+ * finished.
+ *
+ * XXX For the sake of consistency with commit, we could have set
+ * it with the LSN of prepare but as of now we don't track that
+ * value similar to XactLastCommitEnd, and adding it for this
+ * purpose doesn't seems worth
+ * it.
+ */
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_PREPARE,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ begin_replication_step();
+
+ INJECTION_POINT("parallel-worker-before-prepare", NULL);
+
+ /* Mark the transaction as prepared. */
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. See apply_handle_prepare.
+ */
+ MyParallelShared->last_commit_end = InvalidXLogRecPtr;
+ pa_commit_transaction();
+
+ pa_unlock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+
+ pa_reset_subtrans();
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
in_remote_transaction = false;
@@ -2269,6 +2428,9 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2281,6 +2443,14 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
+ /*
+ * No need to update last_remote_xid because leader worker applied the
+ * message thus upcoming transaction preserves the order automatically.
+ * Let's set the xid to an invalid value to skip sending the
+ * INTERNAL_DEPENDENCY message.
+ */
+ last_remote_xid = InvalidTransactionId;
+
store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
InvalidTransactionId);
in_remote_transaction = false;
@@ -2337,6 +2507,10 @@ apply_handle_rollback_prepared(StringInfo s)
/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
begin_replication_step();
+
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
FinishPreparedTransaction(gid, false);
end_replication_step();
CommitTransactionCommand();
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 69cf48cb7ac..57bcfde513e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -17,6 +17,8 @@ if ($ENV{enable_injection_points} ne 'yes')
# Initialize publisher node
my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_publisher->start;
# Insert initial data
@@ -35,6 +37,8 @@ $node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
$node_subscriber->append_conf('postgresql.conf',
"max_logical_replication_workers = 10");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_subscriber->start;
# Check if the extension injection_points is available, as it may be
@@ -127,4 +131,57 @@ $result =
"SELECT count(1) FROM regress_tab WHERE value = 'updated'");
is ($result, 5, 'updates are also replicated to subscriber');
+# Ensure PREPAREd transaction also affects the parallel apply
+
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub DISABLE;");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub SET (two_phase = on);");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub ENABLE;");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '0', "no parallel apply workers exist after restart");
+
+# Attach an injection_point. Parallel workers would wait before the prepare
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-prepare','wait');"
+);
+
+# PREPARE a transaction on publisher. It would be handled by a parallel apply
+# worker.
+$node_publisher->safe_psql('postgres', qq[
+ BEGIN;
+ INSERT INTO regress_tab VALUES (generate_series(51, 60), 'prepare');
+ PREPARE TRANSACTION 'regress_prepare';
+]);
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-prepare');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction waits for the prepared
+# transaction
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(61, 70), 'test');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-prepare');
+ SELECT injection_points_wakeup('parallel-worker-before-prepare');
+]);
+
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+# COMMIT the prepared transaction. It is always handled by the leader
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
+$node_publisher->wait_for_catchup('regress_sub');
+
done_testing();
--
2.47.3
v5-0007-Track-dependencies-for-streamed-transactions.patchapplication/octet-stream; name=v5-0007-Track-dependencies-for-streamed-transactions.patchDownload
From 8e96cbb4d47c79612a35f2d257837552befed889 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 4 Dec 2025 20:55:26 +0900
Subject: [PATCH v5 7/8] Track dependencies for streamed transactions
This commit allows tracking dependencies of streamed transactions.
Regarding the streaming=on case, dependency tracking is enabled while applying
spooled changes from files.
In the streaming=parallel case, dependency tracking is performed when the leader
sends changes to parallel workers. Apart from non-streamed transactions, the
leader waits for parallel workers till the assigned transactions are finished at
COMMIT/PREPARE/ABORT; thus, the XID of streamed transactions is not cached as
the lastly handled one. Also, streamed transactions are not recorded as
parallelized transactions because upcoming workers do not have to wait for them.
---
.../replication/logical/applyparallelworker.c | 19 +++++-
src/backend/replication/logical/worker.c | 66 +++++++++++++++++--
src/include/replication/worker_internal.h | 2 +-
src/test/subscription/t/050_parallel_apply.pl | 47 +++++++++++++
4 files changed, 126 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 5b6267c6047..bb66d64582c 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -168,7 +168,14 @@
* key) as another ongoing transaction (see handle_dependency_on_change for
* details). If so, the leader sends a list of dependent transaction IDs to the
* parallel worker, indicating that the parallel apply worker must wait for
- * these transactions to commit before proceeding.
+ * these transactions to commit before proceeding. If transactions are streamed
+ * but leader deciedes no to assign parallel apply workers, dependencies are
+ * verified when the transaction is committed.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
*
* Commit order
* ------------
@@ -1635,6 +1642,12 @@ pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
stream_apply_worker = winfo;
}
+bool
+pa_stream_apply_worker_is_null(void)
+{
+ return stream_apply_worker == NULL;
+}
+
/*
* Form a unique savepoint name for the streaming transaction.
*
@@ -1720,6 +1733,10 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
TransactionId xid = abort_data->xid;
TransactionId subxid = abort_data->subxid;
+ /* Streamed transactions won't be registered */
+ Assert(!dshash_find(parallelized_txns, &xid, false) &&
+ !dshash_find(parallelized_txns, &subxid, false));
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b6d3d43e8c0..9776edd2310 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -961,13 +961,26 @@ check_dependency_on_replica_identity(Oid relid,
&rientry->remote_xid,
new_depended_xid);
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
/*
* Update the new depended xid into the entry if valid, the new xid could
* be invalid if the transaction will be applied by the leader itself
* which means all the changes will be committed before processing next
* transaction, so no need to be depended on.
*/
- if (TransactionIdIsValid(new_depended_xid))
+ else if (TransactionIdIsValid(new_depended_xid))
rientry->remote_xid = new_depended_xid;
/*
@@ -1081,8 +1094,11 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
*/
StringInfoData change = *s;
- /* Compute dependency only for non-streaming transaction */
- if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ /*
+ * Skip if we are handling streaming transactions but changes are not
+ * applied yet.
+ */
+ if (pa_stream_apply_worker_is_null() && in_streamed_transaction)
return;
/* Only the leader checks dependencies and schedules the parallel apply */
@@ -1442,7 +1458,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
- handle_dependency_on_change(action, s, current_xid, winfo);
+ /*
+ * Check dependencies related to the received change. The XID of the top
+ * transaction is always used to avoid detecting false-positive
+ * dependencies between top and sub transactions. Sub-transactions can be
+ * replicated for streamed transactions, and they won't be marked as
+ * parallelized so that parallel workers won't wait for rolled-back
+ * sub-transactions.
+ */
+ handle_dependency_on_change(action, s,
+ in_streamed_transaction
+ ? stream_xid : remote_xid,
+ winfo);
/*
* Re-fetch the latest apply action as it might have been changed during
@@ -2579,6 +2606,10 @@ apply_handle_stream_prepare(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
prepare_data.xid, prepare_data.prepare_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/* Mark the transaction as prepared. */
apply_handle_prepare_internal(&prepare_data);
@@ -2602,7 +2633,8 @@ apply_handle_stream_prepare(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, prepare_data.end_lsn);
@@ -2668,6 +2700,11 @@ apply_handle_stream_prepare(StringInfo s)
pgstat_report_stat(false);
+ /*
+ * No need to update the last_remote_xid here because leader worker
+ * always wait until streamed transactions finish.
+ */
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3452,6 +3489,10 @@ apply_handle_stream_commit(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
commit_data.commit_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
apply_handle_commit_internal(&commit_data);
/* Unlink the files with serialized changes and subxact info. */
@@ -3463,7 +3504,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 5371ee767f1..69ecd51a359 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -354,7 +354,7 @@ extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
extern void pa_set_xact_state(ParallelApplyWorkerShared *wshared,
ParallelTransState xact_state);
extern void pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo);
-
+extern bool pa_stream_apply_worker_is_null(void);
extern void pa_start_subtrans(TransactionId current_xid,
TransactionId top_xid);
extern void pa_reset_subtrans(void);
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 57bcfde513e..20e8a7b91a7 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -184,4 +184,51 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
$node_publisher->wait_for_catchup('regress_sub');
+# Ensure streamed transactions waits the previous transaction
+
+$node_publisher->append_conf('postgresql.conf',
+ "logical_decoding_work_mem = 64kB");
+$node_publisher->reload;
+# Run a query to make sure that the reload has taken effect.
+$node_publisher->safe_psql('postgres', "SELECT 1");
+
+# Attach the injection_point again
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(71, 80), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+# Run a transaction which would be streamed
+my $h = $node_publisher->background_psql('postgres', on_error_stop => 0);
+
+$offset = -s $node_subscriber->logfile;
+
+$h->query_safe(
+ q{
+BEGIN;
+UPDATE regress_tab SET value = 'streamed-updated' WHERE id BETWEEN 71 AND 80;
+INSERT INTO regress_tab VALUES (generate_series(100, 5100), 'streamed');
+});
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the streamed transaction can be applied
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$h->query_safe("COMMIT;");
+
done_testing();
--
2.47.3
v5-0008-Support-dependency-tracking-via-local-unique-inde.patchapplication/octet-stream; name=v5-0008-Support-dependency-tracking-via-local-unique-inde.patchDownload
From d2dec08b523e5f1e4dd1d3407bfca143056cdaa4 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <hayato@example.com>
Date: Thu, 11 Dec 2025 22:21:47 +0900
Subject: [PATCH v5 8/8] Support dependency tracking via local unique indexes
Currently, logical replication's parallel apply mechanism tracks dependencies
primarily based on the REPLICA IDENTITY defined on the publisher table.
However, local subscriber tables might have additional unique indexes that
could effectively serve as dependency keys, even if they don't correspond to
the publisher's REPLICA IDENTITY. Failing to track these additional unique
keys can lead to incorrect data and/or deadlocks during parallel application.
This patch extends the parallel apply's dependency tracking to consider
local unique indexes on the subscriber table. This is achieved by extending
the existing Replica Identity hash table to also store dependency information
based on these local unique indexes.
The LogicalRepRelMapEntry structure is extended to store details about these
local unique indexes. This information is collected and cached when
dependency checking is first performed for a remote transaction on a given
relation. This collection process requires to be in a transaction to access
system catalog information.
---
src/backend/replication/logical/relation.c | 132 +++++++++
src/backend/replication/logical/worker.c | 275 ++++++++++++++----
src/backend/storage/lmgr/deadlock.c | 1 -
src/include/replication/logicalrelation.h | 10 +
src/test/subscription/t/050_parallel_apply.pl | 43 +++
src/tools/pgindent/typedefs.list | 2 +
6 files changed, 406 insertions(+), 57 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 9991bfe76cc..5601696d338 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -125,6 +125,21 @@ logicalrep_relmap_init(void)
(Datum) 0);
}
+/*
+ * Release local index list
+ */
+static void
+free_local_unique_indexes(LogicalRepRelMapEntry *entry)
+{
+ Assert(am_leader_apply_worker());
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, entry->local_unique_indexes)
+ bms_free(idxinfo->indexkeys);
+
+ list_free(entry->local_unique_indexes);
+ entry->local_unique_indexes = NIL;
+}
+
/*
* Free the entry of a relation map cache.
*/
@@ -152,6 +167,9 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
if (entry->attrmap)
free_attrmap(entry->attrmap);
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
}
/*
@@ -352,6 +370,107 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
}
+/*
+ * Collect all local unique indexes that can be used for dependency tracking.
+ */
+static void
+collect_local_indexes(LogicalRepRelMapEntry *entry)
+{
+ List *idxlist;
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
+
+ entry->local_unique_indexes_collected = true;
+
+ idxlist = RelationGetIndexList(entry->localrel);
+
+ /* Quick exit if there are no indexes */
+ if (idxlist == NIL)
+ return;
+
+ /* Iterate indexes to list all usable indexes */
+ foreach_oid(idxoid, idxlist)
+ {
+ Relation idxrel;
+ int indnkeys;
+ AttrMap *attrmap;
+ Bitmapset *indexkeys = NULL;
+ bool suitable = true;
+
+ idxrel = index_open(idxoid, AccessShareLock);
+
+ /*
+ * Check whether the index can be used for the dependency tracking.
+ *
+ * For simplification, the same condition as REPLICA IDENTITY FULL,
+ * plus it must be a unique index.
+ */
+ if (!(idxrel->rd_index->indisunique &&
+ IsIndexUsableForReplicaIdentityFull(idxrel, entry->attrmap)))
+ {
+ index_close(idxrel, AccessShareLock);
+ continue;
+ }
+
+ indnkeys = idxrel->rd_index->indnkeyatts;
+ attrmap = entry->attrmap;
+
+ Assert(indnkeys);
+
+ /* Seek each attributes and add to a Bitmap */
+ for (int i = 0; i < indnkeys; i++)
+ {
+ AttrNumber localcol = idxrel->rd_index->indkey.values[i];
+ AttrNumber remotecol;
+
+ /* Skip computed column */
+ if (!AttributeNumberIsValid(localcol))
+ continue;
+
+ remotecol = attrmap->attnums[AttrNumberGetAttrOffset(localcol)];
+
+ /*
+ * Skip if the column does not exist on publisher node. In this
+ * case the replicated tuples always have NULL or default value.
+ */
+ if (remotecol < 0)
+ {
+ suitable = false;
+ break;
+ }
+
+ /* Checks are passed, remember the attribute */
+ indexkeys = bms_add_member(indexkeys, remotecol);
+ }
+
+ index_close(idxrel, AccessShareLock);
+
+ /*
+ * One of a column does not exist on publisher side, skip using index.
+ */
+ if (!suitable)
+ continue;
+
+ /* This index is usable, store on memory */
+ if (indexkeys)
+ {
+ MemoryContext oldctx;
+ LogicalRepSubscriberIdx *idxinfo;
+
+ oldctx = MemoryContextSwitchTo(LogicalRepRelMapContext);
+ idxinfo = palloc(sizeof(LogicalRepSubscriberIdx));
+ idxinfo->indexoid = idxoid;
+ idxinfo->indexkeys = bms_copy(indexkeys);
+ entry->local_unique_indexes =
+ lappend(entry->local_unique_indexes, idxinfo);
+ MemoryContextSwitchTo(oldctx);
+ }
+ }
+
+ list_free(idxlist);
+}
+
/*
* Open the local relation associated with the remote one.
*
@@ -499,6 +618,13 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localindexoid = FindLogicalRepLocalIndex(entry->localrel, remoterel,
entry->attrmap);
+ /*
+ * Leader must also collect all local unique indexes for dependency
+ * tracking.
+ */
+ if (am_leader_apply_worker())
+ collect_local_indexes(entry);
+
entry->localrelvalid = true;
}
@@ -771,6 +897,12 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,
entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel,
entry->attrmap);
+ /*
+ * TODO: Parallel apply does not support the parallel apply for now.
+ * Just mark local indexes are collected.
+ */
+ entry->local_unique_indexes_collected = true;
+
entry->localrelvalid = true;
return entry;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 9776edd2310..87ec0fdbd0c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -548,9 +548,19 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+/*
+ * Type of key used for dependency tracking.
+ */
+typedef enum LogicalRepKeyKind
+{
+ LOGICALREP_KEY_REPLICA_IDENTITY,
+ LOGICALREP_KEY_LOCAL_UNIQUE
+} LogicalRepKeyKind;
+
typedef struct ReplicaIdentityKey
{
Oid relid;
+ LogicalRepKeyKind kind;
LogicalRepTupleData *data;
} ReplicaIdentityKey;
@@ -710,7 +720,8 @@ static bool
hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
{
if (a->relid != b->relid ||
- a->data->ncols != b->data->ncols)
+ a->data->ncols != b->data->ncols ||
+ a->kind != b->kind)
return false;
for (int i = 0; i < a->data->ncols; i++)
@@ -718,6 +729,9 @@ hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
if (a->data->colstatus[i] != b->data->colstatus[i])
return false;
+ if (a->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
if (a->data->colvalues[i].len != b->data->colvalues[i].len)
return false;
@@ -839,6 +853,93 @@ check_and_append_xid_dependency(List *depends_on_xids,
return lappend_xid(depends_on_xids, *depends_on_xid);
}
+/*
+ * Common function for registering dependency on a key. Used by both
+ * check_dependency_on_replica_identity and check_dependency_on_local_key.
+ */
+static void
+register_dependency_with_key(ReplicaIdentityKey *key, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ ReplicaIdentityEntry *rientry;
+ bool found = false;
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, key,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1,
+ key->kind == LOGICALREP_KEY_REPLICA_IDENTITY ?
+ "found conflicting replica identity change from %u" :
+ "found conflicting local unique change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(key);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, key);
+ free_replica_identity_key(key);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ else if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
/*
* Check for dependencies on preceding transactions that modify the same key.
* Returns the dependent transactions in 'depends_on_xids' and records the
@@ -853,10 +954,8 @@ check_dependency_on_replica_identity(Oid relid,
LogicalRepRelMapEntry *relentry;
LogicalRepTupleData *ridata;
ReplicaIdentityKey *rikey;
- ReplicaIdentityEntry *rientry;
MemoryContext oldctx;
int n_ri;
- bool found = false;
Assert(depends_on_xids);
@@ -922,75 +1021,125 @@ check_dependency_on_replica_identity(Oid relid,
rikey = palloc0_object(ReplicaIdentityKey);
rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_REPLICA_IDENTITY;
rikey->data = ridata;
- if (TransactionIdIsValid(new_depended_xid))
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
+}
+
+/*
+ * Mostly same as check_dependency_on_replica_identity() but for local unique
+ * indexes.
+ */
+static void
+check_dependency_on_local_key(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ MemoryContext oldctx;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * Gather information for local indexes if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (!relentry->local_unique_indexes_collected)
{
- rientry = replica_identity_insert(replica_identity_table, rikey,
- &found);
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_close(logicalrep_rel_open(relid, AccessShareLock),
+ AccessShareLock);
/*
- * Release the key built to search the entry, if the entry already
- * exists. Otherwise, initialize the remote_xid.
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
*/
- if (found)
- {
- elog(DEBUG1, "found conflicting replica identity change from %u",
- rientry->remote_xid);
+ if (needs_start)
+ CommitTransactionCommand();
- free_replica_identity_key(rikey);
- }
- else
- rientry->remote_xid = InvalidTransactionId;
+ Assert(relentry->local_unique_indexes_collected);
}
- else
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, relentry->local_unique_indexes)
{
- rientry = replica_identity_lookup(replica_identity_table, rikey);
- free_replica_identity_key(rikey);
- }
+ int columns = bms_num_members(idxinfo->indexkeys);
+ bool suitable = true;
- MemoryContextSwitchTo(oldctx);
+ Assert(columns);
- /* Return if no entry found */
- if (!rientry)
- return;
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, idxinfo->indexkeys))
+ continue;
- Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+ /*
+ * Skip if the column is not changed.
+ *
+ * XXX: NULL is allowed.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ {
+ suitable = false;
+ break;
+ }
+ }
- *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
- &rientry->remote_xid,
- new_depended_xid);
+ if (!suitable)
+ continue;
- /*
- * Remove the entry if it is registered for the streamed transactions. We
- * do not have to register an entry for them; The leader worker always
- * waits until the parallel worker finishes handling streamed transactions,
- * thus no need to consider the possiblity that upcoming parallel workers
- * would go ahead.
- */
- if (TransactionIdIsValid(stream_xid) && !found)
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
- }
+ oldctx = MemoryContextSwitchTo(ApplyContext);
- /*
- * Update the new depended xid into the entry if valid, the new xid could
- * be invalid if the transaction will be applied by the leader itself
- * which means all the changes will be committed before processing next
- * transaction, so no need to be depended on.
- */
- else if (TransactionIdIsValid(new_depended_xid))
- rientry->remote_xid = new_depended_xid;
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, columns);
+ ridata->colstatus = palloc0_array(char, columns);
+ ridata->ncols = columns;
- /*
- * Remove the entry if the transaction has been committed and no new
- * dependency needs to be added.
- */
- else if (!TransactionIdIsValid(rientry->remote_xid))
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ for (int i_original = 0, i_key = 0; i_original < original_data->ncols; i_original++)
+ {
+ if (!bms_is_member(i_original, idxinfo->indexkeys))
+ continue;
+
+ if (original_data->colstatus[i_original] != LOGICALREP_COLUMN_NULL)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ initStringInfoExt(&ridata->colvalues[i_key], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_key], original_colvalue->data);
+ }
+
+ ridata->colstatus[i_key] = original_data->colstatus[i_original];
+ i_key++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_LOCAL_UNIQUE;
+ rikey->data = ridata;
+
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
}
}
@@ -1120,6 +1269,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_UPDATE:
@@ -1127,13 +1279,21 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
&newtup);
if (has_oldtup)
+ {
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ }
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_DELETE:
@@ -1141,6 +1301,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_TRUNCATE:
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index c4bfaaa67ac..ca7dee52b32 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -33,7 +33,6 @@
#include "storage/procnumber.h"
#include "utils/memutils.h"
-
/*
* One edge in the waits-for graph.
*
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 34a7069e9e5..32152ef3833 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -16,6 +16,12 @@
#include "catalog/index.h"
#include "replication/logicalproto.h"
+typedef struct LogicalRepSubscriberIdx
+{
+ Oid indexoid; /* OID of the local key */
+ Bitmapset *indexkeys; /* Bitmap of key columns *on remote* */
+} LogicalRepSubscriberIdx;
+
typedef struct LogicalRepRelMapEntry
{
LogicalRepRelation remoterel; /* key is remoterel.remoteid */
@@ -39,6 +45,10 @@ typedef struct LogicalRepRelMapEntry
XLogRecPtr statelsn;
TransactionId last_depended_xid;
+
+ /* Local unique indexes. Used for dependency tracking */
+ List *local_unique_indexes;
+ bool local_unique_indexes_collected;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 20e8a7b91a7..e489a4bdc1e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -231,4 +231,47 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$h->query_safe("COMMIT;");
+# Ensure subscriber-local indexes are also used for the dependency tracking
+
+# Truncate the data for upcoming tests
+$node_publisher->safe_psql('postgres', "TRUNCATE TABLE regress_tab;");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Define an unique index on subscriber
+$node_subscriber->safe_psql('postgres',
+ "CREATE INDEX ON regress_tab (value);");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert a tuple on publisher. Parallel worker would wait at the injection
+# point
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (1, 'would conflict');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is would wait because all
+# parallel workers wait till the previously launched worker commits.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (2, 'would not conflict');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Insert a conflicting tuple on publisher. Leader worker would detect the conflict
+# and wait for the transaction to commit.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (3, 'would conflict');");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d9ec32de3ea..ecd4845f389 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1636,6 +1636,7 @@ LogicalRepBeginData
LogicalRepCommitData
LogicalRepCommitPreparedTxnData
LogicalRepCtxStruct
+LogicalRepKeyKind
LogicalRepMsgType
LogicalRepPartMapEntry
LogicalRepPreparedTxnData
@@ -1645,6 +1646,7 @@ LogicalRepRelation
LogicalRepRollbackPreparedTxnData
LogicalRepSequenceInfo
LogicalRepStreamAbortData
+LogicalRepSubscriberIdx
LogicalRepTupleData
LogicalRepTyp
LogicalRepWorker
--
2.47.3
Dear Hackers,
Here is a rebased version.
Since the parallel worker's bug has been fixed, the patch is not attached anymore.
0006 contains changes to handle the case that user-defined triggers are not
immutable. Some triggers may change their behaviors based on the number of tuples
and other internal states. To keep the result consistent with the non-parallel
case, parallel workers wait to apply changes till the previous transaction is
committed if the target relation has such triggers.
Note that we assume CHECK constraints are immutable, so they are not checked.
I think it is a reasonable assumption because it has already been described in
the doc [1]https://www.postgresql.org/docs/current/ddl-constraints.html.
(This does not contain tests yet)
0007 contains changes for track dependencies by local indexes. It was mostly the
same as v5-0008. Since I cannot find a reasonable way to compute a hash for
expression indexes, these indexes are no longer used for tracking. Instead, the
parallel worker waits to apply changes till the previous transaction is
committed if the target relation has such indexes.
[1]: https://www.postgresql.org/docs/current/ddl-constraints.html
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v6-0001-Introduce-a-shared-hash-table-to-store-paralleliz.patchapplication/octet-stream; name=v6-0001-Introduce-a-shared-hash-table-to-store-paralleliz.patchDownload
From 4e6589e8848b7c2de6e6b5f12766eb4674302fec Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v6 1/7] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 055feea0bc5..6ca5f778a3b 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..53b87a2df10 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -406,6 +406,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..e16295e5a3b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -137,3 +137,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v6-0002-Introduce-a-local-hash-table-to-store-replica-ide.patchapplication/octet-stream; name=v6-0002-Introduce-a-local-hash-table-to-store-replica-ide.patchDownload
From deba281b9fc6022f884d59ecbd3877598fba5ceb Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v6 2/7] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 6ca5f778a3b..cf08206d9fd 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b881f..13f8cb74e9f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 73d38644c4a..0b1eeefe9c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v6-0003-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v6-0003-Parallel-apply-non-streaming-transactions.patchDownload
From 834ad15997798aedc77fec24ed034ebf03044b1d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v6 3/7] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader first updates the hash entry
with the receiving remote xid then tells the parallel worker to wait for it.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on foreign key.
--
A transaction could conflict with another if modifying the same key.
While current patches don't address conflicts involving foreign keys, tracking
these dependencies might be needed.
---
.../replication/logical/applyparallelworker.c | 339 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 303 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/meson.build | 1 +
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/test/subscription/t/050_parallel_apply.pl | 130 +++++++
src/tools/pgindent/typedefs.list | 4 +
15 files changed, 801 insertions(+), 74 deletions(-)
create mode 100644 src/test/subscription/t/050_parallel_apply.pl
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index cf08206d9fd..5b6267c6047 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -152,6 +155,33 @@
* session-level locks because both locks could be acquired outside the
* transaction, and the stream lock in the leader needs to persist across
* transaction boundaries i.e. until the end of the streaming transaction.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * ----------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
*-------------------------------------------------------------------------
*/
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1860,6 +2076,13 @@ pa_wait_for_depended_transaction(TransactionId xid)
{
elog(DEBUG1, "wait for depended xid %u", xid);
+ /*
+ * Quick exit if parallelized_txns has not been initialized yet. This can
+ * happen when this function is called by the leader worker.
+ */
+ if (!parallelized_txns)
+ return;
+
for (;;)
{
ParallelizedTxnEntry *txn_entry;
@@ -1880,3 +2103,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ded46c49a83..96b6a74055e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 13f8cb74e9f..9991bfe76cc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0b1eeefe9c9..3832481647e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -484,6 +485,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +605,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +675,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1407,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1420,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1430,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1855,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1952,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1992,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2007,97 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Apart from parallelized transactions, we do not have to register
+ * this transaction to parallelized_txns. The commit ordering is
+ * always preserved.
+ */
+
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
+ apply_handle_commit_internal(&commit_data);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ INJECTION_POINT("parallel-worker-before-commit", NULL);
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2220,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2084,7 +2281,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2351,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2414,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2667,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3194,7 +3399,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3433,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4210,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4762,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4775,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4785,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4823,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4853,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = palloc_object(FlushPosition);
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6301,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6346,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6394,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..e877ca09c30 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,7 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/050_parallel_apply.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index ecb79e79474..0ccec516a18 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
new file mode 100644
index 00000000000..69cf48cb7ac
--- /dev/null
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# This tests that dependency tracking between transactions can work well
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+# Insert initial data
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(1, 10), 'test');");
+
+# Create a publication
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION regress_pub FOR ALL TABLES;");
+
+# Initialize subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Create a subscription
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_subscriber->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub CONNECTION '$publisher_connstr' PUBLICATION regress_pub;");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub');
+
+# Insert tuples on publisher
+#
+# XXX This may not enough to launch a parallel apply worker, because
+# table_states_not_ready is not discarded yet.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(11, 20), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Insert tuples again
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(21, 30), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Verify the parallel apply worker is launched
+my $result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '1', "parallel apply worker is laucnhed by a non-streamed transaction");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert tuples on publisher
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(31, 40), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+my $offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is independent from the
+# previous one, but the parallel worker would wait till it finishes
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(41, 50), 'test');");
+
+# Verify the parallel worker waits for the transaction
+my $str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+my $xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Update tuples which have not been applied yet on subscriber because the
+# parallel worker stops at the injection point. Newly assigned worker also
+# waits for the same transactions as above.
+$node_publisher->safe_psql('postgres',
+ "UPDATE regress_tab SET value = 'updated' WHERE id BETWEEN 31 AND 35;");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
+# Wakeup the parallel worker. We detach first no to stop other parallel workers
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the parallel worker wakes up
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(1) FROM regress_tab");
+is ($result, 50, 'inserts are replicated to subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM regress_tab WHERE value = 'updated'");
+is ($result, 5, 'updates are also replicated to subscriber');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..1517828a2d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2090,6 +2090,7 @@ ParallelHashGrowth
ParallelHashJoinBatch
ParallelHashJoinBatchAccessor
ParallelHashJoinState
+ParallelizedTxnEntry
ParallelIndexScanDesc
ParallelSlot
ParallelSlotArray
@@ -2574,6 +2575,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4083,6 +4086,7 @@ rendezvousHashEntry
rep
replace_rte_variables_callback
replace_rte_variables_context
+replica_identity_hash
report_error_fn
ret_type
rewind_source
--
2.47.3
v6-0004-support-2PC.patchapplication/octet-stream; name=v6-0004-support-2PC.patchDownload
From 7db41cfc2b7690a00ece9d0baa3244a6772b2866 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 2 Dec 2025 13:01:26 +0900
Subject: [PATCH v6 4/7] support 2PC
This patch allows the PREPARE transaction to be applied in parallel. Parallel
apply workers are assigned to a transaction when BEGIN_PREPARE is received. This
part and the dependency-waiting mechanism are the same as a normal transaction.
A parallel worker can be freed after it handles a PREPARE message. The prepared
transaction can be deleted from parallelized_txns at that time; the upcoming
transactions will wait until then.
The leader apply worker resolves COMMIT PREPARED/ROLLBACK PREPARED. Since it can
be serialized automatically, it does not add the transaction to parallelized_txns.
---
src/backend/replication/logical/worker.c | 230 +++++++++++++++---
src/test/subscription/t/050_parallel_apply.pl | 57 +++++
2 files changed, 259 insertions(+), 28 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3832481647e..ab757e3fac9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2116,6 +2116,11 @@ static void
apply_handle_begin_prepare(StringInfo s)
{
LogicalRepPreparedTxnData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* Tablesync should never receive prepare. */
if (am_tablesync_worker())
@@ -2127,12 +2132,61 @@ apply_handle_begin_prepare(StringInfo s)
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin_prepare(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.prepare_lsn);
remote_final_lsn = begin_data.prepare_lsn;
maybe_start_skipping_changes(begin_data.prepare_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN_PREPARE, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -2182,6 +2236,11 @@ static void
apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_prepare(s, &prepare_data);
@@ -2192,36 +2251,136 @@ apply_handle_prepare(StringInfo s)
LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- /*
- * Unlike commit, here, we always prepare the transaction even though no
- * change has happened in this transaction or all changes are skipped. It
- * is done this way because at commit prepared time, we won't know whether
- * we have skipped preparing a transaction because of those reasons.
- *
- * XXX, We can optimize such that at commit prepared time, we first check
- * whether we have prepared the transaction or not but that doesn't seem
- * worthwhile because such cases shouldn't be common.
- */
- begin_replication_step();
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
- apply_handle_prepare_internal(&prepare_data);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Unlike commit, here, we always prepare the transaction even
+ * though no change has happened in this transaction or all changes
+ * are skipped. It is done this way because at commit prepared
+ * time, we won't know whether we have skipped preparing a
+ * transaction because of those reasons.
+ *
+ * XXX, We can optimize such that at commit prepared time, we first
+ * check whether we have prepared the transaction or not but that
+ * doesn't seem worthwhile because such cases shouldn't be common.
+ */
+ begin_replication_step();
- end_replication_step();
- CommitTransactionCommand();
- pgstat_report_stat(false);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
- /*
- * It is okay not to set the local_end LSN for the prepare because we
- * always flush the prepare record. So, we can send the acknowledgment of
- * the remote_end LSN as soon as prepare is finished.
- *
- * XXX For the sake of consistency with commit, we could have set it with
- * the LSN of prepare but as of now we don't track that value similar to
- * XactLastCommitEnd, and adding it for this purpose doesn't seems worth
- * it.
- */
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
- InvalidTransactionId);
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. So, we can send the
+ * acknowledgment of the remote_end LSN as soon as prepare is
+ * finished.
+ *
+ * XXX For the sake of consistency with commit, we could have set
+ * it with the LSN of prepare but as of now we don't track that
+ * value similar to XactLastCommitEnd, and adding it for this
+ * purpose doesn't seems worth
+ * it.
+ */
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_PREPARE,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ begin_replication_step();
+
+ INJECTION_POINT("parallel-worker-before-prepare", NULL);
+
+ /* Mark the transaction as prepared. */
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. See apply_handle_prepare.
+ */
+ MyParallelShared->last_commit_end = InvalidXLogRecPtr;
+ pa_commit_transaction();
+
+ pa_unlock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+
+ pa_reset_subtrans();
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
in_remote_transaction = false;
@@ -2269,6 +2428,9 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2281,6 +2443,14 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
+ /*
+ * No need to update last_remote_xid because leader worker applied the
+ * message thus upcoming transaction preserves the order automatically.
+ * Let's set the xid to an invalid value to skip sending the
+ * INTERNAL_DEPENDENCY message.
+ */
+ last_remote_xid = InvalidTransactionId;
+
store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
InvalidTransactionId);
in_remote_transaction = false;
@@ -2337,6 +2507,10 @@ apply_handle_rollback_prepared(StringInfo s)
/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
begin_replication_step();
+
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
FinishPreparedTransaction(gid, false);
end_replication_step();
CommitTransactionCommand();
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 69cf48cb7ac..57bcfde513e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -17,6 +17,8 @@ if ($ENV{enable_injection_points} ne 'yes')
# Initialize publisher node
my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_publisher->start;
# Insert initial data
@@ -35,6 +37,8 @@ $node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
$node_subscriber->append_conf('postgresql.conf',
"max_logical_replication_workers = 10");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_subscriber->start;
# Check if the extension injection_points is available, as it may be
@@ -127,4 +131,57 @@ $result =
"SELECT count(1) FROM regress_tab WHERE value = 'updated'");
is ($result, 5, 'updates are also replicated to subscriber');
+# Ensure PREPAREd transaction also affects the parallel apply
+
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub DISABLE;");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub SET (two_phase = on);");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub ENABLE;");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '0', "no parallel apply workers exist after restart");
+
+# Attach an injection_point. Parallel workers would wait before the prepare
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-prepare','wait');"
+);
+
+# PREPARE a transaction on publisher. It would be handled by a parallel apply
+# worker.
+$node_publisher->safe_psql('postgres', qq[
+ BEGIN;
+ INSERT INTO regress_tab VALUES (generate_series(51, 60), 'prepare');
+ PREPARE TRANSACTION 'regress_prepare';
+]);
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-prepare');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction waits for the prepared
+# transaction
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(61, 70), 'test');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-prepare');
+ SELECT injection_points_wakeup('parallel-worker-before-prepare');
+]);
+
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+# COMMIT the prepared transaction. It is always handled by the leader
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
+$node_publisher->wait_for_catchup('regress_sub');
+
done_testing();
--
2.47.3
v6-0005-Track-dependencies-for-streamed-transactions.patchapplication/octet-stream; name=v6-0005-Track-dependencies-for-streamed-transactions.patchDownload
From d0ed14991f075640876c1de34450f47efa965162 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 4 Dec 2025 20:55:26 +0900
Subject: [PATCH v6 5/7] Track dependencies for streamed transactions
This commit allows tracking dependencies of streamed transactions.
Regarding the streaming=on case, dependency tracking is enabled while applying
spooled changes from files.
In the streaming=parallel case, dependency tracking is performed when the leader
sends changes to parallel workers. Apart from non-streamed transactions, the
leader waits for parallel workers till the assigned transactions are finished at
COMMIT/PREPARE/ABORT; thus, the XID of streamed transactions is not cached as
the lastly handled one. Also, streamed transactions are not recorded as
parallelized transactions because upcoming workers do not have to wait for them.
---
.../replication/logical/applyparallelworker.c | 19 +++++-
src/backend/replication/logical/worker.c | 66 +++++++++++++++++--
src/include/replication/worker_internal.h | 2 +-
src/test/subscription/t/050_parallel_apply.pl | 47 +++++++++++++
4 files changed, 126 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 5b6267c6047..bb66d64582c 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -168,7 +168,14 @@
* key) as another ongoing transaction (see handle_dependency_on_change for
* details). If so, the leader sends a list of dependent transaction IDs to the
* parallel worker, indicating that the parallel apply worker must wait for
- * these transactions to commit before proceeding.
+ * these transactions to commit before proceeding. If transactions are streamed
+ * but leader deciedes no to assign parallel apply workers, dependencies are
+ * verified when the transaction is committed.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
*
* Commit order
* ------------
@@ -1635,6 +1642,12 @@ pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
stream_apply_worker = winfo;
}
+bool
+pa_stream_apply_worker_is_null(void)
+{
+ return stream_apply_worker == NULL;
+}
+
/*
* Form a unique savepoint name for the streaming transaction.
*
@@ -1720,6 +1733,10 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
TransactionId xid = abort_data->xid;
TransactionId subxid = abort_data->subxid;
+ /* Streamed transactions won't be registered */
+ Assert(!dshash_find(parallelized_txns, &xid, false) &&
+ !dshash_find(parallelized_txns, &subxid, false));
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ab757e3fac9..3057e6a3aab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -961,13 +961,26 @@ check_dependency_on_replica_identity(Oid relid,
&rientry->remote_xid,
new_depended_xid);
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
/*
* Update the new depended xid into the entry if valid, the new xid could
* be invalid if the transaction will be applied by the leader itself
* which means all the changes will be committed before processing next
* transaction, so no need to be depended on.
*/
- if (TransactionIdIsValid(new_depended_xid))
+ else if (TransactionIdIsValid(new_depended_xid))
rientry->remote_xid = new_depended_xid;
/*
@@ -1081,8 +1094,11 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
*/
StringInfoData change = *s;
- /* Compute dependency only for non-streaming transaction */
- if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ /*
+ * Skip if we are handling streaming transactions but changes are not
+ * applied yet.
+ */
+ if (pa_stream_apply_worker_is_null() && in_streamed_transaction)
return;
/* Only the leader checks dependencies and schedules the parallel apply */
@@ -1442,7 +1458,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
- handle_dependency_on_change(action, s, current_xid, winfo);
+ /*
+ * Check dependencies related to the received change. The XID of the top
+ * transaction is always used to avoid detecting false-positive
+ * dependencies between top and sub transactions. Sub-transactions can be
+ * replicated for streamed transactions, and they won't be marked as
+ * parallelized so that parallel workers won't wait for rolled-back
+ * sub-transactions.
+ */
+ handle_dependency_on_change(action, s,
+ in_streamed_transaction
+ ? stream_xid : remote_xid,
+ winfo);
/*
* Re-fetch the latest apply action as it might have been changed during
@@ -2579,6 +2606,10 @@ apply_handle_stream_prepare(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
prepare_data.xid, prepare_data.prepare_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/* Mark the transaction as prepared. */
apply_handle_prepare_internal(&prepare_data);
@@ -2602,7 +2633,8 @@ apply_handle_stream_prepare(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, prepare_data.end_lsn);
@@ -2668,6 +2700,11 @@ apply_handle_stream_prepare(StringInfo s)
pgstat_report_stat(false);
+ /*
+ * No need to update the last_remote_xid here because leader worker
+ * always wait until streamed transactions finish.
+ */
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3452,6 +3489,10 @@ apply_handle_stream_commit(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
commit_data.commit_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
apply_handle_commit_internal(&commit_data);
/* Unlink the files with serialized changes and subxact info. */
@@ -3463,7 +3504,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 5371ee767f1..69ecd51a359 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -354,7 +354,7 @@ extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
extern void pa_set_xact_state(ParallelApplyWorkerShared *wshared,
ParallelTransState xact_state);
extern void pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo);
-
+extern bool pa_stream_apply_worker_is_null(void);
extern void pa_start_subtrans(TransactionId current_xid,
TransactionId top_xid);
extern void pa_reset_subtrans(void);
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 57bcfde513e..20e8a7b91a7 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -184,4 +184,51 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
$node_publisher->wait_for_catchup('regress_sub');
+# Ensure streamed transactions waits the previous transaction
+
+$node_publisher->append_conf('postgresql.conf',
+ "logical_decoding_work_mem = 64kB");
+$node_publisher->reload;
+# Run a query to make sure that the reload has taken effect.
+$node_publisher->safe_psql('postgres', "SELECT 1");
+
+# Attach the injection_point again
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(71, 80), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+# Run a transaction which would be streamed
+my $h = $node_publisher->background_psql('postgres', on_error_stop => 0);
+
+$offset = -s $node_subscriber->logfile;
+
+$h->query_safe(
+ q{
+BEGIN;
+UPDATE regress_tab SET value = 'streamed-updated' WHERE id BETWEEN 71 AND 80;
+INSERT INTO regress_tab VALUES (generate_series(100, 5100), 'streamed');
+});
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the streamed transaction can be applied
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$h->query_safe("COMMIT;");
+
done_testing();
--
2.47.3
v6-0006-Wait-applying-transaction-if-one-of-user-defined-.patchapplication/octet-stream; name=v6-0006-Wait-applying-transaction-if-one-of-user-defined-.patchDownload
From 2aa2a27e3b8961c7aa4f1ca04da360e82c6cfe1b Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 23 Dec 2025 17:58:15 +0900
Subject: [PATCH v6 6/7] Wait applying transaction if one of user-defined
triggers is immutable
Since many parallel workers apply transactions, triggers for relations can also
be fired in parallel, which may obtain unexpected results. To make it safe,
parallel apply workers wait for the previously dispatched transaction before
applying changes to the relation that has mutable triggers.
---
src/backend/replication/logical/relation.c | 123 ++++++++++++++++++---
src/backend/replication/logical/worker.c | 68 ++++++++++++
src/include/replication/logicalrelation.h | 20 ++++
3 files changed, 197 insertions(+), 14 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 9991bfe76cc..14f3ebf725e 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -21,7 +21,9 @@
#include "access/genam.h"
#include "access/table.h"
#include "catalog/namespace.h"
+#include "catalog/pg_proc.h"
#include "catalog/pg_subscription_rel.h"
+#include "commands/trigger.h"
#include "executor/executor.h"
#include "nodes/makefuncs.h"
#include "replication/logicalrelation.h"
@@ -159,6 +161,10 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
*
* Called when new relation mapping is sent by the publisher to update
* our expected view of incoming data from said publisher.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL has
+ * already assumed that CHECK constraints' conditions are immutable and here
+ * follows the rule.
*/
void
logicalrep_relmap_update(LogicalRepRelation *remoterel)
@@ -208,6 +214,8 @@ logicalrep_relmap_update(LogicalRepRelation *remoterel)
(remoterel->relkind == 0) ? RELKIND_RELATION : remoterel->relkind;
entry->remoterel.attkeys = bms_copy(remoterel->attkeys);
+
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
MemoryContextSwitchTo(oldctx);
}
@@ -353,27 +361,79 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
/*
- * Open the local relation associated with the remote one.
+ * Check all local triggers for the relation to see the parallelizability.
*
- * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ * We regard relations as applicable in parallel if all triggers are immutable.
+ * Result is directly set to LogicalRepRelMapEntry::parallel_safe.
*/
-LogicalRepRelMapEntry *
-logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+static void
+check_defined_triggers(LogicalRepRelMapEntry *entry)
+{
+ TriggerDesc *trigdesc = entry->localrel->trigdesc;
+
+ /* Quick exit if triffer is not defined */
+ if (trigdesc == NULL)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+ return;
+ }
+
+ /* Seek triggers one by one to see the volatility */
+ for (int i = 0; i < trigdesc->numtriggers; i++)
+ {
+ Trigger *trigger = &trigdesc->triggers[i];
+
+ Assert(OidIsValid(trigger->tgfoid));
+
+ /* Skip if the trigger is not enabled for logical replication */
+ if (trigger->tgenabled == TRIGGER_DISABLED ||
+ trigger->tgenabled == TRIGGER_FIRES_ON_ORIGIN)
+ continue;
+
+ /* Check the volatility of the trigger. Exit if it is not immutable */
+ if (func_volatile(trigger->tgfoid) != PROVOLATILE_IMMUTABLE)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ return;
+ }
+ }
+
+ /* All triggers are immutable, set as parallel safe */
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+}
+
+/*
+ * Actual workhorse for logicalrep_rel_open().
+ *
+ * Caller must specify *either* entry or key. If the entry is specified, its
+ * attributes are filled and returned. The logical relation is kept opening.
+ * If the key is given, the corresponding entry is first searched in the hash
+ * table and processed as in the above case. At the end, logical replication is
+ * closed.
+ */
+void
+logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
+ LOCKMODE lockmode)
{
- LogicalRepRelMapEntry *entry;
- bool found;
LogicalRepRelation *remoterel;
- if (LogicalRepRelMap == NULL)
- logicalrep_relmap_init();
+ Assert((entry && !remoteid) || (!entry && remoteid));
- /* Search for existing entry. */
- entry = hash_search(LogicalRepRelMap, &remoteid,
- HASH_FIND, &found);
+ if (!entry)
+ {
+ bool found;
- if (!found)
- elog(ERROR, "no relation map entry for remote relation ID %u",
- remoteid);
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+ }
remoterel = &entry->remoterel;
@@ -499,6 +559,13 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localindexoid = FindLogicalRepLocalIndex(entry->localrel, remoterel,
entry->attrmap);
+ /*
+ * Leader must also collect all local unique indexes for dependency
+ * tracking.
+ */
+ if (am_leader_apply_worker())
+ check_defined_triggers(entry);
+
entry->localrelvalid = true;
}
@@ -507,6 +574,34 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localreloid,
&entry->statelsn);
+ if (remoteid)
+ logicalrep_rel_close(entry, lockmode);
+}
+
+/*
+ * Open the local relation associated with the remote one.
+ *
+ * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ */
+LogicalRepRelMapEntry *
+logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ logicalrep_rel_load(entry, 0, lockmode);
+
return entry;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3057e6a3aab..72383ab78b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1062,6 +1062,59 @@ check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
relentry->last_depended_xid = new_depended_xid;
}
+/*
+ * Check the parallelizability of applying changes for the relation.
+ * Append the lastly dispatched transaction in in 'depends_on_xids' if the
+ * relation is parallel unsafe.
+ */
+static void
+check_dependency_for_parallel_safety(LogicalRepRelId relid,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ /* Quick exit if no transactions have been dispatched */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return;
+
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * Gather information for local triggres if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_UNKNOWN)
+ {
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
+
+ /*
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
+ */
+ if (needs_start)
+ CommitTransactionCommand();
+
+ Assert(relentry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN);
+ }
+
+ /* Do nothing for parallel safe relations */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_SAFE)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &last_remote_xid,
+ new_depended_xid);
+}
+
/*
* Check dependencies related to the current change by determining if the
* modification impacts the same row or table as another ongoing transaction. If
@@ -1120,6 +1173,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_UPDATE:
@@ -1127,13 +1182,19 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
&newtup);
if (has_oldtup)
+ {
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
+ }
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_DELETE:
@@ -1141,6 +1202,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_TRUNCATE:
@@ -1153,8 +1216,13 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
* modified the same table.
*/
foreach_int(truncated_relid, remote_relids)
+ {
check_dependency_on_rel(truncated_relid, new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(truncated_relid,
+ new_depended_xid,
+ &depends_on_xids);
+ }
break;
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 34a7069e9e5..e3d0df58620 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -39,6 +39,20 @@ typedef struct LogicalRepRelMapEntry
XLogRecPtr statelsn;
TransactionId last_depended_xid;
+
+ /*
+ * Whether the relation can be applied in parallel or not. It is
+ * distinglish whether defined triggers are the immutable or not.
+ *
+ * Theoretically, we can determine the parallelizability for each type of
+ * replication messages, INSERT/UPDATE/DELETE/TRUNCATE. But it is not done
+ * yet to reduce the number of attributes.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL
+ * has already assumed that CHECK constraints' conditions are immutable and
+ * here follows the rule.
+ */
+ char parallel_safe;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -46,6 +60,8 @@ extern void logicalrep_partmap_reset_relmap(LogicalRepRelation *remoterel);
extern LogicalRepRelMapEntry *logicalrep_rel_open(LogicalRepRelId remoteid,
LOCKMODE lockmode);
+extern void logicalrep_rel_load(LogicalRepRelMapEntry *entry,
+ LogicalRepRelId remoteid, LOCKMODE lockmode);
extern LogicalRepRelMapEntry *logicalrep_partition_open(LogicalRepRelMapEntry *root,
Relation partrel, AttrMap *map);
extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
@@ -56,4 +72,8 @@ extern int logicalrep_get_num_rels(void);
extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
+#define LOGICALREP_PARALLEL_SAFE 's'
+#define LOGICALREP_PARALLEL_RESTRICTED 'r'
+#define LOGICALREP_PARALLEL_UNKNOWN 'u'
+
#endif /* LOGICALRELATION_H */
--
2.47.3
v6-0007-Support-dependency-tracking-via-local-unique-inde.patchapplication/octet-stream; name=v6-0007-Support-dependency-tracking-via-local-unique-inde.patchDownload
From c19d8e3a6850426468d78eb50318d7b181e62b12 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <hayato@example.com>
Date: Thu, 11 Dec 2025 22:21:47 +0900
Subject: [PATCH v6 7/7] Support dependency tracking via local unique indexes
Currently, logical replication's parallel apply mechanism tracks dependencies
primarily based on the REPLICA IDENTITY defined on the publisher table.
However, local subscriber tables might have additional unique indexes that
could effectively serve as dependency keys, even if they don't correspond to
the publisher's REPLICA IDENTITY. Failing to track these additional unique
keys can lead to incorrect data and/or deadlocks during parallel application.
This patch extends the parallel apply's dependency tracking to consider
local unique indexes on the subscriber table. This is achieved by extending
the existing Replica Identity hash table to also store dependency information
based on these local unique indexes.
The LogicalRepRelMapEntry structure is extended to store details about these
local unique indexes. This information is collected and cached when
dependency checking is first performed for a remote transaction on a given
relation. This collection process requires to be in a transaction to access
system catalog information.
---
src/backend/replication/logical/relation.c | 151 +++++++++-
src/backend/replication/logical/worker.c | 272 ++++++++++++++----
src/backend/storage/lmgr/deadlock.c | 1 -
src/include/replication/logicalrelation.h | 14 +
src/test/subscription/t/050_parallel_apply.pl | 43 +++
src/tools/pgindent/typedefs.list | 2 +
6 files changed, 424 insertions(+), 59 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 14f3ebf725e..9d744f4c8cb 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -127,6 +127,21 @@ logicalrep_relmap_init(void)
(Datum) 0);
}
+/*
+ * Release local index list
+ */
+static void
+free_local_unique_indexes(LogicalRepRelMapEntry *entry)
+{
+ Assert(am_leader_apply_worker());
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, entry->local_unique_indexes)
+ bms_free(idxinfo->indexkeys);
+
+ list_free(entry->local_unique_indexes);
+ entry->local_unique_indexes = NIL;
+}
+
/*
* Free the entry of a relation map cache.
*/
@@ -154,6 +169,9 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
if (entry->attrmap)
free_attrmap(entry->attrmap);
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
}
/*
@@ -360,6 +378,116 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
}
+/*
+ * Collect all local unique indexes that can be used for dependency tracking.
+ */
+static void
+collect_local_indexes(LogicalRepRelMapEntry *entry)
+{
+ List *idxlist;
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
+
+ entry->local_unique_indexes_collected = true;
+
+ idxlist = RelationGetIndexList(entry->localrel);
+
+ /* Quick exit if there are no indexes */
+ if (idxlist == NIL)
+ return;
+
+ /* Iterate indexes to list all usable indexes */
+ foreach_oid(idxoid, idxlist)
+ {
+ Relation idxrel;
+ int indnkeys;
+ AttrMap *attrmap;
+ Bitmapset *indexkeys = NULL;
+ bool suitable = true;
+
+ idxrel = index_open(idxoid, AccessShareLock);
+
+ /*
+ * Check whether the index can be used for the dependency tracking.
+ *
+ * For simplification, the same condition as REPLICA IDENTITY FULL,
+ * plus it must be a unique index.
+ */
+ if (!(idxrel->rd_index->indisunique &&
+ IsIndexUsableForReplicaIdentityFull(idxrel, entry->attrmap)))
+ {
+ index_close(idxrel, AccessShareLock);
+ continue;
+ }
+
+ indnkeys = idxrel->rd_index->indnkeyatts;
+ attrmap = entry->attrmap;
+
+ Assert(indnkeys);
+
+ /* Seek each attributes and add to a Bitmap */
+ for (int i = 0; i < indnkeys; i++)
+ {
+ AttrNumber localcol = idxrel->rd_index->indkey.values[i];
+ AttrNumber remotecol;
+
+ /*
+ * XXX: Mark a relation as parallel-unsafe if it has expression
+ * indexes because we cannot compute the hash value for the
+ * dependency tracking. For safety, transactions that modify such
+ * tables can wait for applications till the lastly dispatched
+ * transaction is committed.
+ */
+ if (!AttributeNumberIsValid(localcol))
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ break;
+ }
+
+ remotecol = attrmap->attnums[AttrNumberGetAttrOffset(localcol)];
+
+ /*
+ * Skip if the column does not exist on publisher node. In this
+ * case the replicated tuples always have NULL or default value.
+ */
+ if (remotecol < 0)
+ {
+ suitable = false;
+ break;
+ }
+
+ /* Checks are passed, remember the attribute */
+ indexkeys = bms_add_member(indexkeys, remotecol);
+ }
+
+ index_close(idxrel, AccessShareLock);
+
+ /*
+ * One of a column does not exist on publisher side, skip using index.
+ */
+ if (!suitable)
+ continue;
+
+ /* This index is usable, store on memory */
+ if (indexkeys)
+ {
+ MemoryContext oldctx;
+ LogicalRepSubscriberIdx *idxinfo;
+
+ oldctx = MemoryContextSwitchTo(LogicalRepRelMapContext);
+ idxinfo = palloc(sizeof(LogicalRepSubscriberIdx));
+ idxinfo->indexoid = idxoid;
+ idxinfo->indexkeys = bms_copy(indexkeys);
+ entry->local_unique_indexes =
+ lappend(entry->local_unique_indexes, idxinfo);
+ MemoryContextSwitchTo(oldctx);
+ }
+ }
+
+ list_free(idxlist);
+}
+
/*
* Check all local triggers for the relation to see the parallelizability.
*
@@ -369,7 +497,16 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
static void
check_defined_triggers(LogicalRepRelMapEntry *entry)
{
- TriggerDesc *trigdesc = entry->localrel->trigdesc;
+ TriggerDesc *trigdesc;
+
+ /*
+ * Skip if the parallelizability has already been checked. Possilble if the
+ * relation has expression indexes.
+ */
+ if (entry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN)
+ return;
+
+ trigdesc = entry->localrel->trigdesc;
/* Quick exit if triffer is not defined */
if (trigdesc == NULL)
@@ -410,7 +547,7 @@ check_defined_triggers(LogicalRepRelMapEntry *entry)
* If the key is given, the corresponding entry is first searched in the hash
* table and processed as in the above case. At the end, logical replication is
* closed.
- */
+ */
void
logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
LOCKMODE lockmode)
@@ -564,7 +701,11 @@ logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
* tracking.
*/
if (am_leader_apply_worker())
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
+ collect_local_indexes(entry);
check_defined_triggers(entry);
+ }
entry->localrelvalid = true;
}
@@ -866,6 +1007,12 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,
entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel,
entry->attrmap);
+ /*
+ * TODO: Parallel apply does not support the parallel apply for now.
+ * Just mark local indexes are collected.
+ */
+ entry->local_unique_indexes_collected = true;
+
entry->localrelvalid = true;
return entry;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 72383ab78b8..dae9a98da13 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -548,9 +548,19 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+/*
+ * Type of key used for dependency tracking.
+ */
+typedef enum LogicalRepKeyKind
+{
+ LOGICALREP_KEY_REPLICA_IDENTITY,
+ LOGICALREP_KEY_LOCAL_UNIQUE
+} LogicalRepKeyKind;
+
typedef struct ReplicaIdentityKey
{
Oid relid;
+ LogicalRepKeyKind kind;
LogicalRepTupleData *data;
} ReplicaIdentityKey;
@@ -710,7 +720,8 @@ static bool
hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
{
if (a->relid != b->relid ||
- a->data->ncols != b->data->ncols)
+ a->data->ncols != b->data->ncols ||
+ a->kind != b->kind)
return false;
for (int i = 0; i < a->data->ncols; i++)
@@ -718,6 +729,9 @@ hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
if (a->data->colstatus[i] != b->data->colstatus[i])
return false;
+ if (a->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
if (a->data->colvalues[i].len != b->data->colvalues[i].len)
return false;
@@ -839,6 +853,93 @@ check_and_append_xid_dependency(List *depends_on_xids,
return lappend_xid(depends_on_xids, *depends_on_xid);
}
+/*
+ * Common function for registering dependency on a key. Used by both
+ * check_dependency_on_replica_identity and check_dependency_on_local_key.
+ */
+static void
+register_dependency_with_key(ReplicaIdentityKey *key, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ ReplicaIdentityEntry *rientry;
+ bool found = false;
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, key,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1,
+ key->kind == LOGICALREP_KEY_REPLICA_IDENTITY ?
+ "found conflicting replica identity change from %u" :
+ "found conflicting local unique change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(key);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, key);
+ free_replica_identity_key(key);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ else if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
/*
* Check for dependencies on preceding transactions that modify the same key.
* Returns the dependent transactions in 'depends_on_xids' and records the
@@ -853,10 +954,8 @@ check_dependency_on_replica_identity(Oid relid,
LogicalRepRelMapEntry *relentry;
LogicalRepTupleData *ridata;
ReplicaIdentityKey *rikey;
- ReplicaIdentityEntry *rientry;
MemoryContext oldctx;
int n_ri;
- bool found = false;
Assert(depends_on_xids);
@@ -922,75 +1021,124 @@ check_dependency_on_replica_identity(Oid relid,
rikey = palloc0_object(ReplicaIdentityKey);
rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_REPLICA_IDENTITY;
rikey->data = ridata;
- if (TransactionIdIsValid(new_depended_xid))
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
+}
+
+/*
+ * Mostly same as check_dependency_on_replica_identity() but for local unique
+ * indexes.
+ */
+static void
+check_dependency_on_local_key(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ MemoryContext oldctx;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * Gather information for local indexes if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (!relentry->local_unique_indexes_collected)
{
- rientry = replica_identity_insert(replica_identity_table, rikey,
- &found);
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
/*
- * Release the key built to search the entry, if the entry already
- * exists. Otherwise, initialize the remote_xid.
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
*/
- if (found)
- {
- elog(DEBUG1, "found conflicting replica identity change from %u",
- rientry->remote_xid);
+ if (needs_start)
+ CommitTransactionCommand();
- free_replica_identity_key(rikey);
- }
- else
- rientry->remote_xid = InvalidTransactionId;
+ Assert(relentry->local_unique_indexes_collected);
}
- else
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, relentry->local_unique_indexes)
{
- rientry = replica_identity_lookup(replica_identity_table, rikey);
- free_replica_identity_key(rikey);
- }
+ int columns = bms_num_members(idxinfo->indexkeys);
+ bool suitable = true;
- MemoryContextSwitchTo(oldctx);
+ Assert(columns);
- /* Return if no entry found */
- if (!rientry)
- return;
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, idxinfo->indexkeys))
+ continue;
- Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+ /*
+ * Skip if the column is not changed.
+ *
+ * XXX: NULL is allowed.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ {
+ suitable = false;
+ break;
+ }
+ }
- *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
- &rientry->remote_xid,
- new_depended_xid);
+ if (!suitable)
+ continue;
- /*
- * Remove the entry if it is registered for the streamed transactions. We
- * do not have to register an entry for them; The leader worker always
- * waits until the parallel worker finishes handling streamed transactions,
- * thus no need to consider the possiblity that upcoming parallel workers
- * would go ahead.
- */
- if (TransactionIdIsValid(stream_xid) && !found)
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
- }
+ oldctx = MemoryContextSwitchTo(ApplyContext);
- /*
- * Update the new depended xid into the entry if valid, the new xid could
- * be invalid if the transaction will be applied by the leader itself
- * which means all the changes will be committed before processing next
- * transaction, so no need to be depended on.
- */
- else if (TransactionIdIsValid(new_depended_xid))
- rientry->remote_xid = new_depended_xid;
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, columns);
+ ridata->colstatus = palloc0_array(char, columns);
+ ridata->ncols = columns;
- /*
- * Remove the entry if the transaction has been committed and no new
- * dependency needs to be added.
- */
- else if (!TransactionIdIsValid(rientry->remote_xid))
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ for (int i_original = 0, i_key = 0; i_original < original_data->ncols; i_original++)
+ {
+ if (!bms_is_member(i_original, idxinfo->indexkeys))
+ continue;
+
+ if (original_data->colstatus[i_original] != LOGICALREP_COLUMN_NULL)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ initStringInfoExt(&ridata->colvalues[i_key], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_key], original_colvalue->data);
+ }
+
+ ridata->colstatus[i_key] = original_data->colstatus[i_original];
+ i_key++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_LOCAL_UNIQUE;
+ rikey->data = ridata;
+
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
}
}
@@ -1173,6 +1321,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1186,6 +1337,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
}
@@ -1193,6 +1347,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1202,6 +1359,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index c4bfaaa67ac..ca7dee52b32 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -33,7 +33,6 @@
#include "storage/procnumber.h"
#include "utils/memutils.h"
-
/*
* One edge in the waits-for graph.
*
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index e3d0df58620..9ac97fc4b38 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -16,6 +16,12 @@
#include "catalog/index.h"
#include "replication/logicalproto.h"
+typedef struct LogicalRepSubscriberIdx
+{
+ Oid indexoid; /* OID of the local key */
+ Bitmapset *indexkeys; /* Bitmap of key columns *on remote* */
+} LogicalRepSubscriberIdx;
+
typedef struct LogicalRepRelMapEntry
{
LogicalRepRelation remoterel; /* key is remoterel.remoteid */
@@ -40,6 +46,10 @@ typedef struct LogicalRepRelMapEntry
TransactionId last_depended_xid;
+ /* Local unique indexes. Used for dependency tracking */
+ List *local_unique_indexes;
+ bool local_unique_indexes_collected;
+
/*
* Whether the relation can be applied in parallel or not. It is
* distinglish whether defined triggers are the immutable or not.
@@ -51,6 +61,10 @@ typedef struct LogicalRepRelMapEntry
* Note that we do not check the user-defined constraints here. PostgreSQL
* has already assumed that CHECK constraints' conditions are immutable and
* here follows the rule.
+ *
+ * XXX: Additonally, this can be false if the relation has expression
+ * indexes. Because we cannot compute the hash value for the dependency
+ * tracking.
*/
char parallel_safe;
} LogicalRepRelMapEntry;
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 20e8a7b91a7..e489a4bdc1e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -231,4 +231,47 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$h->query_safe("COMMIT;");
+# Ensure subscriber-local indexes are also used for the dependency tracking
+
+# Truncate the data for upcoming tests
+$node_publisher->safe_psql('postgres', "TRUNCATE TABLE regress_tab;");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Define an unique index on subscriber
+$node_subscriber->safe_psql('postgres',
+ "CREATE INDEX ON regress_tab (value);");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert a tuple on publisher. Parallel worker would wait at the injection
+# point
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (1, 'would conflict');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is would wait because all
+# parallel workers wait till the previously launched worker commits.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (2, 'would not conflict');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Insert a conflicting tuple on publisher. Leader worker would detect the conflict
+# and wait for the transaction to commit.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (3, 'would conflict');");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1517828a2d7..998749eaaf0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1637,6 +1637,7 @@ LogicalRepBeginData
LogicalRepCommitData
LogicalRepCommitPreparedTxnData
LogicalRepCtxStruct
+LogicalRepKeyKind
LogicalRepMsgType
LogicalRepPartMapEntry
LogicalRepPreparedTxnData
@@ -1646,6 +1647,7 @@ LogicalRepRelation
LogicalRepRollbackPreparedTxnData
LogicalRepSequenceInfo
LogicalRepStreamAbortData
+LogicalRepSubscriberIdx
LogicalRepTupleData
LogicalRepTyp
LogicalRepWorker
--
2.47.3
Here is a rebased version.
Oh, I mistook run the git format-patch command. Here is a correct set.
the sequence number is incremented.
0006 contains changes to handle the case that user-defined triggers are not...
It should be 0007.
0007 contains changes for track dependencies by local indexes. It was mostly the...
It should be 0008.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v6_2-0001-Introduce-new-type-of-logical-replication-messa.patchapplication/octet-stream; name=v6_2-0001-Introduce-new-type-of-logical-replication-messa.patchDownload
From aacddd4275a24eaf823777bcf134930ead8f8799 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 10:37:27 +0900
Subject: [PATCH v6_2 1/8] Introduce new type of logical replication messages
to track dependencies
This patch introduces two logical replication messages,
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY and LOGICAL_REP_MSG_INTERNAL_RELATION.
Apart from other messages, they are not sent by walsnders; the leader worker
sends to parallel workers based on the needs.
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY ensures that dependent transactions are
committed in the correct order. It has a list of transaction IDs that parallel
workers must wait for. The message type would be generated when the leader
detects a dependency between the current and other transactions, or just before
the COMMIT message. The latter one is used to preserve the commit ordering
between the publisher and the subscriber.
LOGICAL_REP_MSG_INTERNAL_RELATION is used to synchronize the relation
information between the leader and parallel workers. It has a list of relations
that the leader already knows, and parallel workers also update the relmap in
response to the message. This type of message is generated when the leader
allocates a new parallel worker to the transaction, or when the publisher sends
additional RELATION messages.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 16 ++++++
src/backend/replication/logical/proto.c | 4 ++
src/backend/replication/logical/worker.c | 49 +++++++++++++++++++
src/include/replication/logicalproto.h | 2 +
src/include/replication/worker_internal.h | 4 ++
5 files changed, 75 insertions(+)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index a4aafcf5b6e..055feea0bc5 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1645,3 +1645,19 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ /* XXX wait until given transaction is finished */
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 27ad74fd759..ded46c49a83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -1253,6 +1253,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 718408bb599..73d38644c4a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -629,6 +629,47 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -3868,6 +3909,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..5d91e2a4287 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..a3526eae578 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -359,6 +359,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTableSyncWorker(worker) ((worker)->in_use && \
@@ -366,6 +368,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
--
2.47.3
v6_2-0002-Introduce-a-shared-hash-table-to-store-parallel.patchapplication/octet-stream; name=v6_2-0002-Introduce-a-shared-hash-table-to-store-parallel.patchDownload
From 4e6589e8848b7c2de6e6b5f12766eb4674302fec Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v6_2 2/8] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 055feea0bc5..6ca5f778a3b 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..53b87a2df10 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -406,6 +406,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..e16295e5a3b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -137,3 +137,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v6_2-0003-Introduce-a-local-hash-table-to-store-replica-i.patchapplication/octet-stream; name=v6_2-0003-Introduce-a-local-hash-table-to-store-replica-i.patchDownload
From deba281b9fc6022f884d59ecbd3877598fba5ceb Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v6_2 3/8] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 6ca5f778a3b..cf08206d9fd 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b881f..13f8cb74e9f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 73d38644c4a..0b1eeefe9c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v6_2-0004-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v6_2-0004-Parallel-apply-non-streaming-transactions.patchDownload
From 834ad15997798aedc77fec24ed034ebf03044b1d Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v6_2 4/8] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader first updates the hash entry
with the receiving remote xid then tells the parallel worker to wait for it.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on foreign key.
--
A transaction could conflict with another if modifying the same key.
While current patches don't address conflicts involving foreign keys, tracking
these dependencies might be needed.
---
.../replication/logical/applyparallelworker.c | 339 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 303 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/meson.build | 1 +
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/test/subscription/t/050_parallel_apply.pl | 130 +++++++
src/tools/pgindent/typedefs.list | 4 +
15 files changed, 801 insertions(+), 74 deletions(-)
create mode 100644 src/test/subscription/t/050_parallel_apply.pl
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index cf08206d9fd..5b6267c6047 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -152,6 +155,33 @@
* session-level locks because both locks could be acquired outside the
* transaction, and the stream lock in the leader needs to persist across
* transaction boundaries i.e. until the end of the streaming transaction.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * ----------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
*-------------------------------------------------------------------------
*/
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1860,6 +2076,13 @@ pa_wait_for_depended_transaction(TransactionId xid)
{
elog(DEBUG1, "wait for depended xid %u", xid);
+ /*
+ * Quick exit if parallelized_txns has not been initialized yet. This can
+ * happen when this function is called by the leader worker.
+ */
+ if (!parallelized_txns)
+ return;
+
for (;;)
{
ParallelizedTxnEntry *txn_entry;
@@ -1880,3 +2103,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ded46c49a83..96b6a74055e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 13f8cb74e9f..9991bfe76cc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0b1eeefe9c9..3832481647e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -484,6 +485,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +605,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +675,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1407,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1420,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1430,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1855,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1952,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1992,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2007,97 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Apart from parallelized transactions, we do not have to register
+ * this transaction to parallelized_txns. The commit ordering is
+ * always preserved.
+ */
+
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
+ apply_handle_commit_internal(&commit_data);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ INJECTION_POINT("parallel-worker-before-commit", NULL);
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2220,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2084,7 +2281,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2351,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2414,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2667,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3194,7 +3399,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3433,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4210,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4762,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4775,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4785,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4823,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4853,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = palloc_object(FlushPosition);
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6301,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6346,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6394,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..e877ca09c30 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,7 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/050_parallel_apply.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index ecb79e79474..0ccec516a18 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
new file mode 100644
index 00000000000..69cf48cb7ac
--- /dev/null
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# This tests that dependency tracking between transactions can work well
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+# Insert initial data
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(1, 10), 'test');");
+
+# Create a publication
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION regress_pub FOR ALL TABLES;");
+
+# Initialize subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Create a subscription
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_subscriber->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub CONNECTION '$publisher_connstr' PUBLICATION regress_pub;");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub');
+
+# Insert tuples on publisher
+#
+# XXX This may not enough to launch a parallel apply worker, because
+# table_states_not_ready is not discarded yet.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(11, 20), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Insert tuples again
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(21, 30), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Verify the parallel apply worker is launched
+my $result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '1', "parallel apply worker is laucnhed by a non-streamed transaction");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert tuples on publisher
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(31, 40), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+my $offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is independent from the
+# previous one, but the parallel worker would wait till it finishes
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(41, 50), 'test');");
+
+# Verify the parallel worker waits for the transaction
+my $str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+my $xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Update tuples which have not been applied yet on subscriber because the
+# parallel worker stops at the injection point. Newly assigned worker also
+# waits for the same transactions as above.
+$node_publisher->safe_psql('postgres',
+ "UPDATE regress_tab SET value = 'updated' WHERE id BETWEEN 31 AND 35;");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
+# Wakeup the parallel worker. We detach first no to stop other parallel workers
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the parallel worker wakes up
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(1) FROM regress_tab");
+is ($result, 50, 'inserts are replicated to subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM regress_tab WHERE value = 'updated'");
+is ($result, 5, 'updates are also replicated to subscriber');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5c88fa92f4e..1517828a2d7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2090,6 +2090,7 @@ ParallelHashGrowth
ParallelHashJoinBatch
ParallelHashJoinBatchAccessor
ParallelHashJoinState
+ParallelizedTxnEntry
ParallelIndexScanDesc
ParallelSlot
ParallelSlotArray
@@ -2574,6 +2575,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4083,6 +4086,7 @@ rendezvousHashEntry
rep
replace_rte_variables_callback
replace_rte_variables_context
+replica_identity_hash
report_error_fn
ret_type
rewind_source
--
2.47.3
v6_2-0005-support-2PC.patchapplication/octet-stream; name=v6_2-0005-support-2PC.patchDownload
From 7db41cfc2b7690a00ece9d0baa3244a6772b2866 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 2 Dec 2025 13:01:26 +0900
Subject: [PATCH v6_2 5/8] support 2PC
This patch allows the PREPARE transaction to be applied in parallel. Parallel
apply workers are assigned to a transaction when BEGIN_PREPARE is received. This
part and the dependency-waiting mechanism are the same as a normal transaction.
A parallel worker can be freed after it handles a PREPARE message. The prepared
transaction can be deleted from parallelized_txns at that time; the upcoming
transactions will wait until then.
The leader apply worker resolves COMMIT PREPARED/ROLLBACK PREPARED. Since it can
be serialized automatically, it does not add the transaction to parallelized_txns.
---
src/backend/replication/logical/worker.c | 230 +++++++++++++++---
src/test/subscription/t/050_parallel_apply.pl | 57 +++++
2 files changed, 259 insertions(+), 28 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3832481647e..ab757e3fac9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2116,6 +2116,11 @@ static void
apply_handle_begin_prepare(StringInfo s)
{
LogicalRepPreparedTxnData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* Tablesync should never receive prepare. */
if (am_tablesync_worker())
@@ -2127,12 +2132,61 @@ apply_handle_begin_prepare(StringInfo s)
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin_prepare(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.prepare_lsn);
remote_final_lsn = begin_data.prepare_lsn;
maybe_start_skipping_changes(begin_data.prepare_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN_PREPARE, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -2182,6 +2236,11 @@ static void
apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_prepare(s, &prepare_data);
@@ -2192,36 +2251,136 @@ apply_handle_prepare(StringInfo s)
LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- /*
- * Unlike commit, here, we always prepare the transaction even though no
- * change has happened in this transaction or all changes are skipped. It
- * is done this way because at commit prepared time, we won't know whether
- * we have skipped preparing a transaction because of those reasons.
- *
- * XXX, We can optimize such that at commit prepared time, we first check
- * whether we have prepared the transaction or not but that doesn't seem
- * worthwhile because such cases shouldn't be common.
- */
- begin_replication_step();
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
- apply_handle_prepare_internal(&prepare_data);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Unlike commit, here, we always prepare the transaction even
+ * though no change has happened in this transaction or all changes
+ * are skipped. It is done this way because at commit prepared
+ * time, we won't know whether we have skipped preparing a
+ * transaction because of those reasons.
+ *
+ * XXX, We can optimize such that at commit prepared time, we first
+ * check whether we have prepared the transaction or not but that
+ * doesn't seem worthwhile because such cases shouldn't be common.
+ */
+ begin_replication_step();
- end_replication_step();
- CommitTransactionCommand();
- pgstat_report_stat(false);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
- /*
- * It is okay not to set the local_end LSN for the prepare because we
- * always flush the prepare record. So, we can send the acknowledgment of
- * the remote_end LSN as soon as prepare is finished.
- *
- * XXX For the sake of consistency with commit, we could have set it with
- * the LSN of prepare but as of now we don't track that value similar to
- * XactLastCommitEnd, and adding it for this purpose doesn't seems worth
- * it.
- */
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
- InvalidTransactionId);
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. So, we can send the
+ * acknowledgment of the remote_end LSN as soon as prepare is
+ * finished.
+ *
+ * XXX For the sake of consistency with commit, we could have set
+ * it with the LSN of prepare but as of now we don't track that
+ * value similar to XactLastCommitEnd, and adding it for this
+ * purpose doesn't seems worth
+ * it.
+ */
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_PREPARE,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ begin_replication_step();
+
+ INJECTION_POINT("parallel-worker-before-prepare", NULL);
+
+ /* Mark the transaction as prepared. */
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. See apply_handle_prepare.
+ */
+ MyParallelShared->last_commit_end = InvalidXLogRecPtr;
+ pa_commit_transaction();
+
+ pa_unlock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+
+ pa_reset_subtrans();
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
in_remote_transaction = false;
@@ -2269,6 +2428,9 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2281,6 +2443,14 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
+ /*
+ * No need to update last_remote_xid because leader worker applied the
+ * message thus upcoming transaction preserves the order automatically.
+ * Let's set the xid to an invalid value to skip sending the
+ * INTERNAL_DEPENDENCY message.
+ */
+ last_remote_xid = InvalidTransactionId;
+
store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
InvalidTransactionId);
in_remote_transaction = false;
@@ -2337,6 +2507,10 @@ apply_handle_rollback_prepared(StringInfo s)
/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
begin_replication_step();
+
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
FinishPreparedTransaction(gid, false);
end_replication_step();
CommitTransactionCommand();
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 69cf48cb7ac..57bcfde513e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -17,6 +17,8 @@ if ($ENV{enable_injection_points} ne 'yes')
# Initialize publisher node
my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_publisher->start;
# Insert initial data
@@ -35,6 +37,8 @@ $node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
$node_subscriber->append_conf('postgresql.conf',
"max_logical_replication_workers = 10");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_subscriber->start;
# Check if the extension injection_points is available, as it may be
@@ -127,4 +131,57 @@ $result =
"SELECT count(1) FROM regress_tab WHERE value = 'updated'");
is ($result, 5, 'updates are also replicated to subscriber');
+# Ensure PREPAREd transaction also affects the parallel apply
+
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub DISABLE;");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub SET (two_phase = on);");
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub ENABLE;");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '0', "no parallel apply workers exist after restart");
+
+# Attach an injection_point. Parallel workers would wait before the prepare
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-prepare','wait');"
+);
+
+# PREPARE a transaction on publisher. It would be handled by a parallel apply
+# worker.
+$node_publisher->safe_psql('postgres', qq[
+ BEGIN;
+ INSERT INTO regress_tab VALUES (generate_series(51, 60), 'prepare');
+ PREPARE TRANSACTION 'regress_prepare';
+]);
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-prepare');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction waits for the prepared
+# transaction
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(61, 70), 'test');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-prepare');
+ SELECT injection_points_wakeup('parallel-worker-before-prepare');
+]);
+
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+# COMMIT the prepared transaction. It is always handled by the leader
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
+$node_publisher->wait_for_catchup('regress_sub');
+
done_testing();
--
2.47.3
v6_2-0006-Track-dependencies-for-streamed-transactions.patchapplication/octet-stream; name=v6_2-0006-Track-dependencies-for-streamed-transactions.patchDownload
From d0ed14991f075640876c1de34450f47efa965162 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 4 Dec 2025 20:55:26 +0900
Subject: [PATCH v6_2 6/8] Track dependencies for streamed transactions
This commit allows tracking dependencies of streamed transactions.
Regarding the streaming=on case, dependency tracking is enabled while applying
spooled changes from files.
In the streaming=parallel case, dependency tracking is performed when the leader
sends changes to parallel workers. Apart from non-streamed transactions, the
leader waits for parallel workers till the assigned transactions are finished at
COMMIT/PREPARE/ABORT; thus, the XID of streamed transactions is not cached as
the lastly handled one. Also, streamed transactions are not recorded as
parallelized transactions because upcoming workers do not have to wait for them.
---
.../replication/logical/applyparallelworker.c | 19 +++++-
src/backend/replication/logical/worker.c | 66 +++++++++++++++++--
src/include/replication/worker_internal.h | 2 +-
src/test/subscription/t/050_parallel_apply.pl | 47 +++++++++++++
4 files changed, 126 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 5b6267c6047..bb66d64582c 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -168,7 +168,14 @@
* key) as another ongoing transaction (see handle_dependency_on_change for
* details). If so, the leader sends a list of dependent transaction IDs to the
* parallel worker, indicating that the parallel apply worker must wait for
- * these transactions to commit before proceeding.
+ * these transactions to commit before proceeding. If transactions are streamed
+ * but leader deciedes no to assign parallel apply workers, dependencies are
+ * verified when the transaction is committed.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
*
* Commit order
* ------------
@@ -1635,6 +1642,12 @@ pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
stream_apply_worker = winfo;
}
+bool
+pa_stream_apply_worker_is_null(void)
+{
+ return stream_apply_worker == NULL;
+}
+
/*
* Form a unique savepoint name for the streaming transaction.
*
@@ -1720,6 +1733,10 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
TransactionId xid = abort_data->xid;
TransactionId subxid = abort_data->subxid;
+ /* Streamed transactions won't be registered */
+ Assert(!dshash_find(parallelized_txns, &xid, false) &&
+ !dshash_find(parallelized_txns, &subxid, false));
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index ab757e3fac9..3057e6a3aab 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -961,13 +961,26 @@ check_dependency_on_replica_identity(Oid relid,
&rientry->remote_xid,
new_depended_xid);
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
/*
* Update the new depended xid into the entry if valid, the new xid could
* be invalid if the transaction will be applied by the leader itself
* which means all the changes will be committed before processing next
* transaction, so no need to be depended on.
*/
- if (TransactionIdIsValid(new_depended_xid))
+ else if (TransactionIdIsValid(new_depended_xid))
rientry->remote_xid = new_depended_xid;
/*
@@ -1081,8 +1094,11 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
*/
StringInfoData change = *s;
- /* Compute dependency only for non-streaming transaction */
- if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ /*
+ * Skip if we are handling streaming transactions but changes are not
+ * applied yet.
+ */
+ if (pa_stream_apply_worker_is_null() && in_streamed_transaction)
return;
/* Only the leader checks dependencies and schedules the parallel apply */
@@ -1442,7 +1458,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
- handle_dependency_on_change(action, s, current_xid, winfo);
+ /*
+ * Check dependencies related to the received change. The XID of the top
+ * transaction is always used to avoid detecting false-positive
+ * dependencies between top and sub transactions. Sub-transactions can be
+ * replicated for streamed transactions, and they won't be marked as
+ * parallelized so that parallel workers won't wait for rolled-back
+ * sub-transactions.
+ */
+ handle_dependency_on_change(action, s,
+ in_streamed_transaction
+ ? stream_xid : remote_xid,
+ winfo);
/*
* Re-fetch the latest apply action as it might have been changed during
@@ -2579,6 +2606,10 @@ apply_handle_stream_prepare(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
prepare_data.xid, prepare_data.prepare_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/* Mark the transaction as prepared. */
apply_handle_prepare_internal(&prepare_data);
@@ -2602,7 +2633,8 @@ apply_handle_stream_prepare(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, prepare_data.end_lsn);
@@ -2668,6 +2700,11 @@ apply_handle_stream_prepare(StringInfo s)
pgstat_report_stat(false);
+ /*
+ * No need to update the last_remote_xid here because leader worker
+ * always wait until streamed transactions finish.
+ */
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3452,6 +3489,10 @@ apply_handle_stream_commit(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
commit_data.commit_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
apply_handle_commit_internal(&commit_data);
/* Unlink the files with serialized changes and subxact info. */
@@ -3463,7 +3504,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 5371ee767f1..69ecd51a359 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -354,7 +354,7 @@ extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
extern void pa_set_xact_state(ParallelApplyWorkerShared *wshared,
ParallelTransState xact_state);
extern void pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo);
-
+extern bool pa_stream_apply_worker_is_null(void);
extern void pa_start_subtrans(TransactionId current_xid,
TransactionId top_xid);
extern void pa_reset_subtrans(void);
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 57bcfde513e..20e8a7b91a7 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -184,4 +184,51 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
$node_publisher->wait_for_catchup('regress_sub');
+# Ensure streamed transactions waits the previous transaction
+
+$node_publisher->append_conf('postgresql.conf',
+ "logical_decoding_work_mem = 64kB");
+$node_publisher->reload;
+# Run a query to make sure that the reload has taken effect.
+$node_publisher->safe_psql('postgres', "SELECT 1");
+
+# Attach the injection_point again
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(71, 80), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+# Run a transaction which would be streamed
+my $h = $node_publisher->background_psql('postgres', on_error_stop => 0);
+
+$offset = -s $node_subscriber->logfile;
+
+$h->query_safe(
+ q{
+BEGIN;
+UPDATE regress_tab SET value = 'streamed-updated' WHERE id BETWEEN 71 AND 80;
+INSERT INTO regress_tab VALUES (generate_series(100, 5100), 'streamed');
+});
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the streamed transaction can be applied
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$h->query_safe("COMMIT;");
+
done_testing();
--
2.47.3
v6_2-0007-Wait-applying-transaction-if-one-of-user-define.patchapplication/octet-stream; name=v6_2-0007-Wait-applying-transaction-if-one-of-user-define.patchDownload
From 2aa2a27e3b8961c7aa4f1ca04da360e82c6cfe1b Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 23 Dec 2025 17:58:15 +0900
Subject: [PATCH v6_2 7/8] Wait applying transaction if one of user-defined
triggers is immutable
Since many parallel workers apply transactions, triggers for relations can also
be fired in parallel, which may obtain unexpected results. To make it safe,
parallel apply workers wait for the previously dispatched transaction before
applying changes to the relation that has mutable triggers.
---
src/backend/replication/logical/relation.c | 123 ++++++++++++++++++---
src/backend/replication/logical/worker.c | 68 ++++++++++++
src/include/replication/logicalrelation.h | 20 ++++
3 files changed, 197 insertions(+), 14 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 9991bfe76cc..14f3ebf725e 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -21,7 +21,9 @@
#include "access/genam.h"
#include "access/table.h"
#include "catalog/namespace.h"
+#include "catalog/pg_proc.h"
#include "catalog/pg_subscription_rel.h"
+#include "commands/trigger.h"
#include "executor/executor.h"
#include "nodes/makefuncs.h"
#include "replication/logicalrelation.h"
@@ -159,6 +161,10 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
*
* Called when new relation mapping is sent by the publisher to update
* our expected view of incoming data from said publisher.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL has
+ * already assumed that CHECK constraints' conditions are immutable and here
+ * follows the rule.
*/
void
logicalrep_relmap_update(LogicalRepRelation *remoterel)
@@ -208,6 +214,8 @@ logicalrep_relmap_update(LogicalRepRelation *remoterel)
(remoterel->relkind == 0) ? RELKIND_RELATION : remoterel->relkind;
entry->remoterel.attkeys = bms_copy(remoterel->attkeys);
+
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
MemoryContextSwitchTo(oldctx);
}
@@ -353,27 +361,79 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
/*
- * Open the local relation associated with the remote one.
+ * Check all local triggers for the relation to see the parallelizability.
*
- * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ * We regard relations as applicable in parallel if all triggers are immutable.
+ * Result is directly set to LogicalRepRelMapEntry::parallel_safe.
*/
-LogicalRepRelMapEntry *
-logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+static void
+check_defined_triggers(LogicalRepRelMapEntry *entry)
+{
+ TriggerDesc *trigdesc = entry->localrel->trigdesc;
+
+ /* Quick exit if triffer is not defined */
+ if (trigdesc == NULL)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+ return;
+ }
+
+ /* Seek triggers one by one to see the volatility */
+ for (int i = 0; i < trigdesc->numtriggers; i++)
+ {
+ Trigger *trigger = &trigdesc->triggers[i];
+
+ Assert(OidIsValid(trigger->tgfoid));
+
+ /* Skip if the trigger is not enabled for logical replication */
+ if (trigger->tgenabled == TRIGGER_DISABLED ||
+ trigger->tgenabled == TRIGGER_FIRES_ON_ORIGIN)
+ continue;
+
+ /* Check the volatility of the trigger. Exit if it is not immutable */
+ if (func_volatile(trigger->tgfoid) != PROVOLATILE_IMMUTABLE)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ return;
+ }
+ }
+
+ /* All triggers are immutable, set as parallel safe */
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+}
+
+/*
+ * Actual workhorse for logicalrep_rel_open().
+ *
+ * Caller must specify *either* entry or key. If the entry is specified, its
+ * attributes are filled and returned. The logical relation is kept opening.
+ * If the key is given, the corresponding entry is first searched in the hash
+ * table and processed as in the above case. At the end, logical replication is
+ * closed.
+ */
+void
+logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
+ LOCKMODE lockmode)
{
- LogicalRepRelMapEntry *entry;
- bool found;
LogicalRepRelation *remoterel;
- if (LogicalRepRelMap == NULL)
- logicalrep_relmap_init();
+ Assert((entry && !remoteid) || (!entry && remoteid));
- /* Search for existing entry. */
- entry = hash_search(LogicalRepRelMap, &remoteid,
- HASH_FIND, &found);
+ if (!entry)
+ {
+ bool found;
- if (!found)
- elog(ERROR, "no relation map entry for remote relation ID %u",
- remoteid);
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+ }
remoterel = &entry->remoterel;
@@ -499,6 +559,13 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localindexoid = FindLogicalRepLocalIndex(entry->localrel, remoterel,
entry->attrmap);
+ /*
+ * Leader must also collect all local unique indexes for dependency
+ * tracking.
+ */
+ if (am_leader_apply_worker())
+ check_defined_triggers(entry);
+
entry->localrelvalid = true;
}
@@ -507,6 +574,34 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localreloid,
&entry->statelsn);
+ if (remoteid)
+ logicalrep_rel_close(entry, lockmode);
+}
+
+/*
+ * Open the local relation associated with the remote one.
+ *
+ * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ */
+LogicalRepRelMapEntry *
+logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ logicalrep_rel_load(entry, 0, lockmode);
+
return entry;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3057e6a3aab..72383ab78b8 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1062,6 +1062,59 @@ check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
relentry->last_depended_xid = new_depended_xid;
}
+/*
+ * Check the parallelizability of applying changes for the relation.
+ * Append the lastly dispatched transaction in in 'depends_on_xids' if the
+ * relation is parallel unsafe.
+ */
+static void
+check_dependency_for_parallel_safety(LogicalRepRelId relid,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ /* Quick exit if no transactions have been dispatched */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return;
+
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * Gather information for local triggres if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_UNKNOWN)
+ {
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
+
+ /*
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
+ */
+ if (needs_start)
+ CommitTransactionCommand();
+
+ Assert(relentry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN);
+ }
+
+ /* Do nothing for parallel safe relations */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_SAFE)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &last_remote_xid,
+ new_depended_xid);
+}
+
/*
* Check dependencies related to the current change by determining if the
* modification impacts the same row or table as another ongoing transaction. If
@@ -1120,6 +1173,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_UPDATE:
@@ -1127,13 +1182,19 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
&newtup);
if (has_oldtup)
+ {
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
+ }
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_DELETE:
@@ -1141,6 +1202,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_TRUNCATE:
@@ -1153,8 +1216,13 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
* modified the same table.
*/
foreach_int(truncated_relid, remote_relids)
+ {
check_dependency_on_rel(truncated_relid, new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(truncated_relid,
+ new_depended_xid,
+ &depends_on_xids);
+ }
break;
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 34a7069e9e5..e3d0df58620 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -39,6 +39,20 @@ typedef struct LogicalRepRelMapEntry
XLogRecPtr statelsn;
TransactionId last_depended_xid;
+
+ /*
+ * Whether the relation can be applied in parallel or not. It is
+ * distinglish whether defined triggers are the immutable or not.
+ *
+ * Theoretically, we can determine the parallelizability for each type of
+ * replication messages, INSERT/UPDATE/DELETE/TRUNCATE. But it is not done
+ * yet to reduce the number of attributes.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL
+ * has already assumed that CHECK constraints' conditions are immutable and
+ * here follows the rule.
+ */
+ char parallel_safe;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -46,6 +60,8 @@ extern void logicalrep_partmap_reset_relmap(LogicalRepRelation *remoterel);
extern LogicalRepRelMapEntry *logicalrep_rel_open(LogicalRepRelId remoteid,
LOCKMODE lockmode);
+extern void logicalrep_rel_load(LogicalRepRelMapEntry *entry,
+ LogicalRepRelId remoteid, LOCKMODE lockmode);
extern LogicalRepRelMapEntry *logicalrep_partition_open(LogicalRepRelMapEntry *root,
Relation partrel, AttrMap *map);
extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
@@ -56,4 +72,8 @@ extern int logicalrep_get_num_rels(void);
extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
+#define LOGICALREP_PARALLEL_SAFE 's'
+#define LOGICALREP_PARALLEL_RESTRICTED 'r'
+#define LOGICALREP_PARALLEL_UNKNOWN 'u'
+
#endif /* LOGICALRELATION_H */
--
2.47.3
v6_2-0008-Support-dependency-tracking-via-local-unique-in.patchapplication/octet-stream; name=v6_2-0008-Support-dependency-tracking-via-local-unique-in.patchDownload
From c19d8e3a6850426468d78eb50318d7b181e62b12 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <hayato@example.com>
Date: Thu, 11 Dec 2025 22:21:47 +0900
Subject: [PATCH v6_2 8/8] Support dependency tracking via local unique indexes
Currently, logical replication's parallel apply mechanism tracks dependencies
primarily based on the REPLICA IDENTITY defined on the publisher table.
However, local subscriber tables might have additional unique indexes that
could effectively serve as dependency keys, even if they don't correspond to
the publisher's REPLICA IDENTITY. Failing to track these additional unique
keys can lead to incorrect data and/or deadlocks during parallel application.
This patch extends the parallel apply's dependency tracking to consider
local unique indexes on the subscriber table. This is achieved by extending
the existing Replica Identity hash table to also store dependency information
based on these local unique indexes.
The LogicalRepRelMapEntry structure is extended to store details about these
local unique indexes. This information is collected and cached when
dependency checking is first performed for a remote transaction on a given
relation. This collection process requires to be in a transaction to access
system catalog information.
---
src/backend/replication/logical/relation.c | 151 +++++++++-
src/backend/replication/logical/worker.c | 272 ++++++++++++++----
src/backend/storage/lmgr/deadlock.c | 1 -
src/include/replication/logicalrelation.h | 14 +
src/test/subscription/t/050_parallel_apply.pl | 43 +++
src/tools/pgindent/typedefs.list | 2 +
6 files changed, 424 insertions(+), 59 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 14f3ebf725e..9d744f4c8cb 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -127,6 +127,21 @@ logicalrep_relmap_init(void)
(Datum) 0);
}
+/*
+ * Release local index list
+ */
+static void
+free_local_unique_indexes(LogicalRepRelMapEntry *entry)
+{
+ Assert(am_leader_apply_worker());
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, entry->local_unique_indexes)
+ bms_free(idxinfo->indexkeys);
+
+ list_free(entry->local_unique_indexes);
+ entry->local_unique_indexes = NIL;
+}
+
/*
* Free the entry of a relation map cache.
*/
@@ -154,6 +169,9 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
if (entry->attrmap)
free_attrmap(entry->attrmap);
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
}
/*
@@ -360,6 +378,116 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
}
+/*
+ * Collect all local unique indexes that can be used for dependency tracking.
+ */
+static void
+collect_local_indexes(LogicalRepRelMapEntry *entry)
+{
+ List *idxlist;
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
+
+ entry->local_unique_indexes_collected = true;
+
+ idxlist = RelationGetIndexList(entry->localrel);
+
+ /* Quick exit if there are no indexes */
+ if (idxlist == NIL)
+ return;
+
+ /* Iterate indexes to list all usable indexes */
+ foreach_oid(idxoid, idxlist)
+ {
+ Relation idxrel;
+ int indnkeys;
+ AttrMap *attrmap;
+ Bitmapset *indexkeys = NULL;
+ bool suitable = true;
+
+ idxrel = index_open(idxoid, AccessShareLock);
+
+ /*
+ * Check whether the index can be used for the dependency tracking.
+ *
+ * For simplification, the same condition as REPLICA IDENTITY FULL,
+ * plus it must be a unique index.
+ */
+ if (!(idxrel->rd_index->indisunique &&
+ IsIndexUsableForReplicaIdentityFull(idxrel, entry->attrmap)))
+ {
+ index_close(idxrel, AccessShareLock);
+ continue;
+ }
+
+ indnkeys = idxrel->rd_index->indnkeyatts;
+ attrmap = entry->attrmap;
+
+ Assert(indnkeys);
+
+ /* Seek each attributes and add to a Bitmap */
+ for (int i = 0; i < indnkeys; i++)
+ {
+ AttrNumber localcol = idxrel->rd_index->indkey.values[i];
+ AttrNumber remotecol;
+
+ /*
+ * XXX: Mark a relation as parallel-unsafe if it has expression
+ * indexes because we cannot compute the hash value for the
+ * dependency tracking. For safety, transactions that modify such
+ * tables can wait for applications till the lastly dispatched
+ * transaction is committed.
+ */
+ if (!AttributeNumberIsValid(localcol))
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ break;
+ }
+
+ remotecol = attrmap->attnums[AttrNumberGetAttrOffset(localcol)];
+
+ /*
+ * Skip if the column does not exist on publisher node. In this
+ * case the replicated tuples always have NULL or default value.
+ */
+ if (remotecol < 0)
+ {
+ suitable = false;
+ break;
+ }
+
+ /* Checks are passed, remember the attribute */
+ indexkeys = bms_add_member(indexkeys, remotecol);
+ }
+
+ index_close(idxrel, AccessShareLock);
+
+ /*
+ * One of a column does not exist on publisher side, skip using index.
+ */
+ if (!suitable)
+ continue;
+
+ /* This index is usable, store on memory */
+ if (indexkeys)
+ {
+ MemoryContext oldctx;
+ LogicalRepSubscriberIdx *idxinfo;
+
+ oldctx = MemoryContextSwitchTo(LogicalRepRelMapContext);
+ idxinfo = palloc(sizeof(LogicalRepSubscriberIdx));
+ idxinfo->indexoid = idxoid;
+ idxinfo->indexkeys = bms_copy(indexkeys);
+ entry->local_unique_indexes =
+ lappend(entry->local_unique_indexes, idxinfo);
+ MemoryContextSwitchTo(oldctx);
+ }
+ }
+
+ list_free(idxlist);
+}
+
/*
* Check all local triggers for the relation to see the parallelizability.
*
@@ -369,7 +497,16 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
static void
check_defined_triggers(LogicalRepRelMapEntry *entry)
{
- TriggerDesc *trigdesc = entry->localrel->trigdesc;
+ TriggerDesc *trigdesc;
+
+ /*
+ * Skip if the parallelizability has already been checked. Possilble if the
+ * relation has expression indexes.
+ */
+ if (entry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN)
+ return;
+
+ trigdesc = entry->localrel->trigdesc;
/* Quick exit if triffer is not defined */
if (trigdesc == NULL)
@@ -410,7 +547,7 @@ check_defined_triggers(LogicalRepRelMapEntry *entry)
* If the key is given, the corresponding entry is first searched in the hash
* table and processed as in the above case. At the end, logical replication is
* closed.
- */
+ */
void
logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
LOCKMODE lockmode)
@@ -564,7 +701,11 @@ logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
* tracking.
*/
if (am_leader_apply_worker())
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
+ collect_local_indexes(entry);
check_defined_triggers(entry);
+ }
entry->localrelvalid = true;
}
@@ -866,6 +1007,12 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,
entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel,
entry->attrmap);
+ /*
+ * TODO: Parallel apply does not support the parallel apply for now.
+ * Just mark local indexes are collected.
+ */
+ entry->local_unique_indexes_collected = true;
+
entry->localrelvalid = true;
return entry;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 72383ab78b8..dae9a98da13 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -548,9 +548,19 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+/*
+ * Type of key used for dependency tracking.
+ */
+typedef enum LogicalRepKeyKind
+{
+ LOGICALREP_KEY_REPLICA_IDENTITY,
+ LOGICALREP_KEY_LOCAL_UNIQUE
+} LogicalRepKeyKind;
+
typedef struct ReplicaIdentityKey
{
Oid relid;
+ LogicalRepKeyKind kind;
LogicalRepTupleData *data;
} ReplicaIdentityKey;
@@ -710,7 +720,8 @@ static bool
hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
{
if (a->relid != b->relid ||
- a->data->ncols != b->data->ncols)
+ a->data->ncols != b->data->ncols ||
+ a->kind != b->kind)
return false;
for (int i = 0; i < a->data->ncols; i++)
@@ -718,6 +729,9 @@ hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
if (a->data->colstatus[i] != b->data->colstatus[i])
return false;
+ if (a->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
if (a->data->colvalues[i].len != b->data->colvalues[i].len)
return false;
@@ -839,6 +853,93 @@ check_and_append_xid_dependency(List *depends_on_xids,
return lappend_xid(depends_on_xids, *depends_on_xid);
}
+/*
+ * Common function for registering dependency on a key. Used by both
+ * check_dependency_on_replica_identity and check_dependency_on_local_key.
+ */
+static void
+register_dependency_with_key(ReplicaIdentityKey *key, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ ReplicaIdentityEntry *rientry;
+ bool found = false;
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, key,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1,
+ key->kind == LOGICALREP_KEY_REPLICA_IDENTITY ?
+ "found conflicting replica identity change from %u" :
+ "found conflicting local unique change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(key);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, key);
+ free_replica_identity_key(key);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ else if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
/*
* Check for dependencies on preceding transactions that modify the same key.
* Returns the dependent transactions in 'depends_on_xids' and records the
@@ -853,10 +954,8 @@ check_dependency_on_replica_identity(Oid relid,
LogicalRepRelMapEntry *relentry;
LogicalRepTupleData *ridata;
ReplicaIdentityKey *rikey;
- ReplicaIdentityEntry *rientry;
MemoryContext oldctx;
int n_ri;
- bool found = false;
Assert(depends_on_xids);
@@ -922,75 +1021,124 @@ check_dependency_on_replica_identity(Oid relid,
rikey = palloc0_object(ReplicaIdentityKey);
rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_REPLICA_IDENTITY;
rikey->data = ridata;
- if (TransactionIdIsValid(new_depended_xid))
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
+}
+
+/*
+ * Mostly same as check_dependency_on_replica_identity() but for local unique
+ * indexes.
+ */
+static void
+check_dependency_on_local_key(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ MemoryContext oldctx;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * Gather information for local indexes if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (!relentry->local_unique_indexes_collected)
{
- rientry = replica_identity_insert(replica_identity_table, rikey,
- &found);
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
/*
- * Release the key built to search the entry, if the entry already
- * exists. Otherwise, initialize the remote_xid.
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
*/
- if (found)
- {
- elog(DEBUG1, "found conflicting replica identity change from %u",
- rientry->remote_xid);
+ if (needs_start)
+ CommitTransactionCommand();
- free_replica_identity_key(rikey);
- }
- else
- rientry->remote_xid = InvalidTransactionId;
+ Assert(relentry->local_unique_indexes_collected);
}
- else
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, relentry->local_unique_indexes)
{
- rientry = replica_identity_lookup(replica_identity_table, rikey);
- free_replica_identity_key(rikey);
- }
+ int columns = bms_num_members(idxinfo->indexkeys);
+ bool suitable = true;
- MemoryContextSwitchTo(oldctx);
+ Assert(columns);
- /* Return if no entry found */
- if (!rientry)
- return;
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, idxinfo->indexkeys))
+ continue;
- Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+ /*
+ * Skip if the column is not changed.
+ *
+ * XXX: NULL is allowed.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ {
+ suitable = false;
+ break;
+ }
+ }
- *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
- &rientry->remote_xid,
- new_depended_xid);
+ if (!suitable)
+ continue;
- /*
- * Remove the entry if it is registered for the streamed transactions. We
- * do not have to register an entry for them; The leader worker always
- * waits until the parallel worker finishes handling streamed transactions,
- * thus no need to consider the possiblity that upcoming parallel workers
- * would go ahead.
- */
- if (TransactionIdIsValid(stream_xid) && !found)
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
- }
+ oldctx = MemoryContextSwitchTo(ApplyContext);
- /*
- * Update the new depended xid into the entry if valid, the new xid could
- * be invalid if the transaction will be applied by the leader itself
- * which means all the changes will be committed before processing next
- * transaction, so no need to be depended on.
- */
- else if (TransactionIdIsValid(new_depended_xid))
- rientry->remote_xid = new_depended_xid;
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, columns);
+ ridata->colstatus = palloc0_array(char, columns);
+ ridata->ncols = columns;
- /*
- * Remove the entry if the transaction has been committed and no new
- * dependency needs to be added.
- */
- else if (!TransactionIdIsValid(rientry->remote_xid))
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ for (int i_original = 0, i_key = 0; i_original < original_data->ncols; i_original++)
+ {
+ if (!bms_is_member(i_original, idxinfo->indexkeys))
+ continue;
+
+ if (original_data->colstatus[i_original] != LOGICALREP_COLUMN_NULL)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ initStringInfoExt(&ridata->colvalues[i_key], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_key], original_colvalue->data);
+ }
+
+ ridata->colstatus[i_key] = original_data->colstatus[i_original];
+ i_key++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_LOCAL_UNIQUE;
+ rikey->data = ridata;
+
+ MemoryContextSwitchTo(oldctx);
+
+ register_dependency_with_key(rikey, new_depended_xid,
+ depends_on_xids);
}
}
@@ -1173,6 +1321,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1186,6 +1337,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
}
@@ -1193,6 +1347,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1202,6 +1359,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index c4bfaaa67ac..ca7dee52b32 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -33,7 +33,6 @@
#include "storage/procnumber.h"
#include "utils/memutils.h"
-
/*
* One edge in the waits-for graph.
*
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index e3d0df58620..9ac97fc4b38 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -16,6 +16,12 @@
#include "catalog/index.h"
#include "replication/logicalproto.h"
+typedef struct LogicalRepSubscriberIdx
+{
+ Oid indexoid; /* OID of the local key */
+ Bitmapset *indexkeys; /* Bitmap of key columns *on remote* */
+} LogicalRepSubscriberIdx;
+
typedef struct LogicalRepRelMapEntry
{
LogicalRepRelation remoterel; /* key is remoterel.remoteid */
@@ -40,6 +46,10 @@ typedef struct LogicalRepRelMapEntry
TransactionId last_depended_xid;
+ /* Local unique indexes. Used for dependency tracking */
+ List *local_unique_indexes;
+ bool local_unique_indexes_collected;
+
/*
* Whether the relation can be applied in parallel or not. It is
* distinglish whether defined triggers are the immutable or not.
@@ -51,6 +61,10 @@ typedef struct LogicalRepRelMapEntry
* Note that we do not check the user-defined constraints here. PostgreSQL
* has already assumed that CHECK constraints' conditions are immutable and
* here follows the rule.
+ *
+ * XXX: Additonally, this can be false if the relation has expression
+ * indexes. Because we cannot compute the hash value for the dependency
+ * tracking.
*/
char parallel_safe;
} LogicalRepRelMapEntry;
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 20e8a7b91a7..e489a4bdc1e 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -231,4 +231,47 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$h->query_safe("COMMIT;");
+# Ensure subscriber-local indexes are also used for the dependency tracking
+
+# Truncate the data for upcoming tests
+$node_publisher->safe_psql('postgres', "TRUNCATE TABLE regress_tab;");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Define an unique index on subscriber
+$node_subscriber->safe_psql('postgres',
+ "CREATE INDEX ON regress_tab (value);");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert a tuple on publisher. Parallel worker would wait at the injection
+# point
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (1, 'would conflict');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is would wait because all
+# parallel workers wait till the previously launched worker commits.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (2, 'would not conflict');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Insert a conflicting tuple on publisher. Leader worker would detect the conflict
+# and wait for the transaction to commit.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (3, 'would conflict');");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1517828a2d7..998749eaaf0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1637,6 +1637,7 @@ LogicalRepBeginData
LogicalRepCommitData
LogicalRepCommitPreparedTxnData
LogicalRepCtxStruct
+LogicalRepKeyKind
LogicalRepMsgType
LogicalRepPartMapEntry
LogicalRepPreparedTxnData
@@ -1646,6 +1647,7 @@ LogicalRepRelation
LogicalRepRollbackPreparedTxnData
LogicalRepSequenceInfo
LogicalRepStreamAbortData
+LogicalRepSubscriberIdx
LogicalRepTupleData
LogicalRepTyp
LogicalRepWorker
--
2.47.3
Happy new year hackers,
I found that CFbot sometimes failed tests. Per my analysis, there were two
issues in the 0005 patch. The following describes two changes.
1)
Took care of the case where an empty prepared transaction was replicated.
The leader worker would gather even such transactions in get_flush_position()
and try to clean up a replica identity hash. If the empty transaction is firstly
replicated after the worker is launched, however, the replica identity hash is
not yet initialized, which causes the segmentation fault. To address the issue,
a guard was added to the cleanup function.
As far as I know, an empty prepared transaction can happen if
a) the prepared transaction has already been rolled back while decoding, or
b) all changes are skipped.
Added test sometimes meets a) due to the timing issue.
2)
Fixed a timing issue in 050_parallel_apply.pl. The test sets the two_phase
option to true, but sometimes it fails if the apply workers are not yet finished
after the subscription is disabled. Now the test ensures there are no apply workers.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Attachments:
v7-0001-Introduce-new-type-of-logical-replication-message.patchapplication/octet-stream; name=v7-0001-Introduce-new-type-of-logical-replication-message.patchDownload
From 5ee009136a31e7eb5d35c8c42d05b7c2f8f3f5c5 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 10:37:27 +0900
Subject: [PATCH v7 1/8] Introduce new type of logical replication messages to
track dependencies
This patch introduces two logical replication messages,
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY and LOGICAL_REP_MSG_INTERNAL_RELATION.
Apart from other messages, they are not sent by walsnders; the leader worker
sends to parallel workers based on the needs.
LOGICAL_REP_MSG_INTERNAL_DEPENDENCY ensures that dependent transactions are
committed in the correct order. It has a list of transaction IDs that parallel
workers must wait for. The message type would be generated when the leader
detects a dependency between the current and other transactions, or just before
the COMMIT message. The latter one is used to preserve the commit ordering
between the publisher and the subscriber.
LOGICAL_REP_MSG_INTERNAL_RELATION is used to synchronize the relation
information between the leader and parallel workers. It has a list of relations
that the leader already knows, and parallel workers also update the relmap in
response to the message. This type of message is generated when the leader
allocates a new parallel worker to the transaction, or when the publisher sends
additional RELATION messages.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 16 ++++++
src/backend/replication/logical/proto.c | 4 ++
src/backend/replication/logical/worker.c | 49 +++++++++++++++++++
src/include/replication/logicalproto.h | 2 +
src/include/replication/worker_internal.h | 4 ++
5 files changed, 75 insertions(+)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index a4aafcf5b6e..055feea0bc5 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -1645,3 +1645,19 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+
+/*
+ * Wait for the given transaction to finish.
+ */
+void
+pa_wait_for_depended_transaction(TransactionId xid)
+{
+ elog(DEBUG1, "wait for depended xid %u", xid);
+
+ for (;;)
+ {
+ /* XXX wait until given transaction is finished */
+ }
+
+ elog(DEBUG1, "finish waiting for depended xid %u", xid);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 27ad74fd759..ded46c49a83 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -1253,6 +1253,10 @@ logicalrep_message_type(LogicalRepMsgType action)
return "STREAM ABORT";
case LOGICAL_REP_MSG_STREAM_PREPARE:
return "STREAM PREPARE";
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ return "INTERNAL DEPENDENCY";
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ return "INTERNAL RELATION";
}
/*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 718408bb599..73d38644c4a 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -629,6 +629,47 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+/*
+ * Handle internal dependency information.
+ *
+ * Wait for all transactions listed in the message to commit.
+ */
+static void
+apply_handle_internal_dependency(StringInfo s)
+{
+ int nxids = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < nxids; i++)
+ {
+ TransactionId xid = pq_getmsgint(s, 4);
+
+ pa_wait_for_depended_transaction(xid);
+ }
+}
+
+/*
+ * Handle internal relation information.
+ *
+ * Update all relation details in the relation map cache.
+ */
+static void
+apply_handle_internal_relation(StringInfo s)
+{
+ int num_rels;
+
+ num_rels = pq_getmsgint(s, 4);
+
+ for (int i = 0; i < num_rels; i++)
+ {
+ LogicalRepRelation *rel = logicalrep_read_rel(s);
+
+ logicalrep_relmap_update(rel);
+
+ elog(DEBUG1, "parallel apply worker worker init relmap for %s",
+ rel->relname);
+ }
+}
+
/*
* Form the origin name for the subscription.
*
@@ -3868,6 +3909,14 @@ apply_dispatch(StringInfo s)
apply_handle_stream_prepare(s);
break;
+ case LOGICAL_REP_MSG_INTERNAL_RELATION:
+ apply_handle_internal_relation(s);
+ break;
+
+ case LOGICAL_REP_MSG_INTERNAL_DEPENDENCY:
+ apply_handle_internal_dependency(s);
+ break;
+
default:
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index b261c60d3fa..5d91e2a4287 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -75,6 +75,8 @@ typedef enum LogicalRepMsgType
LOGICAL_REP_MSG_STREAM_COMMIT = 'c',
LOGICAL_REP_MSG_STREAM_ABORT = 'A',
LOGICAL_REP_MSG_STREAM_PREPARE = 'p',
+ LOGICAL_REP_MSG_INTERNAL_DEPENDENCY = 'd',
+ LOGICAL_REP_MSG_INTERNAL_RELATION = 'i',
} LogicalRepMsgType;
/*
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index f081619f151..a3526eae578 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -359,6 +359,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern void pa_wait_for_depended_transaction(TransactionId xid);
+
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
#define isTableSyncWorker(worker) ((worker)->in_use && \
@@ -366,6 +368,8 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
#define isSequenceSyncWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_SEQUENCESYNC)
+#define PARALLEL_APPLY_INTERNAL_MESSAGE 'i'
+
static inline bool
am_tablesync_worker(void)
{
--
2.47.3
v7-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchapplication/octet-stream; name=v7-0002-Introduce-a-shared-hash-table-to-store-paralleliz.patchDownload
From 7bf35810bfcaa6b398919ea3064c6d5b9b6596f7 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:28:38 +0900
Subject: [PATCH v7 2/8] Introduce a shared hash table to store parallelized
transactions
This hash table is used for ensuring that parallel workers wait until dependent
transactions are committed.
The shared hash table contains transaction IDs that the leader allocated to
parallel workers. The hash entries are inserted with a remote XID when the
leader bypasses remote transactions to parallel apply workers. Entries are
deleted when parallel workers are committed to corresponding transactions.
When the parallel worker tries to wait for other transactions, it checks the
hash table for the remote XIDs. The process can go ahead only when entries are
removed from the hash.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 100 +++++++++++++++++-
.../utils/activity/wait_event_names.txt | 1 +
src/include/replication/worker_internal.h | 4 +
src/include/storage/lwlocklist.h | 1 +
4 files changed, 105 insertions(+), 1 deletion(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 055feea0bc5..6ca5f778a3b 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -218,12 +218,35 @@ typedef struct ParallelApplyWorkerEntry
ParallelApplyWorkerInfo *winfo;
} ParallelApplyWorkerEntry;
+/* an entry in the parallelized_txns shared hash table */
+typedef struct ParallelizedTxnEntry
+{
+ TransactionId xid; /* Hash key */
+} ParallelizedTxnEntry;
+
/*
* A hash table used to cache the state of streaming transactions being applied
* by the parallel apply workers.
*/
static HTAB *ParallelApplyTxnHash = NULL;
+/*
+ * A hash table used to track the parallelized transactions that could be
+ * depended on by other transactions.
+ */
+static dsa_area *parallel_apply_dsa_area = NULL;
+static dshash_table *parallelized_txns = NULL;
+
+/* parameters for the parallelized_txns shared hash table */
+static const dshash_parameters dsh_params = {
+ sizeof(TransactionId),
+ sizeof(ParallelizedTxnEntry),
+ dshash_memcmp,
+ dshash_memhash,
+ dshash_memcpy,
+ LWTRANCHE_PARALLEL_APPLY_DSA
+};
+
/*
* A list (pool) of active parallel apply workers. The information for
* the new worker is added to the list after successfully launching it. The
@@ -257,6 +280,8 @@ static List *subxactlist = NIL;
static void pa_free_worker_info(ParallelApplyWorkerInfo *winfo);
static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
+static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -334,6 +359,15 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shm_mq *mq;
Size queue_size = DSM_QUEUE_SIZE;
Size error_queue_size = DSM_ERROR_QUEUE_SIZE;
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
+
+ pa_attach_parallelized_txn_hash(¶llel_apply_dsa_handle,
+ ¶llelized_txns_handle);
+
+ if (parallel_apply_dsa_handle == DSA_HANDLE_INVALID ||
+ parallelized_txns_handle == DSHASH_HANDLE_INVALID)
+ return false;
/*
* Estimate how much shared memory we need.
@@ -369,6 +403,8 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
shared->fileset_state = FS_EMPTY;
+ shared->parallel_apply_dsa_handle = parallel_apply_dsa_handle;
+ shared->parallelized_txns_handle = parallelized_txns_handle;
shm_toc_insert(toc, PARALLEL_APPLY_KEY_SHARED, shared);
@@ -864,6 +900,8 @@ ParallelApplyWorkerMain(Datum main_arg)
shm_mq *mq;
shm_mq_handle *mqh;
shm_mq_handle *error_mqh;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
RepOriginId originid;
int worker_slot = DatumGetInt32(main_arg);
char originname[NAMEDATALEN];
@@ -951,6 +989,8 @@ ParallelApplyWorkerMain(Datum main_arg)
InitializingApplyWorker = false;
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
/* Setup replication origin tracking. */
StartTransactionCommand();
ReplicationOriginNameForLogicalRep(MySubscription->oid, InvalidOid,
@@ -1646,6 +1686,51 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+/*
+ * Attach to the shared hash table for parallelized transactions.
+ */
+static void
+pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
+ dshash_table_handle *pa_dshash_handle)
+{
+ MemoryContext oldctx;
+
+ if (parallelized_txns)
+ {
+ Assert(parallel_apply_dsa_area);
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ return;
+ }
+
+ /* Be sure any local memory allocated by DSA routines is persistent. */
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (am_leader_apply_worker())
+ {
+ /* Initialize dynamic shared hash table for last-start times. */
+ parallel_apply_dsa_area = dsa_create(LWTRANCHE_PARALLEL_APPLY_DSA);
+ dsa_pin(parallel_apply_dsa_area);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_create(parallel_apply_dsa_area, &dsh_params, NULL);
+
+ /* Store handles in shared memory for other backends to use. */
+ *pa_dsa_handle = dsa_get_handle(parallel_apply_dsa_area);
+ *pa_dshash_handle = dshash_get_hash_table_handle(parallelized_txns);
+ }
+ else if (am_parallel_apply_worker())
+ {
+ /* Attach to existing dynamic shared hash table. */
+ parallel_apply_dsa_area = dsa_attach(MyParallelShared->parallel_apply_dsa_handle);
+ dsa_pin_mapping(parallel_apply_dsa_area);
+ parallelized_txns = dshash_attach(parallel_apply_dsa_area, &dsh_params,
+ MyParallelShared->parallelized_txns_handle,
+ NULL);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1656,7 +1741,20 @@ pa_wait_for_depended_transaction(TransactionId xid)
for (;;)
{
- /* XXX wait until given transaction is finished */
+ ParallelizedTxnEntry *txn_entry;
+
+ txn_entry = dshash_find(parallelized_txns, &xid, false);
+
+ /* The entry is removed only if the transaction is committed */
+ if (txn_entry == NULL)
+ break;
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+
+ pa_lock_transaction(xid, AccessShareLock);
+ pa_unlock_transaction(xid, AccessShareLock);
+
+ CHECK_FOR_INTERRUPTS();
}
elog(DEBUG1, "finish waiting for depended xid %u", xid);
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index dcfadbd5aae..53b87a2df10 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -406,6 +406,7 @@ SubtransSLRU "Waiting to access the sub-transaction SLRU cache."
XactSLRU "Waiting to access the transaction status SLRU cache."
ParallelVacuumDSA "Waiting for parallel vacuum dynamic shared memory allocation."
AioUringCompletion "Waiting for another process to complete IO via io_uring."
+ParallelApplyDSA "Waiting for parallel apply dynamic shared memory allocation."
# No "ABI_compatibility" region here as WaitEventLWLock has its own C code.
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index a3526eae578..ddcdcc05053 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "catalog/pg_subscription.h"
#include "datatype/timestamp.h"
+#include "lib/dshash.h"
#include "miscadmin.h"
#include "replication/logicalrelation.h"
#include "replication/walreceiver.h"
@@ -197,6 +198,9 @@ typedef struct ParallelApplyWorkerShared
*/
PartialFileSetState fileset_state;
FileSet fileset;
+
+ dsa_handle parallel_apply_dsa_handle;
+ dshash_table_handle parallelized_txns_handle;
} ParallelApplyWorkerShared;
/*
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index 533344509e9..e16295e5a3b 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -137,3 +137,4 @@ PG_LWLOCKTRANCHE(SUBTRANS_SLRU, SubtransSLRU)
PG_LWLOCKTRANCHE(XACT_SLRU, XactSLRU)
PG_LWLOCKTRANCHE(PARALLEL_VACUUM_DSA, ParallelVacuumDSA)
PG_LWLOCKTRANCHE(AIO_URING_COMPLETION, AioUringCompletion)
+PG_LWLOCKTRANCHE(PARALLEL_APPLY_DSA, ParallelApplyDSA)
--
2.47.3
v7-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchapplication/octet-stream; name=v7-0003-Introduce-a-local-hash-table-to-store-replica-ide.patchDownload
From 56c965288a81f8c3bae36d651f7ca6467772b930 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 16:39:02 +0900
Subject: [PATCH v7 3/8] Introduce a local hash table to store replica
identities
This local hash table on the leader is used for detecting dependencies between
transactions.
The hash contains the Replica Identity (RI) as a key and the remote XID that
modified the corresponding tuple. The hash entries are inserted when the leader
finds an RI from a replication message. Entries are deleted when transactions
committed by parallel workers are gathered, or the number of entries exceeds the
limit.
When the leader sends replication changes to parallel workers, it checks whether
other transactions have already used the RI associated with the change. If
something is found, the leader treats it as a dependent transaction and notifies
parallel workers to wait until it finishes via LOGICAL_REP_MSG_INTERNAL_DEPENDENCY.
Author: Hou Zhijie <houzj.fnst@fujitsu.com>
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
---
.../replication/logical/applyparallelworker.c | 123 +++-
src/backend/replication/logical/relation.c | 24 +
src/backend/replication/logical/worker.c | 616 +++++++++++++++++-
src/include/replication/logicalrelation.h | 3 +
src/include/replication/worker_internal.h | 8 +-
5 files changed, 771 insertions(+), 3 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 6ca5f778a3b..cf08206d9fd 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -216,6 +216,7 @@ typedef struct ParallelApplyWorkerEntry
{
TransactionId xid; /* Hash key -- must be first */
ParallelApplyWorkerInfo *winfo;
+ XLogRecPtr local_end;
} ParallelApplyWorkerEntry;
/* an entry in the parallelized_txns shared hash table */
@@ -504,7 +505,7 @@ pa_launch_parallel_worker(void)
* streaming changes.
*/
void
-pa_allocate_worker(TransactionId xid)
+pa_allocate_worker(TransactionId xid, bool stream_txn)
{
bool found;
ParallelApplyWorkerInfo *winfo = NULL;
@@ -545,7 +546,9 @@ pa_allocate_worker(TransactionId xid)
winfo->in_use = true;
winfo->serialize_changes = false;
+ winfo->stream_txn = stream_txn;
entry->winfo = winfo;
+ entry->local_end = InvalidXLogRecPtr;
}
/*
@@ -742,6 +745,73 @@ pa_process_spooled_messages_if_required(void)
return true;
}
+/*
+ * Get the local end LSN for a transaction applied by a parallel apply worker.
+ *
+ * Set delete_entry to true if you intend to remove the transaction from the
+ * ParallelApplyTxnHash after collecting its LSN.
+ *
+ * If the parallel apply worker did not write any changes during the transaction
+ * application due to situations like update/delete_missing or a before trigger,
+ * the *skipped_write will be set to true.
+ */
+XLogRecPtr
+pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+ ParallelApplyWorkerInfo *winfo;
+
+ Assert(TransactionIdIsValid(xid));
+
+ if (skipped_write)
+ *skipped_write = false;
+
+ /* Find an entry for the requested transaction. */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return InvalidXLogRecPtr;
+
+ /*
+ * If worker info is NULL, it indicates that the worker has been reused
+ * for handling other transactions. Consequently, the local end LSN has
+ * already been collected and saved in entry->local_end.
+ */
+ winfo = entry->winfo;
+ if (winfo == NULL)
+ {
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ return entry->local_end;
+ }
+
+ /* Return InvalidXLogRecPtr if the transaction is still in progress */
+ if (pa_get_xact_state(winfo->shared) != PARALLEL_TRANS_FINISHED)
+ return InvalidXLogRecPtr;
+
+ /* Collect the local end LSN from the worker's shared memory area */
+ entry->local_end = winfo->shared->last_commit_end;
+ entry->winfo = NULL;
+
+ if (skipped_write)
+ *skipped_write = XLogRecPtrIsInvalid(entry->local_end);
+
+ elog(DEBUG1, "store local commit %X/%X end to txn entry: %u",
+ LSN_FORMAT_ARGS(entry->local_end), xid);
+
+ if (delete_entry &&
+ !hash_search(ParallelApplyTxnHash, &xid, HASH_REMOVE, NULL))
+ elog(ERROR, "hash table corrupted");
+
+ return entry->local_end;
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -1686,6 +1756,26 @@ pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
pa_free_worker(winfo);
}
+bool
+pa_transaction_committed(TransactionId xid)
+{
+ bool found;
+ ParallelApplyWorkerEntry *entry;
+
+ Assert(TransactionIdIsValid(xid));
+
+ /* Find an entry for the requested transaction */
+ entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+
+ if (!found)
+ return true;
+
+ if (!entry->winfo)
+ return true;
+
+ return pa_get_xact_state(entry->winfo->shared) == PARALLEL_TRANS_FINISHED;
+}
+
/*
* Attach to the shared hash table for parallelized transactions.
*/
@@ -1731,6 +1821,37 @@ pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
MemoryContextSwitchTo(oldctx);
}
+/*
+ * Record in-progress transactions from the given list that are being depended
+ * on into the shared hash table.
+ */
+void
+pa_record_dependency_on_transactions(List *depends_on_xids)
+{
+ foreach_xid(xid, depends_on_xids)
+ {
+ bool found;
+ ParallelApplyWorkerEntry *winfo_entry;
+ ParallelApplyWorkerInfo *winfo;
+ ParallelizedTxnEntry *txn_entry;
+
+ winfo_entry = hash_search(ParallelApplyTxnHash, &xid, HASH_FIND, &found);
+ winfo = winfo_entry->winfo;
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ /*
+ * If the transaction has been committed now, remove the entry,
+ * otherwise the parallel apply worker will remove the entry once
+ * committed the transaction.
+ */
+ if (pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ dshash_delete_entry(parallelized_txns, txn_entry);
+ else
+ dshash_release_lock(parallelized_txns, txn_entry);
+ }
+}
+
/*
* Wait for the given transaction to finish.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 2c8485b881f..13f8cb74e9f 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -959,3 +959,27 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+
+/*
+ * Get the LogicalRepRelMapEntry corresponding to the given relid without
+ * opening the local relation.
+ */
+LogicalRepRelMapEntry *
+logicalrep_get_relentry(LogicalRepRelId remoteid)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, (void *) &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(DEBUG1, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ return entry;
+}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 73d38644c4a..0b1eeefe9c9 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -303,6 +303,7 @@ typedef struct FlushPosition
dlist_node node;
XLogRecPtr local_end;
XLogRecPtr remote_end;
+ TransactionId pa_remote_xid;
} FlushPosition;
static dlist_head lsn_mapping = DLIST_STATIC_INIT(lsn_mapping);
@@ -544,6 +545,49 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+typedef struct ReplicaIdentityKey
+{
+ Oid relid;
+ LogicalRepTupleData *data;
+} ReplicaIdentityKey;
+
+typedef struct ReplicaIdentityEntry
+{
+ ReplicaIdentityKey *keydata;
+ TransactionId remote_xid;
+
+ /* needed for simplehash */
+ uint32 hash;
+ char status;
+} ReplicaIdentityEntry;
+
+#include "common/hashfn.h"
+
+static uint32 hash_replica_identity(ReplicaIdentityKey *key);
+static bool hash_replica_identity_compare(ReplicaIdentityKey *a,
+ ReplicaIdentityKey *b);
+
+/* Define parameters for replica identity hash table code generation. */
+#define SH_PREFIX replica_identity
+#define SH_ELEMENT_TYPE ReplicaIdentityEntry
+#define SH_KEY_TYPE ReplicaIdentityKey *
+#define SH_KEY keydata
+#define SH_HASH_KEY(tb, key) hash_replica_identity(key)
+#define SH_EQUAL(tb, a, b) hash_replica_identity_compare(a, b)
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) (a)->hash
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+#define REPLICA_IDENTITY_INITIAL_SIZE 128
+#define REPLICA_IDENTITY_CLEANUP_THRESHOLD 1024
+
+static replica_identity_hash *replica_identity_table = NULL;
+
+static void write_internal_dependencies(StringInfo s, List *depends_on_xids);
+
static inline void subxact_filename(char *path, Oid subid, TransactionId xid);
static inline void changes_filename(char *path, Oid subid, TransactionId xid);
@@ -629,6 +673,546 @@ static TransApplyAction get_transaction_apply_action(TransactionId xid,
static void replorigin_reset(int code, Datum arg);
+static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
+ StringInfo s);
+
+/*
+ * Compute the hash value for entries in the replica_identity_table.
+ */
+static uint32
+hash_replica_identity(ReplicaIdentityKey *key)
+{
+ int i;
+ uint32 hashkey = 0;
+
+ hashkey = hash_combine(hashkey, hash_uint32(key->relid));
+
+ for (i = 0; i < key->data->ncols; i++)
+ {
+ uint32 hkey;
+
+ if (key->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
+ hkey = hash_any((const unsigned char *) key->data->colvalues[i].data,
+ key->data->colvalues[i].len);
+ hashkey = hash_combine(hashkey, hkey);
+ }
+
+ return hashkey;
+}
+
+/*
+ * Compare two entries in the replica_identity_table.
+ */
+static bool
+hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
+{
+ if (a->relid != b->relid ||
+ a->data->ncols != b->data->ncols)
+ return false;
+
+ for (int i = 0; i < a->data->ncols; i++)
+ {
+ if (a->data->colstatus[i] != b->data->colstatus[i])
+ return false;
+
+ if (a->data->colvalues[i].len != b->data->colvalues[i].len)
+ return false;
+
+ if (strcmp(a->data->colvalues[i].data, b->data->colvalues[i].data))
+ return false;
+
+ elog(DEBUG1, "conflicting key %s", a->data->colvalues[i].data);
+ }
+
+ return true;
+}
+
+/*
+ * Free resources associated with a replica identity key.
+ */
+static void
+free_replica_identity_key(ReplicaIdentityKey *key)
+{
+ Assert(key);
+
+ pfree(key->data->colvalues);
+ pfree(key->data->colstatus);
+ pfree(key->data);
+ pfree(key);
+}
+
+/*
+ * Clean up hash table entries associated with the given transaction IDs.
+ */
+static void
+cleanup_replica_identity_table(List *committed_xid)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ if (!committed_xid)
+ return;
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ if (!list_member_xid(committed_xid, rientry->remote_xid))
+ continue;
+
+ /* Clean up the hash entry for committed transaction */
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check committed transactions and clean up corresponding entries in the hash
+ * table.
+ */
+static void
+cleanup_committed_replica_identity_entries(void)
+{
+ dlist_mutable_iter iter;
+ List *committed_xids = NIL;
+
+ dlist_foreach_modify(iter, &lsn_mapping)
+ {
+ FlushPosition *pos =
+ dlist_container(FlushPosition, node, iter.cur);
+ bool skipped_write;
+
+ if (!TransactionIdIsValid(pos->pa_remote_xid) ||
+ !XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ continue;
+
+ committed_xids = lappend_xid(committed_xids, pos->pa_remote_xid);
+ }
+
+ /* cleanup the entries for committed transactions */
+ cleanup_replica_identity_table(committed_xids);
+}
+
+/*
+ * Append a transaction dependency, excluding duplicates and committed
+ * transactions.
+ */
+static List *
+check_and_append_xid_dependency(List *depends_on_xids,
+ TransactionId *depends_on_xid,
+ TransactionId current_xid)
+{
+ Assert(depends_on_xid);
+
+ if (!TransactionIdIsValid(*depends_on_xid))
+ return depends_on_xids;
+
+ if (TransactionIdEquals(*depends_on_xid, current_xid))
+ return depends_on_xids;
+
+ if (list_member_xid(depends_on_xids, *depends_on_xid))
+ return depends_on_xids;
+
+ /*
+ * Return and reset the xid if the transaction has been committed.
+ */
+ if (pa_transaction_committed(*depends_on_xid))
+ {
+ *depends_on_xid = InvalidTransactionId;
+ return depends_on_xids;
+ }
+
+ return lappend_xid(depends_on_xids, *depends_on_xid);
+}
+
+/*
+ * Check for dependencies on preceding transactions that modify the same key.
+ * Returns the dependent transactions in 'depends_on_xids' and records the
+ * current change.
+ */
+static void
+check_dependency_on_replica_identity(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ ReplicaIdentityEntry *rientry;
+ MemoryContext oldctx;
+ int n_ri;
+ bool found = false;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * First search whether any previous transaction has affected the whole
+ * table e.g., truncate or schema change from publisher.
+ */
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ n_ri = bms_num_members(relentry->remoterel.attkeys);
+
+ /*
+ * Return if there are no replica identity columns, indicating that the
+ * remote relation has neither a replica identity key nor is marked as
+ * replica identity full.
+ */
+ if (!n_ri)
+ return;
+
+ /* Check if the RI key value of the tuple is invalid */
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, relentry->remoterel.attkeys))
+ continue;
+
+ /*
+ * Return if RI key is NULL or is explicitly marked unchanged. The key
+ * value could be NULL in the new tuple of a update opertaion which
+ * means the RI key is not updated.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_NULL ||
+ original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ return;
+ }
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, n_ri);
+ ridata->colstatus = palloc0_array(char, n_ri);
+ ridata->ncols = n_ri;
+
+ for (int i_original = 0, i_ri = 0; i_original < original_data->ncols; i_original++)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ if (!bms_is_member(i_original, relentry->remoterel.attkeys))
+ continue;
+
+ initStringInfoExt(&ridata->colvalues[i_ri], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_ri], original_colvalue->data);
+ ridata->colstatus[i_ri] = original_data->colstatus[i_original];
+ i_ri++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->data = ridata;
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, rikey,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1, "found conflicting replica identity change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(rikey);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, rikey);
+ free_replica_identity_key(rikey);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
+/*
+ * Check for preceding transactions that involve insert, delete, or update
+ * operations on the specified table, and return them in 'depends_on_xids'.
+ */
+static void
+find_all_dependencies_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ replica_identity_iterator i;
+ ReplicaIdentityEntry *rientry;
+
+ Assert(depends_on_xids);
+
+ replica_identity_start_iterate(replica_identity_table, &i);
+ while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
+ {
+ Assert(TransactionIdIsValid(rientry->remote_xid));
+
+ if (rientry->keydata->relid != relid)
+ continue;
+
+ /* Clean up the hash entry for committed transaction while on it */
+ if (pa_transaction_committed(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+
+ continue;
+ }
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+ }
+}
+
+/*
+ * Check for any preceding transactions that affect the given table and returns
+ * them in 'depends_on_xids'.
+ */
+static void
+check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ Assert(depends_on_xids);
+
+ find_all_dependencies_on_rel(relid, new_depended_xid, depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * The relentry has not been initialized yet, indicating no change has
+ * been applide yet.
+ */
+ if (!relentry)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &relentry->last_depended_xid,
+ new_depended_xid);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ relentry->last_depended_xid = new_depended_xid;
+}
+
+/*
+ * Check dependencies related to the current change by determining if the
+ * modification impacts the same row or table as another ongoing transaction. If
+ * needed, instruct parallel apply workers to wait for these preceding
+ * transactions to complete.
+ *
+ * Simultaneously, track the dependency for the current change to ensure that
+ * subsequent transactions address this dependency.
+ */
+static void
+handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
+ TransactionId new_depended_xid,
+ ParallelApplyWorkerInfo *winfo)
+{
+ LogicalRepRelId relid;
+ LogicalRepTupleData oldtup;
+ LogicalRepTupleData newtup;
+ LogicalRepRelation *rel;
+ List *depends_on_xids = NIL;
+ List *remote_relids;
+ bool has_oldtup = false;
+ bool cascade = false;
+ bool restart_seqs = false;
+ StringInfoData dependencies;
+
+ /*
+ * Parse the consume data using a local copy instead of directly consuming
+ * the given remote change as the caller may also read the data from the
+ * remote message.
+ */
+ StringInfoData change = *s;
+
+ /* Compute dependency only for non-streaming transaction */
+ if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ return;
+
+ /* Only the leader checks dependencies and schedules the parallel apply */
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!replica_identity_table)
+ replica_identity_table = replica_identity_create(ApplyContext,
+ REPLICA_IDENTITY_INITIAL_SIZE,
+ NULL);
+
+ if (replica_identity_table->members >= REPLICA_IDENTITY_CLEANUP_THRESHOLD)
+ cleanup_committed_replica_identity_entries();
+
+ switch (action)
+ {
+ case LOGICAL_REP_MSG_INSERT:
+ relid = logicalrep_read_insert(&change, &newtup);
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_UPDATE:
+ relid = logicalrep_read_update(&change, &has_oldtup, &oldtup,
+ &newtup);
+
+ if (has_oldtup)
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+
+ check_dependency_on_replica_identity(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_DELETE:
+ relid = logicalrep_read_delete(&change, &oldtup);
+ check_dependency_on_replica_identity(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TRUNCATE:
+ remote_relids = logicalrep_read_truncate(&change, &cascade,
+ &restart_seqs);
+
+ /*
+ * Truncate affects all rows in a table, so the current
+ * transaction should wait for all preceding transactions that
+ * modified the same table.
+ */
+ foreach_int(truncated_relid, remote_relids)
+ check_dependency_on_rel(truncated_relid, new_depended_xid,
+ &depends_on_xids);
+
+ break;
+
+ case LOGICAL_REP_MSG_RELATION:
+ rel = logicalrep_read_rel(&change);
+
+ /*
+ * The replica identity key could be changed, making existing
+ * entries in the replica identity invalid. In this case, parallel
+ * apply is not allowed on this specific table until all running
+ * transactions that modified it have finished.
+ */
+ check_dependency_on_rel(rel->remoteid, new_depended_xid,
+ &depends_on_xids);
+ break;
+
+ case LOGICAL_REP_MSG_TYPE:
+ case LOGICAL_REP_MSG_MESSAGE:
+
+ /*
+ * Type updates accompany relation updates, so dependencies have
+ * already been checked during relation updates. Logical messages
+ * do not conflict with any changes, so they can be ignored.
+ */
+ break;
+
+ default:
+ Assert(false);
+ break;
+ }
+
+ if (!depends_on_xids)
+ return;
+
+ /*
+ * Notify the transactions that they are dependent on the current
+ * transaction.
+ */
+ pa_record_dependency_on_transactions(depends_on_xids);
+
+ /*
+ * If the leader applies the transaction itself, start waiting for
+ * transactions that depend on the current transaction immediately.
+ */
+ if (winfo == NULL)
+ {
+ foreach_xid(xid, depends_on_xids)
+ pa_wait_for_depended_transaction(xid);
+
+ return;
+ }
+
+ initStringInfo(&dependencies);
+
+ /* Build the dependency message used to send to parallel apply worker */
+ write_internal_dependencies(&dependencies, depends_on_xids);
+
+ (void) send_internal_dependencies(winfo, &dependencies);
+}
+
+/*
+ * Write internal dependency information to the output for the parallel apply
+ * worker.
+ */
+static void
+write_internal_dependencies(StringInfo s, List *depends_on_xids)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(s, list_length(depends_on_xids));
+
+ foreach_xid(xid, depends_on_xids)
+ pq_sendint32(s, xid);
+}
+
/*
* Handle internal dependency information.
*
@@ -826,7 +1410,10 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
+ {
+ handle_dependency_on_change(action, s, InvalidTransactionId, winfo);
return false;
+ }
Assert(TransactionIdIsValid(stream_xid));
@@ -1268,6 +1855,33 @@ apply_handle_begin(StringInfo s)
pgstat_report_activity(STATE_RUNNING, NULL);
}
+/*
+ * Send an INTERNAL_DEPENDENCY message to a parallel apply worker.
+ *
+ * Returns false if we switched to the serialize mode to send the message,
+ * true otherwise.
+ */
+static bool
+send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
+{
+ Assert(s->data[0] == PARALLEL_APPLY_INTERNAL_MESSAGE);
+ Assert(s->data[1] == LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+
+ if (!winfo->serialize_changes)
+ {
+ if (pa_send_data(winfo, s->len, s->data))
+ return true;
+
+ pa_switch_to_partial_serialize(winfo, true);
+ }
+
+ /* Skip writing the first internal message flag */
+ s->cursor++;
+ stream_write_change(LOGICAL_REP_MSG_INTERNAL_DEPENDENCY, s);
+
+ return false;
+}
+
/*
* Handle COMMIT message.
*
@@ -1795,7 +2409,7 @@ apply_handle_stream_start(StringInfo s)
/* Try to allocate a worker for the streaming transaction. */
if (first_segment)
- pa_allocate_worker(stream_xid);
+ pa_allocate_worker(stream_xid, true);
apply_action = get_transaction_apply_action(stream_xid, &winfo);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 7a561a8e8d8..4b321bd2ad2 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -37,6 +37,8 @@ typedef struct LogicalRepRelMapEntry
/* Sync state. */
char state;
XLogRecPtr statelsn;
+
+ TransactionId last_depended_xid;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -50,5 +52,6 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index ddcdcc05053..78b5667cebe 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -235,6 +235,8 @@ typedef struct ParallelApplyWorkerInfo
*/
bool in_use;
+ bool stream_txn;
+
ParallelApplyWorkerShared *shared;
} ParallelApplyWorkerInfo;
@@ -332,8 +334,10 @@ extern void apply_error_callback(void *arg);
extern void set_apply_error_context_origin(char *originname);
/* Parallel apply worker setup and interactions */
-extern void pa_allocate_worker(TransactionId xid);
+extern void pa_allocate_worker(TransactionId xid, bool stream_txn);
extern ParallelApplyWorkerInfo *pa_find_worker(TransactionId xid);
+extern XLogRecPtr pa_get_last_commit_end(TransactionId xid, bool delete_entry,
+ bool *skipped_write);
extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
@@ -362,6 +366,8 @@ extern void pa_decr_and_wait_stream_block(void);
extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
+extern bool pa_transaction_committed(TransactionId xid);
+extern void pa_record_dependency_on_transactions(List *depends_on_xids);
extern void pa_wait_for_depended_transaction(TransactionId xid);
--
2.47.3
v7-0004-Parallel-apply-non-streaming-transactions.patchapplication/octet-stream; name=v7-0004-Parallel-apply-non-streaming-transactions.patchDownload
From 776c076c23cb632ec97a4314cb414748e6f9f3d0 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Mon, 1 Dec 2025 12:28:29 +0900
Subject: [PATCH v7 4/8] Parallel apply non-streaming transactions
--
Basic design
--
The leader worker assigns each non-streaming transaction to a parallel apply
worker. Before dispatching changes to a parallel worker, the leader verifies if
the current modification affects the same row (identitied by replica identity
key) as another ongoing transaction. If so, the leader sends a list of dependent
transaction IDs to the parallel worker, indicating that the parallel apply
worker must wait for these transactions to commit before proceeding.
Each parallel apply worker records the local end LSN of the transaction it
applies in shared memory. Subsequently, the leader gathers these local end LSNs
and logs them in the local 'lsn_mapping' for verifying whether they have been
flushed to disk (following the logic in get_flush_position()).
If no parallel apply worker is available, the leader will apply the transaction
independently.
For further details, please refer to the following:
--
dedendency tracking
--
The leader maintains a local hash table, using the remote change's replica
identity column values and relid as keys, with remote transaction IDs as values.
Before sending changes to the parallel apply worker, the leader computes a hash
using RI key values and the relid of the current change to search the hash
table. If an existing entry is found, the leader first updates the hash entry
with the receiving remote xid then tells the parallel worker to wait for it.
If the remote relation lacks a replica identity (RI), it indicates that only
INSERT can be replicated for this table. In such cases, the leader skips
dependency checks, allowing the parallel apply worker to proceed with applying
changes without delay. This is because the only potential conflict could happen
is related to the local unique key or foreign key, which that is yet to be
implemented (see TODO - dependency on local unique key, foreign key.).
In cases of TRUNCATE or remote schema changes affecting the entire table, the
leader retrieves all remote xids touching the same table (via sequential scans
of the hash table) and tells the parallel worker to wait for those transactions
to commit.
Hash entries are cleaned up once the transaction corresponding to the remote xid
in the entry has been committed. Clean-up typically occurs when collecting the
flush position of each transaction, but is forced if the hash table exceeds a
set threshold.
--
dedendency waiting
--
If a transaction is relied upon by others, the leader adds its xid to a shared
hash table. The shared hash table entry is cleared by the parallel apply worker
upon completing the transaction. Workers needing to wait for a transaction check
the shared hash table entry; if present, they lock the transaction ID (using
pa_lock_transaction). If absent, it indicates the transaction has been
committed, negating the need to wait.
--
commit order
--
There is a case where columns have no foreign or primary keys, and integrity is
maintained at the application layer. In this case, the above RI mechanism cannot
detect any dependencies. For safety reasons, parallel apply workers preserve the
commit ordering done on the publisher side. This is done by the leader worker
caching the lastly dispatched transaction ID and adding a dependency between it
and the currently dispatching one.
--
TODO - dependency on foreign key.
--
A transaction could conflict with another if modifying the same key.
While current patches don't address conflicts involving foreign keys, tracking
these dependencies might be needed.
---
.../replication/logical/applyparallelworker.c | 339 ++++++++++++++++--
src/backend/replication/logical/proto.c | 38 ++
src/backend/replication/logical/relation.c | 31 ++
src/backend/replication/logical/worker.c | 303 ++++++++++++++--
src/include/replication/logicalproto.h | 2 +
src/include/replication/logicalrelation.h | 2 +
src/include/replication/worker_internal.h | 11 +-
src/test/subscription/meson.build | 1 +
src/test/subscription/t/001_rep_changes.pl | 2 +
src/test/subscription/t/010_truncate.pl | 2 +-
src/test/subscription/t/015_stream.pl | 8 +-
src/test/subscription/t/026_stats.pl | 1 +
src/test/subscription/t/027_nosuperuser.pl | 1 +
src/test/subscription/t/050_parallel_apply.pl | 130 +++++++
src/tools/pgindent/typedefs.list | 4 +
15 files changed, 801 insertions(+), 74 deletions(-)
create mode 100644 src/test/subscription/t/050_parallel_apply.pl
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index cf08206d9fd..5b6267c6047 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -14,6 +14,9 @@
* ParallelApplyWorkerInfo which is required so the leader worker and parallel
* apply workers can communicate with each other.
*
+ * Streaming transactions
+ * ======================
+ *
* The parallel apply workers are assigned (if available) as soon as xact's
* first stream is received for subscriptions that have set their 'streaming'
* option as parallel. The leader apply worker will send changes to this new
@@ -152,6 +155,33 @@
* session-level locks because both locks could be acquired outside the
* transaction, and the stream lock in the leader needs to persist across
* transaction boundaries i.e. until the end of the streaming transaction.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
+ *
+ * Transaction dependency
+ * ----------------------
+ * Before dispatching changes to a parallel worker, the leader verifies if the
+ * current modification affects the same row (identitied by replica identity
+ * key) as another ongoing transaction (see handle_dependency_on_change for
+ * details). If so, the leader sends a list of dependent transaction IDs to the
+ * parallel worker, indicating that the parallel apply worker must wait for
+ * these transactions to commit before proceeding.
+ *
+ * Commit order
+ * ------------
+ * There is a case where columns have no foreign or primary keys, and integrity
+ * is maintained at the application layer. In this case, the above RI mechanism
+ * cannot detect any dependencies. For safety reasons, parallel apply workers
+ * preserve the commit ordering done on the publisher side. This is done by the
+ * leader worker caching the lastly dispatched transaction ID and adding a
+ * dependency between it and the currently dispatching one.
+ * We can extend the parallel apply worker to allow out-of-order commits in the
+ * future: At least we must use a new mechanism to track replication progress
+ * in out-of-order commits. Then we can stop caching the transaction ID and
+ * adding the dependency.
*-------------------------------------------------------------------------
*/
@@ -283,6 +313,7 @@ static ParallelTransState pa_get_xact_state(ParallelApplyWorkerShared *wshared);
static PartialFileSetState pa_get_fileset_state(void);
static void pa_attach_parallelized_txn_hash(dsa_handle *pa_dsa_handle,
dshash_table_handle *pa_dshash_handle);
+static void write_internal_relation(StringInfo s, LogicalRepRelation *rel);
/*
* Returns true if it is OK to start a parallel apply worker, false otherwise.
@@ -400,6 +431,7 @@ pa_setup_dsm(ParallelApplyWorkerInfo *winfo)
shared = shm_toc_allocate(toc, sizeof(ParallelApplyWorkerShared));
SpinLockInit(&shared->mutex);
+ shared->xid = InvalidTransactionId;
shared->xact_state = PARALLEL_TRANS_UNKNOWN;
pg_atomic_init_u32(&(shared->pending_stream_count), 0);
shared->last_commit_end = InvalidXLogRecPtr;
@@ -443,6 +475,8 @@ pa_launch_parallel_worker(void)
MemoryContext oldcontext;
bool launched;
ParallelApplyWorkerInfo *winfo;
+ dsa_handle pa_dsa_handle;
+ dshash_table_handle pa_dshash_handle;
ListCell *lc;
/* Try to get an available parallel apply worker from the worker pool. */
@@ -450,10 +484,33 @@ pa_launch_parallel_worker(void)
{
winfo = (ParallelApplyWorkerInfo *) lfirst(lc);
- if (!winfo->in_use)
+ if (!winfo->stream_txn &&
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED)
+ {
+ /*
+ * Save the local commit LSN of the last transaction applied by
+ * this worker before reusing it for another transaction. This WAL
+ * position is crucial for determining the flush position in
+ * responses to the publisher (see get_flush_position()).
+ */
+ (void) pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+ return winfo;
+ }
+
+ if (winfo->stream_txn && !winfo->in_use)
return winfo;
}
+ pa_attach_parallelized_txn_hash(&pa_dsa_handle, &pa_dshash_handle);
+
+ /*
+ * Return if the number of parallel apply workers has reached the maximum
+ * limit.
+ */
+ if (list_length(ParallelApplyWorkerPool) ==
+ max_parallel_apply_workers_per_subscription)
+ return NULL;
+
/*
* Start a new parallel apply worker.
*
@@ -481,18 +538,32 @@ pa_launch_parallel_worker(void)
dsm_segment_handle(winfo->dsm_seg),
false);
- if (launched)
- {
- ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
- }
- else
+ if (!launched)
{
+ MemoryContextSwitchTo(oldcontext);
pa_free_worker_info(winfo);
- winfo = NULL;
+ return NULL;
}
+ ParallelApplyWorkerPool = lappend(ParallelApplyWorkerPool, winfo);
+
MemoryContextSwitchTo(oldcontext);
+ /*
+ * Send all existing remote relation information to the parallel apply
+ * worker. This allows the parallel worker to initialize the
+ * LogicalRepRelMapEntry locally before applying remote changes.
+ */
+ if (logicalrep_get_num_rels())
+ {
+ StringInfoData out;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, NULL);
+ pa_send_data(winfo, out.len, out.data);
+ }
+
return winfo;
}
@@ -597,7 +668,8 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
{
Assert(!am_parallel_apply_worker());
Assert(winfo->in_use);
- Assert(pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
+ Assert(!winfo->stream_txn ||
+ pa_get_xact_state(winfo->shared) == PARALLEL_TRANS_FINISHED);
if (!hash_search(ParallelApplyTxnHash, &winfo->shared->xid, HASH_REMOVE, NULL))
elog(ERROR, "hash table corrupted");
@@ -613,9 +685,7 @@ pa_free_worker(ParallelApplyWorkerInfo *winfo)
* been serialized and then letting the parallel apply worker deal with
* the spurious message, we stop the worker.
*/
- if (winfo->serialize_changes ||
- list_length(ParallelApplyWorkerPool) >
- (max_parallel_apply_workers_per_subscription / 2))
+ if (winfo->serialize_changes)
{
logicalrep_pa_worker_stop(winfo);
pa_free_worker_info(winfo);
@@ -812,6 +882,38 @@ pa_get_last_commit_end(TransactionId xid, bool delete_entry, bool *skipped_write
return entry->local_end;
}
+/*
+ * Wait for the remote transaction associated with the specified remote xid to
+ * complete.
+ */
+static void
+pa_wait_for_transaction(TransactionId wait_for_xid)
+{
+ if (!am_leader_apply_worker())
+ return;
+
+ if (!TransactionIdIsValid(wait_for_xid))
+ return;
+
+ elog(DEBUG1, "plan to wait for remote_xid %u to finish",
+ wait_for_xid);
+
+ for (;;)
+ {
+ if (pa_transaction_committed(wait_for_xid))
+ break;
+
+ pa_lock_transaction(wait_for_xid, AccessShareLock);
+ pa_unlock_transaction(wait_for_xid, AccessShareLock);
+
+ /* An interrupt may have occurred while we were waiting. */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ elog(DEBUG1, "finished wait for remote_xid %u to finish",
+ wait_for_xid);
+}
+
/*
* Interrupt handler for main loop of parallel apply worker.
*/
@@ -887,21 +989,34 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
* parallel apply workers can only be PqReplMsg_WALData.
*/
c = pq_getmsgbyte(&s);
- if (c != PqReplMsg_WALData)
- elog(ERROR, "unexpected message \"%c\"", c);
-
- /*
- * Ignore statistics fields that have been updated by the leader
- * apply worker.
- *
- * XXX We can avoid sending the statistics fields from the leader
- * apply worker but for that, it needs to rebuild the entire
- * message by removing these fields which could be more work than
- * simply ignoring these fields in the parallel apply worker.
- */
- s.cursor += SIZE_STATS_MESSAGE;
+ if (c == PqReplMsg_WALData)
+ {
+ /*
+ * Ignore statistics fields that have been updated by the
+ * leader apply worker.
+ *
+ * XXX We can avoid sending the statistics fields from the
+ * leader apply worker but for that, it needs to rebuild the
+ * entire message by removing these fields which could be more
+ * work than simply ignoring these fields in the parallel
+ * apply worker.
+ */
+ s.cursor += SIZE_STATS_MESSAGE;
- apply_dispatch(&s);
+ apply_dispatch(&s);
+ }
+ else if (c == PARALLEL_APPLY_INTERNAL_MESSAGE)
+ {
+ apply_dispatch(&s);
+ }
+ else
+ {
+ /*
+ * The first byte of messages sent from leader apply worker to
+ * parallel apply workers can only be 'w' or 'i'.
+ */
+ elog(ERROR, "unexpected message \"%c\"", c);
+ }
}
else if (shmq_res == SHM_MQ_WOULD_BLOCK)
{
@@ -918,6 +1033,9 @@ LogicalParallelApplyLoop(shm_mq_handle *mqh)
if (rc & WL_LATCH_SET)
ResetLatch(MyLatch);
+
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
else
@@ -955,6 +1073,9 @@ pa_shutdown(int code, Datum arg)
INVALID_PROC_NUMBER);
dsm_detach((dsm_segment *) DatumGetPointer(arg));
+
+ if (parallel_apply_dsa_area)
+ dsa_detach(parallel_apply_dsa_area);
}
/*
@@ -1267,7 +1388,6 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
shm_mq_result result;
TimestampTz startTime = 0;
- Assert(!IsTransactionState());
Assert(!winfo->serialize_changes);
/*
@@ -1319,6 +1439,67 @@ pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)
}
}
+/*
+ * Distribute remote relation information to all active parallel apply workers
+ * that require it.
+ */
+void
+pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel)
+{
+ List *workers_stopped = NIL;
+ StringInfoData out;
+
+ if (!ParallelApplyWorkerPool)
+ return;
+
+ initStringInfo(&out);
+
+ write_internal_relation(&out, rel);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, ParallelApplyWorkerPool)
+ {
+ /*
+ * Skip the worker responsible for the current transaction, as the
+ * relation information has already been sent to it.
+ */
+ if (winfo == stream_apply_worker)
+ continue;
+
+ /*
+ * Skip the worker that is in serialize mode, as they will soon stop
+ * once they finish applying the transaction.
+ */
+ if (winfo->serialize_changes)
+ continue;
+
+ elog(DEBUG1, "distributing schema changes to pa workers");
+
+ if (pa_send_data(winfo, out.len, out.data))
+ continue;
+
+ elog(DEBUG1, "failed to distribute, will stop that worker instead");
+
+ /*
+ * Distribution to this worker failed due to a sending timeout. Wait
+ * for the worker to complete its transaction and then stop it. This
+ * is consistent with the handling of workers in serialize mode (see
+ * pa_free_worker() for details).
+ */
+ pa_wait_for_transaction(winfo->shared->xid);
+
+ pa_get_last_commit_end(winfo->shared->xid, false, NULL);
+
+ logicalrep_pa_worker_stop(winfo);
+
+ workers_stopped = lappend(workers_stopped, winfo);
+ }
+
+ pfree(out.data);
+
+ foreach_ptr(ParallelApplyWorkerInfo, winfo, workers_stopped)
+ pa_free_worker_info(winfo);
+}
+
/*
* Switch to PARTIAL_SERIALIZE mode for the current transaction -- this means
* that the current data and any subsequent data for this transaction will be
@@ -1401,8 +1582,8 @@ pa_wait_for_xact_finish(ParallelApplyWorkerInfo *winfo)
/*
* Wait for the transaction lock to be released. This is required to
- * detect deadlock among leader and parallel apply workers. Refer to the
- * comments atop this file.
+ * detect detect deadlock among leader and parallel apply workers. Refer
+ * to the comments atop this file.
*/
pa_lock_transaction(winfo->shared->xid, AccessShareLock);
pa_unlock_transaction(winfo->shared->xid, AccessShareLock);
@@ -1479,6 +1660,9 @@ pa_savepoint_name(Oid suboid, TransactionId xid, char *spname, Size szsp)
void
pa_start_subtrans(TransactionId current_xid, TransactionId top_xid)
{
+ if (!TransactionIdIsValid(top_xid))
+ return;
+
if (current_xid != top_xid &&
!list_member_xid(subxactlist, current_xid))
{
@@ -1735,25 +1919,41 @@ pa_decr_and_wait_stream_block(void)
void
pa_xact_finish(ParallelApplyWorkerInfo *winfo, XLogRecPtr remote_lsn)
{
+ XLogRecPtr local_lsn = InvalidXLogRecPtr;
+ TransactionId pa_remote_xid = winfo->shared->xid;
+
Assert(am_leader_apply_worker());
/*
- * Unlock the shared object lock so that parallel apply worker can
- * continue to receive and apply changes.
+ * Unlock the shared object lock taken for streaming transactions so that
+ * parallel apply worker can continue to receive and apply changes.
*/
- pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
+ if (winfo->stream_txn)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);
/*
- * Wait for that worker to finish. This is necessary to maintain commit
- * order which avoids failures due to transaction dependencies and
- * deadlocks.
+ * Wait for that worker for streaming transaction to finish. This is
+ * necessary to maintain commit order which avoids failures due to
+ * transaction dependencies and deadlocks.
+ *
+ * For non-streaming transaction but in partial seralize mode, wait for
+ * stop as well as the worker is anyway cannot be reused anymore (see
+ * pa_free_worker() for details).
*/
- pa_wait_for_xact_finish(winfo);
+ if (winfo->serialize_changes || winfo->stream_txn)
+ {
+ pa_wait_for_xact_finish(winfo);
+
+ local_lsn = winfo->shared->last_commit_end;
+ pa_remote_xid = InvalidTransactionId;
+
+ pa_free_worker(winfo);
+ }
if (XLogRecPtrIsValid(remote_lsn))
- store_flush_position(remote_lsn, winfo->shared->last_commit_end);
+ store_flush_position(remote_lsn, local_lsn, pa_remote_xid);
- pa_free_worker(winfo);
+ pa_set_stream_apply_worker(NULL);
}
bool
@@ -1852,6 +2052,22 @@ pa_record_dependency_on_transactions(List *depends_on_xids)
}
}
+/*
+ * Mark the transaction state as finished and remove the shared hash entry.
+ */
+void
+pa_commit_transaction(void)
+{
+ TransactionId xid = MyParallelShared->xid;
+
+ SpinLockAcquire(&MyParallelShared->mutex);
+ MyParallelShared->xact_state = PARALLEL_TRANS_FINISHED;
+ SpinLockRelease(&MyParallelShared->mutex);
+
+ dshash_delete_key(parallelized_txns, &xid);
+ elog(DEBUG1, "depended xid %u committed", xid);
+}
+
/*
* Wait for the given transaction to finish.
*/
@@ -1860,6 +2076,13 @@ pa_wait_for_depended_transaction(TransactionId xid)
{
elog(DEBUG1, "wait for depended xid %u", xid);
+ /*
+ * Quick exit if parallelized_txns has not been initialized yet. This can
+ * happen when this function is called by the leader worker.
+ */
+ if (!parallelized_txns)
+ return;
+
for (;;)
{
ParallelizedTxnEntry *txn_entry;
@@ -1880,3 +2103,45 @@ pa_wait_for_depended_transaction(TransactionId xid)
elog(DEBUG1, "finish waiting for depended xid %u", xid);
}
+
+/*
+ * Write internal relation description to the output stream.
+ */
+static void
+write_internal_relation(StringInfo s, LogicalRepRelation *rel)
+{
+ pq_sendbyte(s, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(s, LOGICAL_REP_MSG_INTERNAL_RELATION);
+
+ if (rel)
+ {
+ pq_sendint(s, 1, 4);
+ logicalrep_write_internal_rel(s, rel);
+ }
+ else
+ {
+ pq_sendint(s, logicalrep_get_num_rels(), 4);
+ logicalrep_write_all_rels(s);
+ }
+}
+
+/*
+ * Register a transaction to the shared hash table.
+ *
+ * This function is intended to be called during the commit phase of
+ * non-streamed transactions. Other parallel workers would wait,
+ * removing the added entry.
+ */
+void
+pa_add_parallelized_transaction(TransactionId xid)
+{
+ bool found;
+ ParallelizedTxnEntry *txn_entry;
+
+ Assert(parallelized_txns);
+ Assert(TransactionIdIsValid(xid));
+
+ txn_entry = dshash_find_or_insert(parallelized_txns, &xid, &found);
+
+ dshash_release_lock(parallelized_txns, txn_entry);
+}
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index ded46c49a83..96b6a74055e 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -691,6 +691,44 @@ logicalrep_write_rel(StringInfo out, TransactionId xid, Relation rel,
logicalrep_write_attrs(out, rel, columns, include_gencols_type);
}
+/*
+ * Write internal relation description to the output stream.
+ */
+void
+logicalrep_write_internal_rel(StringInfo out, LogicalRepRelation *rel)
+{
+ pq_sendint32(out, rel->remoteid);
+
+ /* Write relation name */
+ pq_sendstring(out, rel->nspname);
+ pq_sendstring(out, rel->relname);
+
+ /* Write the replica identity. */
+ pq_sendbyte(out, rel->replident);
+
+ /* Write attribute description */
+ pq_sendint16(out, rel->natts);
+
+ for (int i = 0; i < rel->natts; i++)
+ {
+ uint8 flags = 0;
+
+ if (bms_is_member(i, rel->attkeys))
+ flags |= LOGICALREP_IS_REPLICA_IDENTITY;
+
+ pq_sendbyte(out, flags);
+
+ /* attribute name */
+ pq_sendstring(out, rel->attnames[i]);
+
+ /* attribute type id */
+ pq_sendint32(out, rel->atttyps[i]);
+
+ /* ignore attribute mode for now */
+ pq_sendint32(out, 0);
+ }
+}
+
/*
* Read the relation info from stream and return as LogicalRepRelation.
*/
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 13f8cb74e9f..9991bfe76cc 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -960,6 +960,37 @@ FindLogicalRepLocalIndex(Relation localrel, LogicalRepRelation *remoterel,
return InvalidOid;
}
+/*
+ * Get the number of entries in the LogicalRepRelMap.
+ */
+int
+logicalrep_get_num_rels(void)
+{
+ if (LogicalRepRelMap == NULL)
+ return 0;
+
+ return hash_get_num_entries(LogicalRepRelMap);
+}
+
+/*
+ * Write all the remote relation information from the LogicalRepRelMapEntry to
+ * the output stream.
+ */
+void
+logicalrep_write_all_rels(StringInfo out)
+{
+ LogicalRepRelMapEntry *entry;
+ HASH_SEQ_STATUS status;
+
+ if (LogicalRepRelMap == NULL)
+ return;
+
+ hash_seq_init(&status, LogicalRepRelMap);
+
+ while ((entry = (LogicalRepRelMapEntry *) hash_seq_search(&status)) != NULL)
+ logicalrep_write_internal_rel(out, &entry->remoterel);
+}
+
/*
* Get the LogicalRepRelMapEntry corresponding to the given relid without
* opening the local relation.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0b1eeefe9c9..3832481647e 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -286,6 +286,7 @@
#include "tcop/tcopprot.h"
#include "utils/acl.h"
#include "utils/guc.h"
+#include "utils/injection_point.h"
#include "utils/inval.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
@@ -484,6 +485,8 @@ static List *on_commit_wakeup_workers_subids = NIL;
bool in_remote_transaction = false;
static XLogRecPtr remote_final_lsn = InvalidXLogRecPtr;
+static TransactionId remote_xid = InvalidTransactionId;
+static TransactionId last_remote_xid = InvalidTransactionId;
/* fields valid only when processing streamed transaction */
static bool in_streamed_transaction = false;
@@ -602,11 +605,7 @@ static inline void cleanup_subxact_info(void);
/*
* Serialize and deserialize changes for a toplevel transaction.
*/
-static void stream_open_file(Oid subid, TransactionId xid,
- bool first_segment);
static void stream_write_change(char action, StringInfo s);
-static void stream_open_and_write_change(TransactionId xid, char action, StringInfo s);
-static void stream_close_file(void);
static void send_feedback(XLogRecPtr recvpos, bool force, bool requestReply);
@@ -676,6 +675,8 @@ static void replorigin_reset(int code, Datum arg);
static bool send_internal_dependencies(ParallelApplyWorkerInfo *winfo,
StringInfo s);
+static bool build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo);
+
/*
* Compute the hash value for entries in the replica_identity_table.
*/
@@ -1406,7 +1407,11 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
TransApplyAction apply_action;
StringInfoData original_msg;
- apply_action = get_transaction_apply_action(stream_xid, &winfo);
+ Assert(!in_streamed_transaction || TransactionIdIsValid(stream_xid));
+
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
/* not in streaming mode */
if (apply_action == TRANS_LEADER_APPLY)
@@ -1415,8 +1420,6 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
return false;
}
- Assert(TransactionIdIsValid(stream_xid));
-
/*
* The parallel apply worker needs the xid in this message to decide
* whether to define a savepoint, so save the original message that has
@@ -1427,15 +1430,28 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
/*
* We should have received XID of the subxact as the first part of the
- * message, so extract it.
+ * message in streaming transactions, so extract it.
*/
- current_xid = pq_getmsgint(s, 4);
+ if (in_streamed_transaction)
+ current_xid = pq_getmsgint(s, 4);
+ else
+ current_xid = remote_xid;
if (!TransactionIdIsValid(current_xid))
ereport(ERROR,
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
+ handle_dependency_on_change(action, s, current_xid, winfo);
+
+ /*
+ * Re-fetch the latest apply action as it might have been changed during
+ * dependency check.
+ */
+ apply_action = get_transaction_apply_action(in_streamed_transaction
+ ? stream_xid : remote_xid,
+ &winfo);
+
switch (apply_action)
{
case TRANS_LEADER_SERIALIZE:
@@ -1839,17 +1855,71 @@ static void
apply_handle_begin(StringInfo s)
{
LogicalRepBeginData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* There must not be an active streaming transaction. */
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.final_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.final_lsn);
remote_final_lsn = begin_data.final_lsn;
maybe_start_skipping_changes(begin_data.final_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -1882,6 +1952,37 @@ send_internal_dependencies(ParallelApplyWorkerInfo *winfo, StringInfo s)
return false;
}
+/*
+ * Make a dependency between this and the lastly committed transaction.
+ *
+ * This function ensures that the commit ordering handled by parallel apply
+ * workers is preserved. Returns false if we switched to the serialize mode to
+ * send the massage, true otherwise.
+ */
+static bool
+build_dependency_with_last_committed_txn(ParallelApplyWorkerInfo *winfo)
+{
+ StringInfoData dependency_msg;
+ bool ret;
+
+ /* Skip if transactions have not been applied yet */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return true;
+
+ /* Build the dependency message used to send to parallel apply worker */
+ initStringInfo(&dependency_msg);
+
+ pq_sendbyte(&dependency_msg, PARALLEL_APPLY_INTERNAL_MESSAGE);
+ pq_sendbyte(&dependency_msg, LOGICAL_REP_MSG_INTERNAL_DEPENDENCY);
+ pq_sendint32(&dependency_msg, 1);
+ pq_sendint32(&dependency_msg, last_remote_xid);
+
+ ret = send_internal_dependencies(winfo, &dependency_msg);
+
+ pfree(dependency_msg.data);
+ return ret;
+}
+
/*
* Handle COMMIT message.
*
@@ -1891,6 +1992,11 @@ static void
apply_handle_commit(StringInfo s)
{
LogicalRepCommitData commit_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_commit(s, &commit_data);
@@ -1901,7 +2007,97 @@ apply_handle_commit(StringInfo s)
LSN_FORMAT_ARGS(commit_data.commit_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- apply_handle_commit_internal(&commit_data);
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Apart from parallelized transactions, we do not have to register
+ * this transaction to parallelized_txns. The commit ordering is
+ * always preserved.
+ */
+
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
+ apply_handle_commit_internal(&commit_data);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_COMMIT,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, commit_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ INJECTION_POINT("parallel-worker-before-commit", NULL);
+
+ apply_handle_commit_internal(&commit_data);
+
+ MyParallelShared->last_commit_end = XactLastCommitEnd;
+
+ pa_commit_transaction();
+
+ pa_unlock_transaction(remote_xid, AccessExclusiveLock);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
+ in_remote_transaction = false;
+
+ elog(DEBUG1, "reset remote_xid %u", remote_xid);
/*
* Process any tables that are being synchronized in parallel, as well as
@@ -2024,7 +2220,8 @@ apply_handle_prepare(StringInfo s)
* XactLastCommitEnd, and adding it for this purpose doesn't seems worth
* it.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2084,7 +2281,8 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
- store_flush_position(prepare_data.end_lsn, XactLastCommitEnd);
+ store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2153,7 +2351,8 @@ apply_handle_rollback_prepared(StringInfo s)
* transaction because we always flush the WAL record for it. See
* apply_handle_prepare.
*/
- store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr);
+ store_flush_position(rollback_data.rollback_end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
/*
@@ -2215,7 +2414,8 @@ apply_handle_stream_prepare(StringInfo s)
* It is okay not to set the local_end LSN for the prepare because
* we always flush the prepare record. See apply_handle_prepare.
*/
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr);
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
in_remote_transaction = false;
@@ -2467,6 +2667,11 @@ apply_handle_stream_start(StringInfo s)
case TRANS_LEADER_PARTIAL_SERIALIZE:
Assert(winfo);
+ /*
+ * TODO, the pa worker could start to wait too soon when
+ * processing some old stream start
+ */
+
/*
* Open the spool file unless it was already opened when switching
* to serialize mode. The transaction started in
@@ -3194,7 +3399,8 @@ apply_handle_commit_internal(LogicalRepCommitData *commit_data)
pgstat_report_stat(false);
- store_flush_position(commit_data->end_lsn, XactLastCommitEnd);
+ store_flush_position(commit_data->end_lsn, XactLastCommitEnd,
+ InvalidTransactionId);
}
else
{
@@ -3227,6 +3433,9 @@ apply_handle_relation(StringInfo s)
/* Also reset all entries in the partition map that refer to remoterel. */
logicalrep_partmap_reset_relmap(rel);
+
+ if (am_leader_apply_worker())
+ pa_distribute_schema_changes_to_workers(rel);
}
/*
@@ -4001,6 +4210,8 @@ FindDeletedTupleInLocalRel(Relation localrel, Oid localidxoid,
/*
* This handles insert, update, delete on a partitioned table.
+ *
+ * TODO, support parallel apply.
*/
static void
apply_handle_tuple_routing(ApplyExecutionData *edata,
@@ -4551,6 +4762,10 @@ apply_dispatch(StringInfo s)
* check which entries on it are already locally flushed. Those we can report
* as having been flushed.
*
+ * For non-streaming transactions managed by a parallel apply worker, we will
+ * get the local commit end from the shared parallel apply worker info once the
+ * transaction has been committed by the worker.
+ *
* The have_pending_txes is true if there are outstanding transactions that
* need to be flushed.
*/
@@ -4560,6 +4775,7 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
{
dlist_mutable_iter iter;
XLogRecPtr local_flush = GetFlushRecPtr(NULL);
+ List *committed_pa_xid = NIL;
*write = InvalidXLogRecPtr;
*flush = InvalidXLogRecPtr;
@@ -4569,6 +4785,36 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
FlushPosition *pos =
dlist_container(FlushPosition, node, iter.cur);
+ if (TransactionIdIsValid(pos->pa_remote_xid) &&
+ XLogRecPtrIsInvalid(pos->local_end))
+ {
+ bool skipped_write;
+
+ pos->local_end = pa_get_last_commit_end(pos->pa_remote_xid, true,
+ &skipped_write);
+
+ elog(DEBUG1,
+ "got commit end from parallel apply worker, "
+ "txn: %u, remote_end %X/%X, local_end %X/%X",
+ pos->pa_remote_xid, LSN_FORMAT_ARGS(pos->remote_end),
+ LSN_FORMAT_ARGS(pos->local_end));
+
+ /*
+ * Break the loop if the worker has not finished applying the
+ * transaction. There's no need to check subsequent transactions,
+ * as they must commit after the current transaction being
+ * examined and thus won't have their commit end available yet.
+ */
+ if (!skipped_write && XLogRecPtrIsInvalid(pos->local_end))
+ break;
+
+ committed_pa_xid = lappend_xid(committed_pa_xid, pos->pa_remote_xid);
+ }
+
+ /*
+ * Worker has finished applying or the transaction was applied in the
+ * leader apply worker
+ */
*write = pos->remote_end;
if (pos->local_end <= local_flush)
@@ -4577,29 +4823,19 @@ get_flush_position(XLogRecPtr *write, XLogRecPtr *flush,
dlist_delete(iter.cur);
pfree(pos);
}
- else
- {
- /*
- * Don't want to uselessly iterate over the rest of the list which
- * could potentially be long. Instead get the last element and
- * grab the write position from there.
- */
- pos = dlist_tail_element(FlushPosition, node,
- &lsn_mapping);
- *write = pos->remote_end;
- *have_pending_txes = true;
- return;
- }
}
*have_pending_txes = !dlist_is_empty(&lsn_mapping);
+
+ cleanup_replica_identity_table(committed_pa_xid);
}
/*
* Store current remote/local lsn pair in the tracking list.
*/
void
-store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
+store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid)
{
FlushPosition *flushpos;
@@ -4617,6 +4853,7 @@ store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn)
flushpos = palloc_object(FlushPosition);
flushpos->local_end = local_lsn;
flushpos->remote_end = remote_lsn;
+ flushpos->pa_remote_xid = remote_xid;
dlist_push_tail(&lsn_mapping, &flushpos->node);
MemoryContextSwitchTo(ApplyMessageContext);
@@ -6064,7 +6301,7 @@ stream_cleanup_files(Oid subid, TransactionId xid)
* changes for this transaction, create the buffile, otherwise open the
* previously created file.
*/
-static void
+void
stream_open_file(Oid subid, TransactionId xid, bool first_segment)
{
char path[MAXPGPATH];
@@ -6109,7 +6346,7 @@ stream_open_file(Oid subid, TransactionId xid, bool first_segment)
* stream_close_file
* Close the currently open file with streamed changes.
*/
-static void
+void
stream_close_file(void)
{
Assert(stream_fd != NULL);
@@ -6157,7 +6394,7 @@ stream_write_change(char action, StringInfo s)
* target file if not already before writing the message and close the file at
* the end.
*/
-static void
+void
stream_open_and_write_change(TransactionId xid, char action, StringInfo s)
{
Assert(!in_streamed_transaction);
diff --git a/src/include/replication/logicalproto.h b/src/include/replication/logicalproto.h
index 5d91e2a4287..7d2aaf2d389 100644
--- a/src/include/replication/logicalproto.h
+++ b/src/include/replication/logicalproto.h
@@ -253,6 +253,8 @@ extern void logicalrep_write_message(StringInfo out, TransactionId xid, XLogRecP
extern void logicalrep_write_rel(StringInfo out, TransactionId xid,
Relation rel, Bitmapset *columns,
PublishGencolsType include_gencols_type);
+extern void logicalrep_write_internal_rel(StringInfo out,
+ LogicalRepRelation *rel);
extern LogicalRepRelation *logicalrep_read_rel(StringInfo in);
extern void logicalrep_write_typ(StringInfo out, TransactionId xid,
Oid typoid);
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 4b321bd2ad2..34a7069e9e5 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -52,6 +52,8 @@ extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
LOCKMODE lockmode);
extern bool IsIndexUsableForReplicaIdentityFull(Relation idxrel, AttrMap *attrmap);
extern Oid GetRelationIdentityOrPK(Relation rel);
+extern int logicalrep_get_num_rels(void);
+extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
#endif /* LOGICALRELATION_H */
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 78b5667cebe..5371ee767f1 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -314,6 +314,10 @@ extern void apply_dispatch(StringInfo s);
extern void maybe_reread_subscription(void);
extern void stream_cleanup_files(Oid subid, TransactionId xid);
+extern void stream_open_file(Oid subid, TransactionId xid, bool first_segment);
+extern void stream_close_file(void);
+extern void stream_open_and_write_change(TransactionId xid, char action,
+ StringInfo s);
extern void set_stream_options(WalRcvStreamOptions *options,
char *slotname,
@@ -327,7 +331,8 @@ extern void SetupApplyOrSyncWorker(int worker_slot);
extern void DisableSubscriptionAndExit(void);
-extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn);
+extern void store_flush_position(XLogRecPtr remote_lsn, XLogRecPtr local_lsn,
+ TransactionId remote_xid);
/* Function for apply error callback */
extern void apply_error_callback(void *arg);
@@ -342,6 +347,7 @@ extern void pa_detach_all_error_mq(void);
extern bool pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes,
const void *data);
+extern void pa_distribute_schema_changes_to_workers(LogicalRepRelation *rel);
extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
bool stream_locked);
@@ -368,8 +374,9 @@ extern void pa_xact_finish(ParallelApplyWorkerInfo *winfo,
XLogRecPtr remote_lsn);
extern bool pa_transaction_committed(TransactionId xid);
extern void pa_record_dependency_on_transactions(List *depends_on_xids);
-
+extern void pa_commit_transaction(void);
extern void pa_wait_for_depended_transaction(TransactionId xid);
+extern void pa_add_parallelized_transaction(TransactionId xid);
#define isParallelApplyWorker(worker) ((worker)->in_use && \
(worker)->type == WORKERTYPE_PARALLEL_APPLY)
diff --git a/src/test/subscription/meson.build b/src/test/subscription/meson.build
index 85d10a89994..e877ca09c30 100644
--- a/src/test/subscription/meson.build
+++ b/src/test/subscription/meson.build
@@ -46,6 +46,7 @@ tests += {
't/034_temporal.pl',
't/035_conflicts.pl',
't/036_sequences.pl',
+ 't/050_parallel_apply.pl',
't/100_bugs.pl',
],
},
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index ecb79e79474..0ccec516a18 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -16,6 +16,8 @@ $node_publisher->start;
# Create subscriber node
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
$node_subscriber->start;
# Create some preexisting content on publisher
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index 3d16c2a800d..c2fba0b9a9c 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -17,7 +17,7 @@ $node_publisher->start;
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf',
- qq(max_logical_replication_workers = 6));
+ qq(max_logical_replication_workers = 7));
$node_subscriber->start;
my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl
index 03135b1cd6e..e79ddd9a41c 100644
--- a/src/test/subscription/t/015_stream.pl
+++ b/src/test/subscription/t/015_stream.pl
@@ -232,6 +232,12 @@ $node_subscriber->wait_for_log(
$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+# FIXME: Currently, non-streaming transactions are applied in parallel by
+# default. So, the first transaction is handled by a parallel apply worker. To
+# trigger the deadlock, initiate an more transaction to be applied by the
+# leader.
+$node_publisher->safe_psql('postgres', "INSERT INTO test_tab_2 values(1)");
+
$h->query_safe('COMMIT');
$h->quit;
@@ -247,7 +253,7 @@ $node_publisher->wait_for_catchup($appname);
$result =
$node_subscriber->safe_psql('postgres', "SELECT count(*) FROM test_tab_2");
-is($result, qq(5001), 'data replicated to subscriber after dropping index');
+is($result, qq(5002), 'data replicated to subscriber after dropping index');
# Clean up test data from the environment.
$node_publisher->safe_psql('postgres', "TRUNCATE TABLE test_tab_2");
diff --git a/src/test/subscription/t/026_stats.pl b/src/test/subscription/t/026_stats.pl
index a430ab4feec..58e34839ab4 100644
--- a/src/test/subscription/t/026_stats.pl
+++ b/src/test/subscription/t/026_stats.pl
@@ -16,6 +16,7 @@ $node_publisher->start;
# Create subscriber node.
my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_subscriber->start;
diff --git a/src/test/subscription/t/027_nosuperuser.pl b/src/test/subscription/t/027_nosuperuser.pl
index 691731743df..e0c1d213800 100644
--- a/src/test/subscription/t/027_nosuperuser.pl
+++ b/src/test/subscription/t/027_nosuperuser.pl
@@ -86,6 +86,7 @@ $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
$node_publisher->init(allows_streaming => 'logical');
$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "max_logical_replication_workers = 10");
$node_publisher->start;
$node_subscriber->start;
$publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
new file mode 100644
index 00000000000..69cf48cb7ac
--- /dev/null
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -0,0 +1,130 @@
+
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# This tests that dependency tracking between transactions can work well
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize publisher node
+my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+# Insert initial data
+$node_publisher->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(1, 10), 'test');");
+
+# Create a publication
+$node_publisher->safe_psql('postgres',
+ "CREATE PUBLICATION regress_pub FOR ALL TABLES;");
+
+# Initialize subscriber node
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
+$node_subscriber->init;
+$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_logical_replication_workers = 10");
+$node_subscriber->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node_subscriber->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node_subscriber->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# Create a subscription
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE regress_tab (id int PRIMARY KEY, value text);");
+$node_subscriber->safe_psql('postgres',
+ "CREATE SUBSCRIPTION regress_sub CONNECTION '$publisher_connstr' PUBLICATION regress_pub;");
+
+# Wait for initial table sync to finish
+$node_subscriber->wait_for_subscription_sync($node_publisher, 'regress_sub');
+
+# Insert tuples on publisher
+#
+# XXX This may not enough to launch a parallel apply worker, because
+# table_states_not_ready is not discarded yet.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(11, 20), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Insert tuples again
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(21, 30), 'test');");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Verify the parallel apply worker is launched
+my $result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '1', "parallel apply worker is laucnhed by a non-streamed transaction");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert tuples on publisher
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(31, 40), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+my $offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is independent from the
+# previous one, but the parallel worker would wait till it finishes
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(41, 50), 'test');");
+
+# Verify the parallel worker waits for the transaction
+my $str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+my $xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Update tuples which have not been applied yet on subscriber because the
+# parallel worker stops at the injection point. Newly assigned worker also
+# waits for the same transactions as above.
+$node_publisher->safe_psql('postgres',
+ "UPDATE regress_tab SET value = 'updated' WHERE id BETWEEN 31 AND 35;");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
+# Wakeup the parallel worker. We detach first no to stop other parallel workers
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the parallel worker wakes up
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$result =
+ $node_subscriber->safe_psql('postgres', "SELECT count(1) FROM regress_tab");
+is ($result, 50, 'inserts are replicated to subscriber');
+
+$result =
+ $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM regress_tab WHERE value = 'updated'");
+is ($result, 5, 'updates are also replicated to subscriber');
+
+done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ceb3fc5d980..80810793746 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,7 @@ ParallelHashGrowth
ParallelHashJoinBatch
ParallelHashJoinBatchAccessor
ParallelHashJoinState
+ParallelizedTxnEntry
ParallelIndexScanDesc
ParallelSlot
ParallelSlotArray
@@ -2573,6 +2574,8 @@ ReparameterizeForeignPathByChild_function
ReplaceVarsFromTargetList_context
ReplaceVarsNoMatchOption
ReplaceWrapOption
+ReplicaIdentityEntry
+ReplicaIdentityKey
ReplicaIdentityStmt
ReplicationKind
ReplicationSlot
@@ -4082,6 +4085,7 @@ rendezvousHashEntry
rep
replace_rte_variables_callback
replace_rte_variables_context
+replica_identity_hash
report_error_fn
ret_type
rewind_source
--
2.47.3
v7-0005-support-2PC.patchapplication/octet-stream; name=v7-0005-support-2PC.patchDownload
From 99fe54601c3c28eef019776fd21bf765100446b6 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 2 Dec 2025 13:01:26 +0900
Subject: [PATCH v7 5/8] support 2PC
This patch allows the PREPARE transaction to be applied in parallel. Parallel
apply workers are assigned to a transaction when BEGIN_PREPARE is received. This
part and the dependency-waiting mechanism are the same as a normal transaction.
A parallel worker can be freed after it handles a PREPARE message. The prepared
transaction can be deleted from parallelized_txns at that time; the upcoming
transactions will wait until then.
The leader apply worker resolves COMMIT PREPARED/ROLLBACK PREPARED. Since it can
be serialized automatically, it does not add the transaction to parallelized_txns.
---
src/backend/replication/logical/worker.c | 238 +++++++++++++++---
src/test/subscription/t/050_parallel_apply.pl | 60 +++++
2 files changed, 270 insertions(+), 28 deletions(-)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 3832481647e..99a0aeb1757 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -756,6 +756,14 @@ cleanup_replica_identity_table(List *committed_xid)
if (!committed_xid)
return;
+ /*
+ * Skip if the replica_identity_table is not initialized yet. This can
+ * happen if the empty transaction was replicated and a parallel apply
+ * worker was launched. See comments in apply_handle_prepare().
+ */
+ if (!replica_identity_table)
+ return;
+
replica_identity_start_iterate(replica_identity_table, &i);
while ((rientry = replica_identity_iterate(replica_identity_table, &i)) != NULL)
{
@@ -2116,6 +2124,11 @@ static void
apply_handle_begin_prepare(StringInfo s)
{
LogicalRepPreparedTxnData begin_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
/* Tablesync should never receive prepare. */
if (am_tablesync_worker())
@@ -2127,12 +2140,61 @@ apply_handle_begin_prepare(StringInfo s)
Assert(!TransactionIdIsValid(stream_xid));
logicalrep_read_begin_prepare(s, &begin_data);
- set_apply_error_context_xact(begin_data.xid, begin_data.prepare_lsn);
+
+ remote_xid = begin_data.xid;
+
+ set_apply_error_context_xact(remote_xid, begin_data.prepare_lsn);
remote_final_lsn = begin_data.prepare_lsn;
maybe_start_skipping_changes(begin_data.prepare_lsn);
+ pa_allocate_worker(remote_xid, false);
+
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
+
+ elog(DEBUG1, "new remote_xid %u", remote_xid);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ if (pa_send_data(winfo, s->len, s->data))
+ {
+ pa_set_stream_apply_worker(winfo);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_write_change(LOGICAL_REP_MSG_BEGIN_PREPARE, &original_msg);
+
+ /* Cache the parallel apply worker for this transaction. */
+ pa_set_stream_apply_worker(winfo);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+ /* Hold the lock until the end of the transaction. */
+ pa_lock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+ pa_set_xact_state(MyParallelShared, PARALLEL_TRANS_STARTED);
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
in_remote_transaction = true;
pgstat_report_activity(STATE_RUNNING, NULL);
@@ -2182,6 +2244,11 @@ static void
apply_handle_prepare(StringInfo s)
{
LogicalRepPreparedTxnData prepare_data;
+ ParallelApplyWorkerInfo *winfo;
+ TransApplyAction apply_action;
+
+ /* Save the message before it is consumed. */
+ StringInfoData original_msg = *s;
logicalrep_read_prepare(s, &prepare_data);
@@ -2192,36 +2259,136 @@ apply_handle_prepare(StringInfo s)
LSN_FORMAT_ARGS(prepare_data.prepare_lsn),
LSN_FORMAT_ARGS(remote_final_lsn))));
- /*
- * Unlike commit, here, we always prepare the transaction even though no
- * change has happened in this transaction or all changes are skipped. It
- * is done this way because at commit prepared time, we won't know whether
- * we have skipped preparing a transaction because of those reasons.
- *
- * XXX, We can optimize such that at commit prepared time, we first check
- * whether we have prepared the transaction or not but that doesn't seem
- * worthwhile because such cases shouldn't be common.
- */
- begin_replication_step();
+ apply_action = get_transaction_apply_action(remote_xid, &winfo);
- apply_handle_prepare_internal(&prepare_data);
+ switch (apply_action)
+ {
+ case TRANS_LEADER_APPLY:
+ /*
+ * Unlike commit, here, we always prepare the transaction even
+ * though no change has happened in this transaction or all changes
+ * are skipped. It is done this way because at commit prepared
+ * time, we won't know whether we have skipped preparing a
+ * transaction because of those reasons.
+ *
+ * XXX, We can optimize such that at commit prepared time, we first
+ * check whether we have prepared the transaction or not but that
+ * doesn't seem worthwhile because such cases shouldn't be common.
+ */
+ begin_replication_step();
- end_replication_step();
- CommitTransactionCommand();
- pgstat_report_stat(false);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
- /*
- * It is okay not to set the local_end LSN for the prepare because we
- * always flush the prepare record. So, we can send the acknowledgment of
- * the remote_end LSN as soon as prepare is finished.
- *
- * XXX For the sake of consistency with commit, we could have set it with
- * the LSN of prepare but as of now we don't track that value similar to
- * XactLastCommitEnd, and adding it for this purpose doesn't seems worth
- * it.
- */
- store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
- InvalidTransactionId);
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. So, we can send the
+ * acknowledgment of the remote_end LSN as soon as prepare is
+ * finished.
+ *
+ * XXX For the sake of consistency with commit, we could have set
+ * it with the LSN of prepare but as of now we don't track that
+ * value similar to XactLastCommitEnd, and adding it for this
+ * purpose doesn't seems worth
+ * it.
+ */
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ break;
+
+ case TRANS_LEADER_SEND_TO_PARALLEL:
+ Assert(winfo);
+
+ /*
+ * Mark this transaction as parallelized. This ensures that
+ * upcoming transactions wait until this transaction is committed.
+ */
+ pa_add_parallelized_transaction(remote_xid);
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
+ {
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+ }
+
+ /*
+ * Switch to serialize mode when we are not able to send the
+ * change to parallel apply worker.
+ */
+ pa_switch_to_partial_serialize(winfo, true);
+/* fall through */
+ case TRANS_LEADER_PARTIAL_SERIALIZE:
+ Assert(winfo);
+
+ stream_open_and_write_change(remote_xid, LOGICAL_REP_MSG_PREPARE,
+ &original_msg);
+
+ pa_set_fileset_state(winfo->shared, FS_SERIALIZE_DONE);
+
+ /* Finish processing the transaction. */
+ pa_xact_finish(winfo, prepare_data.end_lsn);
+ break;
+
+ case TRANS_PARALLEL_APPLY:
+
+ /*
+ * If the parallel apply worker is applying spooled messages then
+ * close the file before committing.
+ */
+ if (stream_fd)
+ stream_close_file();
+
+ begin_replication_step();
+
+ INJECTION_POINT("parallel-worker-before-prepare", NULL);
+
+ /* Mark the transaction as prepared. */
+ apply_handle_prepare_internal(&prepare_data);
+
+ end_replication_step();
+
+ CommitTransactionCommand();
+ pgstat_report_stat(false);
+
+ store_flush_position(prepare_data.end_lsn, InvalidXLogRecPtr,
+ InvalidTransactionId);
+
+ /*
+ * It is okay not to set the local_end LSN for the prepare because
+ * we always flush the prepare record. See apply_handle_prepare.
+ */
+ MyParallelShared->last_commit_end = InvalidXLogRecPtr;
+ pa_commit_transaction();
+
+ pa_unlock_transaction(MyParallelShared->xid, AccessExclusiveLock);
+
+ pa_reset_subtrans();
+ break;
+
+ default:
+ elog(ERROR, "unexpected apply action: %d", (int) apply_action);
+ break;
+ }
+
+ /* Cache the remote_xid */
+ last_remote_xid = remote_xid;
+
+ remote_xid = InvalidTransactionId;
in_remote_transaction = false;
@@ -2269,6 +2436,9 @@ apply_handle_commit_prepared(StringInfo s)
/* There is no transaction when COMMIT PREPARED is called */
begin_replication_step();
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
@@ -2281,6 +2451,14 @@ apply_handle_commit_prepared(StringInfo s)
CommitTransactionCommand();
pgstat_report_stat(false);
+ /*
+ * No need to update last_remote_xid because leader worker applied the
+ * message thus upcoming transaction preserves the order automatically.
+ * Let's set the xid to an invalid value to skip sending the
+ * INTERNAL_DEPENDENCY message.
+ */
+ last_remote_xid = InvalidTransactionId;
+
store_flush_position(prepare_data.end_lsn, XactLastCommitEnd,
InvalidTransactionId);
in_remote_transaction = false;
@@ -2337,6 +2515,10 @@ apply_handle_rollback_prepared(StringInfo s)
/* There is no transaction when ABORT/ROLLBACK PREPARED is called */
begin_replication_step();
+
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
FinishPreparedTransaction(gid, false);
end_replication_step();
CommitTransactionCommand();
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 69cf48cb7ac..15973f7d0e0 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -17,6 +17,8 @@ if ($ENV{enable_injection_points} ne 'yes')
# Initialize publisher node
my $node_publisher = PostgreSQL::Test::Cluster->new('publisher');
$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_publisher->start;
# Insert initial data
@@ -35,6 +37,8 @@ $node_subscriber->init;
$node_subscriber->append_conf('postgresql.conf', "log_min_messages = debug1");
$node_subscriber->append_conf('postgresql.conf',
"max_logical_replication_workers = 10");
+$node_subscriber->append_conf('postgresql.conf',
+ "max_prepared_transactions = 10");
$node_subscriber->start;
# Check if the extension injection_points is available, as it may be
@@ -127,4 +131,60 @@ $result =
"SELECT count(1) FROM regress_tab WHERE value = 'updated'");
is ($result, 5, 'updates are also replicated to subscriber');
+# Ensure PREPAREd transaction also affects the parallel apply
+
+$node_subscriber->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_sub DISABLE;");
+$node_subscriber->poll_query_until('postgres',
+ "SELECT count(*) = 0 FROM pg_stat_activity WHERE backend_type = 'logical replication apply worker'"
+);
+$node_subscriber->safe_psql(
+ 'postgres', "
+ ALTER SUBSCRIPTION regress_sub SET (two_phase = on);
+ ALTER SUBSCRIPTION regress_sub ENABLE;");
+
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(1) FROM pg_stat_activity WHERE backend_type = 'logical replication parallel worker'");
+is($result, '0', "no parallel apply workers exist after restart");
+
+# Attach an injection_point. Parallel workers would wait before the prepare
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-prepare','wait');"
+);
+
+# PREPARE a transaction on publisher. It would be handled by a parallel apply
+# worker.
+$node_publisher->safe_psql('postgres', qq[
+ BEGIN;
+ INSERT INTO regress_tab VALUES (generate_series(51, 60), 'prepare');
+ PREPARE TRANSACTION 'regress_prepare';
+]);
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-prepare');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction waits for the prepared
+# transaction
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(61, 70), 'test');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-prepare');
+ SELECT injection_points_wakeup('parallel-worker-before-prepare');
+]);
+
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+# COMMIT the prepared transaction. It is always handled by the leader
+$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
+$node_publisher->wait_for_catchup('regress_sub');
+
done_testing();
--
2.47.3
v7-0006-Track-dependencies-for-streamed-transactions.patchapplication/octet-stream; name=v7-0006-Track-dependencies-for-streamed-transactions.patchDownload
From db1f78b4fe74e7419f2c7f0a4959207376494c9f Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Thu, 4 Dec 2025 20:55:26 +0900
Subject: [PATCH v7 6/8] Track dependencies for streamed transactions
This commit allows tracking dependencies of streamed transactions.
Regarding the streaming=on case, dependency tracking is enabled while applying
spooled changes from files.
In the streaming=parallel case, dependency tracking is performed when the leader
sends changes to parallel workers. Apart from non-streamed transactions, the
leader waits for parallel workers till the assigned transactions are finished at
COMMIT/PREPARE/ABORT; thus, the XID of streamed transactions is not cached as
the lastly handled one. Also, streamed transactions are not recorded as
parallelized transactions because upcoming workers do not have to wait for them.
---
.../replication/logical/applyparallelworker.c | 19 +++++-
src/backend/replication/logical/worker.c | 66 +++++++++++++++++--
src/include/replication/worker_internal.h | 2 +-
src/test/subscription/t/050_parallel_apply.pl | 47 +++++++++++++
4 files changed, 126 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/logical/applyparallelworker.c b/src/backend/replication/logical/applyparallelworker.c
index 5b6267c6047..bb66d64582c 100644
--- a/src/backend/replication/logical/applyparallelworker.c
+++ b/src/backend/replication/logical/applyparallelworker.c
@@ -168,7 +168,14 @@
* key) as another ongoing transaction (see handle_dependency_on_change for
* details). If so, the leader sends a list of dependent transaction IDs to the
* parallel worker, indicating that the parallel apply worker must wait for
- * these transactions to commit before proceeding.
+ * these transactions to commit before proceeding. If transactions are streamed
+ * but leader deciedes no to assign parallel apply workers, dependencies are
+ * verified when the transaction is committed.
+ *
+ * Non-streaming transactions
+ * ======================
+ * The handling is similar to streaming transactions, but including few
+ * differences:
*
* Commit order
* ------------
@@ -1635,6 +1642,12 @@ pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
stream_apply_worker = winfo;
}
+bool
+pa_stream_apply_worker_is_null(void)
+{
+ return stream_apply_worker == NULL;
+}
+
/*
* Form a unique savepoint name for the streaming transaction.
*
@@ -1720,6 +1733,10 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_data)
TransactionId xid = abort_data->xid;
TransactionId subxid = abort_data->subxid;
+ /* Streamed transactions won't be registered */
+ Assert(!dshash_find(parallelized_txns, &xid, false) &&
+ !dshash_find(parallelized_txns, &subxid, false));
+
/*
* Update origin state so we can restart streaming from correct position
* in case of crash.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 99a0aeb1757..83bcf216fd3 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -969,13 +969,26 @@ check_dependency_on_replica_identity(Oid relid,
&rientry->remote_xid,
new_depended_xid);
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
/*
* Update the new depended xid into the entry if valid, the new xid could
* be invalid if the transaction will be applied by the leader itself
* which means all the changes will be committed before processing next
* transaction, so no need to be depended on.
*/
- if (TransactionIdIsValid(new_depended_xid))
+ else if (TransactionIdIsValid(new_depended_xid))
rientry->remote_xid = new_depended_xid;
/*
@@ -1089,8 +1102,11 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
*/
StringInfoData change = *s;
- /* Compute dependency only for non-streaming transaction */
- if (in_streamed_transaction || (winfo && winfo->stream_txn))
+ /*
+ * Skip if we are handling streaming transactions but changes are not
+ * applied yet.
+ */
+ if (pa_stream_apply_worker_is_null() && in_streamed_transaction)
return;
/* Only the leader checks dependencies and schedules the parallel apply */
@@ -1450,7 +1466,18 @@ handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)
(errcode(ERRCODE_PROTOCOL_VIOLATION),
errmsg_internal("invalid transaction ID in streamed replication transaction")));
- handle_dependency_on_change(action, s, current_xid, winfo);
+ /*
+ * Check dependencies related to the received change. The XID of the top
+ * transaction is always used to avoid detecting false-positive
+ * dependencies between top and sub transactions. Sub-transactions can be
+ * replicated for streamed transactions, and they won't be marked as
+ * parallelized so that parallel workers won't wait for rolled-back
+ * sub-transactions.
+ */
+ handle_dependency_on_change(action, s,
+ in_streamed_transaction
+ ? stream_xid : remote_xid,
+ winfo);
/*
* Re-fetch the latest apply action as it might have been changed during
@@ -2587,6 +2614,10 @@ apply_handle_stream_prepare(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset,
prepare_data.xid, prepare_data.prepare_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
/* Mark the transaction as prepared. */
apply_handle_prepare_internal(&prepare_data);
@@ -2610,7 +2641,8 @@ apply_handle_stream_prepare(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, prepare_data.end_lsn);
@@ -2676,6 +2708,11 @@ apply_handle_stream_prepare(StringInfo s)
pgstat_report_stat(false);
+ /*
+ * No need to update the last_remote_xid here because leader worker
+ * always wait until streamed transactions finish.
+ */
+
/*
* Process any tables that are being synchronized in parallel, as well as
* any newly added tables or sequences.
@@ -3460,6 +3497,10 @@ apply_handle_stream_commit(StringInfo s)
apply_spooled_messages(MyLogicalRepWorker->stream_fileset, xid,
commit_data.commit_lsn);
+ /* Wait until the last transaction finishes */
+ if (TransactionIdIsValid(last_remote_xid))
+ pa_wait_for_depended_transaction(last_remote_xid);
+
apply_handle_commit_internal(&commit_data);
/* Unlink the files with serialized changes and subxact info. */
@@ -3471,7 +3512,20 @@ apply_handle_stream_commit(StringInfo s)
case TRANS_LEADER_SEND_TO_PARALLEL:
Assert(winfo);
- if (pa_send_data(winfo, s->len, s->data))
+ /*
+ * Apart from non-streaming case, no need to mark this transaction
+ * as parallelized. Because the leader waits until the streamed
+ * transaction is committed thus commit ordering is always
+ * preserved.
+ */
+
+ /*
+ * Build a dependency between this transaction and the lastly
+ * committed transaction to preserve the commit order. Then try to
+ * send a COMMIT message if succeeded.
+ */
+ if (build_dependency_with_last_committed_txn(winfo) &&
+ pa_send_data(winfo, s->len, s->data))
{
/* Finish processing the streaming transaction. */
pa_xact_finish(winfo, commit_data.end_lsn);
diff --git a/src/include/replication/worker_internal.h b/src/include/replication/worker_internal.h
index 5371ee767f1..69ecd51a359 100644
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -354,7 +354,7 @@ extern void pa_switch_to_partial_serialize(ParallelApplyWorkerInfo *winfo,
extern void pa_set_xact_state(ParallelApplyWorkerShared *wshared,
ParallelTransState xact_state);
extern void pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo);
-
+extern bool pa_stream_apply_worker_is_null(void);
extern void pa_start_subtrans(TransactionId current_xid,
TransactionId top_xid);
extern void pa_reset_subtrans(void);
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 15973f7d0e0..9254b85d350 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -187,4 +187,51 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'regress_prepare';");
$node_publisher->wait_for_catchup('regress_sub');
+# Ensure streamed transactions waits the previous transaction
+
+$node_publisher->append_conf('postgresql.conf',
+ "logical_decoding_work_mem = 64kB");
+$node_publisher->reload;
+# Run a query to make sure that the reload has taken effect.
+$node_publisher->safe_psql('postgres', "SELECT 1");
+
+# Attach the injection_point again
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (generate_series(71, 80), 'test');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+# Run a transaction which would be streamed
+my $h = $node_publisher->background_psql('postgres', on_error_stop => 0);
+
+$offset = -s $node_subscriber->logfile;
+
+$h->query_safe(
+ q{
+BEGIN;
+UPDATE regress_tab SET value = 'streamed-updated' WHERE id BETWEEN 71 AND 80;
+INSERT INTO regress_tab VALUES (generate_series(100, 5100), 'streamed');
+});
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Wakeup the parallel worker
+$node_subscriber->safe_psql('postgres', qq[
+ SELECT injection_points_detach('parallel-worker-before-commit');
+ SELECT injection_points_wakeup('parallel-worker-before-commit');
+]);
+
+# Verify the streamed transaction can be applied
+$node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset);
+
+$h->query_safe("COMMIT;");
+
done_testing();
--
2.47.3
v7-0007-Wait-applying-transaction-if-one-of-user-defined-.patchapplication/octet-stream; name=v7-0007-Wait-applying-transaction-if-one-of-user-defined-.patchDownload
From 45c2c40caa13219e7860cb81d86fa9d742cc1798 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Date: Tue, 23 Dec 2025 17:58:15 +0900
Subject: [PATCH v7 7/8] Wait applying transaction if one of user-defined
triggers is immutable
Since many parallel workers apply transactions, triggers for relations can also
be fired in parallel, which may obtain unexpected results. To make it safe,
parallel apply workers wait for the previously dispatched transaction before
applying changes to the relation that has mutable triggers.
---
src/backend/replication/logical/relation.c | 123 ++++++++++++++++++---
src/backend/replication/logical/worker.c | 68 ++++++++++++
src/include/replication/logicalrelation.h | 20 ++++
3 files changed, 197 insertions(+), 14 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 9991bfe76cc..14f3ebf725e 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -21,7 +21,9 @@
#include "access/genam.h"
#include "access/table.h"
#include "catalog/namespace.h"
+#include "catalog/pg_proc.h"
#include "catalog/pg_subscription_rel.h"
+#include "commands/trigger.h"
#include "executor/executor.h"
#include "nodes/makefuncs.h"
#include "replication/logicalrelation.h"
@@ -159,6 +161,10 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
*
* Called when new relation mapping is sent by the publisher to update
* our expected view of incoming data from said publisher.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL has
+ * already assumed that CHECK constraints' conditions are immutable and here
+ * follows the rule.
*/
void
logicalrep_relmap_update(LogicalRepRelation *remoterel)
@@ -208,6 +214,8 @@ logicalrep_relmap_update(LogicalRepRelation *remoterel)
(remoterel->relkind == 0) ? RELKIND_RELATION : remoterel->relkind;
entry->remoterel.attkeys = bms_copy(remoterel->attkeys);
+
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
MemoryContextSwitchTo(oldctx);
}
@@ -353,27 +361,79 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
/*
- * Open the local relation associated with the remote one.
+ * Check all local triggers for the relation to see the parallelizability.
*
- * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ * We regard relations as applicable in parallel if all triggers are immutable.
+ * Result is directly set to LogicalRepRelMapEntry::parallel_safe.
*/
-LogicalRepRelMapEntry *
-logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+static void
+check_defined_triggers(LogicalRepRelMapEntry *entry)
+{
+ TriggerDesc *trigdesc = entry->localrel->trigdesc;
+
+ /* Quick exit if triffer is not defined */
+ if (trigdesc == NULL)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+ return;
+ }
+
+ /* Seek triggers one by one to see the volatility */
+ for (int i = 0; i < trigdesc->numtriggers; i++)
+ {
+ Trigger *trigger = &trigdesc->triggers[i];
+
+ Assert(OidIsValid(trigger->tgfoid));
+
+ /* Skip if the trigger is not enabled for logical replication */
+ if (trigger->tgenabled == TRIGGER_DISABLED ||
+ trigger->tgenabled == TRIGGER_FIRES_ON_ORIGIN)
+ continue;
+
+ /* Check the volatility of the trigger. Exit if it is not immutable */
+ if (func_volatile(trigger->tgfoid) != PROVOLATILE_IMMUTABLE)
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ return;
+ }
+ }
+
+ /* All triggers are immutable, set as parallel safe */
+ entry->parallel_safe = LOGICALREP_PARALLEL_SAFE;
+}
+
+/*
+ * Actual workhorse for logicalrep_rel_open().
+ *
+ * Caller must specify *either* entry or key. If the entry is specified, its
+ * attributes are filled and returned. The logical relation is kept opening.
+ * If the key is given, the corresponding entry is first searched in the hash
+ * table and processed as in the above case. At the end, logical replication is
+ * closed.
+ */
+void
+logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
+ LOCKMODE lockmode)
{
- LogicalRepRelMapEntry *entry;
- bool found;
LogicalRepRelation *remoterel;
- if (LogicalRepRelMap == NULL)
- logicalrep_relmap_init();
+ Assert((entry && !remoteid) || (!entry && remoteid));
- /* Search for existing entry. */
- entry = hash_search(LogicalRepRelMap, &remoteid,
- HASH_FIND, &found);
+ if (!entry)
+ {
+ bool found;
- if (!found)
- elog(ERROR, "no relation map entry for remote relation ID %u",
- remoteid);
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+ }
remoterel = &entry->remoterel;
@@ -499,6 +559,13 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localindexoid = FindLogicalRepLocalIndex(entry->localrel, remoterel,
entry->attrmap);
+ /*
+ * Leader must also collect all local unique indexes for dependency
+ * tracking.
+ */
+ if (am_leader_apply_worker())
+ check_defined_triggers(entry);
+
entry->localrelvalid = true;
}
@@ -507,6 +574,34 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
entry->localreloid,
&entry->statelsn);
+ if (remoteid)
+ logicalrep_rel_close(entry, lockmode);
+}
+
+/*
+ * Open the local relation associated with the remote one.
+ *
+ * Rebuilds the Relcache mapping if it was invalidated by local DDL.
+ */
+LogicalRepRelMapEntry *
+logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
+{
+ LogicalRepRelMapEntry *entry;
+ bool found;
+
+ if (LogicalRepRelMap == NULL)
+ logicalrep_relmap_init();
+
+ /* Search for existing entry. */
+ entry = hash_search(LogicalRepRelMap, &remoteid,
+ HASH_FIND, &found);
+
+ if (!found)
+ elog(ERROR, "no relation map entry for remote relation ID %u",
+ remoteid);
+
+ logicalrep_rel_load(entry, 0, lockmode);
+
return entry;
}
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 83bcf216fd3..c1cf301c97b 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1070,6 +1070,59 @@ check_dependency_on_rel(LogicalRepRelId relid, TransactionId new_depended_xid,
relentry->last_depended_xid = new_depended_xid;
}
+/*
+ * Check the parallelizability of applying changes for the relation.
+ * Append the lastly dispatched transaction in in 'depends_on_xids' if the
+ * relation is parallel unsafe.
+ */
+static void
+check_dependency_for_parallel_safety(LogicalRepRelId relid,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+
+ /* Quick exit if no transactions have been dispatched */
+ if (!TransactionIdIsValid(last_remote_xid))
+ return;
+
+ relentry = logicalrep_get_relentry(relid);
+
+ /*
+ * Gather information for local triggres if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_UNKNOWN)
+ {
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
+
+ /*
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
+ */
+ if (needs_start)
+ CommitTransactionCommand();
+
+ Assert(relentry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN);
+ }
+
+ /* Do nothing for parallel safe relations */
+ if (relentry->parallel_safe == LOGICALREP_PARALLEL_SAFE)
+ return;
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &last_remote_xid,
+ new_depended_xid);
+}
+
/*
* Check dependencies related to the current change by determining if the
* modification impacts the same row or table as another ongoing transaction. If
@@ -1128,6 +1181,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_UPDATE:
@@ -1135,13 +1190,19 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
&newtup);
if (has_oldtup)
+ {
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
+ }
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_DELETE:
@@ -1149,6 +1210,8 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(relid, new_depended_xid,
+ &depends_on_xids);
break;
case LOGICAL_REP_MSG_TRUNCATE:
@@ -1161,8 +1224,13 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
* modified the same table.
*/
foreach_int(truncated_relid, remote_relids)
+ {
check_dependency_on_rel(truncated_relid, new_depended_xid,
&depends_on_xids);
+ check_dependency_for_parallel_safety(truncated_relid,
+ new_depended_xid,
+ &depends_on_xids);
+ }
break;
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index 34a7069e9e5..e3d0df58620 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -39,6 +39,20 @@ typedef struct LogicalRepRelMapEntry
XLogRecPtr statelsn;
TransactionId last_depended_xid;
+
+ /*
+ * Whether the relation can be applied in parallel or not. It is
+ * distinglish whether defined triggers are the immutable or not.
+ *
+ * Theoretically, we can determine the parallelizability for each type of
+ * replication messages, INSERT/UPDATE/DELETE/TRUNCATE. But it is not done
+ * yet to reduce the number of attributes.
+ *
+ * Note that we do not check the user-defined constraints here. PostgreSQL
+ * has already assumed that CHECK constraints' conditions are immutable and
+ * here follows the rule.
+ */
+ char parallel_safe;
} LogicalRepRelMapEntry;
extern void logicalrep_relmap_update(LogicalRepRelation *remoterel);
@@ -46,6 +60,8 @@ extern void logicalrep_partmap_reset_relmap(LogicalRepRelation *remoterel);
extern LogicalRepRelMapEntry *logicalrep_rel_open(LogicalRepRelId remoteid,
LOCKMODE lockmode);
+extern void logicalrep_rel_load(LogicalRepRelMapEntry *entry,
+ LogicalRepRelId remoteid, LOCKMODE lockmode);
extern LogicalRepRelMapEntry *logicalrep_partition_open(LogicalRepRelMapEntry *root,
Relation partrel, AttrMap *map);
extern void logicalrep_rel_close(LogicalRepRelMapEntry *rel,
@@ -56,4 +72,8 @@ extern int logicalrep_get_num_rels(void);
extern void logicalrep_write_all_rels(StringInfo out);
extern LogicalRepRelMapEntry *logicalrep_get_relentry(LogicalRepRelId remoteid);
+#define LOGICALREP_PARALLEL_SAFE 's'
+#define LOGICALREP_PARALLEL_RESTRICTED 'r'
+#define LOGICALREP_PARALLEL_UNKNOWN 'u'
+
#endif /* LOGICALRELATION_H */
--
2.47.3
v7-0008-Support-dependency-tracking-via-local-unique-inde.patchapplication/octet-stream; name=v7-0008-Support-dependency-tracking-via-local-unique-inde.patchDownload
From 0ff666fdd43e02abdcc8a151410c18fb52e9e293 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda <hayato@example.com>
Date: Thu, 11 Dec 2025 22:21:47 +0900
Subject: [PATCH v7 8/8] Support dependency tracking via local unique indexes
Currently, logical replication's parallel apply mechanism tracks dependencies
primarily based on the REPLICA IDENTITY defined on the publisher table.
However, local subscriber tables might have additional unique indexes that
could effectively serve as dependency keys, even if they don't correspond to
the publisher's REPLICA IDENTITY. Failing to track these additional unique
keys can lead to incorrect data and/or deadlocks during parallel application.
This patch extends the parallel apply's dependency tracking to consider
local unique indexes on the subscriber table. This is achieved by extending
the existing Replica Identity hash table to also store dependency information
based on these local unique indexes.
The LogicalRepRelMapEntry structure is extended to store details about these
local unique indexes. This information is collected and cached when
dependency checking is first performed for a remote transaction on a given
relation. This collection process requires to be in a transaction to access
system catalog information.
---
src/backend/replication/logical/relation.c | 161 ++++++++++-
src/backend/replication/logical/worker.c | 270 ++++++++++++++----
src/backend/storage/lmgr/deadlock.c | 1 -
src/include/replication/logicalrelation.h | 14 +
src/test/subscription/t/050_parallel_apply.pl | 43 +++
src/tools/pgindent/typedefs.list | 2 +
6 files changed, 432 insertions(+), 59 deletions(-)
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index 14f3ebf725e..781fc3c73b9 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -127,6 +127,21 @@ logicalrep_relmap_init(void)
(Datum) 0);
}
+/*
+ * Release local index list
+ */
+static void
+free_local_unique_indexes(LogicalRepRelMapEntry *entry)
+{
+ Assert(am_leader_apply_worker());
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, entry->local_unique_indexes)
+ bms_free(idxinfo->indexkeys);
+
+ list_free_deep(entry->local_unique_indexes);
+ entry->local_unique_indexes = NIL;
+}
+
/*
* Free the entry of a relation map cache.
*/
@@ -154,6 +169,9 @@ logicalrep_relmap_free_entry(LogicalRepRelMapEntry *entry)
if (entry->attrmap)
free_attrmap(entry->attrmap);
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
}
/*
@@ -360,6 +378,126 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
}
}
+/*
+ * Collect all local unique indexes that can be used for dependency tracking.
+ */
+static void
+collect_local_indexes(LogicalRepRelMapEntry *entry)
+{
+ List *idxlist;
+
+ if (entry->local_unique_indexes != NIL)
+ free_local_unique_indexes(entry);
+
+ entry->local_unique_indexes_collected = true;
+
+ idxlist = RelationGetIndexList(entry->localrel);
+
+ /* Quick exit if there are no indexes */
+ if (idxlist == NIL)
+ return;
+
+ /* Iterate indexes to list all usable indexes */
+ foreach_oid(idxoid, idxlist)
+ {
+ Relation idxrel;
+ int indnkeys;
+ AttrMap *attrmap;
+ Bitmapset *indexkeys = NULL;
+ bool suitable = true;
+
+ idxrel = index_open(idxoid, AccessShareLock);
+
+ /*
+ * Check whether the index can be used for the dependency tracking.
+ *
+ * For simplification, the same condition as REPLICA IDENTITY FULL,
+ * plus it must be a unique index.
+ */
+ if (!(idxrel->rd_index->indisunique &&
+ IsIndexUsableForReplicaIdentityFull(idxrel, entry->attrmap)))
+ {
+ index_close(idxrel, AccessShareLock);
+ continue;
+ }
+
+ indnkeys = idxrel->rd_index->indnkeyatts;
+ attrmap = entry->attrmap;
+
+ Assert(indnkeys);
+
+ /* Seek each attributes and add to a Bitmap */
+ for (int i = 0; i < indnkeys; i++)
+ {
+ AttrNumber localcol = idxrel->rd_index->indkey.values[i];
+ AttrNumber remotecol;
+
+ /*
+ * XXX: Mark a relation as parallel-unsafe if it has expression
+ * indexes because we cannot compute the hash value for the
+ * dependency tracking. For safety, transactions that modify such
+ * tables can wait for applications till the lastly dispatched
+ * transaction is committed.
+ */
+ if (!AttributeNumberIsValid(localcol))
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_RESTRICTED;
+ suitable = false;
+ break;
+ }
+
+ remotecol = attrmap->attnums[AttrNumberGetAttrOffset(localcol)];
+
+ /*
+ * Skip if the column does not exist on publisher node. In this
+ * case the replicated tuples always have NULL or default value.
+ */
+ if (remotecol < 0)
+ {
+ suitable = false;
+ break;
+ }
+
+ /* Checks are passed, remember the attribute */
+ indexkeys = bms_add_member(indexkeys, remotecol);
+ }
+
+ index_close(idxrel, AccessShareLock);
+
+ /*
+ * Skip using the index if it is not suitable. This can happen if
+ * 1) one of the columns does not exist on the publisher side, or
+ * 2) there is an expression column.
+ */
+ if (!suitable)
+ {
+ if (indexkeys)
+ bms_free(indexkeys);
+
+ continue;
+ }
+
+ /* This index is usable, store on memory */
+ if (indexkeys)
+ {
+ MemoryContext oldctx;
+ LogicalRepSubscriberIdx *idxinfo;
+
+ oldctx = MemoryContextSwitchTo(LogicalRepRelMapContext);
+ idxinfo = palloc(sizeof(LogicalRepSubscriberIdx));
+ idxinfo->indexoid = idxoid;
+ idxinfo->indexkeys = bms_copy(indexkeys);
+ entry->local_unique_indexes =
+ lappend(entry->local_unique_indexes, idxinfo);
+
+ pfree(indexkeys);
+ MemoryContextSwitchTo(oldctx);
+ }
+ }
+
+ list_free(idxlist);
+}
+
/*
* Check all local triggers for the relation to see the parallelizability.
*
@@ -369,7 +507,16 @@ logicalrep_rel_mark_updatable(LogicalRepRelMapEntry *entry)
static void
check_defined_triggers(LogicalRepRelMapEntry *entry)
{
- TriggerDesc *trigdesc = entry->localrel->trigdesc;
+ TriggerDesc *trigdesc;
+
+ /*
+ * Skip if the parallelizability has already been checked. Possilble if the
+ * relation has expression indexes.
+ */
+ if (entry->parallel_safe != LOGICALREP_PARALLEL_UNKNOWN)
+ return;
+
+ trigdesc = entry->localrel->trigdesc;
/* Quick exit if triffer is not defined */
if (trigdesc == NULL)
@@ -410,7 +557,7 @@ check_defined_triggers(LogicalRepRelMapEntry *entry)
* If the key is given, the corresponding entry is first searched in the hash
* table and processed as in the above case. At the end, logical replication is
* closed.
- */
+ */
void
logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
LOCKMODE lockmode)
@@ -564,7 +711,11 @@ logicalrep_rel_load(LogicalRepRelMapEntry *entry, LogicalRepRelId remoteid,
* tracking.
*/
if (am_leader_apply_worker())
+ {
+ entry->parallel_safe = LOGICALREP_PARALLEL_UNKNOWN;
+ collect_local_indexes(entry);
check_defined_triggers(entry);
+ }
entry->localrelvalid = true;
}
@@ -866,6 +1017,12 @@ logicalrep_partition_open(LogicalRepRelMapEntry *root,
entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel,
entry->attrmap);
+ /*
+ * TODO: Parallel apply does not support the parallel apply for now.
+ * Just mark local indexes are collected.
+ */
+ entry->local_unique_indexes_collected = true;
+
entry->localrelvalid = true;
return entry;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index c1cf301c97b..39fe95746c7 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -548,9 +548,19 @@ typedef struct ApplySubXactData
static ApplySubXactData subxact_data = {0, 0, InvalidTransactionId, NULL};
+/*
+ * Type of key used for dependency tracking.
+ */
+typedef enum LogicalRepKeyKind
+{
+ LOGICALREP_KEY_REPLICA_IDENTITY,
+ LOGICALREP_KEY_LOCAL_UNIQUE
+} LogicalRepKeyKind;
+
typedef struct ReplicaIdentityKey
{
Oid relid;
+ LogicalRepKeyKind kind;
LogicalRepTupleData *data;
} ReplicaIdentityKey;
@@ -710,7 +720,8 @@ static bool
hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
{
if (a->relid != b->relid ||
- a->data->ncols != b->data->ncols)
+ a->data->ncols != b->data->ncols ||
+ a->kind != b->kind)
return false;
for (int i = 0; i < a->data->ncols; i++)
@@ -718,6 +729,9 @@ hash_replica_identity_compare(ReplicaIdentityKey *a, ReplicaIdentityKey *b)
if (a->data->colstatus[i] != b->data->colstatus[i])
return false;
+ if (a->data->colstatus[i] == LOGICALREP_COLUMN_NULL)
+ continue;
+
if (a->data->colvalues[i].len != b->data->colvalues[i].len)
return false;
@@ -847,6 +861,93 @@ check_and_append_xid_dependency(List *depends_on_xids,
return lappend_xid(depends_on_xids, *depends_on_xid);
}
+/*
+ * Common function for checking dependency by using the key. Used by both
+ * check_dependency_on_replica_identity and check_dependency_on_local_key.
+ */
+static void
+check_dependency_by_key(ReplicaIdentityKey *key, TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ ReplicaIdentityEntry *rientry;
+ bool found = false;
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(ApplyContext);
+
+ if (TransactionIdIsValid(new_depended_xid))
+ {
+ rientry = replica_identity_insert(replica_identity_table, key,
+ &found);
+
+ /*
+ * Release the key built to search the entry, if the entry already
+ * exists. Otherwise, initialize the remote_xid.
+ */
+ if (found)
+ {
+ elog(DEBUG1,
+ key->kind == LOGICALREP_KEY_REPLICA_IDENTITY ?
+ "found conflicting replica identity change from %u" :
+ "found conflicting local unique change from %u",
+ rientry->remote_xid);
+
+ free_replica_identity_key(key);
+ }
+ else
+ rientry->remote_xid = InvalidTransactionId;
+ }
+ else
+ {
+ rientry = replica_identity_lookup(replica_identity_table, key);
+ free_replica_identity_key(key);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+
+ /* Return if no entry found */
+ if (!rientry)
+ return;
+
+ Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+
+ *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
+ &rientry->remote_xid,
+ new_depended_xid);
+
+ /*
+ * Remove the entry if it is registered for the streamed transactions. We
+ * do not have to register an entry for them; The leader worker always
+ * waits until the parallel worker finishes handling streamed transactions,
+ * thus no need to consider the possiblity that upcoming parallel workers
+ * would go ahead.
+ */
+ if (TransactionIdIsValid(stream_xid) && !found)
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+
+ /*
+ * Update the new depended xid into the entry if valid, the new xid could
+ * be invalid if the transaction will be applied by the leader itself
+ * which means all the changes will be committed before processing next
+ * transaction, so no need to be depended on.
+ */
+ else if (TransactionIdIsValid(new_depended_xid))
+ rientry->remote_xid = new_depended_xid;
+
+ /*
+ * Remove the entry if the transaction has been committed and no new
+ * dependency needs to be added.
+ */
+ else if (!TransactionIdIsValid(rientry->remote_xid))
+ {
+ free_replica_identity_key(rientry->keydata);
+ replica_identity_delete_item(replica_identity_table, rientry);
+ }
+}
+
/*
* Check for dependencies on preceding transactions that modify the same key.
* Returns the dependent transactions in 'depends_on_xids' and records the
@@ -861,10 +962,8 @@ check_dependency_on_replica_identity(Oid relid,
LogicalRepRelMapEntry *relentry;
LogicalRepTupleData *ridata;
ReplicaIdentityKey *rikey;
- ReplicaIdentityEntry *rientry;
MemoryContext oldctx;
int n_ri;
- bool found = false;
Assert(depends_on_xids);
@@ -930,75 +1029,122 @@ check_dependency_on_replica_identity(Oid relid,
rikey = palloc0_object(ReplicaIdentityKey);
rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_REPLICA_IDENTITY;
rikey->data = ridata;
- if (TransactionIdIsValid(new_depended_xid))
+ MemoryContextSwitchTo(oldctx);
+
+ check_dependency_by_key(rikey, new_depended_xid, depends_on_xids);
+}
+
+/*
+ * Mostly same as check_dependency_on_replica_identity() but for local unique
+ * indexes.
+ */
+static void
+check_dependency_on_local_key(Oid relid,
+ LogicalRepTupleData *original_data,
+ TransactionId new_depended_xid,
+ List **depends_on_xids)
+{
+ LogicalRepRelMapEntry *relentry;
+ LogicalRepTupleData *ridata;
+ ReplicaIdentityKey *rikey;
+ MemoryContext oldctx;
+
+ Assert(depends_on_xids);
+
+ /* Search for existing entry */
+ relentry = logicalrep_get_relentry(relid);
+
+ Assert(relentry);
+
+ /*
+ * Gather information for local indexes if not yet. We require to be in a
+ * transaction state because system catalogs are read.
+ */
+ if (!relentry->local_unique_indexes_collected)
{
- rientry = replica_identity_insert(replica_identity_table, rikey,
- &found);
+ bool needs_start = !IsTransactionOrTransactionBlock();
+
+ if (needs_start)
+ StartTransactionCommand();
+
+ logicalrep_rel_load(NULL, relid, AccessShareLock);
/*
- * Release the key built to search the entry, if the entry already
- * exists. Otherwise, initialize the remote_xid.
+ * Close the transaction if we start here. We must not abort because it
+ * would release all session-level locks, such as the stream lock, and
+ * break the deadlock detection mechanism between LA and PA. The
+ * outcome is the same regardless of the end status, since the
+ * transaction did not modify any tuples.
*/
- if (found)
- {
- elog(DEBUG1, "found conflicting replica identity change from %u",
- rientry->remote_xid);
+ if (needs_start)
+ CommitTransactionCommand();
- free_replica_identity_key(rikey);
- }
- else
- rientry->remote_xid = InvalidTransactionId;
+ Assert(relentry->local_unique_indexes_collected);
}
- else
+
+ foreach_ptr(LogicalRepSubscriberIdx, idxinfo, relentry->local_unique_indexes)
{
- rientry = replica_identity_lookup(replica_identity_table, rikey);
- free_replica_identity_key(rikey);
- }
+ int columns = bms_num_members(idxinfo->indexkeys);
+ bool suitable = true;
- MemoryContextSwitchTo(oldctx);
+ Assert(columns);
- /* Return if no entry found */
- if (!rientry)
- return;
+ for (int i = 0; i < original_data->ncols; i++)
+ {
+ if (!bms_is_member(i, idxinfo->indexkeys))
+ continue;
- Assert(!found || TransactionIdIsValid(rientry->remote_xid));
+ /*
+ * Skip if the column is not changed.
+ *
+ * XXX: NULL is allowed.
+ */
+ if (original_data->colstatus[i] == LOGICALREP_COLUMN_UNCHANGED)
+ {
+ suitable = false;
+ break;
+ }
+ }
- *depends_on_xids = check_and_append_xid_dependency(*depends_on_xids,
- &rientry->remote_xid,
- new_depended_xid);
+ if (!suitable)
+ continue;
- /*
- * Remove the entry if it is registered for the streamed transactions. We
- * do not have to register an entry for them; The leader worker always
- * waits until the parallel worker finishes handling streamed transactions,
- * thus no need to consider the possiblity that upcoming parallel workers
- * would go ahead.
- */
- if (TransactionIdIsValid(stream_xid) && !found)
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
- }
+ oldctx = MemoryContextSwitchTo(ApplyContext);
- /*
- * Update the new depended xid into the entry if valid, the new xid could
- * be invalid if the transaction will be applied by the leader itself
- * which means all the changes will be committed before processing next
- * transaction, so no need to be depended on.
- */
- else if (TransactionIdIsValid(new_depended_xid))
- rientry->remote_xid = new_depended_xid;
+ /* Allocate space for replica identity values */
+ ridata = palloc0_object(LogicalRepTupleData);
+ ridata->colvalues = palloc0_array(StringInfoData, columns);
+ ridata->colstatus = palloc0_array(char, columns);
+ ridata->ncols = columns;
- /*
- * Remove the entry if the transaction has been committed and no new
- * dependency needs to be added.
- */
- else if (!TransactionIdIsValid(rientry->remote_xid))
- {
- free_replica_identity_key(rientry->keydata);
- replica_identity_delete_item(replica_identity_table, rientry);
+ for (int i_original = 0, i_key = 0; i_original < original_data->ncols; i_original++)
+ {
+ if (!bms_is_member(i_original, idxinfo->indexkeys))
+ continue;
+
+ if (original_data->colstatus[i_original] != LOGICALREP_COLUMN_NULL)
+ {
+ StringInfo original_colvalue = &original_data->colvalues[i_original];
+
+ initStringInfoExt(&ridata->colvalues[i_key], original_colvalue->len + 1);
+ appendStringInfoString(&ridata->colvalues[i_key], original_colvalue->data);
+ }
+
+ ridata->colstatus[i_key] = original_data->colstatus[i_original];
+ i_key++;
+ }
+
+ rikey = palloc0_object(ReplicaIdentityKey);
+ rikey->relid = relid;
+ rikey->kind = LOGICALREP_KEY_LOCAL_UNIQUE;
+ rikey->data = ridata;
+
+ MemoryContextSwitchTo(oldctx);
+
+ check_dependency_by_key(rikey, new_depended_xid, depends_on_xids);
}
}
@@ -1181,6 +1327,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1194,6 +1343,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
}
@@ -1201,6 +1353,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &newtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &newtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
@@ -1210,6 +1365,9 @@ handle_dependency_on_change(LogicalRepMsgType action, StringInfo s,
check_dependency_on_replica_identity(relid, &oldtup,
new_depended_xid,
&depends_on_xids);
+ check_dependency_on_local_key(relid, &oldtup,
+ new_depended_xid,
+ &depends_on_xids);
check_dependency_for_parallel_safety(relid, new_depended_xid,
&depends_on_xids);
break;
diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index c4bfaaa67ac..ca7dee52b32 100644
--- a/src/backend/storage/lmgr/deadlock.c
+++ b/src/backend/storage/lmgr/deadlock.c
@@ -33,7 +33,6 @@
#include "storage/procnumber.h"
#include "utils/memutils.h"
-
/*
* One edge in the waits-for graph.
*
diff --git a/src/include/replication/logicalrelation.h b/src/include/replication/logicalrelation.h
index e3d0df58620..9ac97fc4b38 100644
--- a/src/include/replication/logicalrelation.h
+++ b/src/include/replication/logicalrelation.h
@@ -16,6 +16,12 @@
#include "catalog/index.h"
#include "replication/logicalproto.h"
+typedef struct LogicalRepSubscriberIdx
+{
+ Oid indexoid; /* OID of the local key */
+ Bitmapset *indexkeys; /* Bitmap of key columns *on remote* */
+} LogicalRepSubscriberIdx;
+
typedef struct LogicalRepRelMapEntry
{
LogicalRepRelation remoterel; /* key is remoterel.remoteid */
@@ -40,6 +46,10 @@ typedef struct LogicalRepRelMapEntry
TransactionId last_depended_xid;
+ /* Local unique indexes. Used for dependency tracking */
+ List *local_unique_indexes;
+ bool local_unique_indexes_collected;
+
/*
* Whether the relation can be applied in parallel or not. It is
* distinglish whether defined triggers are the immutable or not.
@@ -51,6 +61,10 @@ typedef struct LogicalRepRelMapEntry
* Note that we do not check the user-defined constraints here. PostgreSQL
* has already assumed that CHECK constraints' conditions are immutable and
* here follows the rule.
+ *
+ * XXX: Additonally, this can be false if the relation has expression
+ * indexes. Because we cannot compute the hash value for the dependency
+ * tracking.
*/
char parallel_safe;
} LogicalRepRelMapEntry;
diff --git a/src/test/subscription/t/050_parallel_apply.pl b/src/test/subscription/t/050_parallel_apply.pl
index 9254b85d350..337a598a38c 100644
--- a/src/test/subscription/t/050_parallel_apply.pl
+++ b/src/test/subscription/t/050_parallel_apply.pl
@@ -234,4 +234,47 @@ $node_subscriber->wait_for_log(qr/finish waiting for depended xid $xid/, $offset
$h->query_safe("COMMIT;");
+# Ensure subscriber-local indexes are also used for the dependency tracking
+
+# Truncate the data for upcoming tests
+$node_publisher->safe_psql('postgres', "TRUNCATE TABLE regress_tab;");
+$node_publisher->wait_for_catchup('regress_sub');
+
+# Define an unique index on subscriber
+$node_subscriber->safe_psql('postgres',
+ "CREATE INDEX ON regress_tab (value);");
+
+# Attach an injection_point. Parallel workers would wait before the commit
+$node_subscriber->safe_psql('postgres',
+ "SELECT injection_points_attach('parallel-worker-before-commit','wait');"
+);
+
+# Insert a tuple on publisher. Parallel worker would wait at the injection
+# point
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (1, 'would conflict');");
+
+# Wait until the parallel worker enters the injection point.
+$node_subscriber->wait_for_event('logical replication parallel worker',
+ 'parallel-worker-before-commit');
+
+$offset = -s $node_subscriber->logfile;
+
+# Insert tuples on publisher again. This transaction is would wait because all
+# parallel workers wait till the previously launched worker commits.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (2, 'would not conflict');");
+
+# Verify the parallel worker waits for the transaction
+$str = $node_subscriber->wait_for_log(qr/wait for depended xid ([1-9][0-9]+)/, $offset);
+$xid = $str =~ /wait for depended xid ([1-9][0-9]+)/;
+
+# Insert a conflicting tuple on publisher. Leader worker would detect the conflict
+# and wait for the transaction to commit.
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO regress_tab VALUES (3, 'would conflict');");
+
+# Verify the parallel worker waits for the same transaction
+$node_subscriber->wait_for_log(qr/wait for depended xid $xid/, $offset);
+
done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 80810793746..12479b64958 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1637,6 +1637,7 @@ LogicalRepBeginData
LogicalRepCommitData
LogicalRepCommitPreparedTxnData
LogicalRepCtxStruct
+LogicalRepKeyKind
LogicalRepMsgType
LogicalRepPartMapEntry
LogicalRepPreparedTxnData
@@ -1646,6 +1647,7 @@ LogicalRepRelation
LogicalRepRollbackPreparedTxnData
LogicalRepSequenceInfo
LogicalRepStreamAbortData
+LogicalRepSubscriberIdx
LogicalRepTupleData
LogicalRepTyp
LogicalRepWorker
--
2.47.3