Slow catchup of 2PC (twophase) transactions on replica in LR
Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch.
To reproduce the problem:
* Setup logical replication from master to replica with subscription parameter twophase = true. * Create some intermediate load on the master (use pgbench with custom sql with prepare+commit) * Optionally switch off the replica for some time (keep load on master). * Switch on the replica and wait until it reaches the master.
The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load.
I think, the main proplem of 2PC catchup bad performance - the lack of asynchronous commit support for 2PC. For regular transactions asynchronous commit is used on the replica by default (subscrition sycnronous_commit = off). It allows the replication worker process on the replica to avoid fsync (XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer will do fsync). I agree, 2PC are mostly used in multimaster configurations with two or more nodes which are performed synchronously, but when the node in catchup (node is not online in a multimaster cluster), asynchronous commit have to be used to speedup the catchup.
There is another thing that affects on the disbalance of the master and replica performance. When the master executes requestes from multiple clients, there is a fsync optimization takes place in XLogFlush. It allows to decrease the number of fsync in case when a number of parallel backends write to the WAL simultaneously. The replica applies received transactions in one thread sequentially, such optimization is not applied.
I see some possible solutions:
* Implement asyncronous commit for 2PC transactions. * Do some hacking with enableFsync when it is possible.
I think, asynchronous commit support for 2PC transactions should significantly increase replica performance and help to solve this problem. I tried to implement it (like for usual transactions) but I've found another problem: 2PC state is stored in WAL on prepare, on commit we have to read 2PC state from WAL but the read is delayed until WAL is flushed by the background wal writer (read LSN should be less than flush LSN). Storing 2PC state in a shared memory (as it proposed earlier) may help.
I used the following query to monitor the catchup progress on the master:SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
I used the following script for pgbench to the master:SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
What do you think?
With best regards,
Vitaly Davydov
On Thu, Feb 22, 2024 at 6:59 PM Давыдов Виталий
<v.davydov@postgrespro.ru> wrote:
I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch.
To reproduce the problem:
Setup logical replication from master to replica with subscription parameter twophase = true.
Create some intermediate load on the master (use pgbench with custom sql with prepare+commit)
Optionally switch off the replica for some time (keep load on master).
Switch on the replica and wait until it reaches the master.The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load.
I think, the main proplem of 2PC catchup bad performance - the lack of asynchronous commit support for 2PC. For regular transactions asynchronous commit is used on the replica by default (subscrition sycnronous_commit = off). It allows the replication worker process on the replica to avoid fsync (XLogFLush) and to utilize 100% CPU (the background wal writer or checkpointer will do fsync). I agree, 2PC are mostly used in multimaster configurations with two or more nodes which are performed synchronously, but when the node in catchup (node is not online in a multimaster cluster), asynchronous commit have to be used to speedup the catchup.
I don't see we do anything specific for 2PC transactions to make them
behave differently than regular transactions with respect to
synchronous_commit setting. What makes you think so? Can you pin point
the code you are referring to?
There is another thing that affects on the disbalance of the master and replica performance. When the master executes requestes from multiple clients, there is a fsync optimization takes place in XLogFlush. It allows to decrease the number of fsync in case when a number of parallel backends write to the WAL simultaneously. The replica applies received transactions in one thread sequentially, such optimization is not applied.
Right, I think for this we need to implement parallel apply.
I see some possible solutions:
Implement asyncronous commit for 2PC transactions.
Do some hacking with enableFsync when it is possible.
Can you be a bit more specific about what exactly you have in mind to
achieve the above solutions?
--
With Regards,
Amit Kapila.
On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий <v.davydov@postgrespro.ru>
wrote:
Dear All,
I'd like to present and talk about a problem when 2PC transactions are
applied quite slowly on a replica during logical replication. There is a
master and a replica with established logical replication from the master
to the replica with twophase = true. With some load level on the master,
the replica starts to lag behind the master, and the lag will be
increasing. We have to significantly decrease the load on the master to
allow replica to complete the catchup. Such problem may create significant
difficulties in the production. The problem appears at least on
REL_16_STABLE branch.To reproduce the problem:
- Setup logical replication from master to replica with subscription
parameter twophase = true.
- Create some intermediate load on the master (use pgbench with custom
sql with prepare+commit)
- Optionally switch off the replica for some time (keep load on
master).
- Switch on the replica and wait until it reaches the master.The replica will never reach the master with even some low load on the
master. If to remove the load, the replica will reach the master for much
greater time, than expected. I tried the same for regular transactions, but
such problem doesn't appear even with a decent load.
I tried this setup and I do see that the logical subscriber does reach the
master in a short time. I'm not sure what I'm missing. I stopped the
logical subscriber in between while pgbench was running and then started it
again and ran the following:
postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
sent_lsn | pg_current_wal_lsn
-----------+--------------------
0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)
My pgbench command:
pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5
my custom sql file:
cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$;
regards,
Ajin Cherian
Fujitsu Australia
Hi Ajin,
Thank you for your feedback. Could you please try to increase the number of clients (-c pgbench option) up to 20 or more? It seems, I forgot to specify it.
With best regards,
Vitaly Davydov On Fri, Feb 23, 2024 at 12:29 AM Давыдов Виталий <v.davydov@postgrespro.ru> wrote:
Dear All,
I'd like to present and talk about a problem when 2PC transactions are applied quite slowly on a replica during logical replication. There is a master and a replica with established logical replication from the master to the replica with twophase = true. With some load level on the master, the replica starts to lag behind the master, and the lag will be increasing. We have to significantly decrease the load on the master to allow replica to complete the catchup. Such problem may create significant difficulties in the production. The problem appears at least on REL_16_STABLE branch.
To reproduce the problem:
* Setup logical replication from master to replica with subscription parameter twophase = true. * Create some intermediate load on the master (use pgbench with custom sql with prepare+commit) * Optionally switch off the replica for some time (keep load on master). * Switch on the replica and wait until it reaches the master.
The replica will never reach the master with even some low load on the master. If to remove the load, the replica will reach the master for much greater time, than expected. I tried the same for regular transactions, but such problem doesn't appear even with a decent load.
I tried this setup and I do see that the logical subscriber does reach the master in a short time. I'm not sure what I'm missing. I stopped the logical subscriber in between while pgbench was running and then started it again and ran the following:postgres=# SELECT sent_lsn, pg_current_wal_lsn() FROM pg_stat_replication;
sent_lsn | pg_current_wal_lsn
-----------+--------------------
0/6793FA0 | 0/6793FA0 <=== caught up
(1 row)
My pgbench command:pgbench postgres -p 6972 -c 2 -j 3 -f /home/ajin/test.sql -T 200 -P 5 my custom sql file:cat test.sql
SELECT md5(random()::text) as mygid \gset
BEGIN;
DELETE FROM test WHERE v = pg_backend_pid();
INSERT INTO test(v) SELECT pg_backend_pid();
PREPARE TRANSACTION $$:mygid$$;
COMMIT PREPARED $$:mygid$$; regards,Ajin CherianFujitsu Australia
Hi Amit,
Amit Kapila <amit.kapila16@gmail.com> wrote:
I don't see we do anything specific for 2PC transactions to make them behave differently than regular transactions with respect to synchronous_commit setting. What makes you think so? Can you pin point the code you are referring to?Yes, sure. The function RecordTransactionCommitPrepared is called on prepared transaction commit (twophase.c). It calls XLogFlush unconditionally. The function RecordTransactionCommit (for regular transactions, xact.c) calls XLogFlush if synchronous_commit > OFF, otherwise it calls XLogSetAsyncXactLSN.
There is some comment in RecordTransactionCommitPrepared (by Bruce Momjian) that shows that async commit is not supported yet:
/*
* We don't currently try to sleep before flush here ... nor is there any
* support for async commit of a prepared xact (the very idea is probably
* a contradiction)
*/
/* Flush XLOG to disk */
XLogFlush(recptr);
Right, I think for this we need to implement parallel apply.Yes, parallel apply is a good point. But, I believe, it will not work if asynchronous commit is not supported. You have only one receiver process which should dispatch incoming messages to parallel workers. I guess, you will never reach such rate of parallel execution on replica as on the master with multiple backends.
Can you be a bit more specific about what exactly you have in mind to achieve the above solutions?My proposal is to implement async commit for 2PC transactions as it is for regular transactions. It should significantly speedup the catchup process. Then, think how to apply in parallel, which is much diffcult to do. The current problem is to get 2PC state from the WAL on commit prepared. At this moment, the WAL is not flushed yet, commit function waits until WAL with 2PC state is to be flushed. I just tried to do it in my sandbox and found such a problem. Inability to get 2PC state from unflushed WAL stops me right now. I think about possible solutions.
The idea with enableFsync is not a suitable solution, in general, I think. I just pointed it as an alternate idea. You just do enableFsync = false before prepare or commit prepared and do enableFsync = true after these functions. In this case, 2PC records will not be fsync-ed, but FlushPtr will be increased. Thus, 2PC state can be read from WAL on commit prepared without waiting. To make it work correctly, I guess, we have to do some additional work to keep more wal on the master and filter some duplicate transactions on the replica, if replica restarts during catchup.
With best regards,
Vitaly Davydov
On Fri, Feb 23, 2024 at 10:41 PM Давыдов Виталий
<v.davydov@postgrespro.ru> wrote:
Amit Kapila <amit.kapila16@gmail.com> wrote:
I don't see we do anything specific for 2PC transactions to make them behave differently than regular transactions with respect to synchronous_commit setting. What makes you think so? Can you pin point the code you are referring to?
Yes, sure. The function RecordTransactionCommitPrepared is called on prepared transaction commit (twophase.c). It calls XLogFlush unconditionally. The function RecordTransactionCommit (for regular transactions, xact.c) calls XLogFlush if synchronous_commit > OFF, otherwise it calls XLogSetAsyncXactLSN.
There is some comment in RecordTransactionCommitPrepared (by Bruce Momjian) that shows that async commit is not supported yet:
/*
* We don't currently try to sleep before flush here ... nor is there any
* support for async commit of a prepared xact (the very idea is probably
* a contradiction)
*/
/* Flush XLOG to disk */
XLogFlush(recptr);
It seems this comment is added in the commit 4a78cdeb where we added
async commit support. I think the reason is probably that when the WAL
record for prepared is already flushed then what will be the idea of
async commit here?
Right, I think for this we need to implement parallel apply.
Yes, parallel apply is a good point. But, I believe, it will not work if asynchronous commit is not supported. You have only one receiver process which should dispatch incoming messages to parallel workers. I guess, you will never reach such rate of parallel execution on replica as on the master with multiple backends.
Can you be a bit more specific about what exactly you have in mind to achieve the above solutions?
My proposal is to implement async commit for 2PC transactions as it is for regular transactions. It should significantly speedup the catchup process. Then, think how to apply in parallel, which is much diffcult to do. The current problem is to get 2PC state from the WAL on commit prepared. At this moment, the WAL is not flushed yet, commit function waits until WAL with 2PC state is to be flushed. I just tried to do it in my sandbox and found such a problem. Inability to get 2PC state from unflushed WAL stops me right now. I think about possible solutions.
At commit prepared, it seems we read prepare's WAL record, right? If
so, it is not clear to me do you see a problem with a flush of
commit_prepared or reading WAL for prepared or both of these.
--
With Regards,
Amit Kapila.
Hi Amit,
Thank you for your interest in the discussion!
On Monday, February 26, 2024 16:24 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think the reason is probably that when the WAL record for prepared is already flushed then what will be the idea of async commit here?I think, the idea of async commit should be applied for both transactions: PREPARE and COMMIT PREPARED, which are actually two separate local transactions. For both these transactions we may call XLogSetAsyncXactLSN on commit instead of XLogFlush when async commit is enabled. When I use async commit, I mean to apply async commit to local transactions, not to a twophase (prepared) transaction itself.
At commit prepared, it seems we read prepare's WAL record, right? If so, it is not clear to me do you see a problem with a flush of commit_prepared or reading WAL for prepared or both of these.The problem with reading WAL is due to async commit of PREPARE TRANSACTION which saves 2PC in the WAL. At the moment of COMMIT PREPARED the WAL with PREPARE TRANSACTION 2PC state may not be XLogFlush-ed yet. So, PREPARE TRANSACTION should wait until its 2PC state is flushed.
I did some experiments with saving 2PC state in the local memory of logical replication worker and, I think, it worked and demonstrated much better performance. Logical replication worker utilized up to 100% CPU. I'm just concerned about possible problems with async commit for twophase transactions.
To be more specific, I've attached a patch to support async commit for twophase. It is not the final patch but it is presented only for discussion purposes. There were some attempts to save 2PC in memory in past but it was rejected. Now, there might be the second round to discuss it.
With best regards,
Vitaly
Attachments:
0001-Add-asynchronous-commit-support-for-2PC.patchtext/x-patchDownload+108-4
On Tue, Feb 27, 2024 at 4:49 PM Давыдов Виталий
<v.davydov@postgrespro.ru> wrote:
Thank you for your interest in the discussion!
On Monday, February 26, 2024 16:24 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think the reason is probably that when the WAL record for prepared is already flushed then what will be the idea of async commit here?
I think, the idea of async commit should be applied for both transactions: PREPARE and COMMIT PREPARED, which are actually two separate local transactions. For both these transactions we may call XLogSetAsyncXactLSN on commit instead of XLogFlush when async commit is enabled. When I use async commit, I mean to apply async commit to local transactions, not to a twophase (prepared) transaction itself.
At commit prepared, it seems we read prepare's WAL record, right? If so, it is not clear to me do you see a problem with a flush of commit_prepared or reading WAL for prepared or both of these.
The problem with reading WAL is due to async commit of PREPARE TRANSACTION which saves 2PC in the WAL. At the moment of COMMIT PREPARED the WAL with PREPARE TRANSACTION 2PC state may not be XLogFlush-ed yet.
As we do XLogFlush() at the time of prepare then why it is not
available? OR are you talking about this state after your idea/patch
where you are trying to make both Prepare and Commit_prepared records
async?
So, PREPARE TRANSACTION should wait until its 2PC state is flushed.
I did some experiments with saving 2PC state in the local memory of logical replication worker and, I think, it worked and demonstrated much better performance. Logical replication worker utilized up to 100% CPU. I'm just concerned about possible problems with async commit for twophase transactions.
To be more specific, I've attached a patch to support async commit for twophase. It is not the final patch but it is presented only for discussion purposes. There were some attempts to save 2PC in memory in past but it was rejected.
It would be good if you could link those threads.
--
With Regards,
Amit Kapila.
Hi Amit,
On Tuesday, February 27, 2024 16:00 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:
As we do XLogFlush() at the time of prepare then why it is not available? OR are you talking about this state after your idea/patch where you are trying to make both Prepare and Commit_prepared records async?Right, I'm talking about my patch where async commit is implemented. There is no such problem with reading 2PC from not flushed WAL in the vanilla because XLogFlush is called unconditionally, as you've described. But an attempt to add some async stuff leads to the problem of reading not flushed WAL. It is why I store 2pc state in the local memory in my patch.
It would be good if you could link those threads.Sure, I will find and add some links to the discussions from past.
Thank you!
With best regards,
Vitaly
On Tue, Feb 27, 2024 at 4:49 PM Давыдов Виталий
<v.davydov@postgrespro.ru> wrote:
Thank you for your interest in the discussion!
On Monday, February 26, 2024 16:24 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:
I think the reason is probably that when the WAL record for prepared is already flushed then what will be the idea of async commit here?
I think, the idea of async commit should be applied for both transactions: PREPARE and COMMIT PREPARED, which are actually two separate local transactions. For both these transactions we may call XLogSetAsyncXactLSN on commit instead of XLogFlush when async commit is enabled. When I use async commit, I mean to apply async commit to local transactions, not to a twophase (prepared) transaction itself.
At commit prepared, it seems we read prepare's WAL record, right? If so, it is not clear to me do you see a problem with a flush of commit_prepared or reading WAL for prepared or both of these.
The problem with reading WAL is due to async commit of PREPARE TRANSACTION which saves 2PC in the WAL. At the moment of COMMIT PREPARED the WAL with PREPARE TRANSACTION 2PC state may not be XLogFlush-ed yet.
As we do XLogFlush() at the time of prepare then why it is not
available? OR are you talking about this state after your idea/patch
where you are trying to make both Prepare and Commit_prepared records
async?
So, PREPARE TRANSACTION should wait until its 2PC state is flushed.
I did some experiments with saving 2PC state in the local memory of logical replication worker and, I think, it worked and demonstrated much better performance. Logical replication worker utilized up to 100% CPU. I'm just concerned about possible problems with async commit for twophase transactions.
To be more specific, I've attached a patch to support async commit for twophase. It is not the final patch but it is presented only for discussion purposes. There were some attempts to save 2PC in memory in past but it was rejected.
It would be good if you could link those threads.
--
With Regards,
Amit Kapila.
Dear All,
Consider, please, my patch for async commit for twophase transactions. It can be applicable when catchup performance is not enought with publication parameter twophase = on.
The key changes are:
* Use XLogSetAsyncXactLSN instead of XLogFlush as it is for usual transactions. * In case of async commit only, save 2PC state in the pg_twophase file (but not fsync it) in addition to saving in the WAL. The file is used as an alternative to storing 2pc state in the memory. * On recovery, reject pg_twophase files with future xids.Probably, 2PC async commit should be enabled by a GUC (not implemented in the patch).
With best regards,
Vitaly
Attachments:
0001-Async-commit-support-for-twophase-transactions.patchtext/x-patchDownload+138-34
On 29/02/2024 19:34, Давыдов Виталий wrote:
Dear All,
Consider, please, my patch for async commit for twophase transactions.
It can be applicable when catchup performance is not enought with
publication parameter twophase = on.The key changes are:
* Use XLogSetAsyncXactLSN instead of XLogFlush as it is for usual
transactions.
* In case of async commit only, save 2PC state in the pg_twophase file
(but not fsync it) in addition to saving in the WAL. The file is
used as an alternative to storing 2pc state in the memory.
* On recovery, reject pg_twophase files with future xids.Probably, 2PC async commit should be enabled by a GUC (not implemented
in the patch).
In a nutshell, this changes PREPARE TRANSACTION so that if
synchronous_commit is 'off', the PREPARE TRANSACTION is not fsync'd to
disk. So if you crash after the PREPARE TRANSACTION has returned, the
transaction might be lost. I think that's completely unacceptable.
If you're ok to lose the prepared state of twophase transactions on
crash, why don't you create the subscription with 'two_phase=off' to
begin with?
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi Heikki,
Thank you for the reply.
On Tuesday, March 05, 2024 12:05 MSK, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
In a nutshell, this changes PREPARE TRANSACTION so that if
synchronous_commit is 'off', the PREPARE TRANSACTION is not fsync'd to
disk. So if you crash after the PREPARE TRANSACTION has returned, the
transaction might be lost. I think that's completely unacceptable.
You are right, the prepared transaction might be lost after crash. The same may happen with regular transactions that are not fsync-ed on replica in logical replication by default. The subscription parameter synchronous_commit is OFF by default. I'm not sure, is there some auto recovery for regular transactions? I think, the main difference between these two cases - how to manually recover when some PREPARE TRANSACTION or COMMIT PREPARED are lost. For regular transactions, some updates or deletes in tables on replica may be enough to fix the problem. For twophase transactions, it may be harder to fix it by hands, but it is possible, I believe. If you create a custom solution that is based on twophase transactions (like multimaster) such auto recovery may happen automatically. Another solution is to ignore errors on commit prepared if the corresponding prepared tx is missing. I don't know other risks that may happen with async commit of twophase transactions.
If you're ok to lose the prepared state of twophase transactions on
crash, why don't you create the subscription with 'two_phase=off' to
begin with?In usual work, the subscription has two_phase = on. I have to change this option at catchup stage only, but this parameter can not be altered. There was a patch proposal in past to implement altering of two_phase option, but it was rejected. I think, the recreation of the subscription with two_phase = off will not work.
I believe, async commit for twophase transactions on catchup will significantly improve the catchup performance. It is worth to think about such feature.
P.S. We might introduce a GUC option to allow async commit for twophase transactions. By default, sync commit will be applied for twophase transactions, as it is now.
With best regards,
Vitaly Davydov
On Tue, Mar 5, 2024 at 7:59 PM Давыдов Виталий <v.davydov@postgrespro.ru> wrote:
Thank you for the reply.
On Tuesday, March 05, 2024 12:05 MSK, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
In a nutshell, this changes PREPARE TRANSACTION so that if
synchronous_commit is 'off', the PREPARE TRANSACTION is not fsync'd to
disk. So if you crash after the PREPARE TRANSACTION has returned, the
transaction might be lost. I think that's completely unacceptable.You are right, the prepared transaction might be lost after crash. The same may happen with regular transactions that are not fsync-ed on replica in logical replication by default. The subscription parameter synchronous_commit is OFF by default. I'm not sure, is there some auto recovery for regular transactions?
Unless the commit WAL is not flushed, we wouldn't have updated the
replication origin's LSN and neither the walsender would increase the
confirmed_flush_lsn for the corresponding slot till the commit is
flushed on subscriber. So, if the subscriber crashed before flushing
the commit record, server should send the same transaction again. The
same should be true for prepared transaction stuff as well.
--
With Regards,
Amit Kapila.
On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий <v.davydov@postgrespro.ru>
wrote:
In usual work, the subscription has two_phase = on. I have to change this
option at catchup stage only, but this parameter can not be altered. There
was a patch proposal in past to implement altering of two_phase option, but
it was rejected. I think, the recreation of the subscription with two_phase
= off will not work.
The altering of two_phase was restricted because if there was a previously
prepared transaction on the subscriber when the two_phase was on, and then
it was turned off, the apply worker on the subscriber would re-apply the
transaction a second time and this might result in an inconsistent replica.
Here's a patch that allows toggling two_phase option provided that there
are no pending uncommitted prepared transactions on the subscriber for that
subscription.
Thanks to Kuroda-san for working on the patch.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v1-0001-Allow-altering-of-two_phase-option-in-subscribers.patchapplication/octet-stream; name=v1-0001-Allow-altering-of-two_phase-option-in-subscribers.patchDownload+96-12
On Thu, Apr 4, 2024 at 10:53 AM Ajin Cherian <itsajin@gmail.com> wrote:
On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий <v.davydov@postgrespro.ru> wrote:
In usual work, the subscription has two_phase = on. I have to change this option at catchup stage only, but this parameter can not be altered. There was a patch proposal in past to implement altering of two_phase option, but it was rejected. I think, the recreation of the subscription with two_phase = off will not work.
The altering of two_phase was restricted because if there was a previously prepared transaction on the subscriber when the two_phase was on, and then it was turned off, the apply worker on the subscriber would re-apply the transaction a second time and this might result in an inconsistent replica.
Here's a patch that allows toggling two_phase option provided that there are no pending uncommitted prepared transactions on the subscriber for that subscription.
I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?
--
With Regards,
Amit Kapila.
On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?--
With Regards,
Amit Kapila.
Updated the patch, as it wasn't addressing updating of two-phase in the
remote slot.
Currently the main issue that needs to be handled is the handling of
pending prepared transactions while the two_phase is altered. I see 3
issues with the current approach.
1. Uncommitted prepared transactions when toggling two_phase from true to
false
When two_phase was true, prepared transactions were decoded at PREPARE time
and send to the subscriber, which is then prepared on the subscriber with a
new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED
on the publisher is converted to commit and the entire transaction is
decoded and sent to the subscriber. This will leave the previously
prepared transaction pending.
2. Uncommitted prepared transactions when toggling two_phase form false to
true
When two_phase was false, prepared transactions were ignored and not
decoded at PREPARE time on the publisher. Once the two_phase is toggled to
true, the apply worker and the walsender are restarted and a replication is
restarted from a new "start_decoding_at" LSN. Now, this new
"start_decoding_at" could be past the LSN of the PREPARE record and if so,
the PREPARE record is skipped and not send to the subscriber. Look at
comments in DecodeTXNNeedSkip() for detail. Later when the user issues
COMMIT PREPARED, this is decoded and sent to the subscriber. but there is
no prepared transaction on the subscriber, and this fails because the
corresponding gid of the transaction couldn't be found.
3. While altering the two_phase of the subscription, it is required to also
alter the two_phase field of the slot on the primary. The subscription
cannot remotely alter the two_phase option of the slot when the
subscription is enabled, as the slot is owned by the walsender on the
publisher side.
Possible solutions for the 3 problems:
1. While toggling two_phase from true to false, we could probably get list
of prepared transactions for this subscriber id and rollback/abort the
prepared transactions. This will allow the transactions to be re-applied
like a normal transaction when the commit comes. Alternatively, if this
isn't appropriate doing it in the ALTER SUBSCRIPTION context, we could
store the xids of all prepared transactions of this subscription in a list
and when the corresponding xid is being committed by the apply worker,
prior to commit, we make sure the previously prepared transaction is rolled
back. But this would add the overhead of checking this list every time a
transaction is committed by the apply worker.
2. No solution yet.
3. We could mandate that the altering of two_phase state only be done after
disabling the subscription, just like how it is handled for failover option.
Let me know your thoughts.
regards,
Ajin Cherian
Fujitsu Australia
Attachments:
v2-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patchapplication/octet-stream; name=v2-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patchDownload+143-26
On Fri, Apr 5, 2024 at 4:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?Updated the patch, as it wasn't addressing updating of two-phase in the remote slot.
Vitaly, does the minimal solution provided by the proposed patch
(Allow to alter two_phase option of a subscriber provided no
uncommitted
prepared transactions are pending on that subscription.) address your use case?
Currently the main issue that needs to be handled is the handling of pending prepared transactions while the two_phase is altered. I see 3 issues with the current approach.
1. Uncommitted prepared transactions when toggling two_phase from true to false
When two_phase was true, prepared transactions were decoded at PREPARE time and send to the subscriber, which is then prepared on the subscriber with a new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED on the publisher is converted to commit and the entire transaction is decoded and sent to the subscriber. This will leave the previously prepared transaction pending.2. Uncommitted prepared transactions when toggling two_phase form false to true
When two_phase was false, prepared transactions were ignored and not decoded at PREPARE time on the publisher. Once the two_phase is toggled to true, the apply worker and the walsender are restarted and a replication is restarted from a new "start_decoding_at" LSN. Now, this new "start_decoding_at" could be past the LSN of the PREPARE record and if so, the PREPARE record is skipped and not send to the subscriber. Look at comments in DecodeTXNNeedSkip() for detail. Later when the user issues COMMIT PREPARED, this is decoded and sent to the subscriber. but there is no prepared transaction on the subscriber, and this fails because the corresponding gid of the transaction couldn't be found.3. While altering the two_phase of the subscription, it is required to also alter the two_phase field of the slot on the primary. The subscription cannot remotely alter the two_phase option of the slot when the subscription is enabled, as the slot is owned by the walsender on the publisher side.
Thanks for summarizing the reasons for not allowing altering the
two_pc property for a subscription.
Possible solutions for the 3 problems:
1. While toggling two_phase from true to false, we could probably get a list of prepared transactions for this subscriber id and rollback/abort the prepared transactions. This will allow the transactions to be re-applied like a normal transaction when the commit comes. Alternatively, if this isn't appropriate doing it in the ALTER SUBSCRIPTION context, we could store the xids of all prepared transactions of this subscription in a list and when the corresponding xid is being committed by the apply worker, prior to commit, we make sure the previously prepared transaction is rolled back. But this would add the overhead of checking this list every time a transaction is committed by the apply worker.
In the second solution, if you check at the time of commit whether
there exists a prior prepared transaction then won't we end up
applying the changes twice? I think we can first try to achieve it at
the time of Alter Subscription because the other solution can add
overhead at each commit?
2. No solution yet.
One naive idea is that on the publisher we can remember whether the
prepare has been sent and if so then only send commit_prepared,
otherwise send the entire transaction. On the subscriber-side, we
somehow, need to ensure before applying the first change whether the
corresponding transaction is already prepared and if so then skip the
changes and just perform the commit prepared. One drawback of this
approach is that after restart, the prepare flag wouldn't be saved in
the memory and we end up sending the entire transaction again. One way
to avoid this overhead is that the publisher before sending the entire
transaction checks with subscriber whether it has a prepared
transaction corresponding to the current commit. I understand that
this is not a good idea even if it works but I don't have any better
ideas. What do you think?
3. We could mandate that the altering of two_phase state only be done after disabling the subscription, just like how it is handled for failover option.
makes sense.
--
With Regards,
Amit Kapila.
Hi Amit, Ajin, All
Thank you for the patch and the responses. I apologize for my delayed answer due to some curcumstances.
On Wednesday, April 10, 2024 14:18 MSK, Amit Kapila <amit.kapila16@gmail.com> wrote:
Vitaly, does the minimal solution provided by the proposed patch (Allow to alter two_phase option of a subscriber provided no uncommitted prepared transactions are pending on that subscription.) address your use case?In general, the idea behind the patch seems to be suitable for my case. Furthermore, the case of two_phase switch from false to true with uncommitted pending prepared transactions probably never happens in my case. The switch from false to true means that the replica completes the catchup from the master and switches to the normal mode when it participates in the multi-node configuration. There should be no uncommitted pending prepared transactions at the moment of the switch to the normal mode.
I'm going to try this patch. Give me please some time to investigate the patch. I will come with some feedback a little bit later.
Thank you for your help!
With best regards,
Vitaly Davydov
Dear Amit,
One naive idea is that on the publisher we can remember whether the
prepare has been sent and if so then only send commit_prepared,
otherwise send the entire transaction. On the subscriber-side, we
somehow, need to ensure before applying the first change whether the
corresponding transaction is already prepared and if so then skip the
changes and just perform the commit prepared. One drawback of this
approach is that after restart, the prepare flag wouldn't be saved in
the memory and we end up sending the entire transaction again. One way
to avoid this overhead is that the publisher before sending the entire
transaction checks with subscriber whether it has a prepared
transaction corresponding to the current commit. I understand that
this is not a good idea even if it works but I don't have any better
ideas. What do you think?
Alternative idea is that worker pass a list of prepared transaction as new
option in START_REPLICATION. This can reduce the number of inter-node
communications, but sometimes the list may be huge.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/
Dear Amit,
Vitaly, does the minimal solution provided by the proposed patch
(Allow to alter two_phase option of a subscriber provided no
uncommitted
prepared transactions are pending on that subscription.) address your use case?
I think we do not have to handle cases which there are prepared transactions on
publisher/subscriber, as the first step. It leads additional complexity and we
do not have smarter solutions, especially for problem 2.
IIUC it meets the Vitaly's condition, right?
1. While toggling two_phase from true to false, we could probably get a list of
prepared transactions for this subscriber id and rollback/abort the prepared
transactions. This will allow the transactions to be re-applied like a normal
transaction when the commit comes. Alternatively, if this isn't appropriate doing it
in the ALTER SUBSCRIPTION context, we could store the xids of all prepared
transactions of this subscription in a list and when the corresponding xid is being
committed by the apply worker, prior to commit, we make sure the previously
prepared transaction is rolled back. But this would add the overhead of checking
this list every time a transaction is committed by the apply worker.In the second solution, if you check at the time of commit whether
there exists a prior prepared transaction then won't we end up
applying the changes twice? I think we can first try to achieve it at
the time of Alter Subscription because the other solution can add
overhead at each commit?
Yeah, at least the second solution might be problematic. I prototyped
the first one and worked well. However, to make the feature more consistent,
it is prohibit to exist prepared transactions on subscriber for now.
We can ease based on the requirement.
2. No solution yet.
One naive idea is that on the publisher we can remember whether the
prepare has been sent and if so then only send commit_prepared,
otherwise send the entire transaction. On the subscriber-side, we
somehow, need to ensure before applying the first change whether the
corresponding transaction is already prepared and if so then skip the
changes and just perform the commit prepared. One drawback of this
approach is that after restart, the prepare flag wouldn't be saved in
the memory and we end up sending the entire transaction again. One way
to avoid this overhead is that the publisher before sending the entire
transaction checks with subscriber whether it has a prepared
transaction corresponding to the current commit. I understand that
this is not a good idea even if it works but I don't have any better
ideas. What do you think?
I considered but not sure it is good to add such mechanism. Your idea requires
additional wait-loop, which might lead bugs and unexpected behavior. And it may
degrade the performance based on the network environment.
As for the another solution (worker sends a list of prepared transactions), it
is also not so good because list of prepared transactions may be huge.
Based on above, I think we can reject the case for now.
FYI - We also considered the idea which walsender waits until all prepared transactions
are resolved before decoding and sending changes, but it did not work well
- the restarted walsender sent only COMMIT PREPARED record for transactions which
have been prepared before disabling the subscription. This happened because
1) if the two_phase option of slots is false, the confirmed_flush can be ahead of
PREPARE record, and
2) after the altering and restarting, start_decoding_at becomes same as
confirmed_flush and records behind this won't be decoded.
3. We could mandate that the altering of two_phase state only be done after
disabling the subscription, just like how it is handled for failover option.
makes sense.
OK, this spec was added.
According to above, I updated the patch with Ajin.
0001 - extends ALTER SUBSCRIPTION statement. A tab-completion was added.
0002 - mandates the subscription has been disabled. Since no need to change
AtEOXact_ApplyLauncher(), the change is reverted.
If no objections, this can be included to 0001.
0003 - checks whether there are transactions prepared by the worker. If found,
rejects the ALTER SUBSCRIPTION command.
0004 - checks whether there are transactions prepared on publisher. The backend
connects to the publisher and confirms it. If found, rejects the ALTER
SUBSCRIPTION command.
0005 - adds TAP test for it.
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/