An improvement of ProcessTwoPhaseBuffer logic
Dear Hackers,
I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase transaction states from disk or the WAL. It takes xid as well as some other input parameters and executes the following steps:
Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the corresponding file and returns it
In some, very rare scenarios, the postgres instance will newer recover because of such logic. Imagine, that the two_phase directory contains some files with two-phase states of transactions of distant future. I assume, it can happen if some WAL segments are broken and ignored (as well as clog data) but two_phase directory was not broken. In recovery, postgresql reads all the files in two_phase and tries to recover two-phase states.
The problem appears in the functions TransactionIdDidCommit or TransactionIdDidAbort. These functions may fail with the FATAL message like below when no clog state on disk is available for the xid:
FATAL: could not access status of transaction 286331153
DETAIL: Could not open file "pg_xact/0111": No such file or directory.
Such error do not allow the postgresql instance to be started.
My guess, if to swap Step #1 with Step #2 such error will disappear because transactions will be filtered when comparing xid with ShmemVariableCache->nextXid before accessing clog. The function will be more robust. In general, it works but I'm not sure that such logic will not break some rare boundary cases. Another solution is to catch and ignore such error, but the original solution is the simpler one. I appreciate any thoughts concerning this topic. May be, you know some cases when such change in logic is not relevant?
Thank you in advance!
With best regards,
Vitaly
On Tue, Dec 24, 2024 at 04:26:32PM +0300, Vitaly Davydov wrote:
In some, very rare scenarios, the postgres instance will newer
recover because of such logic. Imagine, that the two_phase directory
contains some files with two-phase states of transactions of distant
future. I assume, it can happen if some WAL segments are broken and
ignored (as well as clog data) but two_phase directory was not
broken. In recovery, postgresql reads all the files in two_phase and
tries to recover two-phase states.
Well, one recent issue in this area is a bug fixed by cf4401fe6cf5:
/messages/by-id/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com
If you see other bug patterns like this one, please let me know.
There are good changes that I could be involved in stuff that touched
this area of the code, so I'm mostly behind its maintenance these
days.
My guess, if to swap Step #1 with Step #2 such error will disappear
because transactions will be filtered when comparing xid with
ShmemVariableCache->nextXid before accessing clog. The function will
be more robust. In general, it works but I'm not sure that such
logic will not break some rare boundary cases. Another solution is
to catch and ignore such error, but the original solution is the
simpler one. I appreciate any thoughts concerning this topic. May
be, you know some cases when such change in logic is not relevant?
Hmm. Historically, up to 9.6, StandbyRecoverPreparedTransactions()
and RecoverPreparedTransactions() have been in charge of doing the
commit/abort checks with potentially clog lookups when it came to past
files. PrescanPreparedTransactions() has been in charge of doing the
check about future files.
If you look closely, PrescanPreparedTransactions() is called in two
code paths before StandbyRecoverPreparedTransactions(), meaning that
we did the checks for the future files first, then the past checks
with potential CLOG lookups.
In v10~, things have changed to use ProcessTwoPhaseBuffer(), and as
you say, the checks are what we have now: they're done in a reverse
order. So, yes, I agree that you have a point here: the past code was
safer because it would have been able to avoid clog lookups if we had
a 2PC file in the future for a reason or another so it was safer than
what we have now.
I'm wondering if the attached patch is something worth backpatching,
actually, as these failing CLOG lookups can lead to spurious failures
at startup. HEAD-only would be OK based on the odds so that's a
no-brainer. Vitaly, have you seen that in the wild as an effect of
future 2PC files?
Any opinions from others?
--
Michael
Attachments:
twophase-process-checks.patchtext/x-diff; charset=us-asciiDownload+20-20
Hi Michael,
Thank you for the explanation and the patch! I'm happy, that I seem to be on the right way.
On Wednesday, December 25, 2024 08:04 MSK, Michael Paquier <michael@paquier.xyz> wrote:
Vitaly, have you seen that in the wild as an effect of future 2PC files?
I haven't heard about this problem in production, but I've encountered it when I did some development in two-phase functionality. I fixed it by swapping the if blocks and it solved my problem.
It is pretty easy to reproduce it on the master branch:
1. Start an instance with enabled prepared transactions
2. Create a simple prepared transaction
3. Do checkpoint to write the transaction two-phase state into a file in pg_twophase subdirectory
4. Copy the created file, change its name to reflect a future xid (in my case: cp 00000000000002E8 00000000000FF2E8)
5. Commit the prepared transaction
6. Stop the instance with -m immediate
7. Start the instance
After starting, you can get an error like "could not start server". In the log file you can find a message like:
LOG: database system was not properly shut down; automatic recovery in progress
FATAL: could not access status of transaction 1045224
DETAIL: Could not read from file "pg_xact/0000" at offset 253952: read too few bytes.
2024-12-25 18:38:30.606 MSK [795557] LOG: startup process (PID 795560) exited with exit code 1
I tried your patch and it seems the server is started successfully. But I've found another problem in my synthetic test - it can not remove the file with the following message:
LOG: database system was not properly shut down; automatic recovery in progress
WARNING: removing future two-phase state file for transaction 1045224
WARNING: could not remove file "pg_twophase/FFFFFFFF000FF2E8": No such file or directory
LOG: redo starts at 0/1762978
The fill will never be removed automatically.
I guess, it is because we incorrectly calculate the two-phase file name using TwoPhaseFilePath in RemoveTwoPhaseFile in this scenario. It can be fixed if to pass file path directly from RecoverPreparedTransactions or StandbyRecoverPreparedTransaction into ProcessTwoPhaseBuffer -> RemoveTwoPhaseFile. I did it in the proposed patch /messages/by-id/cedbe-65e0c000-1-6db17700@133269862 (it is incomplete now).
With best regards,
Vitaly
Import Notes
Resolved by subject fallback
On Wed, Dec 25, 2024 at 07:18:18PM +0300, Vitaly Davydov wrote:
I tried your patch and it seems the server is started
successfully. But I've found another problem in my synthetic test -
it can not remove the file with the following message:LOG: database system was not properly shut down; automatic recovery in progress
WARNING: removing future two-phase state file for transaction 1045224
WARNING: could not remove file "pg_twophase/FFFFFFFF000FF2E8": No such file or directory
LOG: redo starts at 0/1762978
That works properly up to v16~. I have slept over this problem and I
think that we'd better backpatch the fix for the first problem. Now,
I also want to add a test to provide coverage. As you say, that's
simple:
- Create an empty file with a future XID name
- Restart the server.
- Scan the logs for the WARNING where the future file is removed.
The fill will never be removed automatically.
I guess, it is because we incorrectly calculate the two-phase file
name using TwoPhaseFilePath in RemoveTwoPhaseFile in this
scenario. It can be fixed if to pass file path directly from
RecoverPreparedTransactions or StandbyRecoverPreparedTransaction
into ProcessTwoPhaseBuffer -> RemoveTwoPhaseFile.
Yeah, I've noticed that as well. We are dealing with a second bug
that affects 17~ caused by the switch to FullTransactionIds for 2PC
file names, which had been introduced in 5a1dfde8334b (pretty sure
that it is the culprit, did not bisect). Attempting to remove a
future file triggers an assertion failure:
#3 0x00007fed046324f0 in __GI_abort () at ./stdlib/abort.c:79
#4 0x0000563e68a9bd07 in ExceptionalCondition
(conditionName=0x563e68c66fc3 "epoch > 0", fileName=0x563e68c66b0b
"twophase.c", lineNumber=953) at assert.c:66
#5 0x0000563e67683060 in AdjustToFullTransactionId (xid=4095) at
twophase.c:953
#6 0x0000563e67683092 in TwoPhaseFilePath (path=0x7ffcb515b4a0
"h\307!\227>V", xid=4095) at twophase.c:963
#7 0x0000563e67686603 in RemoveTwoPhaseFile (xid=4095,
giveWarning=true) at twophase.c:1728
#8 0x0000563e67688989 in ProcessTwoPhaseBuffer (xid=4095,
prepare_start_lsn=0, fromdisk=true, setParent=false, setNextXid=false)
at twophase.c:2219
#9 0x0000563e676872a6 in restoreTwoPhaseData () at twophase.c:1924
#10 0x0000563e676b5c8b in StartupXLOG () at xlog.c:5642
So this means that we are dealing with two different bugs, and that we
need to fix the assertion failure of the second problem first down to
17, then fix the first problem on all stable branches with a test to
cover both the first and second problems.
I did it in the proposed patch
/messages/by-id/cedbe-65e0c000-1-6db17700@133269862
(it is incomplete now).
I suspect that this can be still dangerous as-is while complicating
the code with more possible paths for the removal of the 2PC files,
because we may still build a file name from an XID, and that's what's
causing this issue..
--
Michael
On Thu, Dec 26, 2024 at 07:42:34AM +0900, Michael Paquier wrote:
I suspect that this can be still dangerous as-is while complicating
the code with more possible paths for the removal of the 2PC files,
because we may still build a file name from an XID, and that's what's
causing this issue..
Here are two patches to address both issues:
- 0001 for the epoch calculation, down to 17, which would take care of
the underflow problem when having a 2PC file that has an XID in the
future at epoch 0. It is true that it could be smarter when dealing
with files from other epochs, and that this is a problem in v17~. I
think that this should integrate better with FullTransactionIds moving
forward rather than pass the file names. That would be something
quite invasive, only for HEAD. At least that's my impression.
- 0002, to take care of the future file issue, down to 13. This
includes a TAP test to demonstrate the problem. The test needs a
tweak for the 2PC file name in v17~.
--
Michael
Attachments:
v1-0001-Fix-failure-with-incorrect-epoch-for-future-2PC-f.patchtext/x-diff; charset=us-asciiDownload+1-3
v1-0002-Fix-handling-of-orphaned-2PC-files-in-the-future-.patchtext/x-diff; charset=us-asciiDownload+38-54
v1-0002-Fix-handling-of-orphaned-2PC-files-in-the-fut-v16.patchtext/x-diff; charset=us-asciiDownload+43-21
Hi Michael,
Here are two patches to address both issues:
Thank you for the preparing the patches and test simplification. My bad, I overcompilcated it.
I concerned about twophase file name generation. The function TwoPhaseFilePath() is pretty straitforward and unambiguous in 16 and earlier versions. The logic of this function in 17+ seems to be more complex. I do not understand it clearly. But, I guess, it will work incorrectly after turning to a newer epoch, because the epoch is calculated from TransamVariables->nextXid, but not from problem xid. The same problem may happen if we are in epoch 1 or greater. It will produce a wrong file name, because the epoch will be obtained from the last xid, not from file name xid. In another words, the function AdjustToFullTransactionId assumes that if xid > TransamVariables->nextXid, then the xid from the previous epoch. I may be not the case in our scenario.
I suspect that this can be still dangerous as-is while complicating the code with more possible paths for the removal of the 2PC files
Agree, but we may pass file name into TwoPhaseFilePath if any, instead of creation of two functions as in the patch. The cost - two new if conditions. Using file names is pretty safe. Once we read the file and extract xid from its name, just pass this file name to TwoPhaseFilePath(). If not, try to generate it. Anyway, I do not insist on it, just try to discuss.
With best regards,
Vitaly
On Thu, Dec 26, 2024 at 06:11:25PM +0300, Давыдов Виталий wrote:
I concerned about twophase file name generation. The function
TwoPhaseFilePath() is pretty straitforward and unambiguous in 16 and
earlier versions. The logic of this function in 17+ seems to be more
complex. I do not understand it clearly. But, I guess, it will work
incorrectly after turning to a newer epoch, because the epoch is
calculated from TransamVariables->nextXid, but not from problem
xid. The same problem may happen if we are in epoch 1 or greater. It
will produce a wrong file name, because the epoch will be obtained
from the last xid, not from file name xid. In another words, the
function AdjustToFullTransactionId assumes that if xid >
TransamVariables->nextXid, then the xid from the previous epoch. I
may be not the case in our scenario.
Yeah. At this stage, we can pretty much say that the whole idea of
relying AdjustToFullTransactionId() is broken, because we would build
2PC file names based on wrong assumptions, while orphaned files could
be in the far past or far future depending on the epoch.
TBH, we'll live better if we remove AdjustToFullTransactionId() and
sprinkle a bit more the knowledge of FullTransactionIds to build
correctly the 2PC file path in v17~. I've been playing with this code
for a couple of hours and finished with the attached patch. I have
wondered if ReadTwoPhaseFile() should gain the same knowledge as
TwoPhaseFilePath(), but decided to limit the invasiness because we
always call ReadTwoPhaseFile() once we don't have any orphaned files,
for all the phases of recovery. So while this requires a bit more
logic that depends on FullTransactionIdFromEpochAndXid() and
ReadNextFullTransactionId() to build a FullTransactionId and get the
current epoch, that's still acceptable as we only store an XID in the
2PC file.
So please see the attached. You will note that RemoveTwoPhaseFile(),
ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a
FullTransactionId, hence callers need to think about the epoch to use.
That should limit future errors compared to passing the file name as
optional argument.
What do you think?
--
Michael
Attachments:
v2-0001-Fix-failure-with-incorrect-epoch-handling-for-2PC.patchtext/x-diff; charset=us-asciiDownload+184-79
On Friday, December 27, 2024 10:37 MSK, Michael Paquier <michael@paquier.xyz> wrote:
So please see the attached. You will note that RemoveTwoPhaseFile(),
ProcessTwoPhaseBuffer() and TwoPhaseFilePath() now require a
FullTransactionId, hence callers need to think about the epoch to use.
That should limit future errors compared to passing the file name as
optional argument.
In general, I like your solution to use FullTransactionId. I haven't found any evident problems. I just would like to propose to create a couple of new functions like RemoveTwoPhaseFileInCurrentEpoch, TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead FullTransactionId. It may make the code clearer and result into less boilerplate code. I tried to do some hand testing. It seems, the problem is gone with the patch. Thank you!
As an idea, I would like to propose to store FullTransactionId in global transaction state instead of TransactionId. I'm not sure, it will consume significant additional memory, but it make the code more clear and probably result into less number of locks.
With best regards,
Vitaly
Attachments:
0001-Some-cosmetic-fixes.patchtext/x-patchDownload+37-56
On Fri, Dec 27, 2024 at 06:16:24PM +0300, Vitaly Davydov wrote:
In general, I like your solution to use FullTransactionId. I haven't
found any evident problems. I just would like to propose to create a
couple of new functions like RemoveTwoPhaseFileInCurrentEpoch,
TwoPhaseFilePathInCurrentEpoch which accept TransactionId instead
FullTransactionId. It may make the code clearer and result into less
boilerplate code. I tried to do some hand testing. It seems, the
problem is gone with the patch.
Thanks for the review. Agreed that it would be a good thing to limit
the number of paths calling ReadNextFullTransactionId(), but I did not
like much the suggestion TwoPhaseFilePathInCurrentEpoch(), feeling
that it was more important to keep a single code path in charge of
building the file names. Instead, I have gone with a new
FullTransactionIdFromCurrentEpoch() that replaces
AdjustToFullTransactionId(). It cleans up most of the calls to
ReadNextFullTransactionId() compared to the previous patch. It is
true that these couple with calls to ProcessTwoPhaseBuffer(), but the
result felt OK this way in the scope of this fix.
As an idea, I would like to propose to store FullTransactionId in
global transaction state instead of TransactionId. I'm not sure, it
will consume significant additional memory, but it make the code
more clear and probably result into less number of locks.
FWIW, I was wondering about doing the same thing. However, I have
concluded that this some refactoring work out of the scope of fixing
the primary issue we have here, as we are hit by the way the file
names are built when we attempt to remove them.
Note that the memory footprint of storing a FullTransactionId in
twophase.c's GlobalTransactionData does not worry me much. It is more
important to not increase the size of the two-phase state data,
impacting files on disk and their WAL records. This size argument
has been mentioned on the thread that has added epochs to the 2PC file
names, as far as I recall.
At the end, I have applied two patches, the first one down to 13 that
took care of the "future" issue, with tests added in the v14~v16
range. v13 was lacking a perl routine, and it's not a big deal to not
have coverage for this code path anyway. The second patch has been
applied down to v17, to fix the epoch issue, with the more advanced
tests.
If you have a suggestion of patch to plug in a FullTransactionId into
GlobalTransactionData rather than a TransactionId, feel free to
propose something! Help is always welcome, and this would be
HEAD-only work, making it less urgent to deal with.
--
Michael
On Monday, December 30, 2024 04:08 MSK, Michael Paquier <michael@paquier.xyz> wrote:
Instead, I have gone with a new
FullTransactionIdFromCurrentEpoch() that replaces
AdjustToFullTransactionId(). It cleans up most of the calls to
ReadNextFullTransactionId() compared to the previous patch. It is
true that these couple with calls to ProcessTwoPhaseBuffer(), but the
result felt OK this way in the scope of this fix.
Could you please send the latest version of the patch, if any? I haven't found any new patches in the latest email.
Happy New Year!
With best regards,
Vitaly
On Thu, Jan 09, 2025 at 11:21:38AM +0300, Давыдов Виталий wrote:
Could you please send the latest version of the patch, if any? I haven't found any new patches in the latest email.
I've applied fixes for this stuff as of e3584258154f (down to 13) and
7e125b20eed6 (down to 17) with tests for all the supported branches,
meaning that we should be done here.
--
Michael
On Mon, Dec 30, 2024 at 10:08:31AM +0900, Michael Paquier wrote:
On Fri, Dec 27, 2024 at 06:16:24PM +0300, Vitaly Davydov wrote:
As an idea, I would like to propose to store FullTransactionId in
global transaction state instead of TransactionId. I'm not sure, it
will consume significant additional memory, but it make the code
more clear and probably result into less number of locks.FWIW, I was wondering about doing the same thing. However, I have
concluded that this some refactoring work out of the scope of fixing
the primary issue we have here, as we are hit by the way the file
names are built when we attempt to remove them.Note that the memory footprint of storing a FullTransactionId in
twophase.c's GlobalTransactionData does not worry me much. It is more
important to not increase the size of the two-phase state data,
impacting files on disk and their WAL records. This size argument
has been mentioned on the thread that has added epochs to the 2PC file
names, as far as I recall.
I agree with accepting +4 bytes in GlobalTransactionData.
At the end, I have applied two patches, the first one down to 13 that
took care of the "future" issue, with tests added in the v14~v16
range. v13 was lacking a perl routine, and it's not a big deal to not
have coverage for this code path anyway. The second patch has been
applied down to v17, to fix the epoch issue, with the more advanced
tests.If you have a suggestion of patch to plug in a FullTransactionId into
GlobalTransactionData rather than a TransactionId, feel free to
propose something! Help is always welcome, and this would be
HEAD-only work, making it less urgent to deal with.
I suspect we should do that and back-patch to v17 before the next releases,
because commit 7e125b2 wrote this function:
/*
* Compute FullTransactionId for the given TransactionId, using the current
* epoch.
*/
static inline FullTransactionId
FullTransactionIdFromCurrentEpoch(TransactionId xid)
{
FullTransactionId fxid;
FullTransactionId nextFullXid;
uint32 epoch;
nextFullXid = ReadNextFullTransactionId();
epoch = EpochFromFullTransactionId(nextFullXid);
fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
return fxid;
}
I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):
+ /* Discard files from past epochs */
+ if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?
Attachments:
FullTransactionIdFromCurrentEpoch-v0.patchtext/plain; charset=us-asciiDownload+58-107
On Wed, Jan 15, 2025 at 05:00:51PM -0800, Noah Misch wrote:
I agree with accepting +4 bytes in GlobalTransactionData.
Let's just bite the bullet and do that on HEAD and v17, then,
integrating deeper FullTransactionIds into the internals of
twophase.c.
I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):+ /* Discard files from past epochs */ + if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))
Oops, you're right. Your suggestion to unify all that in a single
routine is an excellent idea. Missed the bits in xid8funcs.c.
I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?
I'm all for integrating more FullTransactionIds now that these reflect
in the file names, and do a deeper cut.
As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.
With the additions attached, FullTransactionIdFromAllowableAt() gets
down from 8 to 6 calls in twophase.c. The change related to
MarkAsPreparingGuts() seems optional, though. I am trying to figure
out how to write a regression test to trigger this error, lacking a
bit of time today. That's going to require more trickery with
pg_resetwal to make that cheap, or something like that.. Attached are
some suggestions, as of a 0002 that applies on top of your 0001.
XLogRecGetFullXid() is used nowhere. This could be removed, perhaps,
or not?
--
Michael
On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:
On Wed, Jan 15, 2025 at 05:00:51PM -0800, Noah Misch wrote:
I think "using the current epoch" is wrong for half of the nextFullXid values
having epoch > 0. For example, nextFullId==2^32 is in epoch 1, but all the
allowable XIDs are in epoch 0. (I mean "allowable" in the sense of
AssertTransactionIdInAllowableRange().) From then until we assign another
2^31 XIDs, epochs 0 and 1 are both expected in XID values. After 2^31 XID
assignments, every allowable XID be in epoch 1. Hence, twophase.c would need
two-epoch logic like we have in widen_snapshot_xid() and XLogRecGetFullXid().
Is that right? (I wrote this in a hurry, so this email may have more than the
standard level of errors.) Before commit 7e125b2, twophase also had that
logic. I didn't work out the user-visible consequences of that logic's new
absence here, but I bet on twophase recovery breakage. Similar problem here
(up to two epochs are acceptable, not just one):+ /* Discard files from past epochs */ + if (EpochFromFullTransactionId(fxid) < EpochFromFullTransactionId(nextXid))Oops, you're right. Your suggestion to unify all that in a single
routine is an excellent idea. Missed the bits in xid8funcs.c.
Added.
I wrote the attached half-baked patch to fix those, but I tend to think it's
better to use FullTransactionId as many places as possible in twophase.c.
(We'd still need to convert XIDs that we read from xl_xact_prepare records,
along the lines of XLogRecGetFullXid().) How do you see it?I'm all for integrating more FullTransactionIds now that these reflect
in the file names, and do a deeper cut.As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.
(I did not expect that a function called restoreTwoPhaseData() would run before
a function called PrescanPreparedTransactions(), but so it is.)
How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
consistent? What can it assume about the value of TransamVariables->nextXid
at that early time?
With the additions attached, FullTransactionIdFromAllowableAt() gets
down from 8 to 6 calls in twophase.c. The change related to
MarkAsPreparingGuts() seems optional, though.
Thanks. It's probably not worth doing at that level of reduction. I tried
spreading fxid further and got it down to three conversions, corresponding to
redo of each of XLOG_XACT_PREPARE, XLOG_XACT_COMMIT_PREPARED, and
XLOG_XACT_ABORT_PREPARED. I'm attaching the WIP of that. It's not as
satisfying as I expected, so FullTransactionIdFromCurrentEpoch-v0.patch may
yet be the better direction.
The ProcessTwoPhaseBuffer() code to remove "past two-phase state" seems
best-effort, independent of this change. Just because an XID is in a
potentially-acceptable epoch doesn't mean TransactionIdDidCommit() will find
clog for it. I've not studied whether that matters.
Incidentally, this comment about when a function is called:
* PrescanPreparedTransactions
*
* Scan the shared memory entries of TwoPhaseState and determine the range
* of valid XIDs present. This is run during database startup, after we
* have completed reading WAL. TransamVariables->nextXid has been set to
* one more than the highest XID for which evidence exists in WAL.
doesn't match the timing of the actual earliest call:
/* REDO */
if (InRecovery)
{
...
if (ArchiveRecoveryRequested && EnableHotStandby)
{
...
if (wasShutdown)
oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
else
oldestActiveXID = checkPoint.oldestActiveXid;
...
PerformWalRecovery();
performedWalRecovery = true;
I am trying to figure
out how to write a regression test to trigger this error, lacking a
bit of time today. That's going to require more trickery with
pg_resetwal to make that cheap, or something like that.. Attached are
some suggestions, as of a 0002 that applies on top of your 0001.
Thanks. I'd value having your regression test, but manual-ish testing could
suffice if it's too hard.
XLogRecGetFullXid() is used nowhere. This could be removed, perhaps,
or not?
Maybe. Looks like it was born unused in 67b9b3c (2019-07), so removing may as
well be a separate discussion.
Attachments:
FullTransactionIdFromCurrentEpoch-v0.1.patchtext/plain; charset=us-asciiDownload+242-271
On Thu, Jan 16, 2025 at 12:52:54PM -0800, Noah Misch wrote:
On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:
As far as I understand, the most important point of the logic is to
detect and discard the future files first in restoreTwoPhaseData() ->
ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
the beginning of recovery. Once this filtering is done, it should be
safe to use your FullTransactionIdFromAllowableAt() when doing
the fxid <-> xid transitions between the records and the files on disk
flushed by a restartpoint which store an XID, and the shmem state of
GlobalTransactionData with a fxid.(I did not expect that a function called restoreTwoPhaseData() would run before
a function called PrescanPreparedTransactions(), but so it is.)How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
consistent? What can it assume about the value of TransamVariables->nextXid
at that early time?
I think it can't use CLOG safely. I regret answering my own question now as
opposed to refraining from asking it, but hopefully this will save you time.
Since EndPrepare() flushes the XLOG_XACT_PREPARE record before anything writes
to pg_twophase, presence of a pg_twophase file tells us its record once
existed in the flushed WAL stream. Even before we've started the server on a
base backup, each pg_twophase file pertains to an XLOG_XACT_PREPARE LSN no
later than the end-of-backup checkpoint. If a later file exists, the backup
copied a file after the end-of-backup checkpoint started, which is a violation
of the base backup spec.
Elsewhere in recovery, we have the principle that data directory content means
little until we reach a consistent state by replaying the end-of-backup
checkpoint or by crash recovery reaching the end of WAL. For twophase, that
calls for ignoring pg_twophase content. If a restartpoint has a pg_twophase
file to write, we should allow that to overwrite an old file iff we've not
reached consistency. (We must not read the old pg_twophase file, which may
contain an unfinished write.)
By the time we reach consistency, every file in pg_twophase will be applicable
(not committed or aborted). Each file either predates the start-of-backup
checkpoint (or crash recovery equivalent), or recovery wrote that file. We
need readdir(pg_twophase) only at end of recovery, when we need
_twophase_recover callbacks to restore enough state to accept writes. (During
hot standby, state from XLOG_RUNNING_XACTS suffices[1]Incidentally, the comment lock_twophase_standby_recover incorrectly claims it runs "when starting up into hot standby mode." for the actions
possible in that mode.)
In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.
Thanks,
nm
[1]: Incidentally, the comment lock_twophase_standby_recover incorrectly claims it runs "when starting up into hot standby mode."
it runs "when starting up into hot standby mode."
On Thu, Jan 16, 2025 at 04:52:21PM -0800, Noah Misch wrote:
In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.
Oh, yeah, it seems like you have a very good point here.
ProcessTwoPhaseBuffer() is an effort to use a unique path for the
handling of this 2PC data, and we cannot do that safely at this stage.
The problem is restoreTwoPhaseData(), which is called very early in
the recovery process, while all the other code paths calling
ProcessTwoPhaseBuffer() are done when we're doing reading WAL after
we've reached consistency as far as I can see. This is a wrong
concept since 728bd991c3c4. 5a1dfde8334b has made that much harder to
think about, while not integrating fully things. It also looks like
we may be able to just get rid of the CLOG checks entirely if we
overwrite existing files as long as we are not in a consistent state,
as you say. Hmm.
I agree that e358425 and 7e125b2 are weird attempts that just try to
work around the real issue, and that they're making some cases wrong
while on it with potential file removals.
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);
Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.
I've been thinking about the strategy to use here, and here are some
rough steps:
- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.
- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.
- Let's figure out how much surgery is required with a focus on HEAD
for now (I'm feeling that more non-backpatchable pieces are going to
be needed here), then extract the relevant pieces that could be
backpatchable. Hard to say if this is going to be doable at this
stage, or even worth doing, but it will be possible to dig into the
details once we're happy with the state of HEAD. My first intuition
is that the risk of touching the back-branches may not be worth the
potential reward based on the lack of field complaints (not seen
customer cases, tbh): high risk, low reward.
Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.
--
Michael
On Fri, Jan 17, 2025 at 11:04:03AM +0900, Michael Paquier wrote:
On Thu, Jan 16, 2025 at 04:52:21PM -0800, Noah Misch wrote:
In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks. We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.Oh, yeah, it seems like you have a very good point here.
ProcessTwoPhaseBuffer() is an effort to use a unique path for the
handling of this 2PC data, and we cannot do that safely at this stage.
The problem is restoreTwoPhaseData(), which is called very early in
Agreed.
the recovery process, while all the other code paths calling
ProcessTwoPhaseBuffer() are done when we're doing reading WAL after
we've reached consistency as far as I can see. This is a wrong
concept since 728bd991c3c4. 5a1dfde8334b has made that much harder to
think about, while not integrating fully things. It also looks like
we may be able to just get rid of the CLOG checks entirely if we
overwrite existing files as long as we are not in a consistent state,
as you say. Hmm.I agree that e358425 and 7e125b2 are weird attempts that just try to
work around the real issue, and that they're making some cases wrong
while on it with potential file removals.+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.
The twophase.c bits of that patch turned out to be orthogonal to the key
issue, so that's fine. I suspect the signature changes would be okay to
back-patch to v17, subject to PGXN confirmation of no extension using those
APIs. (The TwoPhaseCallback list is hard-coded; extensions can't add
callbacks.) That said, the rationale for back-patching fxid stuff is gone.
I've been thinking about the strategy to use here, and here are some
rough steps:
- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.
+1. e358425 is borderline, but it's hard to rule out ways of it making
recovery unlink pg_twophase files improperly. Released code may also be doing
that, but that code's seven years of history suggests it does so infrequently
if at all. We don't have as much evidence about what the frequency would be
under e358425.
- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.
I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.
- Let's figure out how much surgery is required with a focus on HEAD
for now (I'm feeling that more non-backpatchable pieces are going to
be needed here), then extract the relevant pieces that could be
backpatchable. Hard to say if this is going to be doable at this
stage, or even worth doing, but it will be possible to dig into the
details once we're happy with the state of HEAD. My first intuition
is that the risk of touching the back-branches may not be worth the
potential reward based on the lack of field complaints (not seen
customer cases, tbh): high risk, low reward.
Let's see how it turns out. My first intuition would be to target a
back-patch, because a corner-case PITR failure is the kind of thing that can
go unreported and yet be hard to forgive.
I'm not confident I could fix both that other thread's data loss bug and
$SUBJECT in time for the 2024-02 releases. (By $SUBJECT, I mean the
seven-year-old bug of interpreting pg_twophase before recovery consistency,
which has a higher chance of causing spurious recovery failure starting in
v17.) Would your or someone else like to fix $SUBJECT, before or after
2024-02 releases? I'd make time to review a patch.
Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.
Yes, perhaps so. It could also be an option to store it only in the
pg_twophase file, not in the WAL record. Files are a lot rarer.
Similarly, I wondered if pg_twophase files should contain the LSN of the
XLOG_XACT_PREPARE record, which would make the file's oldness unambiguous in a
way fxid doesn't achieve. I didn't come up with a problem whose solution
requires it, though.
Thanks,
nm
On Thu, Jan 16, 2025 at 06:44:16PM -0800, Noah Misch wrote:
On Fri, Jan 17, 2025 at 11:04:03AM +0900, Michael Paquier wrote:
+typedef void (*TwoPhaseCallback) (FullTransactionId fxid, uint16 info,
void *recdata, uint32 len);Based on your latest patch, I doubt that we'll be able to do any of
that in the back-branches. That's much nicer in the long term to show
what this code relies on.The twophase.c bits of that patch turned out to be orthogonal to the key
issue, so that's fine. I suspect the signature changes would be okay to
back-patch to v17, subject to PGXN confirmation of no extension using those
APIs. (The TwoPhaseCallback list is hard-coded; extensions can't add
callbacks.) That said, the rationale for back-patching fxid stuff is gone.
I would not bet on this part.
- Let's first revert e358425 and 7e125b2. This indeed needs to be
reworked, and there is a release coming by.+1. e358425 is borderline, but it's hard to rule out ways of it making
recovery unlink pg_twophase files improperly. Released code may also be doing
that, but that code's seven years of history suggests it does so infrequently
if at all. We don't have as much evidence about what the frequency would be
under e358425.
Thanks. I'll got do that now as I can look at the buildfarm today,
then look at the rest.
- Your refactoring around xid8funcs.c is a good idea on its own. This
can be an independent patch.I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.
Okay.
I'm not confident I could fix both that other thread's data loss bug and
$SUBJECT in time for the 2024-02 releases. (By $SUBJECT, I mean the
seven-year-old bug of interpreting pg_twophase before recovery consistency,
which has a higher chance of causing spurious recovery failure starting in
v17.) Would your or someone else like to fix $SUBJECT, before or after
2024-02 releases? I'd make time to review a patch.
Yes, I can work on these two and dig into the pieces, once the state
of the branch a bit cleaner. I'm not sure that a fully-ready patch
will be able to emerge within two weeks as this may need some
interations, but let's see. Backpatching something to remove the clog
lookups does not seem that complicated, actually, now that I'm looking
at it. Simpler in ~16 of course, but it does not seem that bad for
17~ as well if taken the simplest way.
Another point to consider is if we'd better switch 2PC files to store
a fxid rather than a xid.. Perhaps that's not necessary, but if we're
using FullTransactionIds across the full stack of twophase.c that
could make the whole better with even less xid <-> fxid translations
in all these paths. There is always the counter-argument of the extra
4 bytes in the 2PC files and the records, though.Yes, perhaps so. It could also be an option to store it only in the
pg_twophase file, not in the WAL record. Files are a lot rarer.
It would be slightly simpler to do both, I guess. Less translations
between fxids and xids involved this way.
Similarly, I wondered if pg_twophase files should contain the LSN of the
XLOG_XACT_PREPARE record, which would make the file's oldness unambiguous in a
way fxid doesn't achieve. I didn't come up with a problem whose solution
requires it, though.
Not sure about this one.
--
Michael
Dear Hackers,
I hope you may find this patch worth to consider. You may consider this patch as one more step to a complete migration to FullTransactionId in twophase.c for 17+ versions.
The patch fixes TwoPhaseFilePath function. The current version gets TransactionId as an argument and makes a path to the file with twophase transaction state. The problem - this logic is ambiguous, because xid doesn't contain the epoch. By replacing the argument with FullTransactionId, the logic of this function becomes straightforward. All the responsibility to properly form a full xid were delegated to the caller functions.
The change in TwoPhaseFilePath implies to migrate to FullTransactionId in some other functions. There is a function AdjustToFullTransactionId (in the current master) which converts xid to full xid. The problem with this function is that it works properly only with those xids that are inside the current xid range which is defined by the xid horizon (wraparound), because it applies some assumptions concerning epoch definition (two epoches may be in action).
I assume that if a xid is coming from in-memory states, it has to be in the current xid range. Based on this assumption, I would conclude that if the xid is coming via the interface (external) functions which are defined in twophase.h, AdjustToFullTransactionId can be applied to such xid.
There is another story when we define xid from two phase file names, when reading pg_twophase directory. In 17+ twophase file names was changed to contain tx epoch as well. Once we work with twophase files, we have to use full xids. Function AdjustToFullTransactionId is not applicable in this case, because pg_twophase directory may contain any garbage files with future or past full xids which are outside of the current range.
Based on these assumptions (AdjustToFullTransactionId or use full xids) some other functions were modified as shown in the patch.
With best regards,
Vitaly
Attachments:
0001-Use-FullTransactionId-in-TwoPhaseFilePath-and-in-som.patchtext/x-patchDownload+51-34
On Thu, Jan 16, 2025 at 06:44:16PM -0800, Noah Misch wrote:
I got here on a yak shave for postgr.es/m/20250111214454.9a.nmisch@google.com.
I expect that project will still want FullTransactionIdFromAllowableAt(). If
so, I'll include it in that thread's patch series.
Note that this one has been committed as 81772a495ec9 by Noah. I am
going to spend some time looking at the rest.
--
Michael