pgsql: Rework subtransaction commit protocol for hot standby.
Log Message:
-----------
Rework subtransaction commit protocol for hot standby.
This patch eliminates the marking of subtransactions as SUBCOMMITTED in pg_clog
during their commit; instead they remain in-progress until main transaction
commit. At main transaction commit, the commit protocol is atomic-by-page
instead of one transaction at a time. To avoid a race condition with some
subtransactions appearing committed before others in the case where they span
more than one pg_clog page, we conserve the logic that marks them subcommitted
before marking the parent committed.
Simon Riggs with minor help from me
Modified Files:
--------------
pgsql/src/backend/access/transam:
README (r1.11 -> r1.12)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/README?r1=1.11&r2=1.12)
clog.c (r1.47 -> r1.48)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/clog.c?r1=1.47&r2=1.48)
transam.c (r1.76 -> r1.77)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/transam.c?r1=1.76&r2=1.77)
twophase.c (r1.45 -> r1.46)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/twophase.c?r1=1.45&r2=1.46)
xact.c (r1.265 -> r1.266)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xact.c?r1=1.265&r2=1.266)
pgsql/src/include/access:
clog.h (r1.21 -> r1.22)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/access/clog.h?r1=1.21&r2=1.22)
transam.h (r1.65 -> r1.66)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/access/transam.h?r1=1.65&r2=1.66)
alvherre@postgresql.org (Alvaro Herrera) writes:
Rework subtransaction commit protocol for hot standby.
I think this patch broke something. In CVS HEAD, replay of a
transaction commit record with a subtransaction dies with an assert
failure:
#0 0xc0141220 in ?? () from /usr/lib/libc.1
#1 0xc00aa7ec in ?? () from /usr/lib/libc.1
#2 0xc008c2b8 in ?? () from /usr/lib/libc.1
#3 0xc0086d9c in ?? () from /usr/lib/libc.1
#4 0x41288c in ExceptionalCondition (
conditionName=0x6deb8 "!(((*byteptr >> bshift) & ((1 << 2) - 1)) == 0 || ((*byteptr >> bshift) & ((1 << 2) - 1)) == 0x03 || ((*byteptr >> bshift) & ((1 << 2) - 1)) == status)", errorType=0x6dde4 "FailedAssertion",
fileName=0x6ddf4 "clog.c", lineNumber=330) at assert.c:57
#5 0x171680 in TransactionIdSetStatusBit (xid=2063810256, status=2063810248,
lsn={xlogid = 0, xrecoff = 0}, slotno=0) at clog.c:330
#6 0x1714e8 in TransactionIdSetPageStatus (xid=84, nsubxids=1,
subxids=0x4008dd78, status=2063840993, lsn={xlogid = 0, xrecoff = 0},
pageno=0) at clog.c:290
#7 0x1712dc in TransactionIdSetTreeStatus (xid=19394, nsubxids=1,
subxids=0x4008dd78, status=1, lsn={xlogid = 0, xrecoff = 0}) at clog.c:204
#8 0x1722f8 in TransactionIdCommitTree (xid=2063670312, nxids=2063839972,
xids=0x7b011cf0) at transam.c:266
#9 0x1785b8 in xact_redo_commit (xlrec=0x4008dd48, xid=19394) at xact.c:4222
#10 0x1788ec in xact_redo (lsn={xlogid = 0, xrecoff = 211142048},
record=0x4008dd28) at xact.c:4327
#11 0x182a28 in StartupXLOG () at xlog.c:5146
I can't reproduce this 100% reliably, but the case where I saw it came
from mistakenly inserting
Assert(!IsA(node, AppendRelInfo));
into flatten_join_alias_vars_mutator and then running the serial
regression tests. This case *can* happen, but it doesn't occur
until late in the regression tests. The recovery ensuing from the
assert failure crashes maybe one time in two --- likely has something
to do with when the last checkpoint happened.
regards, tom lane
On Wed, 2008-10-22 at 13:25 -0400, Tom Lane wrote:
alvherre@postgresql.org (Alvaro Herrera) writes:
Rework subtransaction commit protocol for hot standby.
I think this patch broke something. In CVS HEAD, replay of a
transaction commit record with a subtransaction dies with an assert
failure:#0 0xc0141220 in ?? () from /usr/lib/libc.1
#1 0xc00aa7ec in ?? () from /usr/lib/libc.1
#2 0xc008c2b8 in ?? () from /usr/lib/libc.1
#3 0xc0086d9c in ?? () from /usr/lib/libc.1
#4 0x41288c in ExceptionalCondition (
conditionName=0x6deb8 "!(((*byteptr >> bshift) & ((1 << 2) - 1)) == 0 || ((*byteptr >> bshift) & ((1 << 2) - 1)) == 0x03 || ((*byteptr >> bshift) & ((1 << 2) - 1)) == status)", errorType=0x6dde4 "FailedAssertion",
fileName=0x6ddf4 "clog.c", lineNumber=330) at assert.c:57
#5 0x171680 in TransactionIdSetStatusBit (xid=2063810256, status=2063810248,
lsn={xlogid = 0, xrecoff = 0}, slotno=0) at clog.c:330
#6 0x1714e8 in TransactionIdSetPageStatus (xid=84, nsubxids=1,
subxids=0x4008dd78, status=2063840993, lsn={xlogid = 0, xrecoff = 0},
pageno=0) at clog.c:290
#7 0x1712dc in TransactionIdSetTreeStatus (xid=19394, nsubxids=1,
subxids=0x4008dd78, status=1, lsn={xlogid = 0, xrecoff = 0}) at clog.c:204
#8 0x1722f8 in TransactionIdCommitTree (xid=2063670312, nxids=2063839972,
xids=0x7b011cf0) at transam.c:266
#9 0x1785b8 in xact_redo_commit (xlrec=0x4008dd48, xid=19394) at xact.c:4222
#10 0x1788ec in xact_redo (lsn={xlogid = 0, xrecoff = 211142048},
record=0x4008dd28) at xact.c:4327
#11 0x182a28 in StartupXLOG () at xlog.c:5146I can't reproduce this 100% reliably, but the case where I saw it came
from mistakenly inserting
Assert(!IsA(node, AppendRelInfo));
into flatten_join_alias_vars_mutator and then running the serial
regression tests. This case *can* happen, but it doesn't occur
until late in the regression tests. The recovery ensuing from the
assert failure crashes maybe one time in two --- likely has something
to do with when the last checkpoint happened.
OK, looking now.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 13:25 -0400, Tom Lane wrote:
conditionName=0x6deb8 "!(((*byteptr >> bshift) & ((1 << 2) - 1)) == 0
|| ((*byteptr >> bshift) & ((1 << 2) - 1)) == 0x03 || ((*byteptr >>
bshift) & ((1 << 2) - 1)) == status)", errorType=0x6dde4
"FailedAssertion",
fileName=0x6ddf4 "clog.c", lineNumber=330) at assert.c:57
This implies we are setting commit on an already aborted xid.
Either we are incorrectly setting some commits as aborts, or more likely
that we are updating the wrong page with an xid.
I'm continuing to look.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 13:25 -0400, Tom Lane wrote:
#5 0x171680 in TransactionIdSetStatusBit (xid=2063810256,
status=2063810248,
lsn={xlogid = 0, xrecoff = 0}, slotno=0) at clog.c:330
#6 0x1714e8 in TransactionIdSetPageStatus (xid=84, nsubxids=1,
subxids=0x4008dd78, status=2063840993, lsn={xlogid = 0, xrecoff =
0},
pageno=0) at clog.c:290
#7 0x1712dc in TransactionIdSetTreeStatus (xid=19394, nsubxids=1,
subxids=0x4008dd78, status=1, lsn={xlogid = 0, xrecoff = 0}) at
clog.c:204
#8 0x1722f8 in TransactionIdCommitTree (xid=2063670312,
nxids=2063839972,
xids=0x7b011cf0) at transam.c:266
#9 0x1785b8 in xact_redo_commit (xlrec=0x4008dd48, xid=19394) at
xact.c:4222
These traces look weird. Look at the way the xid changes value as we
move from call to call. It looks like something is screwy there. If
those values are correct we should have failed an earlier assertion.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
These traces look weird. Look at the way the xid changes value as we
move from call to call. It looks like something is screwy there. If
those values are correct we should have failed an earlier assertion.
No, that's normal behavior on this platform + optimization setting.
Some of those registers have gotten re-used for other values. If
I were desperate to figure out how it got from point A to point B
I'd recompile with -O0, but this particular call stack doesn't seem
to hold any surprises: as you say, it seems to be trying to commit
an aborted xact. I looked far enough to see that the subxact ID
was a couple counts higher than the main, so I doubt that bad data
in the WAL record is the issue.
Are you able to reproduce the crash?
regards, tom lane
On Wed, 2008-10-22 at 15:18 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
These traces look weird. Look at the way the xid changes value as we
move from call to call. It looks like something is screwy there. If
those values are correct we should have failed an earlier assertion.No, that's normal behavior on this platform + optimization setting.
Some of those registers have gotten re-used for other values. If
I were desperate to figure out how it got from point A to point B
I'd recompile with -O0, but this particular call stack doesn't seem
to hold any surprises: as you say, it seems to be trying to commit
an aborted xact. I looked far enough to see that the subxact ID
was a couple counts higher than the main, so I doubt that bad data
in the WAL record is the issue.Are you able to reproduce the crash?
Took a while, but yes, I can reproduce this now. Analysing...
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 20:52 +0100, Simon Riggs wrote:
Took a while, but yes, I can reproduce this now. Analysing...
OK, I think I see what it's doing and why it fails the assert.
It's nothing to do with confusing commit/abort.
The new way of doing things on commit is to subcommit then commit. This
sequence is repeated during WAL replay. If we crash, it will try to
repeat the sequence, so in some cases it will try to set status to
subcommitted on a transaction already marked as committed.
So it fails the Assertion, but does the right thing.
A few ways to fix this, but patch to make that case a no-op is attached.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
clog_bug.v1.patchtext/x-patch; charset=utf-8; name=clog_bug.v1.patchDownload
Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.48
diff -c -r1.48 clog.c
*** src/backend/access/transam/clog.c 20 Oct 2008 19:18:18 -0000 1.48
--- src/backend/access/transam/clog.c 22 Oct 2008 20:24:29 -0000
***************
*** 324,329 ****
--- 324,333 ----
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
+ if (((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_COMMITTED &&
+ status == TRANSACTION_STATUS_SUB_COMMITTED)
+ return;
+
/* Current state should be 0, subcommitted or target state */
Assert(((*byteptr >> bshift) & CLOG_XACT_BITMASK) == 0 ||
((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_SUB_COMMITTED ||
Simon Riggs <simon@2ndQuadrant.com> writes:
It's nothing to do with confusing commit/abort.
The new way of doing things on commit is to subcommit then commit. This
sequence is repeated during WAL replay. If we crash, it will try to
repeat the sequence, so in some cases it will try to set status to
subcommitted on a transaction already marked as committed.
Hmm, but then why did we not see the same thing before?
regards, tom lane
On Wed, 2008-10-22 at 16:41 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
It's nothing to do with confusing commit/abort.
The new way of doing things on commit is to subcommit then commit. This
sequence is repeated during WAL replay. If we crash, it will try to
repeat the sequence, so in some cases it will try to set status to
subcommitted on a transaction already marked as committed.Hmm, but then why did we not see the same thing before?
The failure definitely came from trying to set SUBCOMMITTED on a
transaction already committed. I drew the wrong initial conclusion from
the Assert, which applied only to normal running, not recovery.
As you suggested, something to do with sequencing of checkpoints?
Previously the error showed itself consistently on the second run on
make installcheck. The attached patch succeeds consistently at the same
place. And keeps succeeding on third, fourth, fifth times.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
On Wed, 2008-10-22 at 16:41 -0400, Tom Lane wrote:
Hmm, but then why did we not see the same thing before?
The failure definitely came from trying to set SUBCOMMITTED on a
transaction already committed.
Ah, I see: prior versions did not bother to make a WAL entry for a
subcommit, so there was no case where a replay would try to reverse
the later state change to committed.
I see from a quick look in xact.c that CommitSubTransaction no longer
marks the subxact as subcommitted at all, which makes me wonder what is
the point of even having the state. If you intend that we are going to
rely 100% on in-memory state to detect our own subcommitted
transactions, then why isn't it sufficient to mark the parent committed
and then mark the subtransactions committed? An onlooker would see a
subtransaction go directly from IN_PROGRESS to COMMITTED, but if the
onlooker is too slow to catch the now-very-transient SUBCOMMITTED
state, that's what he'd see anyway.
regards, tom lane
On Wed, 2008-10-22 at 17:16 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Wed, 2008-10-22 at 16:41 -0400, Tom Lane wrote:
Hmm, but then why did we not see the same thing before?
The failure definitely came from trying to set SUBCOMMITTED on a
transaction already committed.Ah, I see: prior versions did not bother to make a WAL entry for a
subcommit, so there was no case where a replay would try to reverse
the later state change to committed.I see from a quick look in xact.c that CommitSubTransaction no longer
marks the subxact as subcommitted at all, which makes me wonder what is
the point of even having the state. If you intend that we are going to
rely 100% on in-memory state to detect our own subcommitted
transactions, then why isn't it sufficient to mark the parent committed
and then mark the subtransactions committed? An onlooker would see a
subtransaction go directly from IN_PROGRESS to COMMITTED, but if the
onlooker is too slow to catch the now-very-transient SUBCOMMITTED
state, that's what he'd see anyway.
My interest was really in maintaining ultra-correctness, while removing
the need to WAL log subcommits for Hot Standby. I think I achieved that,
almost, but if you see further optimizations that is good too.
My understanding is that if we just mark the top-level as committed and
then mark subtransactions as committed that it would be possible to have
two observers conclude different things, which is therefore not atomic:
* backend1 look at top-level xid and see it as committed
* backend2 look at subtransaction and see it as in-progress
especially if the two xids were on separate pages.
Which sounds pretty bad to me.
The problem is what happens across clog pages. We only mark subcommitted
across subtransactions when we access more than one page. It's a very
transient state, but it prevents the situation where multiple observers
see different results.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
My interest was really in maintaining ultra-correctness, while removing
the need to WAL log subcommits for Hot Standby. I think I achieved that,
almost, but if you see further optimizations that is good too.
I'm not focusing on performance here --- I'm focusing on whether I trust
the patch at all. I dislike blowing a hole that size in the sanity
checks in TransactionIdSetStatus. I note that the hole still isn't
big enough, since presumably you'd have to allow ABORTED=>SUB_COMMITTED
too. That means that out of the four state transitions that are
disallowed by the original coding of that Assert, you are now having to
consider two as legal. I don't like that, and I like even less that
it's not even trying to determine whether this is a replay-driven
change.
regards, tom lane
On Wed, 2008-10-22 at 13:25 -0400, Tom Lane wrote:
I can't reproduce this 100% reliably, but the case where I saw it came
from mistakenly inserting
Assert(!IsA(node, AppendRelInfo));
into flatten_join_alias_vars_mutator and then running the serial
regression tests. This case *can* happen, but it doesn't occur
until late in the regression tests. The recovery ensuing from the
assert failure crashes maybe one time in two --- likely has something
to do with when the last checkpoint happened.
This seems like a useful type of test.
Can we change regression tests so that we avoid checkpoints, then crash
at the end of the test and go through recovery process? That would be a
neat way of including a recovery check in with each make check.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, 2008-10-22 at 18:53 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
My interest was really in maintaining ultra-correctness, while removing
the need to WAL log subcommits for Hot Standby. I think I achieved that,
almost, but if you see further optimizations that is good too.I'm not focusing on performance here --- I'm focusing on whether I trust
the patch at all.
I dislike blowing a hole that size in the sanity
checks in TransactionIdSetStatus. I note that the hole still isn't
big enough, since presumably you'd have to allow ABORTED=>SUB_COMMITTED
too.
No, that is never required. We only set subcommitted state for a commit.
That means that out of the four state transitions that are
disallowed by the original coding of that Assert, you are now having to
consider two as legal. I don't like that, and I like even less that
it's not even trying to determine whether this is a replay-driven
change.
Presumably you would like to see an additional parameter to allow that
test to be more strictly determined?
Bug fix v2 patch enclosed, mostly API changes.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
clog_bug.v2.patchtext/x-patch; charset=utf-8; name=clog_bug.v2.patchDownload
Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.48
diff -c -r1.48 clog.c
*** src/backend/access/transam/clog.c 20 Oct 2008 19:18:18 -0000 1.48
--- src/backend/access/transam/clog.c 23 Oct 2008 03:10:54 -0000
***************
*** 82,92 ****
static void WriteTruncateXlogRec(int pageno);
static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
! XLogRecPtr lsn, int pageno);
static void TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
! XLogRecPtr lsn, int slotno);
static void set_status_by_pages(int nsubxids, TransactionId *subxids,
! XidStatus status, XLogRecPtr lsn);
/*
--- 82,92 ----
static void WriteTruncateXlogRec(int pageno);
static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
! XLogRecPtr lsn, int pageno, bool isRedo);
static void TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
! XLogRecPtr lsn, int slotno, bool isRedo);
static void set_status_by_pages(int nsubxids, TransactionId *subxids,
! XidStatus status, XLogRecPtr lsn, bool isRedo);
/*
***************
*** 142,148 ****
*/
void
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
! TransactionId *subxids, XidStatus status, XLogRecPtr lsn)
{
int pageno = TransactionIdToPage(xid); /* get page of parent */
int i;
--- 142,148 ----
*/
void
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
! TransactionId *subxids, XidStatus status, XLogRecPtr lsn, bool isRedo)
{
int pageno = TransactionIdToPage(xid); /* get page of parent */
int i;
***************
*** 168,174 ****
* Set the parent and all subtransactions in a single call
*/
TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn,
! pageno);
}
else
{
--- 168,174 ----
* Set the parent and all subtransactions in a single call
*/
TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn,
! pageno, isRedo);
}
else
{
***************
*** 187,193 ****
if (status == TRANSACTION_STATUS_COMMITTED)
set_status_by_pages(nsubxids - nsubxids_on_first_page,
subxids + nsubxids_on_first_page,
! TRANSACTION_STATUS_SUB_COMMITTED, lsn);
/*
* Now set the parent and subtransactions on same page as the parent,
--- 187,193 ----
if (status == TRANSACTION_STATUS_COMMITTED)
set_status_by_pages(nsubxids - nsubxids_on_first_page,
subxids + nsubxids_on_first_page,
! TRANSACTION_STATUS_SUB_COMMITTED, lsn, isRedo);
/*
* Now set the parent and subtransactions on same page as the parent,
***************
*** 195,201 ****
*/
pageno = TransactionIdToPage(xid);
TransactionIdSetPageStatus(xid, nsubxids_on_first_page, subxids, status,
! lsn, pageno);
/*
* Now work through the rest of the subxids one clog page at a time,
--- 195,201 ----
*/
pageno = TransactionIdToPage(xid);
TransactionIdSetPageStatus(xid, nsubxids_on_first_page, subxids, status,
! lsn, pageno, isRedo);
/*
* Now work through the rest of the subxids one clog page at a time,
***************
*** 203,209 ****
*/
set_status_by_pages(nsubxids - nsubxids_on_first_page,
subxids + nsubxids_on_first_page,
! status, lsn);
}
}
--- 203,209 ----
*/
set_status_by_pages(nsubxids - nsubxids_on_first_page,
subxids + nsubxids_on_first_page,
! status, lsn, isRedo);
}
}
***************
*** 215,221 ****
*/
static void
set_status_by_pages(int nsubxids, TransactionId *subxids,
! XidStatus status, XLogRecPtr lsn)
{
int pageno = TransactionIdToPage(subxids[0]);
int offset = 0;
--- 215,221 ----
*/
static void
set_status_by_pages(int nsubxids, TransactionId *subxids,
! XidStatus status, XLogRecPtr lsn, bool isRedo)
{
int pageno = TransactionIdToPage(subxids[0]);
int offset = 0;
***************
*** 233,239 ****
TransactionIdSetPageStatus(InvalidTransactionId,
num_on_page, subxids + offset,
! status, lsn, pageno);
offset = i;
pageno = TransactionIdToPage(subxids[offset]);
}
--- 233,239 ----
TransactionIdSetPageStatus(InvalidTransactionId,
num_on_page, subxids + offset,
! status, lsn, pageno, isRedo);
offset = i;
pageno = TransactionIdToPage(subxids[offset]);
}
***************
*** 248,254 ****
static void
TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
! XLogRecPtr lsn, int pageno)
{
int slotno;
int i;
--- 248,254 ----
static void
TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
! XLogRecPtr lsn, int pageno, bool isRedo)
{
int slotno;
int i;
***************
*** 289,307 ****
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
TransactionIdSetStatusBit(subxids[i],
TRANSACTION_STATUS_SUB_COMMITTED,
! lsn, slotno);
}
}
/* ... then the main transaction */
! TransactionIdSetStatusBit(xid, status, lsn, slotno);
}
/* Set the subtransactions */
for (i = 0; i < nsubxids; i++)
{
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! TransactionIdSetStatusBit(subxids[i], status, lsn, slotno);
}
ClogCtl->shared->page_dirty[slotno] = true;
--- 289,307 ----
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
TransactionIdSetStatusBit(subxids[i],
TRANSACTION_STATUS_SUB_COMMITTED,
! lsn, slotno, isRedo);
}
}
/* ... then the main transaction */
! TransactionIdSetStatusBit(xid, status, lsn, slotno, isRedo);
}
/* Set the subtransactions */
for (i = 0; i < nsubxids; i++)
{
Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! TransactionIdSetStatusBit(subxids[i], status, lsn, slotno, isRedo);
}
ClogCtl->shared->page_dirty[slotno] = true;
***************
*** 313,321 ****
* Sets the commit status of a single transaction.
*
* Must be called with CLogControlLock held
*/
static void
! TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
--- 313,326 ----
* Sets the commit status of a single transaction.
*
* Must be called with CLogControlLock held
+ *
+ * We pass the isRedo flag all the way down to here to ensure that the
+ * setting of transaction status is valid for the original call and for
+ * recovery.
*/
static void
! TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
! XLogRecPtr lsn, int slotno, bool isRedo)
{
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
***************
*** 327,333 ****
/* Current state should be 0, subcommitted or target state */
Assert(((*byteptr >> bshift) & CLOG_XACT_BITMASK) == 0 ||
((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_SUB_COMMITTED ||
! ((*byteptr >> bshift) & CLOG_XACT_BITMASK) == status);
/* note this assumes exclusive access to the clog page */
byteval = *byteptr;
--- 332,340 ----
/* Current state should be 0, subcommitted or target state */
Assert(((*byteptr >> bshift) & CLOG_XACT_BITMASK) == 0 ||
((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_SUB_COMMITTED ||
! ((*byteptr >> bshift) & CLOG_XACT_BITMASK) == status ||
! (((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_COMMITTED &&
! status == TRANSACTION_STATUS_SUB_COMMITTED && isRedo));
/* note this assumes exclusive access to the clog page */
byteval = *byteptr;
Index: src/backend/access/transam/transam.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/transam.c,v
retrieving revision 1.78
diff -c -r1.78 transam.c
*** src/backend/access/transam/transam.c 20 Oct 2008 20:38:24 -0000 1.78
--- src/backend/access/transam/transam.c 23 Oct 2008 03:11:24 -0000
***************
*** 261,283 ****
* are correctly marked subcommit first.
*/
void
! TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED,
! InvalidXLogRecPtr);
}
/*
* TransactionIdAsyncCommitTree
* Same as above, but for async commits. The commit record LSN is needed.
*/
void
TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids,
XLogRecPtr lsn)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
! TRANSACTION_STATUS_COMMITTED, lsn);
}
/*
--- 261,286 ----
* are correctly marked subcommit first.
*/
void
! TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids, bool isRedo)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
TRANSACTION_STATUS_COMMITTED,
! InvalidXLogRecPtr, isRedo);
}
/*
* TransactionIdAsyncCommitTree
* Same as above, but for async commits. The commit record LSN is needed.
+ *
+ * Never called during recovery, so doesn't have an isRedo parameter.
*/
void
TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids,
XLogRecPtr lsn)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
! TRANSACTION_STATUS_COMMITTED,
! lsn, false);
}
/*
***************
*** 291,300 ****
* will consider all the xacts as not-yet-committed anyway.
*/
void
! TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
! TRANSACTION_STATUS_ABORTED, InvalidXLogRecPtr);
}
/*
--- 294,304 ----
* will consider all the xacts as not-yet-committed anyway.
*/
void
! TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids, bool isRedo)
{
TransactionIdSetTreeStatus(xid, nxids, xids,
! TRANSACTION_STATUS_ABORTED,
! InvalidXLogRecPtr, isRedo);
}
/*
Index: src/backend/access/transam/twophase.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/twophase.c,v
retrieving revision 1.46
diff -c -r1.46 twophase.c
*** src/backend/access/transam/twophase.c 20 Oct 2008 19:18:18 -0000 1.46
--- src/backend/access/transam/twophase.c 23 Oct 2008 03:11:49 -0000
***************
*** 1745,1751 ****
XLogFlush(recptr);
/* Mark the transaction committed in pg_clog */
! TransactionIdCommitTree(xid, nchildren, children);
/* Checkpoint can proceed now */
MyProc->inCommit = false;
--- 1745,1751 ----
XLogFlush(recptr);
/* Mark the transaction committed in pg_clog */
! TransactionIdCommitTree(xid, nchildren, children, false);
/* Checkpoint can proceed now */
MyProc->inCommit = false;
***************
*** 1820,1826 ****
* Mark the transaction aborted in clog. This is not absolutely necessary
* but we may as well do it while we are here.
*/
! TransactionIdAbortTree(xid, nchildren, children);
END_CRIT_SECTION();
}
--- 1820,1826 ----
* Mark the transaction aborted in clog. This is not absolutely necessary
* but we may as well do it while we are here.
*/
! TransactionIdAbortTree(xid, nchildren, children, false);
END_CRIT_SECTION();
}
Index: src/backend/access/transam/xact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.266
diff -c -r1.266 xact.c
*** src/backend/access/transam/xact.c 20 Oct 2008 19:18:18 -0000 1.266
--- src/backend/access/transam/xact.c 23 Oct 2008 03:02:04 -0000
***************
*** 951,957 ****
* Now we may update the CLOG, if we wrote a COMMIT record above
*/
if (markXidCommitted)
! TransactionIdCommitTree(xid, nchildren, children);
}
else
{
--- 951,957 ----
* Now we may update the CLOG, if we wrote a COMMIT record above
*/
if (markXidCommitted)
! TransactionIdCommitTree(xid, nchildren, children, false);
}
else
{
***************
*** 1250,1256 ****
* having flushed the ABORT record to disk, because in event of a crash
* we'd be assumed to have aborted anyway.
*/
! TransactionIdAbortTree(xid, nchildren, children);
END_CRIT_SECTION();
--- 1250,1256 ----
* having flushed the ABORT record to disk, because in event of a crash
* we'd be assumed to have aborted anyway.
*/
! TransactionIdAbortTree(xid, nchildren, children, false);
END_CRIT_SECTION();
***************
*** 4219,4225 ****
/* Mark the transaction committed in pg_clog */
sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
! TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids);
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
--- 4219,4225 ----
/* Mark the transaction committed in pg_clog */
sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
! TransactionIdCommitTree(xid, xlrec->nsubxacts, sub_xids, true);
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
***************
*** 4257,4263 ****
/* Mark the transaction aborted in pg_clog */
sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
! TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids);
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
--- 4257,4263 ----
/* Mark the transaction aborted in pg_clog */
sub_xids = (TransactionId *) &(xlrec->xnodes[xlrec->nrels]);
! TransactionIdAbortTree(xid, xlrec->nsubxacts, sub_xids, true);
/* Make sure nextXid is beyond any XID mentioned in the record */
max_xid = xid;
Index: src/include/access/clog.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/clog.h,v
retrieving revision 1.22
diff -c -r1.22 clog.h
*** src/include/access/clog.h 20 Oct 2008 19:18:18 -0000 1.22
--- src/include/access/clog.h 23 Oct 2008 03:06:48 -0000
***************
*** 33,39 ****
extern void TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
! TransactionId *subxids, XidStatus status, XLogRecPtr lsn);
extern XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn);
extern Size CLOGShmemSize(void);
--- 33,39 ----
extern void TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
! TransactionId *subxids, XidStatus status, XLogRecPtr lsn, bool isRedo);
extern XidStatus TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn);
extern Size CLOGShmemSize(void);
Index: src/include/access/transam.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/transam.h,v
retrieving revision 1.66
diff -c -r1.66 transam.h
*** src/include/access/transam.h 20 Oct 2008 19:18:18 -0000 1.66
--- src/include/access/transam.h 23 Oct 2008 03:06:27 -0000
***************
*** 140,148 ****
extern bool TransactionIdDidAbort(TransactionId transactionId);
extern bool TransactionIdIsKnownCompleted(TransactionId transactionId);
extern void TransactionIdAbort(TransactionId transactionId);
! extern void TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids);
extern void TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids, XLogRecPtr lsn);
! extern void TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids);
extern bool TransactionIdPrecedes(TransactionId id1, TransactionId id2);
extern bool TransactionIdPrecedesOrEquals(TransactionId id1, TransactionId id2);
extern bool TransactionIdFollows(TransactionId id1, TransactionId id2);
--- 140,148 ----
extern bool TransactionIdDidAbort(TransactionId transactionId);
extern bool TransactionIdIsKnownCompleted(TransactionId transactionId);
extern void TransactionIdAbort(TransactionId transactionId);
! extern void TransactionIdCommitTree(TransactionId xid, int nxids, TransactionId *xids, bool isRedo);
extern void TransactionIdAsyncCommitTree(TransactionId xid, int nxids, TransactionId *xids, XLogRecPtr lsn);
! extern void TransactionIdAbortTree(TransactionId xid, int nxids, TransactionId *xids, bool isRedo);
extern bool TransactionIdPrecedes(TransactionId id1, TransactionId id2);
extern bool TransactionIdPrecedesOrEquals(TransactionId id1, TransactionId id2);
extern bool TransactionIdFollows(TransactionId id1, TransactionId id2);
On Thu, 2008-10-23 at 04:38 +0100, Simon Riggs wrote:
That means that out of the four state transitions that are
disallowed by the original coding of that Assert, you are now having to
consider two as legal. I don't like that, and I like even less that
it's not even trying to determine whether this is a replay-driven
change.
Possible state changes
TRANSACTION_STATUS_IN_PROGRESS to
TRANSACTION_STATUS_IN_PROGRESS is allowed
TRANSACTION_STATUS_COMMITTED is allowed
TRANSACTION_STATUS_ABORTED is allowed
TRANSACTION_STATUS_SUB_COMMITTED is allowed
TRANSACTION_STATUS_SUB_COMMITTED to
TRANSACTION_STATUS_IN_PROGRESS is allowed (but should not be)
TRANSACTION_STATUS_COMMITTED is allowed
TRANSACTION_STATUS_ABORTED is allowed
TRANSACTION_STATUS_SUB_COMMITTED is allowed
TRANSACTION_STATUS_COMMITTED to
TRANSACTION_STATUS_IN_PROGRESS is disallowed
TRANSACTION_STATUS_COMMITTED is allowed
TRANSACTION_STATUS_ABORTED is disallowed
TRANSACTION_STATUS_SUB_COMMITTED is ignored in redo only
TRANSACTION_STATUS_ABORTED to
TRANSACTION_STATUS_IN_PROGRESS is disallowed
TRANSACTION_STATUS_COMMITTED is disallowed
TRANSACTION_STATUS_ABORTED is allowed
TRANSACTION_STATUS_SUB_COMMITTED is disallowed
So out of 16 possible state change requests 10 were previously allowed,
one of which was allowed but should not have been.
This patch allows 1 additional legal state change request, now in redo
only.
There are still 5 disallowed state changes, plus another one disallowed
in normal running. That seems fine.
Presumably you would like to see an additional parameter to allow that
test to be more strictly determined?Bug fix v2 patch enclosed, mostly API changes.
I suggest a third version with these changes:
* Write the SUBCOMMITTED to COMMIT transition as a no-op during redo
rather than as an Assert. This prevents a transition from COMMIT to
SUBCOMMIT to ABORT. By making it a no-op the attempt to set COMMIT to
SUBCOMMIT never causes a failure, but it doesn't take place either.
* Disallow SUBCOMMITTED to IN_PROGRESS transition via an Assert.
What do you think?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Thu, 2008-10-23 at 19:37 +0100, Simon Riggs wrote:
I suggest a third version with these changes:
* Write the SUBCOMMITTED to COMMIT transition as a no-op during redo
rather than as an Assert. This prevents a transition from COMMIT to
SUBCOMMIT to ABORT. By making it a no-op the attempt to set COMMIT to
SUBCOMMIT never causes a failure, but it doesn't take place either.* Disallow SUBCOMMITTED to IN_PROGRESS transition via an Assert.
Even better idea: just use the InRecovery flag. Patch enclosed.
Comments please.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
clog_bug.v3.patchtext/x-patch; charset=UTF-8; name=clog_bug.v3.patchDownload
Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.48
diff -c -r1.48 clog.c
*** src/backend/access/transam/clog.c 20 Oct 2008 19:18:18 -0000 1.48
--- src/backend/access/transam/clog.c 27 Oct 2008 21:15:00 -0000
***************
*** 324,332 ****
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
! /* Current state should be 0, subcommitted or target state */
Assert(((*byteptr >> bshift) & CLOG_XACT_BITMASK) == 0 ||
! ((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_SUB_COMMITTED ||
((*byteptr >> bshift) & CLOG_XACT_BITMASK) == status);
/* note this assumes exclusive access to the clog page */
--- 324,346 ----
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
! /*
! * When replaying transactions during recovery we still need to perform
! * the two phases of subcommit and then commit. However, some transactions
! * are already correctly marked, so we just treat those as a no-op which
! * allows us to keep the following Assert as restrictive as possible.
! */
! if (InRecovery && status == TRANSACTION_STATUS_SUB_COMMITTED &&
! ((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_COMMITTED)
! return;
!
! /*
! * Current state change should be from 0 or subcommitted to target state
! * or we should already be there when replaying changes during recovery.
! */
Assert(((*byteptr >> bshift) & CLOG_XACT_BITMASK) == 0 ||
! (((*byteptr >> bshift) & CLOG_XACT_BITMASK) == TRANSACTION_STATUS_SUB_COMMITTED &&
! status != TRANSACTION_STATUS_IN_PROGRESS) ||
((*byteptr >> bshift) & CLOG_XACT_BITMASK) == status);
/* note this assumes exclusive access to the clog page */
On Tue, 2008-10-28 at 00:35 +0000, Simon Riggs wrote:
Comments please.
I'm a little unclear as to what is happening with this.
We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
an option. It was my patch with the bug, so clearly my responsibility.
After some prototypes, I'm happy with the submitted v3 patch.
Please could somebody commit this? Or at least say what they think needs
to happen, or what we're waiting for?
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
I'm a little unclear as to what is happening with this.
We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
an option.
Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it. Is it blocking
some other progress for you?
regards, tom lane
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
I'm a little unclear as to what is happening with this.
We have a bug in CVS HEAD, so clearly doing nothing shouldn't really be
an option.Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it. Is it blocking
some other progress for you?
No, since I know how to fix it.
Just uncomfortable leaving known bugs in the tree, that's all.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.
Just uncomfortable leaving known bugs in the tree, that's all.
Feel free to stick something about it on the open items list
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
if you want to make doubly sure it's not forgotten about.
regards, tom lane
On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.Just uncomfortable leaving known bugs in the tree, that's all.
Feel free to stick something about it on the open items list
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
if you want to make doubly sure it's not forgotten about.
You're pulling my chain.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.Just uncomfortable leaving known bugs in the tree, that's all.
Feel free to stick something about it on the open items list
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
if you want to make doubly sure it's not forgotten about.You're pulling my chain.
:-)
The new entry seems too vague to be useful ...
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tue, 2008-10-28 at 17:36 -0300, Alvaro Herrera wrote:
Simon Riggs wrote:
On Tue, 2008-10-28 at 15:52 -0400, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On Tue, 2008-10-28 at 14:43 -0400, Tom Lane wrote:
Doing nothing until release is certainly not an option ;-). But AFAICS
this is not a showstopper for development so I was intending to wait
until commitfest starts before doing anything with it.Just uncomfortable leaving known bugs in the tree, that's all.
Feel free to stick something about it on the open items list
http://wiki.postgresql.org/wiki/PostgreSQL_8.4_Open_Items
if you want to make doubly sure it's not forgotten about.You're pulling my chain.
:-)
The new entry seems too vague to be useful ...
I'll correct it before release. ;-)
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
Even better idea: just use the InRecovery flag. Patch enclosed.
Applied, thanks.
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support