Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Started by Andres Freundabout 10 years ago4 messages
#1Andres Freund
andres@anarazel.de

Hi,

By happenstance I just had slru.h open after lunch and noticed the
following comment:
/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page.
*/
XLogRecPtr *group_lsn;
int lsn_groups_per_page;

Uhm. multixacts historically didn't need to follow the
write-WAL-before-data rule because it was zapped at restart. But it's
now persistent.

There are no comments about this choice anywhere in multixact.c, leading
me to believe that this was not an intentional decision.

Right now I think the "missing" flushes are fairly unlikely to cause
problems in practice. Mostly because the multixact SLRUs are essentially
append only.

But I'd like some more brains to think about potential danger. If we
decide it's ok, let's update the comments.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Noah Misch
noah@leadboat.com
In reply to: Andres Freund (#1)
Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:

/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page.
*/
XLogRecPtr *group_lsn;
int lsn_groups_per_page;

Uhm. multixacts historically didn't need to follow the
write-WAL-before-data rule because it was zapped at restart. But it's
now persistent.

There are no comments about this choice anywhere in multixact.c, leading
me to believe that this was not an intentional decision.

Here's the multixact.c comment justifying it:

* XLOG interactions: this module generates an XLOG record whenever a new
* OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
* whenever a new MultiXactId is defined. This allows us to completely
* rebuild the data entered since the last checkpoint during XLOG replay.
* Because this is possible, we need not follow the normal rule of
* "write WAL before data"; the only correctness guarantee needed is that
* we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
* checkpoint is considered complete. If a page does make it to disk ahead
* of corresponding WAL records, it will be forcibly zeroed before use anyway.
* Therefore, we don't need to mark our pages with LSN information; we have
* enough synchronization already.

The comment's justification is incomplete, though. What of pages filled over
the course of multiple checkpoint cycles?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#2)
1 attachment(s)
Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:

On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:

/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page.
*/
XLogRecPtr *group_lsn;
int lsn_groups_per_page;

Here's the multixact.c comment justifying it:

* XLOG interactions: this module generates an XLOG record whenever a new
* OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
* whenever a new MultiXactId is defined. This allows us to completely
* rebuild the data entered since the last checkpoint during XLOG replay.
* Because this is possible, we need not follow the normal rule of
* "write WAL before data"; the only correctness guarantee needed is that
* we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
* checkpoint is considered complete. If a page does make it to disk ahead
* of corresponding WAL records, it will be forcibly zeroed before use anyway.
* Therefore, we don't need to mark our pages with LSN information; we have
* enough synchronization already.

The comment's justification is incomplete, though. What of pages filled over
the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

* Note: we need not flush this XLOG entry to disk before proceeding. The
* only way for the MXID to be referenced from any data page is for
* heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
* an XLOG record that must follow ours. The normal LSN interlock between
* the data page and that XLOG record will ensure that our XLOG record
* reaches disk first. If the SLRU members/offsets data reaches disk
* sooner than the XLOG record, we do not care because we'll overwrite it
* with zeroes unless the XLOG record is there too; see notes at top of
* this file.

I find no flaw in the first three sentences. In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity. We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages. That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery. I drafted the attached comment update to consolidate
and slightly correct this justification. (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed. The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation. MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
ExtendMultiXactOffset()
ExtendMultiXactMember()
START_CRIT_SECTION()
(MultiXactState->nextMXact)++
MultiXactState->nextOffset += nmembers
LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
*offptr = offset; /* update pg_multixact/offsets SLRU page */
LWLockRelease(MultiXactOffsetControlLock)
LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
*memberptr = members[i].xid; /* update pg_multixact/members SLRU page */
*flagsptr = flagsval; /* update pg_multixact/members SLRU page */
LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids. They will skip over SLRU space the current
backend reserved. Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that
* the offset entry for that multixact exists (because GetNewMultiXactId
* won't release MultiXactGenLock until it does) but contains zero
* (because we are careful to pre-zero offset pages). Because
* GetNewMultiXactId will never return zero as the starting offset for a
* multixact, when we read zero as the next multixact's offset, we know we
* have this case. We sleep for a bit and try again.

A crash while paused this way makes permanent the zero offset. After
recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
on the immediate previous mxid hangs indefinitely. Also,
pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have
not thoroughly considered how best to fix these. Test procedure:

-- backend 1
checkpoint;
create table victim0 (c) as select 1;
create table stall1 (c) as select 1;
create table last2 (c) as select 1;
begin;
select c from victim0 for key share;
select c from stall1 for key share;
select c from last2 for key share;

-- backend 2
begin; update victim0 set c = c + 1; rollback; -- burn one mxid
-- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at
-- XLogInsert() in MultiXactIdCreateFromMembers():
-- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
select c from stall1 for key share; -- stall in gdb while making an mxid

-- backend 3
select c from last2 for key share; -- use one more mxid; flush WAL

-- in GDB session, issue command: kill

-- backend 1, after recovery
select c from victim0; -- hangs, uncancelable

-- backend 2, after recovery: assertion failure
select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

Attachments:

doc-mxact-wal-before-data-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b66a2b6..c027f68 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -24,17 +24,21 @@
  * since it would get completely confused if someone inquired about a bogus
  * MultiXactId that pointed to an intermediate slot containing an XID.)
  *
- * XLOG interactions: this module generates an XLOG record whenever a new
- * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
- * whenever a new MultiXactId is defined.  This allows us to completely
- * rebuild the data entered since the last checkpoint during XLOG replay.
- * Because this is possible, we need not follow the normal rule of
- * "write WAL before data"; the only correctness guarantee needed is that
- * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
- * checkpoint is considered complete.  If a page does make it to disk ahead
- * of corresponding WAL records, it will be forcibly zeroed before use anyway.
- * Therefore, we don't need to mark our pages with LSN information; we have
- * enough synchronization already.
+ * XLOG interactions: this module generates a record whenever a new OFFSETs or
+ * MEMBERs page is initialized to zeroes, as well as an
+ * XLOG_MULTIXACT_CREATE_ID record whenever a new MultiXactId is defined.
+ * This module ignores the WAL rule "write xlog before data," because it
+ * suffices that actions recording a MultiXactId in a heap xmax do follow that
+ * rule.  The only way for the MXID to be referenced from any data page is for
+ * heap_lock_tuple() or heap_update() to have put it there, and each generates
+ * an XLOG record that must follow ours.  The normal LSN interlock between the
+ * data page and that XLOG record will ensure that our XLOG record reaches
+ * disk first.  If the SLRU members/offsets data reaches disk sooner than the
+ * XLOG records, we do not care; after recovery, no xmax will refer to it.  On
+ * the flip side, to ensure that all referenced entries _do_ reach disk, this
+ * module's XLOG records completely rebuild the data entered since the last
+ * checkpoint.  We flush and sync all dirty OFFSETs and MEMBERs pages to disk
+ * before each checkpoint is considered complete.
  *
  * Like clog.c, and unlike subtrans.c, we have to preserve state across
  * crashes and ensure that MXID and offset numbering increases monotonically
@@ -795,19 +799,7 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 */
 	multi = GetNewMultiXactId(nmembers, &offset);
 
-	/*
-	 * Make an XLOG entry describing the new MXID.
-	 *
-	 * Note: we need not flush this XLOG entry to disk before proceeding. The
-	 * only way for the MXID to be referenced from any data page is for
-	 * heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
-	 * an XLOG record that must follow ours.  The normal LSN interlock between
-	 * the data page and that XLOG record will ensure that our XLOG record
-	 * reaches disk first.  If the SLRU members/offsets data reaches disk
-	 * sooner than the XLOG record, we do not care because we'll overwrite it
-	 * with zeroes unless the XLOG record is there too; see notes at top of
-	 * this file.
-	 */
+	/* Make an XLOG entry describing the new MXID. */
 	xlrec.mid = multi;
 	xlrec.moff = offset;
 	xlrec.nmembers = nmembers;
@@ -2037,7 +2029,11 @@ TrimMultiXact(void)
 
 	/*
 	 * Zero out the remainder of the current offsets page.  See notes in
-	 * TrimCLOG() for motivation.
+	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
+	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
+	 * rule "write xlog before data," nextMXact successors may carry obsolete,
+	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
+	 * operates normally.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
 	if (entryno != 0)
#4Robert Haas
robertmhaas@gmail.com
In reply to: Noah Misch (#3)
Re: Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

On Sat, Nov 28, 2015 at 11:15 PM, Noah Misch <noah@leadboat.com> wrote:

On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:

On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:

/*
* Optional array of WAL flush LSNs associated with entries in the SLRU
* pages. If not zero/NULL, we must flush WAL before writing pages (true
* for pg_clog, false for multixact, pg_subtrans, pg_notify). group_lsn[]
* has lsn_groups_per_page entries per buffer slot, each containing the
* highest LSN known for a contiguous group of SLRU entries on that slot's
* page.
*/
XLogRecPtr *group_lsn;
int lsn_groups_per_page;

Here's the multixact.c comment justifying it:

* XLOG interactions: this module generates an XLOG record whenever a new
* OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
* whenever a new MultiXactId is defined. This allows us to completely
* rebuild the data entered since the last checkpoint during XLOG replay.
* Because this is possible, we need not follow the normal rule of
* "write WAL before data"; the only correctness guarantee needed is that
* we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
* checkpoint is considered complete. If a page does make it to disk ahead
* of corresponding WAL records, it will be forcibly zeroed before use anyway.
* Therefore, we don't need to mark our pages with LSN information; we have
* enough synchronization already.

The comment's justification is incomplete, though. What of pages filled over
the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

* Note: we need not flush this XLOG entry to disk before proceeding. The
* only way for the MXID to be referenced from any data page is for
* heap_lock_tuple() to have put it there, and heap_lock_tuple() generates
* an XLOG record that must follow ours. The normal LSN interlock between
* the data page and that XLOG record will ensure that our XLOG record
* reaches disk first. If the SLRU members/offsets data reaches disk
* sooner than the XLOG record, we do not care because we'll overwrite it
* with zeroes unless the XLOG record is there too; see notes at top of
* this file.

I find no flaw in the first three sentences. In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity. We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages. That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery. I drafted the attached comment update to consolidate
and slightly correct this justification. (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed. The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation. MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
ExtendMultiXactOffset()
ExtendMultiXactMember()
START_CRIT_SECTION()
(MultiXactState->nextMXact)++
MultiXactState->nextOffset += nmembers
LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
*offptr = offset; /* update pg_multixact/offsets SLRU page */
LWLockRelease(MultiXactOffsetControlLock)
LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
*memberptr = members[i].xid; /* update pg_multixact/members SLRU page */
*flagsptr = flagsval; /* update pg_multixact/members SLRU page */
LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids. They will skip over SLRU space the current
backend reserved. Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

* 2. The next multixact may still be in process of being filled in: that
* is, another process may have done GetNewMultiXactId but not yet written
* the offset entry for that ID. In that scenario, it is guaranteed that
* the offset entry for that multixact exists (because GetNewMultiXactId
* won't release MultiXactGenLock until it does) but contains zero
* (because we are careful to pre-zero offset pages). Because
* GetNewMultiXactId will never return zero as the starting offset for a
* multixact, when we read zero as the next multixact's offset, we know we
* have this case. We sleep for a bit and try again.

A crash while paused this way makes permanent the zero offset. After
recovery, though no xmax carries the offset-zero mxid, GetMultiXactIdMembers()
on the immediate previous mxid hangs indefinitely. Also,
pg_get_multixact_members(zero_offset_mxid) gets an assertion failure. I have
not thoroughly considered how best to fix these. Test procedure:

-- backend 1
checkpoint;
create table victim0 (c) as select 1;
create table stall1 (c) as select 1;
create table last2 (c) as select 1;
begin;
select c from victim0 for key share;
select c from stall1 for key share;
select c from last2 for key share;

-- backend 2
begin; update victim0 set c = c + 1; rollback; -- burn one mxid
-- In a shell, attach GDB to backend 2. GDB will stop the next SQL command at
-- XLogInsert() in MultiXactIdCreateFromMembers():
-- gdb --pid $BACKEND_PID -ex 'b XLogInsert' -ex c
select c from stall1 for key share; -- stall in gdb while making an mxid

-- backend 3
select c from last2 for key share; -- use one more mxid; flush WAL

-- in GDB session, issue command: kill

-- backend 1, after recovery
select c from victim0; -- hangs, uncancelable

-- backend 2, after recovery: assertion failure
select pg_get_multixact_members((xmax::text::bigint - 1)::text::xid) from last2;

For tracking purposes, I updated
https://wiki.postgresql.org/wiki/MultiXact_Bugs and added this. We're
probably past due to spend some time cleaning up the older issues that
Andres's patch didn't fix (although it fixed a lot).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers