Bug in StartupSUBTRANS

Started by Jeff Janesalmost 10 years ago7 messages
#1Jeff Janes
jeff.janes@gmail.com
1 attachment(s)

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Here is the state near the time of disaster:

(gdb) print startPage
$1 = 2813758
(gdb) print endPage
$2 = 179
(gdb) p oldestActiveXID
$3 = 4293679649
(gdb) p ShmemVariableCache->nextXid
$4 = 367568

Attached is my attempt at a fix. I've tested it for the ability to
start up the crashed server again, but have not tested the full stack
from initdb to crash with this in place.

Assuming I'm right, I am curious how this problem has been around so
long without being discovered previously. So maybe I'm not right. I
found this with some code to accelerate the consumption of xids, but I
don't see how that would lead to a false positive here.

I think I found it testing 2-part-commit because that inherently means
leaving an active XID hanging around for a few checkpoint cycles,
which is something I've never intentionally tested before.

Cheers,

Jeff

Attachments:

StartupSUB.patchapplication/octet-stream; name=StartupSUB.patchDownload
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
new file mode 100644
index 8170ba3..a9bfcb5
*** a/src/backend/access/transam/subtrans.c
--- b/src/backend/access/transam/subtrans.c
*************** StartupSUBTRANS(TransactionId oldestActi
*** 238,243 ****
--- 238,244 ----
  {
  	int			startPage;
  	int			endPage;
+ 	int			wrapPage = TransactionIdToPage(0xFFFFFFFF);
  
  	/*
  	 * Since we don't expect pg_subtrans to be valid across crashes, we
*************** StartupSUBTRANS(TransactionId oldestActi
*** 254,259 ****
--- 255,262 ----
  	{
  		(void) ZeroSUBTRANSPage(startPage);
  		startPage++;
+ 		if (startPage > wrapPage)
+ 			startPage=0;
  	}
  	(void) ZeroSUBTRANSPage(startPage);
  
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#1)
Re: Bug in StartupSUBTRANS

On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Not sure which patch you're talking about there (2-part-commit).

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Hmm, looks like the != part attempted to wrap, but just didn't get it right.

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#2)
Re: Bug in StartupSUBTRANS

Simon Riggs <simon@2ndQuadrant.com> writes:

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

9.0 is out of support and should not be patched anymore.

I agree that the patch is basically correct, though I'd personally
write it without bothering with the extra variable:

+	/* must account for wraparound */
+	if (startPage > TransactionIdToPage(0xFFFFFFFF))
+		startPage = 0;

Also, the comment at line 45 is now wrong and needs an addition.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Tom Lane (#3)
Re: Bug in StartupSUBTRANS

On Wed, Feb 10, 2016 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

9.0 is out of support and should not be patched anymore.

I agree that the patch is basically correct, though I'd personally
write it without bothering with the extra variable:

+       /* must account for wraparound */
+       if (startPage > TransactionIdToPage(0xFFFFFFFF))
+               startPage = 0;

Also, the comment at line 45 is now wrong and needs an addition.

Instead of using a hardcoded value, wouldn't it be better to use
something based on MaxTransactionId?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#2)
1 attachment(s)
Re: Bug in StartupSUBTRANS

On Tue, Feb 9, 2016 at 10:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Not sure which patch you're talking about there (2-part-commit).

The one here:

commit 978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Wed Jan 20 18:40:44 2016 -0800

Speedup 2PC by skipping two phase state files in normal path

I thought there was a second commit in the series, but I guess that is
still just proposed. I've haven't tested that one, just the committed
one.

I've repeated the crash/recovery/wrap-around testing with the proposed
fix in place, and this pre-existing problem seems to be fixed now, and
I have found no other problems.

I've attached a new version, incorporating comments from Tom and Michael.

Cheers,

Jeff

Attachments:

StartupSUB_v2.patchapplication/octet-stream; name=StartupSUB_v2.patchDownload
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
new file mode 100644
index 8170ba3..c02046c
*** a/src/backend/access/transam/subtrans.c
--- b/src/backend/access/transam/subtrans.c
***************
*** 44,50 ****
   * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE, and segment numbering at
   * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
   * explicit notice of that fact in this module, except when comparing segment
!  * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes).
   */
  
  /* We need four bytes per xact */
--- 44,51 ----
   * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE, and segment numbering at
   * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
   * explicit notice of that fact in this module, except when comparing segment
!  * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes) and zeroing
!  * them in StartupSUBTRANS.
   */
  
  /* We need four bytes per xact */
*************** StartupSUBTRANS(TransactionId oldestActi
*** 254,259 ****
--- 255,263 ----
  	{
  		(void) ZeroSUBTRANSPage(startPage);
  		startPage++;
+ 		/* must account for wraparound */
+ 		if (startPage > TransactionIdToPage(MaxTransactionId))
+ 			startPage=0;
  	}
  	(void) ZeroSUBTRANSPage(startPage);
  
#6Michael Paquier
michael.paquier@gmail.com
In reply to: Jeff Janes (#5)
Re: Bug in StartupSUBTRANS

On Sun, Feb 14, 2016 at 9:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

I've repeated the crash/recovery/wrap-around testing with the proposed
fix in place, and this pre-existing problem seems to be fixed now, and
I have found no other problems.

I've attached a new version, incorporating comments from Tom and Michael.

Thanks, this patch looks good to me this way.

I have added an entry in the CF app to not lose track of this bug:
https://commitfest.postgresql.org/9/526/
That's the most I can do.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#5)
Re: Bug in StartupSUBTRANS

On 14 February 2016 at 00:03, Jeff Janes <jeff.janes@gmail.com> wrote:

I've attached a new version, incorporating comments from Tom and Michael.

Applied, thanks.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services