Bug in StartupSUBTRANS

Started by Jeff Janesabout 10 years ago7 messageshackers
Jump to latest
#1Jeff Janes
jeff.janes@gmail.com

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Here is the state near the time of disaster:

(gdb) print startPage
$1 = 2813758
(gdb) print endPage
$2 = 179
(gdb) p oldestActiveXID
$3 = 4293679649
(gdb) p ShmemVariableCache->nextXid
$4 = 367568

Attached is my attempt at a fix. I've tested it for the ability to
start up the crashed server again, but have not tested the full stack
from initdb to crash with this in place.

Assuming I'm right, I am curious how this problem has been around so
long without being discovered previously. So maybe I'm not right. I
found this with some code to accelerate the consumption of xids, but I
don't see how that would lead to a false positive here.

I think I found it testing 2-part-commit because that inherently means
leaving an active XID hanging around for a few checkpoint cycles,
which is something I've never intentionally tested before.

Cheers,

Jeff

Attachments:

StartupSUB.patchapplication/octet-stream; name=StartupSUB.patchDownload+3-0
#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#1)
Re: Bug in StartupSUBTRANS

On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Not sure which patch you're talking about there (2-part-commit).

Looking at the code, it looks like it does not anticipate that the xid
might wrap around, meaning startPage/endPage might also wrap around.
But obviously should not do so at int_max but rather at some much
smaller other value.

Hmm, looks like the != part attempted to wrap, but just didn't get it right.

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#2)
Re: Bug in StartupSUBTRANS

Simon Riggs <simon@2ndQuadrant.com> writes:

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

9.0 is out of support and should not be patched anymore.

I agree that the patch is basically correct, though I'd personally
write it without bothering with the extra variable:

+	/* must account for wraparound */
+	if (startPage > TransactionIdToPage(0xFFFFFFFF))
+		startPage = 0;

Also, the comment at line 45 is now wrong and needs an addition.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#3)
Re: Bug in StartupSUBTRANS

On Wed, Feb 10, 2016 at 3:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Your patch looks right to me, so I will commit, barring objections... with
backpatch. Likely to 9.0, AFAICS.

9.0 is out of support and should not be patched anymore.

I agree that the patch is basically correct, though I'd personally
write it without bothering with the extra variable:

+       /* must account for wraparound */
+       if (startPage > TransactionIdToPage(0xFFFFFFFF))
+               startPage = 0;

Also, the comment at line 45 is now wrong and needs an addition.

Instead of using a hardcoded value, wouldn't it be better to use
something based on MaxTransactionId?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#2)
Re: Bug in StartupSUBTRANS

On Tue, Feb 9, 2016 at 10:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 9 February 2016 at 18:42, Jeff Janes <jeff.janes@gmail.com> wrote:

While testing the crash resilience of the recent 2-part-commit
improvements, I've run into a problem where sometimes after a crash
the recovery process creates zeroed files in pg_subtrans until it
exhausts all disk space.

Not sure which patch you're talking about there (2-part-commit).

The one here:

commit 978b2f65aa1262eb4ecbf8b3785cb1b9cf4db78e
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Wed Jan 20 18:40:44 2016 -0800

Speedup 2PC by skipping two phase state files in normal path

I thought there was a second commit in the series, but I guess that is
still just proposed. I've haven't tested that one, just the committed
one.

I've repeated the crash/recovery/wrap-around testing with the proposed
fix in place, and this pre-existing problem seems to be fixed now, and
I have found no other problems.

I've attached a new version, incorporating comments from Tom and Michael.

Cheers,

Jeff

Attachments:

StartupSUB_v2.patchapplication/octet-stream; name=StartupSUB_v2.patchDownload+6-3
#6Michael Paquier
michael@paquier.xyz
In reply to: Jeff Janes (#5)
Re: Bug in StartupSUBTRANS

On Sun, Feb 14, 2016 at 9:03 AM, Jeff Janes <jeff.janes@gmail.com> wrote:

I've repeated the crash/recovery/wrap-around testing with the proposed
fix in place, and this pre-existing problem seems to be fixed now, and
I have found no other problems.

I've attached a new version, incorporating comments from Tom and Michael.

Thanks, this patch looks good to me this way.

I have added an entry in the CF app to not lose track of this bug:
https://commitfest.postgresql.org/9/526/
That's the most I can do.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#5)
Re: Bug in StartupSUBTRANS

On 14 February 2016 at 00:03, Jeff Janes <jeff.janes@gmail.com> wrote:

I've attached a new version, incorporating comments from Tom and Michael.

Applied, thanks.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services