Crash on promotion when recovery.conf is renamed

Started by Magnus Haganderabout 9 years ago33 messages

magnus@hagander.net

about 9 years ago

I had a system where the recovery.conf file was renamed "out of the way" at
some point, and then the system was promoted. This is obviously operator
error, but it seems like something we should handle.

What happens now is that the non-existance of recovery.conf is a FATAL
error. I wonder if it should just be a WARNING, at least in the case of
ENOENT?

What happens is this.

Log output:
2016-12-15 09:36:46.265 CET [25437] LOG: received promote request
2016-12-15 09:36:46.265 CET [25438] FATAL: terminating walreceiver process
due to administrator command
mha@mha-laptop:~/postgresql/inst/head$ 2016-12-15 09:36:46.265 CET [25437]
LOG: invalid record length at 0/5015168: wanted 24, got 0
2016-12-15 09:36:46.265 CET [25437] LOG: redo done at 0/5015130
2016-12-15 09:36:46.265 CET [25437] LOG: last completed transaction was at
log time 2016-12-15 09:36:19.27125+01
2016-12-15 09:36:46.276 CET [25437] LOG: selected new timeline ID: 2
2016-12-15 09:36:46.429 CET [25437] FATAL: could not open file
"recovery.conf": No such file or directory
2016-12-15 09:36:46.429 CET [25436] LOG: startup process (PID 25437)
exited with exit code 1
2016-12-15 09:36:46.429 CET [25436] LOG: terminating any other active
server processes
2016-12-15 09:36:46.429 CET [25456] WARNING: terminating connection
because of crash of another server process
2016-12-15 09:36:46.429 CET [25456] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2016-12-15 09:36:46.429 CET [25456] HINT: In a moment you should be able
to reconnect to the database and repeat your command.
2016-12-15 09:36:46.431 CET [25436] LOG: database system is shut down

So we can see it switches to timeline 2. Looking in pg_wal (or pg_xlog --
customer system was on 9.5, but this is reproducible in HEAD):

-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000004
-rw------- 1 mha mha 16777216 Dec 15 09:36 000000010000000000000005
-rw------- 1 mha mha 16777216 Dec 15 09:36 000000020000000000000005
-rw------- 1 mha mha 41 Dec 15 09:36 00000002.history

However, according to pg_controldata, we are still on timeline 1:
Latest checkpoint location: 0/4000060
Prior checkpoint location: 0/4000060
Latest checkpoint's REDO location: 0/4000028
Latest checkpoint's REDO WAL file: 000000010000000000000004
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
..
Minimum recovery ending location: 0/5015168
Min recovery ending loc's timeline: 1

But since we have a history file for timeline 2 in the data directory (and
neatly archived), this data directory isn't consistent with that. Meaning
that for example any other standbys that you try to connect to this cluster
will simply fail, because they try to join up on timeline 2 which doesn't
actually exist.

I wonder if there might be more corner cases like this, but in this
particular one it seems easy enough to just say that failing to rename
recovery.conf because it didn't exist is safe.

But in the case of failing to rename recovery.conf for example because of
permissions errors, we don't want to ignore it. But we also really don't
want to end up with this kind of inconsistent data directory IMO. I don't
know that code well enough to suggest how to fix it though -- hoping for
input for someone who knows it closer?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Heikki Linnakangas

hlinnaka@iki.fi

about 9 years ago

In reply to: Magnus Hagander (#1)

1 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On 12/15/2016 10:44 AM, Magnus Hagander wrote:

I wonder if there might be more corner cases like this, but in this
particular one it seems easy enough to just say that failing to rename
recovery.conf because it didn't exist is safe.

Yeah. It's unexpected though, so I think erroring out is quite
reasonable. If the recovery.conf file went missing, who knows what else
is wrong.

But in the case of failing to rename recovery.conf for example because of
permissions errors, we don't want to ignore it. But we also really don't
want to end up with this kind of inconsistent data directory IMO. I don't
know that code well enough to suggest how to fix it though -- hoping for
input for someone who knows it closer?

Hmm. Perhaps we should write the timeline history file only after
renaming recovery.conf. In general, we should keep the window between
writing the timeline history file and writing the end-of-recovery record
as small as possible. I don't think we can eliminate it completely, but
it makes sense to minimize it. Something like the attached (completely
untested).

- Heikki

Attachments:

reorder-end-of-archive-recovery-actions-1.patchinvalid/octet-stream; name=reorder-end-of-archive-recovery-actions-1.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 084401d..05fe9ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7155,6 +7155,13 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -7208,6 +7215,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7217,15 +7242,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
@@ -7277,9 +7293,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Heikki Linnakangas (#2)

Re: Crash on promotion when recovery.conf is renamed

On Thu, Dec 15, 2016 at 11:25:10AM +0200, Heikki Linnakangas wrote:

On 12/15/2016 10:44 AM, Magnus Hagander wrote:

I wonder if there might be more corner cases like this, but in this
particular one it seems easy enough to just say that failing to rename
recovery.conf because it didn't exist is safe.

Yeah. It's unexpected though, so I think erroring out is quite reasonable.
If the recovery.conf file went missing, who knows what else is wrong.

i'd rather not mess up with the exiting behavior and just complain to the
pilot to not do that. This enters in the category of not removing the
backup_label file after taking a backup...

But in the case of failing to rename recovery.conf for example because of
permissions errors, we don't want to ignore it. But we also really don't
want to end up with this kind of inconsistent data directory IMO. I don't
know that code well enough to suggest how to fix it though -- hoping for
input for someone who knows it closer?

Hmm. Perhaps we should write the timeline history file only after renaming
recovery.conf. In general, we should keep the window between writing the
timeline history file and writing the end-of-recovery record as small as
possible. I don't think we can eliminate it completely, but it makes sense
to minimize it. Something like the attached (completely untested).

I did some tests. And after a lookup it looks like a good thing to book
the timeline number at an earlier step and let other nodes know about
it. So +1 on it.

Looking at PrescanPreparedTransactions(), I am thinking as well that it would
be better to get a hard failure when bumping on a corrupted 2PC file. Future
files are one thing, but corrupted files should be treated more carefully.
--
Michael

Magnus Hagander

magnus@hagander.net

about 9 years ago

In reply to: Michael Paquier (#3)

Re: Crash on promotion when recovery.conf is renamed

On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Thu, Dec 15, 2016 at 11:25:10AM +0200, Heikki Linnakangas wrote:

On 12/15/2016 10:44 AM, Magnus Hagander wrote:

I wonder if there might be more corner cases like this, but in this
particular one it seems easy enough to just say that failing to rename
recovery.conf because it didn't exist is safe.

Yeah. It's unexpected though, so I think erroring out is quite

reasonable.

If the recovery.conf file went missing, who knows what else is wrong.

i'd rather not mess up with the exiting behavior and just complain to the
pilot to not do that. This enters in the category of not removing the
backup_label file after taking a backup...

I'm happy to say that people shouldn't do that. However, we have a system
that is unnecessarily fragile, by making something like that so easy to
break. This could equally happen in other ways. For example, someone could
accidentally provision a recovery.conf with the wrong permissions.

Reducing the fragility of such an important part of the system is a big
improvement, IMO. Of course, we have to be extra careful when touching
those parts.

Bottom line, I'm perfectly fine with failing in such a scenario. I'm not
fine with leaving the user with a corrupt cluster if we can avoid it.

As for your comparison, we fixed the backup_label part by adding support
for taking backups without using the fragile system. So it's not like we
didn't recognize the problem.

But in the case of failing to rename recovery.conf for example because

of

permissions errors, we don't want to ignore it. But we also really

don't

want to end up with this kind of inconsistent data directory IMO. I

don't

know that code well enough to suggest how to fix it though -- hoping

for

input for someone who knows it closer?

Hmm. Perhaps we should write the timeline history file only after

renaming

recovery.conf. In general, we should keep the window between writing the
timeline history file and writing the end-of-recovery record as small as
possible. I don't think we can eliminate it completely, but it makes

sense

to minimize it. Something like the attached (completely untested).

I haven't looked at the code either, but the reason is definitely solid -
let's keep the time as short as possible. In particular, keeping it
recoverable for things that are caused by humans, because they will keep
making mistakes. And editing recovery.conf is a normal thing for a DBA to
do (sure, not removing it -- but it's one of the few files in the data
directory that the DBA is actually supposed to handle manually, so that
increases the risk).

And also by doing things in an order that will at least make it less likely
to end up corrupt if people manage to do it anyway.

I did some tests. And after a lookup it looks like a good thing to book

the timeline number at an earlier step and let other nodes know about
it. So +1 on it.

Looking at PrescanPreparedTransactions(), I am thinking as well that it
would
be better to get a hard failure when bumping on a corrupted 2PC file.
Future
files are one thing, but corrupted files should be treated more carefully.

Again without looking at it, I agree (so much easier that way :P). Ignoring
corruption is generally a bad idea. Failing hard makes the user notice the
error, and makes it possible to initiate recovery from a standby or from
backups or something, or to *intentionally* remove/clear/ignore it.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Magnus Hagander (#4)

2 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Looking at PrescanPreparedTransactions(), I am thinking as well that it
would
be better to get a hard failure when bumping on a corrupted 2PC file.
Future
files are one thing, but corrupted files should be treated more carefully.

Again without looking at it, I agree (so much easier that way :P). Ignoring
corruption is generally a bad idea. Failing hard makes the user notice the
error, and makes it possible to initiate recovery from a standby or from
backups or something, or to *intentionally* remove/clear/ignore it.

And I am finishing with the two patches attached:
- 0001 changes the 2PC checks so as corrupted entries are FATAL.
PreScanPreparedTransaction is used when a hot standby is initialized.
In this case a failure protects the range of XIDs generated,
potentially saving from corruption of data. At the end of recovery,
this is done before any on-disk actions are taken.
- 0002 is the thing that Heikki has sent previously to minimize the
window between end-of-recovery record write and timeline history file
archiving.

I am attaching that to next CF.
--
Michael

Attachments:

0002-Minimize-window-between-history-file-and-end-of-reco.patchtext/x-diff; charset=US-ASCII; name=0002-Minimize-window-between-history-file-and-end-of-reco.patchDownload

From 0e816454c57b851f4aa2743ea776b1e604a27d77 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 Dec 2016 16:47:35 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ea2c523e6e..f23feb47e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7210,6 +7210,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7219,15 +7237,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
-- 
2.11.0

0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchtext/x-diff; charset=US-ASCII; name=0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchDownload

From 4037db778a4353fc79b15eb683a5b9e3f648db24 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 Dec 2016 16:41:44 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 23 ++++++++---------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604993..9cbcb83062 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1672,6 +1672,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -1723,25 +1727,14 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 			/* Read and validate file */
 			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				continue;
-			}
+				ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",
+									   clde->d_name)));
 
 			/* Deconstruct header */
 			hdr = (TwoPhaseFileHeader *) buf;
 			if (!TransactionIdEquals(hdr->xid, xid))
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				pfree(buf);
-				continue;
-			}
+				ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",
+									   clde->d_name)));
 
 			/*
 			 * OK, we think this file is valid.  Incorporate xid into the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa9ee5a0dd..ea2c523e6e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7150,6 +7150,13 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -7272,9 +7279,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.11.0

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Michael Paquier (#5)

Re: Crash on promotion when recovery.conf is renamed

On Tue, Dec 20, 2016 at 4:54 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

I am attaching that to next CF.

Moved to CF 2017-03. Both patches still apply.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Steele

david@pgmasters.net

almost 9 years ago

In reply to: Michael Paquier (#5)

Re: Crash on promotion when recovery.conf is renamed

On 12/20/16 2:54 AM, Michael Paquier wrote:

On Sat, Dec 17, 2016 at 9:23 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Looking at PrescanPreparedTransactions(), I am thinking as well that it
would
be better to get a hard failure when bumping on a corrupted 2PC file.
Future
files are one thing, but corrupted files should be treated more carefully.

Again without looking at it, I agree (so much easier that way :P). Ignoring
corruption is generally a bad idea. Failing hard makes the user notice the
error, and makes it possible to initiate recovery from a standby or from
backups or something, or to *intentionally* remove/clear/ignore it.

And I am finishing with the two patches attached:
- 0001 changes the 2PC checks so as corrupted entries are FATAL.
PreScanPreparedTransaction is used when a hot standby is initialized.
In this case a failure protects the range of XIDs generated,
potentially saving from corruption of data. At the end of recovery,
this is done before any on-disk actions are taken.
- 0002 is the thing that Heikki has sent previously to minimize the
window between end-of-recovery record write and timeline history file
archiving.

I am attaching that to next CF.

This patch still applies cleanly and compiles at cccbdde.

Any idea when you'll have a chance to review?

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 9 years ago

In reply to: David Steele (#7)

Re: Crash on promotion when recovery.conf is renamed

From: David Steele [mailto:david@pgmasters.net]

Any idea when you'll have a chance to review?

I'll do it by early next week.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 9 years ago

In reply to: Michael Paquier (#6)

Re: Crash on promotion when recovery.conf is renamed

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Moved to CF 2017-03. Both patches still apply.

Sorry to be late for reviewing this, but done now. The patch applied, make check passed, and the code looks almost good. I could successfully perform a simple archive recovery. Finally, I broke the 2pc state file while the server is down, and could confirm that the server failed to start as expected, emitting a FATAL message. Worked nicely.

Just two cosmetic points:

(1)
Other places use "two-phase state file", not "two-phase file".

(2)
All other places in twophase.c and most places in other files put ereport() and errmsg() on separate lines. I think it would be better to align with surrounding code.

+ ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Tsunakawa, Takayuki (#9)

Re: Crash on promotion when recovery.conf is renamed

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

All other places in twophase.c and most places in other files put ereport() and errmsg() on separate lines. I think it would be better to align with surrounding code.

+ ereport(FATAL, (errmsg("corrupted two-phase file \"%s\"",

Actually, the *real* problem with that coding is it lacks a SQLSTATE
(errcode call). The only places where it's acceptable to leave that
out are for internal "can't happen" cases, which this surely isn't.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Tom Lane (#10)

Re: Crash on promotion when recovery.conf is renamed

On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

All other places in twophase.c and most places in other files put

ereport() and errmsg() on separate lines. I think it would be better to
align with surrounding code.

+ ereport(FATAL, (errmsg("corrupted

two-phase file \"%s\"",

Actually, the *real* problem with that coding is it lacks a SQLSTATE
(errcode call). The only places where it's acceptable to leave that
out are for internal "can't happen" cases, which this surely isn't.

Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't
happen" case as well.

if (ControlFile->backupEndRequired)

ereport(FATAL,
(errmsg("WAL ends before end of online backup"),
errhint("All WAL generated while online backup was taken must be
available at recovery.")));
else if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
ereport(FATAL,
(errmsg("WAL ends before end of online backup"),
errhint("Online backup started with pg_start_backup() must be ended with
pg_stop_backup(), and all WAL up to that point must be available at
recovery.")));
else
ereport(FATAL,
(errmsg("WAL ends before consistent recovery point")));

Should we consider fixing them?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Alexander Korotkov (#11)

Re: Crash on promotion when recovery.conf is renamed

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, the *real* problem with that coding is it lacks a SQLSTATE
(errcode call). The only places where it's acceptable to leave that
out are for internal "can't happen" cases, which this surely isn't.

Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't
happen" case as well.
Should we consider fixing them?

Yup. Just remember that the default is
XX000 E ERRCODE_INTERNAL_ERROR internal_error

If that's not how you want the error case reported, you need an errcode()
call.

We might need more ERRCODEs than are there now, if none of the existing
ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED
and ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for
example?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Alexander Korotkov

a.korotkov@postgrespro.ru

almost 9 years ago

In reply to: Tsunakawa, Takayuki (#9)

Re: Crash on promotion when recovery.conf is renamed

On Mon, Mar 27, 2017 at 11:26 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
Moved to CF 2017-03. Both patches still apply.

Sorry to be late for reviewing this, but done now. The patch applied,
make check passed, and the code looks almost good. I could successfully
perform a simple archive recovery. Finally, I broke the 2pc state file
while the server is down, and could confirm that the server failed to start
as expected, emitting a FATAL message. Worked nicely.

I've tested moving recovery.conf away case which was originally reported by
Magnus.

When I'm trying to promote standby when recovery.conf is moved away, I get
FATAL error.

waiting for server to promote....2017-03-27 17:12:38.225 MSK [30307] LOG:
received promote request
2017-03-27 17:12:38.225 MSK [30311] FATAL: terminating walreceiver process
due to administrator command
2017-03-27 17:12:38.225 MSK [30307] LOG: redo done at 0/3000028
2017-03-27 17:12:38.226 MSK [30307] LOG: selected new timeline ID: 2
2017-03-27 17:12:38.253 MSK [30307] FATAL: could not open file
"recovery.conf": No such file or directory
2017-03-27 17:12:38.253 MSK [30306] LOG: startup process (PID 30307)
exited with exit code 1
2017-03-27 17:12:38.253 MSK [30306] LOG: terminating any other active
server processes
2017-03-27 17:12:38.256 MSK [30306] LOG: database system is shut down
........................................................... stopped waiting
server is still promoting

If I try to start it after crash, it becomes a master on timeline 1. Just
like in case we deleted recovery.conf while server was shut down.

waiting for server to start....2017-03-27 17:16:54.186 MSK [30413] LOG:
listening on IPv6 address "::1", port 5430
2017-03-27 17:16:54.186 MSK [30413] LOG: listening on IPv6 address
"fe80::1%lo0", port 5430
2017-03-27 17:16:54.186 MSK [30413] LOG: listening on IPv4 address
"127.0.0.1", port 5430
2017-03-27 17:16:54.187 MSK [30413] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5430"
2017-03-27 17:16:54.195 MSK [30414] LOG: database system was interrupted
while in recovery at log time 2017-03-27 17:10:23 MSK
2017-03-27 17:16:54.195 MSK [30414] HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2017-03-27 17:16:54.218 MSK [30414] LOG: database system was not properly
shut down; automatic recovery in progress
2017-03-27 17:16:54.219 MSK [30414] LOG: redo starts at 0/2000060
2017-03-27 17:16:54.219 MSK [30414] LOG: consistent recovery state reached
at 0/3000060
2017-03-27 17:16:54.219 MSK [30414] LOG: invalid record length at
0/3000060: wanted 24, got 0
2017-03-27 17:16:54.219 MSK [30414] LOG: redo done at 0/3000028
2017-03-27 17:16:54.224 MSK [30413] LOG: database system is ready to
accept connections
done
server started

# select pg_is_in_recovery();
pg_is_in_recovery
───────────────────
f
(1 row)

# select pg_walfile_name(pg_current_wal_location());
pg_walfile_name
──────────────────────────
000000010000000000000003
(1 row)

If instead I put recovery back and start server, then streaming replication
continues normally.

waiting for server to start....2017-03-27 17:32:16.887 MSK [30783] LOG:
listening on IPv6 address "::1", port 5430
2017-03-27 17:32:16.887 MSK [30783] LOG: listening on IPv6 address
"fe80::1%lo0", port 5430
2017-03-27 17:32:16.887 MSK [30783] LOG: listening on IPv4 address
"127.0.0.1", port 5430
2017-03-27 17:32:16.888 MSK [30783] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5430"
2017-03-27 17:32:16.897 MSK [30784] LOG: database system was interrupted
while in recovery at log time 2017-03-27 17:28:05 MSK
2017-03-27 17:32:16.897 MSK [30784] HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2017-03-27 17:32:16.914 MSK [30784] LOG: entering standby mode
2017-03-27 17:32:16.916 MSK [30784] LOG: redo starts at 0/8000028
2017-03-27 17:32:16.916 MSK [30784] LOG: consistent recovery state reached
at 0/9000060
2017-03-27 17:32:16.916 MSK [30784] LOG: invalid record length at
0/9000060: wanted 24, got 0
2017-03-27 17:32:16.916 MSK [30783] LOG: database system is ready to
accept read only connections
2017-03-27 17:32:16.920 MSK [30788] LOG: started streaming WAL from
primary at 0/9000000 on timeline 1
done
server started

# select pg_is_in_recovery();
pg_is_in_recovery
───────────────────
t
(1 row)

Thus, I think patch is working as expected in this case.

Also, I'd like to notice that it passes check-world without warnings on my
laptop.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#14

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 9 years ago

In reply to: Tom Lane (#12)

Re: Crash on promotion when recovery.conf is renamed

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

All other places in twophase.c and most places in other files put ereport()

and errmsg() on separate lines. I think it would be better to align with
surrounding code.

+ ereport(FATAL, (errmsg("corrupted

two-phase file \"%s\"",

Actually, the *real* problem with that coding is it lacks a SQLSTATE (errcode
call). The only places where it's acceptable to leave that out are for
internal "can't happen" cases, which this surely isn't.

Oh, I overlooked it. But...

Yup. Just remember that the default is
XX000 E ERRCODE_INTERNAL_ERROR internal_error

If that's not how you want the error case reported, you need an errcode()
call.

We might need more ERRCODEs than are there now, if none of the existing
ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED and
ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for
example?

I'd be always happy if the error code is more specific, but maybe that would be a separate patch. WAL corruption message so far doesn't accompany a specific error code like this in xlog.c:

/*
* We only end up here without a message when XLogPageRead()
* failed - in that case we already logged something. In
* StandbyMode that only happens if we have been triggered, so we
* shouldn't loop anymore in that case.
*/
if (errormsg)
ereport(emode_for_corrupt_record(emode,
RecPtr ? RecPtr : EndRecPtr),
(errmsg_internal("%s", errormsg) /* already translated */ ));

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Tom Lane (#12)

2 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On Mon, Mar 27, 2017 at 11:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:

On Mon, Mar 27, 2017 at 4:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Actually, the *real* problem with that coding is it lacks a SQLSTATE
(errcode call). The only places where it's acceptable to leave that
out are for internal "can't happen" cases, which this surely isn't.

Surrounding code also has ereports lacking SQLSTATE. And that isn't "can't
happen" case as well.
Should we consider fixing them?

Yup. Just remember that the default is
XX000 E ERRCODE_INTERNAL_ERROR internal_error

If that's not how you want the error case reported, you need an errcode()
call.

We might need more ERRCODEs than are there now, if none of the existing
ones seem to fit these cases. There's already ERRCODE_DATA_CORRUPTED
and ERRCODE_INDEX_CORRUPTED; maybe we need ERRCODE_WAL_CORRUPTED, for
example?

While I agree with that in general, we are taking about 2PC files that
are on disk in patch 0001, so I would think that in this case
ERRCODE_DATA_CORRUPTED is the most adapted (or
ERRCODE_TWOPHASE_CORRUPTED?).

The other WARNING messages refer to stale files of already committed
transactions, which are not actually corrupted. What about in this
case having a ERRCODE_TWOPHASE_INVALID?

Updated patches are attached, I did not change the WARNING portion
though as I am not sure what's the consensus on the matter.
--
Michael

Attachments:

0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchapplication/octet-stream; name=0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchDownload

From ea949d40f0af7d18ffb81b55041853f9cb2d9697 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 28 Mar 2017 11:36:09 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 27 ++++++++++++---------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cccc3..88fc94dcad 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1689,6 +1689,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -1740,25 +1744,18 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 			/* Read and validate file */
 			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				continue;
-			}
+				ereport(FATAL,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("corrupted two-phase file \"%s\"",
+								clde->d_name)));
 
 			/* Deconstruct header */
 			hdr = (TwoPhaseFileHeader *) buf;
 			if (!TransactionIdEquals(hdr->xid, xid))
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				pfree(buf);
-				continue;
-			}
+				ereport(FATAL,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("corrupted two-phase file \"%s\"",
+								clde->d_name)));
 
 			/*
 			 * OK, we think this file is valid.  Incorporate xid into the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61ca81d1d2..af40b0497e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7371,6 +7371,13 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -7493,9 +7500,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.12.2

0002-Minimize-window-between-history-file-and-end-of-reco.patchapplication/octet-stream; name=0002-Minimize-window-between-history-file-and-end-of-reco.patchDownload

From b8c55152fb1ce39f72da47a1281a910d247161bb Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 Dec 2016 16:47:35 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index af40b0497e..c9968adb42 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7431,6 +7431,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7440,15 +7458,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
-- 
2.12.2

#16

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 9 years ago

In reply to: Michael Paquier (#15)

Re: Crash on promotion when recovery.conf is renamed

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Michael Paquier
While I agree with that in general, we are taking about 2PC files that are
on disk in patch 0001, so I would think that in this case
ERRCODE_DATA_CORRUPTED is the most adapted (or
ERRCODE_TWOPHASE_CORRUPTED?).

The other WARNING messages refer to stale files of already committed
transactions, which are not actually corrupted. What about in this case
having a ERRCODE_TWOPHASE_INVALID?

Updated patches are attached, I did not change the WARNING portion though
as I am not sure what's the consensus on the matter.

I get the impression that DATA_CORRUPTED means the table data is corrupted, because there's an error code named INDEX_CORRUPTED. Anyway, I don't think this patch needs to attach an error code because:

* Currently, other interlal files (not tables or indexes) seem to use INTERNAL_ERROR (XX000). For example, see ReadControlFile() in xlog.c and pgstat_read_statsfiles() in pgstat.c.

* It doesn't seem that the user needs distinction. I don't object to providing a specific error code for this case, but if the patch needs a specific error code to be committed, I'd like to know how that's useful (e.g. how it affects the recovery action the user takes.)

So, I'd like to mark the patch as ready for committer when ereport() and errmsg() are on separate lines and the message changes to "two-phase state file."

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Tsunakawa, Takayuki (#16)

2 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On Tue, Mar 28, 2017 at 1:33 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

I get the impression that DATA_CORRUPTED means the table data is corrupted, because there's an error code named INDEX_CORRUPTED.

I have interpreted that as the other way around, aka DATA_CORRUPTED
could be used as well to 2PC files :)
But grepping around it seems that you are grabbing the meaning better
than I do, ERRCODE_DATA_CORRUPTED is only used now for relation pages
or large pages.

Anyway, I don't think this patch needs to attach an error code because:
* Currently, other interlal files (not tables or indexes) seem to use INTERNAL_ERROR (XX000). For example, see ReadControlFile() in xlog.c and pgstat_read_statsfiles() in pgstat.c.
* It doesn't seem that the user needs distinction. I don't object to providing a specific error code for this case, but if the patch needs a specific error code to be committed, I'd like to know how that's useful (e.g. how it affects the recovery action the user takes.)
So, I'd like to mark the patch as ready for committer when ereport() and errmsg() are on separate lines and the message changes to "two-phase state file."

Okay. I got the message, and I agree with what you say here. You are
right by the way, the error messages just use "two-phase file" and not
"two-phase STATE file", missed that previously.
--
Michael

Attachments:

0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchapplication/octet-stream; name=0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchDownload

From cf39fc78537af29931b9e8051f4a680f0b671dcf Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 28 Mar 2017 14:09:13 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 25 ++++++++++---------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cccc3..e1e4052253 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1689,6 +1689,10 @@ CheckPointTwoPhase(XLogRecPtr redo_horizon)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -1740,25 +1744,16 @@ PrescanPreparedTransactions(TransactionId **xids_p, int *nxids_p)
 			/* Read and validate file */
 			buf = ReadTwoPhaseFile(xid, true);
 			if (buf == NULL)
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				continue;
-			}
+				ereport(FATAL,
+						(errmsg("corrupted two-phase state file \"%s\"",
+								clde->d_name)));
 
 			/* Deconstruct header */
 			hdr = (TwoPhaseFileHeader *) buf;
 			if (!TransactionIdEquals(hdr->xid, xid))
-			{
-				ereport(WARNING,
-					  (errmsg("removing corrupt two-phase state file \"%s\"",
-							  clde->d_name)));
-				RemoveTwoPhaseFile(xid, true);
-				pfree(buf);
-				continue;
-			}
+				ereport(FATAL,
+						(errmsg("corrupted two-phase state file \"%s\"",
+								clde->d_name)));
 
 			/*
 			 * OK, we think this file is valid.  Incorporate xid into the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61ca81d1d2..af40b0497e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7371,6 +7371,13 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
+	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
 	 * If we are doing an archive recovery, we always assign a new ID.  This
@@ -7493,9 +7500,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.12.2

0002-Minimize-window-between-history-file-and-end-of-reco.patchapplication/octet-stream; name=0002-Minimize-window-between-history-file-and-end-of-reco.patchDownload

From ffc5c9a496fda71fc29158b1aa5d32791587bde6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 20 Dec 2016 16:47:35 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index af40b0497e..c9968adb42 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7431,6 +7431,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if it
+		 * was active, and make a writable copy of the last WAL segment. (Note
+		 * that we also have a copy of the last block of the old WAL in readBuf;
+		 * we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7440,15 +7458,6 @@ StartupXLOG(void)
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
 	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
-	/*
 	 * Prepare to write WAL starting at EndOfLog position, and init xlog
 	 * buffer cache using the block containing the last record from the
 	 * previous incarnation.
-- 
2.12.2

#18

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

almost 9 years ago

In reply to: Michael Paquier (#17)

Re: Crash on promotion when recovery.conf is renamed

From: Michael Paquier [mailto:michael.paquier@gmail.com]

Okay. I got the message, and I agree with what you say here. You are right
by the way, the error messages just use "two-phase file" and not "two-phase
STATE file", missed that previously.
--

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

David Steele

david@pgmasters.net

almost 9 years ago

In reply to: Tsunakawa, Takayuki (#18)

Re: Crash on promotion when recovery.conf is renamed

On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:

From: Michael Paquier [mailto:michael.paquier@gmail.com]

Okay. I got the message, and I agree with what you say here. You are right
by the way, the error messages just use "two-phase file" and not "two-phase
STATE file", missed that previously.
--

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

This bug has been moved to CF 2017-07.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: David Steele (#19)

Re: Crash on promotion when recovery.conf is renamed

On Sat, Apr 8, 2017 at 10:05 AM, David Steele <david@pgmasters.net> wrote:

On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:

From: Michael Paquier [mailto:michael.paquier@gmail.com]

Okay. I got the message, and I agree with what you say here. You are right
by the way, the error messages just use "two-phase file" and not "two-phase
STATE file", missed that previously.
--

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

This bug has been moved to CF 2017-07.

This bug fix has been pending in "Ready for Committer" state for about
4.5 months. Three committers (Magnus, Heikki, Tom) have contributed
to the thread to date. Maybe one of them would like to commit this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Thomas Munro

thomas.munro@enterprisedb.com

over 8 years ago

In reply to: Robert Haas (#20)

Re: Crash on promotion when recovery.conf is renamed

On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 8, 2017 at 10:05 AM, David Steele <david@pgmasters.net> wrote:

On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

This bug has been moved to CF 2017-07.

This bug fix has been pending in "Ready for Committer" state for about
4.5 months. Three committers (Magnus, Heikki, Tom) have contributed
to the thread to date. Maybe one of them would like to commit this?

In the meantime its bits have begun to rot. Michael, could you please rebase?

Thanks!

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Thomas Munro (#21)

2 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On Fri, Sep 1, 2017 at 7:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 8, 2017 at 10:05 AM, David Steele <david@pgmasters.net> wrote:

On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

This bug has been moved to CF 2017-07.

This bug fix has been pending in "Ready for Committer" state for about
4.5 months. Three committers (Magnus, Heikki, Tom) have contributed
to the thread to date. Maybe one of them would like to commit this?

In the meantime its bits have begun to rot. Michael, could you please rebase?

Thanks for the reminder, Thomas. The 2PC code triggered during
recovery has been largely reworked in PG10, explaining the conflicts.
As this has been some time since I touched this patch, I had again a
look at its logic and could not find any problems around it. So
attached are a rebased versions for both 0001 and 0002, I have updated
the messages as well to be more in-line with what is in HEAD for
corrupted entries.
--
Michael

Attachments:

0002-Minimize-window-between-history-file-and-end-of-reco.patchtext/x-patch; charset=US-ASCII; name=0002-Minimize-window-between-history-file-and-end-of-reco.patchDownload

From 2f3a16c0c0cb12f8bfef2d58656352c46a681393 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 2 Sep 2017 21:15:04 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 80fe2c60e6..abc0cec3d2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7449,6 +7449,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching
+		 * if it was active, and make a writable copy of the last WAL segment.
+		 * (Note that we also have a copy of the last block of the old WAL
+		 * in readBuf; we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7457,15 +7475,6 @@ StartupXLOG(void)
 	XLogCtl->ThisTimeLineID = ThisTimeLineID;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
-	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
 	 * buffer cache using the block containing the last record from the
-- 
2.14.1

0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchtext/x-patch; charset=US-ASCII; name=0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchDownload

From 12ab1765f7ea032db217f2081f9561c00d1d96df Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 2 Sep 2017 21:07:21 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 27 +++++++++++----------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ba03d9687e..4b4f2449f8 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1778,6 +1778,10 @@ restoreTwoPhaseData(void)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -2070,13 +2074,9 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		/* Read and validate file */
 		buf = ReadTwoPhaseFile(xid, true);
 		if (buf == NULL)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-			return NULL;
-		}
 	}
 	else
 	{
@@ -2088,20 +2088,15 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	hdr = (TwoPhaseFileHeader *) buf;
 	if (!TransactionIdEquals(hdr->xid, xid))
 	{
+
 		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
 		else
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state from memory for \"%u\"",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state in memory for \"%u\"",
 							xid)));
-			PrepareRedoRemove(xid, true);
-		}
 		pfree(buf);
 		return NULL;
 	}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index df4843f409..80fe2c60e6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7388,6 +7388,13 @@ StartupXLOG(void)
 		}
 	}
 
+	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -7511,9 +7518,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.14.1

#23

Daniel Gustafsson

daniel@yesql.se

over 8 years ago

In reply to: Michael Paquier (#22)

Re: Crash on promotion when recovery.conf is renamed

On 02 Sep 2017, at 14:17, Michael Paquier <michael.paquier@gmail.com> wrote:

On Fri, Sep 1, 2017 at 7:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Sat, Aug 26, 2017 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Apr 8, 2017 at 10:05 AM, David Steele <david@pgmasters.net> wrote:

On 3/28/17 1:21 AM, Tsunakawa, Takayuki wrote:

Thanks, marked as ready for committer, as the code is good and Alexander reported the test success.

This bug has been moved to CF 2017-07.

This bug fix has been pending in "Ready for Committer" state for about
4.5 months. Three committers (Magnus, Heikki, Tom) have contributed
to the thread to date. Maybe one of them would like to commit this?

In the meantime its bits have begun to rot. Michael, could you please rebase?

Thanks for the reminder, Thomas. The 2PC code triggered during
recovery has been largely reworked in PG10, explaining the conflicts.
As this has been some time since I touched this patch, I had again a
look at its logic and could not find any problems around it. So
attached are a rebased versions for both 0001 and 0002, I have updated
the messages as well to be more in-line with what is in HEAD for
corrupted entries.

I’ve moved this to the next CF, but since this no longer applies cleanly I’ve
reset it to Waiting for author.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Daniel Gustafsson (#23)

2 attachment(s)

Re: Crash on promotion when recovery.conf is renamed

On Mon, Oct 2, 2017 at 8:13 AM, Daniel Gustafsson <daniel@yesql.se> wrote:

I’ve moved this to the next CF, but since this no longer applies cleanly I’ve
reset it to Waiting for author.

Thanks Daniel for the reminder. Attached are rebased patches. This
thing rots easily...
--
Michael

Attachments:

0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchapplication/octet-stream; name=0001-Change-detection-of-corrupted-2PC-files-as-FATAL.patchDownload

From 7d1b7609f80d073d2ca0b13b432ead8133850c0a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 2 Sep 2017 21:07:21 +0900
Subject: [PATCH 1/2] Change detection of corrupted 2PC files as FATAL

When scanning 2PC files, be it for initializing a hot standby node or
finding the XID range at the end of recovery, bumping on a corrupted
2PC file is reported as a WARNING and are blindly corrupted. If it
happened that the corrupted file was actually legit, there is actually
a risk of corruption, so switch that to a hard failure.
---
 src/backend/access/transam/twophase.c | 28 ++++++++++------------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index cfaf8da781..e09b85428d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1778,6 +1778,10 @@ restoreTwoPhaseData(void)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and likely
+ * a new backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -2070,13 +2074,9 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		/* Read and validate file */
 		buf = ReadTwoPhaseFile(xid, true);
 		if (buf == NULL)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-			return NULL;
-		}
 	}
 	else
 	{
@@ -2089,21 +2089,13 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!TransactionIdEquals(hdr->xid, xid))
 	{
 		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
 		else
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state from memory for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state in memory for \"%u\"",
 							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		pfree(buf);
-		return NULL;
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a12a4..64a7d24e85 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7463,6 +7463,13 @@ StartupXLOG(void)
 		}
 	}
 
+	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * We don't need this information quite yet, but if it fails for some
+	 * reason, better to fail before we make any on-disk changes.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -7586,9 +7593,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.14.2

0002-Minimize-window-between-history-file-and-end-of-reco.patchapplication/octet-stream; name=0002-Minimize-window-between-history-file-and-end-of-reco.patchDownload

From d42462a77bfb47a114b1e92f974ed557d25913cc Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Sat, 2 Sep 2017 21:15:04 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived immediately.
This way the other nodes are aware that this new timeline number is taken
and should not be assigned to other nodes.

The window between which the history file is archived and the end-of-recovery
record is written cannot be zeroed, but this way it is minimized as much as
possible. The new order of actions prevents as well a corrupted data directory
on failure.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 64a7d24e85..b767c6a968 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7524,6 +7524,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching
+		 * if it was active, and make a writable copy of the last WAL segment.
+		 * (Note that we also have a copy of the last block of the old WAL
+		 * in readBuf; we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby servers.
+		 * If we crash before actually switching to the new timeline, standby
+		 * servers will nevertheless think that we switched to the new timeline,
+		 * and will try to connect to the new timeline. To minimize the window
+		 * for that, try to do as little as possible between here and writing the
+		 * end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7532,15 +7550,6 @@ StartupXLOG(void)
 	XLogCtl->ThisTimeLineID = ThisTimeLineID;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
-	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
 	 * buffer cache using the block containing the last record from the
-- 
2.14.2

#25

Michael Paquier

michael.paquier@gmail.com

about 8 years ago

In reply to: Michael Paquier (#24)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Mon, Oct 2, 2017 at 3:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Oct 2, 2017 at 8:13 AM, Daniel Gustafsson <daniel@yesql.se> wrote:

I’ve moved this to the next CF, but since this no longer applies cleanly I’ve
reset it to Waiting for author.

Thanks Daniel for the reminder. Attached are rebased patches. This
thing rots easily...

This set of patches still applies, and is marked as ready for
committer. Are any of the committers cited on this thread, aka Magnus,
Heikki, Tom interested in this patch set? Or not? We are close to the
end of CF 2017-11, so I am bumping it to the next one.
--
Michael

#26

Stephen Frost

sfrost@snowman.net

about 8 years ago

In reply to: Michael Paquier (#25)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Magnus,

* Michael Paquier (michael.paquier@gmail.com) wrote:

On Mon, Oct 2, 2017 at 3:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Oct 2, 2017 at 8:13 AM, Daniel Gustafsson <daniel@yesql.se> wrote:

I’ve moved this to the next CF, but since this no longer applies cleanly I’ve
reset it to Waiting for author.

Thanks Daniel for the reminder. Attached are rebased patches. This
thing rots easily...

This set of patches still applies, and is marked as ready for
committer. Are any of the committers cited on this thread, aka Magnus,
Heikki, Tom interested in this patch set? Or not? We are close to the
end of CF 2017-11, so I am bumping it to the next one.

Magnus, this was your thread to begin with, though I know others have
been involved, any chance you'll be able to review this for commit
during this CF? I agree that this is certainly a good thing to have
too, though I've not looked at the patch itself in depth. Is there
anything we can do to help move it along?

Appears to compile and pass make check-world still (not sure why Thomas'
bot currently thinks the build fails, since it worked here..).

Thanks!

Stephen

#27

Thomas Munro

thomas.munro@enterprisedb.com

almost 8 years ago

In reply to: Stephen Frost (#26)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Fri, Jan 12, 2018 at 4:35 PM, Stephen Frost <sfrost@snowman.net> wrote:

Appears to compile and pass make check-world still (not sure why Thomas'
bot currently thinks the build fails, since it worked here..).

It looks good now. There was a brief window when all Travis CI builds
said this while updating various packages at startup:

W: GPG error: http://repo.mongodb.org/apt/ubuntu
trusty/mongodb-org/3.4 Release: The following signatures were invalid:
KEYEXPIRED 1515625755
W: The repository 'http://repo.mongodb.org/apt/ubuntu
trusty/mongodb-org/3.4 Release' is not signed.

That's because the apt.sources happens to have that repository in it.
I didn't look into it because it fixed itself at the next rebuild. I
speculate that MongoDB's package repository is eventually consistent.

--
Thomas Munro
http://www.enterprisedb.com

#28

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Stephen Frost (#26)

2 attachment(s)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Thu, Jan 11, 2018 at 10:35:22PM -0500, Stephen Frost wrote:

Magnus, this was your thread to begin with, though I know others have
been involved, any chance you'll be able to review this for commit
during this CF? I agree that this is certainly a good thing to have
too, though I've not looked at the patch itself in depth. Is there
anything we can do to help move it along?

As an effort to move on with bug items in the commit fest, attached are
two patches with a proposed commit message as well as polished comments
Those are proposed for a back-patched. The 2PC issue is particularly
bad in my opinion because having any 2PC file on-disk and corrupted
means that a transaction is lost. I have been playing a bit with
hexedit and changed a couple of bytes in one of them... If trying to
use a base backup which includes one, then the standby reading it would
be similarly confused.
--
Michael

Attachments:

0001-Fail-hard-when-facing-corrupted-two-phase-state-file.patchtext/x-diff; charset=us-asciiDownload

From d12621c1e03abf8876d944ea3e831213111f2909 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 28 Jun 2018 14:38:24 +0900
Subject: [PATCH 1/2] Fail hard when facing corrupted two-phase state files

When a corrupted file is found by WAL replay, be it for crash recovery
or archive recovery, then the file is simply skipped and a WARNING is
logged to the user.  Facing an on-disk WAL file which is corrupted is
more likely to happen than its pair recorded in dedicated WAL records,
but if that happens then the instance faces data loss as the transaction
is not around anymore as it is not possible to commit it.

Reported-by: Michael Paquier
Author: Michael Paquier
Discussion: https://postgr.es/m/20161216060832.GB17838@paquier.xyz
---
 src/backend/access/transam/twophase.c | 28 ++++++++++-----------------
 src/backend/access/transam/xlog.c     | 10 +++++++---
 2 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index a9ef1b3d73..2da3d93d87 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1873,6 +1873,10 @@ restoreTwoPhaseData(void)
  * write a WAL entry, and so there might be no evidence in WAL of those
  * subxact XIDs.
  *
+ * On corrupted two-phase files, fail immediately. Keeping around broken
+ * entries and let replay continue causes harm on the system, and a new
+ * backup should be rolled in.
+ *
  * Our other responsibility is to determine and return the oldest valid XID
  * among the prepared xacts (if none, return ShmemVariableCache->nextXid).
  * This is needed to synchronize pg_subtrans startup properly.
@@ -2165,13 +2169,9 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 		/* Read and validate file */
 		buf = ReadTwoPhaseFile(xid, true);
 		if (buf == NULL)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-			return NULL;
-		}
 	}
 	else
 	{
@@ -2184,21 +2184,13 @@ ProcessTwoPhaseBuffer(TransactionId xid,
 	if (!TransactionIdEquals(hdr->xid, xid))
 	{
 		if (fromdisk)
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state file for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state file for \"%u\"",
 							xid)));
-			RemoveTwoPhaseFile(xid, true);
-		}
 		else
-		{
-			ereport(WARNING,
-					(errmsg("removing corrupt two-phase state from memory for transaction %u",
+			ereport(FATAL,
+					(errmsg("corrupted two-phase state in memory for \"%u\"",
 							xid)));
-			PrepareRedoRemove(xid, true);
-		}
-		pfree(buf);
-		return NULL;
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a419aa49b..3695258e6f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7462,6 +7462,13 @@ StartupXLOG(void)
 		}
 	}
 
+	/*
+	 * Pre-scan prepared transactions to find out the range of XIDs present.
+	 * This information is not quite needed yet, but it is positioned here so
+	 * as potential problems are detected before any on-disk change is done.
+	 */
+	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+
 	/*
 	 * Consider whether we need to assign a new timeline ID.
 	 *
@@ -7585,9 +7592,6 @@ StartupXLOG(void)
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
-	/* Pre-scan prepared transactions to find out the range of XIDs present */
-	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
-
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
-- 
2.18.0

0002-Minimize-window-between-history-file-and-end-of-reco.patchtext/x-diff; charset=us-asciiDownload

From b2862294f52b7262d7ec0cb2379688b2176169ea Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Thu, 28 Jun 2018 14:38:50 +0900
Subject: [PATCH 2/2] Minimize window between history file and end-of-recovery
 record

Once a standby node is promoted, this makes the assignment of the new
timeline number booked earlier as the history file gets archived
immediately.  This way the other nodes are aware that this new timeline
number is taken and should not be assigned to other nodes.

The window between which the history file is archived and the
end-of-recovery record is written cannot be zeroed, but this way it is
minimized as much as possible. The new order of actions prevents as well
a corrupted data directory on failure.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
---
 src/backend/access/transam/xlog.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3695258e6f..f36d8049bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7523,6 +7523,24 @@ StartupXLOG(void)
 		else
 			snprintf(reason, sizeof(reason), "no recovery target specified");
 
+		/*
+		 * We are now done reading the old WAL.  Turn off archive fetching if
+		 * it was active, and make a writable copy of the last WAL segment.
+		 * (Note that we also have a copy of the last block of the old WAL in
+		 * readBuf; we will use that below.)
+		 */
+		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
+
+		/*
+		 * Write the timeline history file, and have it archived. After this
+		 * point (or rather, as soon as the file is archived), the timeline
+		 * will appear as "taken" in the WAL archive and to any standby
+		 * servers. If we crash before actually switching to the new timeline,
+		 * standby servers will nevertheless think that we switched to the new
+		 * timeline, and will try to connect to the new timeline. To minimize
+		 * the window for that, try to do as little as possible between here
+		 * and writing the end-of-recovery record.
+		 */
 		writeTimeLineHistory(ThisTimeLineID, recoveryTargetTLI,
 							 EndRecPtr, reason);
 	}
@@ -7531,15 +7549,6 @@ StartupXLOG(void)
 	XLogCtl->ThisTimeLineID = ThisTimeLineID;
 	XLogCtl->PrevTimeLineID = PrevTimeLineID;
 
-	/*
-	 * We are now done reading the old WAL.  Turn off archive fetching if it
-	 * was active, and make a writable copy of the last WAL segment. (Note
-	 * that we also have a copy of the last block of the old WAL in readBuf;
-	 * we will use that below.)
-	 */
-	if (ArchiveRecoveryRequested)
-		exitArchiveRecovery(EndOfLogTLI, EndOfLog);
-
 	/*
 	 * Prepare to write WAL starting at EndOfLog location, and init xlog
 	 * buffer cache using the block containing the last record from the
-- 
2.18.0

#29

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 7 years ago

In reply to: Michael Paquier (#28)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Thu, Jun 28, 2018 at 1:39 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jan 11, 2018 at 10:35:22PM -0500, Stephen Frost wrote:

Magnus, this was your thread to begin with, though I know others have
been involved, any chance you'll be able to review this for commit
during this CF? I agree that this is certainly a good thing to have
too, though I've not looked at the patch itself in depth. Is there
anything we can do to help move it along?

As an effort to move on with bug items in the commit fest, attached are
two patches with a proposed commit message as well as polished comments
Those are proposed for a back-patched. The 2PC issue is particularly
bad in my opinion because having any 2PC file on-disk and corrupted
means that a transaction is lost. I have been playing a bit with
hexedit and changed a couple of bytes in one of them... If trying to
use a base backup which includes one, then the standby reading it would
be similarly confused.

Thanks to Michael for progressing this.

Back in August, nearly a year ago, Robert Haas said upthread:

This bug fix has been pending in "Ready for Committer" state for about
4.5 months. Three committers (Magnus, Heikki, Tom) have contributed
to the thread to date. Maybe one of them would like to commit this?

Although the state is now back to "Needs Review", I echo those
sentiments. This issue has now been hanging around for about 18
months.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Andrew Dunstan (#29)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Robert Haas wrote:

Although the state is now back to "Needs Review", I echo those
sentiments. This issue has now been hanging around for about 18
months.

For what it's worth, I volunteer to finish the work :)

The 2PC patch is really simple, and fixes a data loss issue. The second
patch has been looked up by Heikki, Magnus and me at least once by each,
and there is visibly an agreement on having it. Having reviews after
a new patch version is sent, by somebody else than the one who sent the
patches is of course always nice..
--
Michael

#31

Andrew Dunstan

andrew.dunstan@2ndquadrant.com

over 7 years ago

In reply to: Michael Paquier (#30)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On 07/06/2018 07:18 AM, Michael Paquier wrote:

Robert Haas wrote:

Although the state is now back to "Needs Review", I echo those
sentiments. This issue has now been hanging around for about 18
months.

No, those are my words, not Robert's :-)

For what it's worth, I volunteer to finish the work :)

The 2PC patch is really simple, and fixes a data loss issue. The second
patch has been looked up by Heikki, Magnus and me at least once by each,
and there is visibly an agreement on having it. Having reviews after
a new patch version is sent, by somebody else than the one who sent the
patches is of course always nice..

If you're comfortable committing it then go for it. It will be good to
have the CF item resolved.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#32

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Andrew Dunstan (#31)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Fri, Jul 06, 2018 at 08:05:50AM -0400, Andrew Dunstan wrote:

On 07/06/2018 07:18 AM, Michael Paquier wrote:

For what it's worth, I volunteer to finish the work :)

The 2PC patch is really simple, and fixes a data loss issue. The second
patch has been looked up by Heikki, Magnus and me at least once by each,
and there is visibly an agreement on having it. Having reviews after
a new patch version is sent, by somebody else than the one who sent the
patches is of course always nice..

If you're comfortable committing it then go for it. It will be good to have
the CF item resolved.

Sure. I think I can finish the set on Monday JST then.
--
Michael

#33

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Michael Paquier (#32)

Re: [HACKERS] Crash on promotion when recovery.conf is renamed

On Fri, Jul 06, 2018 at 09:09:27PM +0900, Michael Paquier wrote:

Sure. I think I can finish the set on Monday JST then.

So, I have been able to back-patch things down to 9.5, but further down
I am not really convinced that it is worth meddling with that,
particularly in light of 7cbee7c which has reworked the way partial
segments on older timelines are handled at the end of promotion. The
portability issues I have found is related to the timeline number
exitArchiveRecovery uses which comes for the WAL reader hence this gets
set to the timeline from the last page read by the startup process.
This can actually cause quite a bit of trouble at the end of recovery
as we would get a failure when trying to copy the last segment from the
old timeline to the new timeline. Well, it could be possible to fix
things properly by roughly back-porting 7cbee7c down to REL9_4_STABLE
but that's not really worth the risk, and moving exitArchiveRecovery()
and PrescanPreparedTransactions() around is a straight-forward move with
9.5~ thanks to this commit.

I have also done nothing yet for the detection of corrupted 2PC files
which get ignored at recovery. While looking again at the patch I sent
upthread, the thing was actually missing some more error handling in
ReadTwoPhaseFile(). In particular, with the proposed patch we would
still not report an error if a 2PC file cannot be opened because of for
example EPERM. In FinishPreparedTransaction, one would example get a
simply a *crash* with no hints about what happened. I have a patch for
that which still needs some polishing, and I will start a new thread on
the matter.
--
Michael