PATCH: track last known XLOG segment in control file

Started by Tomas Vondraabout 10 years ago8 messages
#1Tomas Vondra
tomas.vondra@2ndquadrant.com
1 attachment(s)

Hi,

this is the second improvement proposed in the thread [1]/messages/by-id/56583BDD.9060302@2ndquadrant.com about ext4
data loss issue. It adds another field to control file, tracking the
last known WAL segment. This does not eliminate the data loss, just the
silent part of it when the last segment gets lost (due to forgetting the
rename, deleting it by mistake or whatever). The patch makes sure the
cluster refuses to start if that happens.

[1]: /messages/by-id/56583BDD.9060302@2ndquadrant.com

It's a fairly simple patch, but obviously it touches very complex part
of the code. I'll add it to 2016-01 CF.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

xlog-last-known-segment-v1.patchtext/x-diff; name=xlog-last-known-segment-v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 71fc8ff..50f10a5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2222,6 +2222,16 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			use_existent = true;
 			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
 			openLogOff = 0;
+
+			/* update the last known segment in the control file */
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			if (ControlFile->lastKnownSegment != openLogSegNo)
+			{
+				elog(WARNING, "updating segment number = %lu", openLogSegNo);
+				ControlFile->lastKnownSegment = openLogSegNo;
+				UpdateControlFile();
+			}
+			LWLockRelease(ControlFileLock);
 		}
 
 		/* Make sure we have the current logfile open */
@@ -5904,6 +5914,7 @@ StartupXLOG(void)
 	XLogPageReadPrivate private;
 	bool		fast_promoted = false;
 	struct stat st;
+	XLogSegNo	lastLogSegNo = 0;
 
 	/*
 	 * Read control file and check XLOG status looks valid.
@@ -6865,6 +6876,9 @@ StartupXLOG(void)
 				/* Remember this record as the last-applied one */
 				LastRec = ReadRecPtr;
 
+				/* Also remember the segment number */
+				XLByteToSeg(ReadRecPtr, lastLogSegNo);
+
 				/* Allow read-only connections if we're consistent now */
 				CheckRecoveryConsistency();
 
@@ -6942,6 +6956,18 @@ StartupXLOG(void)
 					RmgrTable[rmid].rm_cleanup();
 			}
 
+			/*
+			 * Check that we've actually seen all the XLOG segments, i.e. that
+			 * we've reached ControlFile->lastKnownSegment (this may fail for
+			 * example when someone deletes the last XLOG segment, or in case
+			 * of a filesystem issue).
+			 */
+			if (ControlFile->lastKnownSegment != lastLogSegNo)
+				ereport(FATAL,
+						(errmsg("not reached the last known segment (expected %lX/%lX seen %lX/%lX)",
+								(ControlFile->lastKnownSegment >> 8), (ControlFile->lastKnownSegment & 0xFF),
+								(lastLogSegNo >> 8), (lastLogSegNo & 0xFF))));
+
 			ereport(LOG,
 					(errmsg("redo done at %X/%X",
 						 (uint32) (ReadRecPtr >> 32), (uint32) ReadRecPtr)));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 32e1d81..44dde42 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -293,6 +293,9 @@ main(int argc, char *argv[])
 		   (uint32) ControlFile.backupEndPoint);
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile.backupEndRequired ? _("yes") : _("no"));
+	printf(_("Last known segment:                   %lX/%X\n"),
+		   (uint64) (ControlFile.lastKnownSegment >> 8),
+		   (uint32) (ControlFile.lastKnownSegment & 0xFF));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile.wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..f0ba450 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,18 @@ typedef struct ControlFileData
 	 * start up. If it's false, but backupStartPoint is set, a backup_label
 	 * file was found at startup but it may have been a leftover from a stray
 	 * pg_start_backup() call, not accompanied by pg_stop_backup().
+	 *
+	 * lastKnownSegment is the segment sequence number of the last known XLOG
+	 * segment. This is useful to check that the recovery actually processed
+	 * all segments allocated before the crash (serves as a protection against
+	 * accidentally deleted segments etc.)
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	XLogSegNo	lastKnownSegment;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
#2Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#1)
Re: PATCH: track last known XLOG segment in control file

Hi,

On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:

this is the second improvement proposed in the thread [1] about ext4 data
loss issue. It adds another field to control file, tracking the last known
WAL segment. This does not eliminate the data loss, just the silent part of
it when the last segment gets lost (due to forgetting the rename, deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly increase
the number of fsyncs. I've a bit of a hard time believing this'll be
worthwhile. Additionally this doesn't seem to take WAL replay into
account?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: PATCH: track last known XLOG segment in control file

On 12/12/2015 11:20 PM, Andres Freund wrote:

Hi,

On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:

this is the second improvement proposed in the thread [1] about ext4 data
loss issue. It adds another field to control file, tracking the last known
WAL segment. This does not eliminate the data loss, just the silent part of
it when the last segment gets lost (due to forgetting the rename, deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly
increase the number of fsyncs.

It should do exactly 1 additional fsync per WAL segment. Or do you think
otherwise?

I've a bit of a hard time believing this'll be worthwhile.

The trouble is protections like this only seem worthwhile after the
fact, when something happens. I think it's reasonable protection against
issues similar to the one I reported ~2 weeks ago. YMMV.

Additionally this doesn't seem to take WAL replay into account?

I think the comparison in StartupXLOG needs to be less strict, to allow
cases when we actually replay more WAL segments. Is that what you mean?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#3)
Re: PATCH: track last known XLOG segment in control file

On 2015-12-12 23:28:33 +0100, Tomas Vondra wrote:

On 12/12/2015 11:20 PM, Andres Freund wrote:

On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:

this is the second improvement proposed in the thread [1] about ext4 data
loss issue. It adds another field to control file, tracking the last known
WAL segment. This does not eliminate the data loss, just the silent part of
it when the last segment gets lost (due to forgetting the rename, deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly
increase the number of fsyncs.

It should do exactly 1 additional fsync per WAL segment. Or do you think
otherwise?

Which is nearly doubling the number of fsyncs, for a good number of
workloads. And it does so to a separate file, i.e. it's not like these
writes and the flushes can be combined. In workloads where pg_xlog is on
a separate partition it'll add the only source of fsyncs besides
checkpoint to the main data directory.

I've a bit of a hard time believing this'll be worthwhile.

The trouble is protections like this only seem worthwhile after the fact,
when something happens. I think it's reasonable protection against issues
similar to the one I reported ~2 weeks ago. YMMV.

Meh. That argument can be used to justify about everything.

Obviously we should be more careful about fsyncing files, including the
directories. I do plan come back to your recent patch.

Additionally this doesn't seem to take WAL replay into account?

I think the comparison in StartupXLOG needs to be less strict, to allow
cases when we actually replay more WAL segments. Is that what you mean?

What I mean is that the value isn't updated during recovery, afaics. You
could argue that minRecoveryPoint is that, in a way.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#4)
Re: PATCH: track last known XLOG segment in control file

On 12/12/2015 11:39 PM, Andres Freund wrote:

On 2015-12-12 23:28:33 +0100, Tomas Vondra wrote:

On 12/12/2015 11:20 PM, Andres Freund wrote:

On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:

this is the second improvement proposed in the thread [1] about ext4 data
loss issue. It adds another field to control file, tracking the last known
WAL segment. This does not eliminate the data loss, just the silent part of
it when the last segment gets lost (due to forgetting the rename, deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly
increase the number of fsyncs.

It should do exactly 1 additional fsync per WAL segment. Or do you think
otherwise?

Which is nearly doubling the number of fsyncs, for a good number of
workloads. And it does so to a separate file, i.e. it's not like
these writes and the flushes can be combined. In workloads where
pg_xlog is on a separate partition it'll add the only source of
fsyncs besides checkpoint to the main data directory.

I doubt it will make any difference in practice, at least on reasonable
hardware (which you should have, if fsync performance matters to you).

But some performance testing will be necessary, I don't expect this to
go in without that. It'd be helpful if you could describe the workload.

I've a bit of a hard time believing this'll be worthwhile.

The trouble is protections like this only seem worthwhile after the fact,
when something happens. I think it's reasonable protection against issues
similar to the one I reported ~2 weeks ago. YMMV.

Meh. That argument can be used to justify about everything.

Obviously we should be more careful about fsyncing files, including
the directories. I do plan come back to your recent patch.

My argument is that this is a reasonable protection against failures in
that area - both our faults (in understanding the durability guarantees
on a particular file system), or file system developer.

Maybe it's not, because the chance of running into exactly the same
issue in this part of code is negligible.

Additionally this doesn't seem to take WAL replay into account?

I think the comparison in StartupXLOG needs to be less strict, to allow
cases when we actually replay more WAL segments. Is that what you mean?

What I mean is that the value isn't updated during recovery, afaics.
You could argue that minRecoveryPoint is that, in a way.

Oh, right. Will fix if we conclude that the general idea makes sense.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#5)
Re: PATCH: track last known XLOG segment in control file

On Sun, Dec 13, 2015 at 4:24 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 12/12/2015 11:39 PM, Andres Freund wrote:

On 2015-12-12 23:28:33 +0100, Tomas Vondra wrote:

On 12/12/2015 11:20 PM, Andres Freund wrote:

On 2015-12-12 22:14:13 +0100, Tomas Vondra wrote:

this is the second improvement proposed in the thread [1] about ext4
data
loss issue. It adds another field to control file, tracking the last
known
WAL segment. This does not eliminate the data loss, just the silent
part of
it when the last segment gets lost (due to forgetting the rename,
deleting
it by mistake or whatever). The patch makes sure the cluster refuses to
start if that happens.

Uh, that's fairly expensive. In many cases it'll significantly
increase the number of fsyncs.

It should do exactly 1 additional fsync per WAL segment. Or do you think
otherwise?

Which is nearly doubling the number of fsyncs, for a good number of
workloads. And it does so to a separate file, i.e. it's not like
these writes and the flushes can be combined. In workloads where
pg_xlog is on a separate partition it'll add the only source of
fsyncs besides checkpoint to the main data directory.

I also think so.

I doubt it will make any difference in practice, at least on reasonable
hardware (which you should have, if fsync performance matters to you).

But some performance testing will be necessary, I don't expect this to go

in without that. It'd be helpful if you could describe the workload.

I think to start with you can try to test pgbench read-write workload
when the data fits in shared_buffers.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#7Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Kapila (#6)
Re: PATCH: track last known XLOG segment in control file

This hasn't been updated in a long time; I should have closed it
earlier. Chatting with Tomas about it, he seemed inclined to just have
the patch rejected because it's not terribly useful anyway.

I'm marking it as rejected. Unless someone has a compelling use case
for this feature that hasn't been put forward, I think we're done here.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alvaro Herrera (#7)
Re: PATCH: track last known XLOG segment in control file

Hi,

On 01/31/2016 01:25 PM, Alvaro Herrera wrote:

This hasn't been updated in a long time; I should have closed it
earlier. Chatting with Tomas about it, he seemed inclined to just have
the patch rejected because it's not terribly useful anyway.

we've discussed that some time ago and my memory is a bit flaky, but I
don't think that's quite what I said. The main reason why I haven't
posted an updated version of the patch is that it seems a bit silly to
spend time on this while the underlying data loss issue is still not
fixed (it got to "ready for committer" since then).

I'm marking it as rejected. Unless someone has a compelling use case
for this feature that hasn't been put forward, I think we're done
here.

I'm a bit torn - I think the protection might be useful, but there are a
few issues with this approach:

1) Getting the "can't start, WAL segments lost" message after a crash
is a bit late. It protects against silent data loss, it only makes
it "not silent" so it's not "fixed". But maybe it's worth it.

2) It protects against a fairly narrow class of failures when we lose
the last WAL segment for some reason. Hopefully once we add the
additional fsyncs that particular failure scenario will get fixed,
but the question is whether we're in danger of reintroducing it
(or a similar issue) later.

I guess we could mark it as "rejected" but that's likely to eliminate
any further feedback / discussion about the protection in general.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers