WAL replay should fdatasync() segments?

Started by Andres Freundabout 12 years ago4 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
write() data before fdatasync()ing them (duh, kinda obvious). But I
think given the current recovery code that leaves a window where we can
get into strange inconsistencies.
Consider what happens if postgres (not the OS!) crashes after writing
WAL data to the OS, but before fdatasync()ing it. Replay will happily
read that record from disk and replay it, which is fine. At the end of
recovery we then will start inserting new records, and those will be
properly fsynced to disk.
But if the *OS* crashes in that moment we might get into the strange
situation where older records might be lost since they weren't
fsync()ed, but newer records and the control file will persist.

I think for a primary that window is relatively small, but I think it's
a good bit bigger for a standby, especially if it's promoted.

I think the correct way to handle this would be to fsync() segments we
read from pg_xlog/ during recovery.

Am I missing something?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#1)
Re: WAL replay should fdatasync() segments?

On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
write() data before fdatasync()ing them (duh, kinda obvious). But I
think given the current recovery code that leaves a window where we can
get into strange inconsistencies.
Consider what happens if postgres (not the OS!) crashes after writing
WAL data to the OS, but before fdatasync()ing it. Replay will happily
read that record from disk and replay it, which is fine. At the end of
recovery we then will start inserting new records, and those will be
properly fsynced to disk.
But if the *OS* crashes in that moment we might get into the strange
situation where older records might be lost since they weren't
fsync()ed, but newer records and the control file will persist.

I think for a primary that window is relatively small, but I think it's
a good bit bigger for a standby, especially if it's promoted.

In normal streaming replication case, ISTM that window is not bigger for
the standby because basically the standby replays only the WAL data
which walreceiver fsync'd to the disk. But if it replays the WAL file which
was fetched from the archive, that WAL file might not have been flushed
to the disk yet. In this case, that window might become bigger...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#2)
Re: WAL replay should fdatasync() segments?

On 2014-01-23 02:05:48 +0900, Fujii Masao wrote:

On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
write() data before fdatasync()ing them (duh, kinda obvious). But I
think given the current recovery code that leaves a window where we can
get into strange inconsistencies.
Consider what happens if postgres (not the OS!) crashes after writing
WAL data to the OS, but before fdatasync()ing it. Replay will happily
read that record from disk and replay it, which is fine. At the end of
recovery we then will start inserting new records, and those will be
properly fsynced to disk.
But if the *OS* crashes in that moment we might get into the strange
situation where older records might be lost since they weren't
fsync()ed, but newer records and the control file will persist.

I think for a primary that window is relatively small, but I think it's
a good bit bigger for a standby, especially if it's promoted.

In normal streaming replication case, ISTM that window is not bigger for
the standby because basically the standby replays only the WAL data
which walreceiver fsync'd to the disk. But if it replays the WAL file which
was fetched from the archive, that WAL file might not have been flushed
to the disk yet. In this case, that window might become bigger...

Yea, but if the walreceiver receives data and crashes/disconnects before
fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll
start inserting new records before establishing a new checkpoint.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#3)
Re: WAL replay should fdatasync() segments?

On Thu, Jan 23, 2014 at 2:08 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-01-23 02:05:48 +0900, Fujii Masao wrote:

On Thu, Jan 23, 2014 at 1:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Hi,

Currently, XLogInsert(), XLogFlush() or XLogBackgroundFlush() will
write() data before fdatasync()ing them (duh, kinda obvious). But I
think given the current recovery code that leaves a window where we can
get into strange inconsistencies.
Consider what happens if postgres (not the OS!) crashes after writing
WAL data to the OS, but before fdatasync()ing it. Replay will happily
read that record from disk and replay it, which is fine. At the end of
recovery we then will start inserting new records, and those will be
properly fsynced to disk.
But if the *OS* crashes in that moment we might get into the strange
situation where older records might be lost since they weren't
fsync()ed, but newer records and the control file will persist.

I think for a primary that window is relatively small, but I think it's
a good bit bigger for a standby, especially if it's promoted.

In normal streaming replication case, ISTM that window is not bigger for
the standby because basically the standby replays only the WAL data
which walreceiver fsync'd to the disk. But if it replays the WAL file which
was fetched from the archive, that WAL file might not have been flushed
to the disk yet. In this case, that window might become bigger...

Yea, but if the walreceiver receives data and crashes/disconnects before
fsync(), we'll read it from pg_xlog, rigth? And if we promote, we'll
start inserting new records before establishing a new checkpoint.

Yeah, true. Such unflushed WAL file can be read by the subsequent recovery...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers