race condition when writing pg_control

Started by Nathan Bossartalmost 6 years ago31 messageshackers
Jump to latest
#1Nathan Bossart
nathandbossart@gmail.com

Hi hackers,

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

Nathan

Attachments:

v1-0001-Prevent-race-condition-when-writing-pg_control.patchapplication/octet-stream; name=v1-0001-Prevent-race-condition-when-writing-pg_control.patchDownload+4-1
#2Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#1)
Re: race condition when writing pg_control

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#2)
Re: race condition when writing pg_control

On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

Attachments:

0001-Fix-race-condition-that-could-corrupt-pg_control.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-race-condition-that-could-corrupt-pg_control.patchDownload+4-1
#4Fujii Masao
masao.fujii@gmail.com
In reply to: Thomas Munro (#3)
Re: race condition when writing pg_control

On 2020/05/22 13:51, Thomas Munro wrote:

On Tue, May 5, 2020 at 9:51 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan <bossartn@amazon.com> wrote:

I believe I've discovered a race condition between the startup and
checkpointer processes that can cause a CRC mismatch in the pg_control
file. If a cluster crashes at the right time, the following error
appears when you attempt to restart it:

FATAL: incorrect checksum in control file

This appears to be caused by some code paths in xlog_redo() that
update ControlFile without taking the ControlFileLock. The attached
patch seems to be sufficient to prevent the CRC mismatch in the
control file, but perhaps this is a symptom of a bigger problem with
concurrent modifications of ControlFile->checkPointCopy.nextFullXid.

This does indeed look pretty dodgy. CreateRestartPoint() running in
the checkpointer does UpdateControlFile() to compute a checksum and
write it out, but xlog_redo() processing
XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} modifies that data without
interlocking. It looks like the ancestors of that line were there
since 35af5422f64 (2006), but back then RecoveryRestartPoint() ran
UpdateControLFile() directly in the startup process (immediately after
that update), so no interlocking problem. Then in cdd46c76548 (2009),
RecoveryRestartPoint() was split up so that CreateRestartPoint() ran
in another process.

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

+1 to push the patch.

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#5Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#4)
Re: race condition when writing pg_control

On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Yeah. It also checks the control file values, implying that we should
have LW_SHARED taken at least at the beginning, but this lock cannot
be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we
should check with an assert if ControlFileLock is taken when going
through UpdateControlFile(). We have one code path at the beginning
of redo where we don't need a lock close to the backup_label file
checks, but we could just pass down a boolean flag to the routine to
handle that case. Another good thing in having an assert is that any
new caller of UpdateControlFile() would need to think about the need
of a lock.
--
Michael

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Munro (#3)
Re: race condition when writing pg_control

On 5/21/20, 9:52 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

Here's a version with a commit message added. I'll push this to all
releases in a day or two if there are no objections.

Looks good to me. Thanks!

Nathan

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Michael Paquier (#5)
Re: race condition when writing pg_control

On 5/22/20, 10:40 PM, "Michael Paquier" <michael@paquier.xyz> wrote:

On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote:

Per my quick check, XLogReportParameters() seems to have the similar issue,
i.e., it updates the control file without taking ControlFileLock.
Maybe we should fix this at the same time?

Yeah. It also checks the control file values, implying that we should
have LW_SHARED taken at least at the beginning, but this lock cannot
be upgraded we need LW_EXCLUSIVE the whole time. I am wondering if we
should check with an assert if ControlFileLock is taken when going
through UpdateControlFile(). We have one code path at the beginning
of redo where we don't need a lock close to the backup_label file
checks, but we could just pass down a boolean flag to the routine to
handle that case. Another good thing in having an assert is that any
new caller of UpdateControlFile() would need to think about the need
of a lock.

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

Nathan

Attachments:

v1-0001-Assert-that-ControlFileLock-is-held-exclusively-i.patchapplication/octet-stream; name=v1-0001-Assert-that-ControlFileLock-is-held-exclusively-i.patchDownload+12-2
#8Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#7)
Re: race condition when writing pg_control

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.
--
Michael

#9Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#8)
Re: race condition when writing pg_control

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.

LGTM.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Fujii Masao (#9)
Re: race condition when writing pg_control

On 5/29/20, 12:24 AM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

For the XLogReportParameters() fix, I simply added an exclusive lock
acquisition for the portion that updates the values in shared memory
and calls UpdateControlFile(). IIUC the first part of this function
that accesses several ControlFile values should be safe, as none of
them can be updated after server start.

They can get updated when replaying a XLOG_PARAMETER_CHANGE record.
But you are right as all of this happens in the startup process, so
your patch looks right to me here.

LGTM.

Thanks for the feedback. I've attached a new set of patches.

Nathan

Attachments:

0003-Assert-that-ControlFileLock-is-held-within-UpdateCon.patchapplication/octet-stream; name=0003-Assert-that-ControlFileLock-is-held-within-UpdateCon.patchDownload+21-16
0002-Acquire-ControlFileLock-within-XLogReportParameters.patchapplication/octet-stream; name=0002-Acquire-ControlFileLock-within-XLogReportParameters.patchDownload+4-1
0001-Fix-race-condition-that-could-corrupt-pg_control.patchapplication/octet-stream; name=0001-Fix-race-condition-that-could-corrupt-pg_control.patchDownload+4-1
#11Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#10)
Re: race condition when writing pg_control

On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:

Thanks for the feedback. I've attached a new set of patches.

Thanks for splitting the set. 0001 and 0002 are the minimum set for
back-patching, and it would be better to merge them together. 0003 is
debatable and not an actual bug fix, so I would refrain from doing a
backpatch. It does not seem that there is a strong consensus in favor
of 0003 either.

Thomas, are you planning to look at this patch set?
--
Michael

#12Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#11)
Re: race condition when writing pg_control

On Tue, Jun 2, 2020 at 5:24 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, May 31, 2020 at 09:11:35PM +0000, Bossart, Nathan wrote:

Thanks for the feedback. I've attached a new set of patches.

Thanks for splitting the set. 0001 and 0002 are the minimum set for
back-patching, and it would be better to merge them together. 0003 is
debatable and not an actual bug fix, so I would refrain from doing a
backpatch. It does not seem that there is a strong consensus in favor
of 0003 either.

Thomas, are you planning to look at this patch set?

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

#13Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#12)
Re: race condition when writing pg_control

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!
--
Michael

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#13)
Re: race condition when writing pg_control

On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

#15Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Munro (#14)
Re: race condition when writing pg_control

On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Thanks! The approach for 0003 was discussed a bit upthread [0]/messages/by-id/20200527071053.GD103662@paquier.xyz. I do
not have a strong opinion, but I lean towards just acquiring the lock.

Nathan

[0]: /messages/by-id/20200527071053.GD103662@paquier.xyz

#16Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#15)
Re: race condition when writing pg_control

On Mon, Jun 08, 2020 at 03:25:31AM +0000, Bossart, Nathan wrote:

On 6/7/20, 7:50 PM, "Thomas Munro" <thomas.munro@gmail.com> wrote:

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Thanks! The approach for 0003 was discussed a bit upthread [0]. I do
not have a strong opinion, but I lean towards just acquiring the lock.

Fujii-san has provided an answer upthread, that can maybe translated
as a +0.3~0.4:
/messages/by-id/fc796148-7d63-47bb-e91d-e09b62a502e9@oss.nttdata.com

FWIW, I'd rather not take the lock as that's not necessary and just
add the parameter if I were to do it. Now I would be fine as well to
just take the lock if you decide that's more simple, as long as we add
this new assertion as a safety net for future changes.
--
Michael

#17Amul Sul
sulamul@gmail.com
In reply to: Fujii Masao (#9)
Re: race condition when writing pg_control

On Fri, May 29, 2020 at 12:54 PM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

On 2020/05/27 16:10, Michael Paquier wrote:

On Tue, May 26, 2020 at 07:30:54PM +0000, Bossart, Nathan wrote:

While an assertion in UpdateControlFile() would not have helped us
catch the problem I initially reported, it does seem worthwhile to add
it. I have attached a patch that adds this assertion and also
attempts to fix XLogReportParameters(). Since there is only one place
where we feel it is safe to call UpdateControlFile() without a lock, I
just changed it to take the lock. I don't think this adds any sort of
significant contention risk, and IMO it is a bit cleaner than the
boolean flag.

Let's see what Fujii-san and Thomas think about that. I'd rather
avoid taking a lock here because we don't need it and because it makes
things IMO confusing with the beginning of StartupXLOG() where a lot
of the fields are read, even if we go without this extra assertion.

I have no strong opinion about this, but I tend to agree with Michael here.

I too don't have a strong opinion about this either, but I like Nathan's
approach more, just take the lock in the startup process as well for the
simplicity if that is not hurting much. I think, apart from the startup
process we
have to take the lock to update the control file, then having separate
treatment
for the startup process looks confusing to me, IMHO.

Regards,
Amul

#18Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#14)
Re: race condition when writing pg_control

On Sun, Jun 7, 2020 at 10:49 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote:

Sorry for my radio silence, I got tangled up with a couple of
conferences. I'm planning to look at 0001 and 0002 shortly.

Thanks!

I pushed 0001 and 0002, squashed into one commit. I'm not sure about
0003. If we're going to do that, wouldn't it be better to just
acquire the lock in that one extra place in StartupXLOG(), rather than
introducing the extra parameter?

Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl
fail and see this message in the replica log [2]https://api.cirrus-ci.com/v1/artifact/task/4626725689098240/testrun/build/testrun/recovery/018_wal_optimize/log/018_wal_optimize_node_replica.log.

2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect
checksum in control file

I'm pretty sure it's not related to my commit. So, I was looking for
existing reports of this error message.

It's a long shot, since 0001 and 0002 were already pushed, but this is
the only recent report I could find of "FATAL: incorrect checksum in
control file" in pgsql-hackers or bugs archives.

I do see this thread from 2016 [3]/messages/by-id/CAEepm=0hh_Dvd2Q+fcjYpkVzSoNX2+f167cYu5nwu=qh5HZhJw@mail.gmail.com which might be relevant because the
reported bug was also on Windows.

- Melanie

[1]: https://cirrus-ci.com/task/4626725689098240
[2]: https://api.cirrus-ci.com/v1/artifact/task/4626725689098240/testrun/build/testrun/recovery/018_wal_optimize/log/018_wal_optimize_node_replica.log
[3]: /messages/by-id/CAEepm=0hh_Dvd2Q+fcjYpkVzSoNX2+f167cYu5nwu=qh5HZhJw@mail.gmail.com

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Melanie Plageman (#18)
Re: race condition when writing pg_control

On Thu, May 16, 2024 at 12:19:22PM -0400, Melanie Plageman wrote:

Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl
fail and see this message in the replica log [2].

2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect
checksum in control file

I'm pretty sure it's not related to my commit. So, I was looking for
existing reports of this error message.

Yeah, I don't see how it could be related.

It's a long shot, since 0001 and 0002 were already pushed, but this is
the only recent report I could find of "FATAL: incorrect checksum in
control file" in pgsql-hackers or bugs archives.

I do see this thread from 2016 [3] which might be relevant because the
reported bug was also on Windows.

I suspect it will be difficult to investigate this one too much further
unless we can track down a copy of the control file with the bad checksum.
Other than searching for any new code that isn't doing the appropriate
locking, maybe we could search the buildfarm for any other occurrences. I
also seem some threads concerning whether the way we are reading/writing
the control file is atomic.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#19)
Re: race condition when writing pg_control

Nathan Bossart <nathandbossart@gmail.com> writes:

I suspect it will be difficult to investigate this one too much further
unless we can track down a copy of the control file with the bad checksum.
Other than searching for any new code that isn't doing the appropriate
locking, maybe we could search the buildfarm for any other occurrences. I
also seem some threads concerning whether the way we are reading/writing
the control file is atomic.

The intention was certainly always that it be atomic. If it isn't
we have got *big* trouble.

regards, tom lane

#21Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#24Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#23)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#24)
#26Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#25)
#27Thomas Munro
thomas.munro@gmail.com
In reply to: Noah Misch (#26)
#28Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#27)
#29Noah Misch
noah@leadboat.com
In reply to: Thomas Munro (#27)
#30vignesh C
vignesh21@gmail.com
In reply to: Noah Misch (#29)
#31Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Thomas Munro (#25)