"could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

Started by Thomas Crayfordover 7 years ago6 messagesbugs

tcrayford@salesforce.com

over 7 years ago

Hi there,

We've found (what seems like) a race condition in a replica server which is
receiving WAL from the archive. We are not 100% sure if it happens on
streaming replicas as well.
Specifically, we occasionally see an error like this:

May 20 09:56:14 redacted[141]: [191806-1] sql_error_code = 58P01 ERROR:
could not open file "pg_wal/00000002000072B50000003A": No such file or
directory

Upon looking at logs around this, we can see that the recovery process had
just restored that file:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
file or directory

(on this particular server, pid 9 is the recovery process, and pid 141 is
the checkpointer)

We've seen this across a range of server versions, including (in the last
30 days):

9.5.12
9.5.14
9.6.10
9.6.6
9.6.7
9.6.8
9.6.9
10.2
10.3
10.4
10.5

Upon digging into the code, we found (we think) the reason behind this
race. The recovery process can call unlink() on the file, whilst the
restartpoint attempts to recycle the segment. To quote from the docs:

In archive recovery or standby mode, the server periodically performs
*restartpoints*, which are similar to checkpoints in normal operation: the
server forces all its state to disk, updates the pg_control file to
indicate that the already-processed WAL data need not be scanned again, and
then recycles any old log segment files in the pg_wal directory.

Looking at the code, I think that the two racing functions are
RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does attempt
to do recycling on segments.

From our perspective this seems like a race condition - the recovery
process unlinks the file as the checkpointer tries to recycle the wal
segment. At this point the checkpointer panics and shuts down, causing a
hit to availability.

As far as we have seen, starting the process up after the crash allows the
server to continue making progress.

This seems like an extremely rare race - our data show it occurs once every
few hundred thousand wal restores from archive that Heroku Postgres does.

Thanks

Tom Crayford
Heroku Postgres

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Thomas Crayford (#1)

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No such
file or directory

What kind of restore_command is used here?

Looking at the code, I think that the two racing functions are
RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does attempt
to do recycling on segments.

Don't you mean KeepFileRestoredFromArchive()? RestoreArchivedFile would
call unlink() on pg_wal/RECOVERYXLOG so that does not match.
--
Michael

Thomas Crayford

tcrayford@salesforce.com

over 7 years ago

In reply to: Michael Paquier (#2)

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

Hi there,

Ok, thanks for the pointer. It seems like the race condition I talked about
is still accurate, does that seem right?

Thanks

Tom

On Mon, Sep 24, 2018 at 4:37 PM Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Mon, Sep 24, 2018 at 12:58:59PM +0100, Thomas Crayford wrote:

May 20 09:56:14 redacted[9]: [2468859-1] sql_error_code = 00000 LOG:
restored log file "00000002000072B50000003A" from archive
May 20 09:56:14 ip-10-0-92-26 redacted[141]: [191806-1] sql_error_code =
58P01 ERROR: could not open file "pg_wal/00000002000072B50000003A": No

such

file or directory

What kind of restore_command is used here?

Looking at the code, I think that the two racing functions are
RestoreArchivedFile, and CreateRestartPoint.

The former calls unlink on the wal segment, CreateRestartPoint does

attempt

to do recycling on segments.

Don't you mean KeepFileRestoredFromArchive()? RestoreArchivedFile would
call unlink() on pg_wal/RECOVERYXLOG so that does not match.
--
Michael

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Thomas Crayford (#3)

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:

Ok, thanks for the pointer. It seems like the race condition I talked about
is still accurate, does that seem right?

KeepFileRestoredFromArchive() looks like a good candidate on the matter
as it removes a WAL segment before replacing it by another with the same
name. I have a hard time understanding why the checkpointer would try
to recycle a segment just recovered though as the startup process would
immediately try to use it. I have not spent more than one hour looking
at potential spots though, which is not much for this kind of race
conditions.

It is also why I am curious about what kind of restore_command you are
using.
--
Michael

Thomas Crayford

tcrayford@salesforce.com

over 7 years ago

In reply to: Michael Paquier (#4)

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

Hi there Michael,

Sorry for the slow response on this - I was oncall last week and it was
quite distracting and busy.

With respect to the restore_command, we use wal-e:
https://github.com/wal-e/wal-e, specifically:

envdir DIRECTORY wal-e wal-fetch "%f" "%p"

Thanks

Tom

On Fri, Sep 28, 2018 at 11:59 PM Michael Paquier <michael@paquier.xyz>
wrote:

Show quoted text

On Fri, Sep 28, 2018 at 01:02:42PM +0100, Thomas Crayford wrote:

Ok, thanks for the pointer. It seems like the race condition I talked

about

is still accurate, does that seem right?

KeepFileRestoredFromArchive() looks like a good candidate on the matter
as it removes a WAL segment before replacing it by another with the same
name. I have a hard time understanding why the checkpointer would try
to recycle a segment just recovered though as the startup process would
immediately try to use it. I have not spent more than one hour looking
at potential spots though, which is not much for this kind of race
conditions.

It is also why I am curious about what kind of restore_command you are
using.
--
Michael

Michael Paquier

michael@paquier.xyz

over 7 years ago

In reply to: Thomas Crayford (#5)

Re: "could not open file "pg_wal/…": No such file or directory" potential crashing bug due to race condition between restartpoint and recovery

On Mon, Oct 01, 2018 at 12:43:02PM +0100, Thomas Crayford wrote:

Sorry for the slow response on this - I was oncall last week and it was
quite distracting and busy.

With respect to the restore_command, we use wal-e:
https://github.com/wal-e/wal-e, specifically:

envdir DIRECTORY wal-e wal-fetch "%f" "%p"

Okay, so that should be fine. When it comes to such race conditions,
putting directly into the suspicious code paths calls to pg_usleep would
most likely help in enforcing the bug to show up.
--
Michael