Concurrency issue in pg_rewind

Started by Alexander Kukushkinover 5 years ago14 messages
#1Alexander Kukushkin
cyberdemn@gmail.com

Hello,

Today I bumped into an issue with pg_rewind which is not 100% clear
where should be better fixed.

The first call of pg_rewind failed with the following message:
servers diverged at WAL location A76/39E55338 on timeline 132
could not open file
"/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
No such file or directory

could not find previous WAL record at A76/1EFFE620
Failure, exiting

In order to avoid rebuilding the replica from scratch, we restored the
missing file by calling restore_command (wal-g) and repeated the call
of pg_rewind.

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
No such file or directory
Failure, exiting

The second call left PGDATA in an inconsistent state (empty pg_control).

A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
from: wal-g by default when fetching the WAL file is also trying to do
a prefetch of a few next WAL files. For that it forks and the child
process doing prefetch while the parent process exits. In order to
avoid multiple parallel prefetches of the same file, wal-g keeps its
state in the pg_wal/.wal-g directory. It also keeps prefetched files
there.

What in fact happened: pg_rewind is building a list of files in the
target directory which don't match with the source directory and
therefore must be changed (copied/removed/truncated/etc). When the
list was built, the wal-g prefetch was still running, but when
pg_rewind tried to remove files that should not be there because they
don't exist in the source directory it failed with the fatal error.

The issue occurred on 10.14, but I believe very similar might happen
with postgres 13 when pg_rewind is called with --restore-target-wal
option.

One might argue that the issue should be fixed in wal-g (it should not
mess up with pg_wal), and I personally 99% agree with that, but so far
this behavior was safe, because postgres itself never looks inside
unexpected directories in pg_wal. Also from the usability perspective
it is very convenient to keep prefetched files in the pg_wal/.wal-g,
because it guarantees 100% that they will be located on the same
filesystem as pg_wal and therefore the next time when the
restore_command is called it is enough just to rename the file.

That made me think about how it could be improved in the pg_rewind.
The thing is, that we want to have a specific file to be removed, and
it is already not there. Should it be a fatal error?
traverse_datadir()/recurse_dir() already ignoring all failed lstat()
calls with errno == ENOENT.

Basically I have to options:
1. Remove modify remove_target_file(const char *path, bool missing_ok)
function and remove the missing_ok option, that would be consistent
with recurse_dir()
2. Change the logic of remove_target_file(), so it doesn't exit with
the fatal error if the file is missing, but shows only a warning.

In addition to the aforementioned options the remove_target_dir() also
should be improved, i.e. it should check errno and behave similarly to
the remove_target_file() if the errno == ENOENT

What do you think?

Regards,
--
Alexander Kukushkin

#2Oleksandr Shulgin
oleksandr.shulgin@zalando.de
In reply to: Alexander Kukushkin (#1)
Re: Concurrency issue in pg_rewind

On Wed, Sep 16, 2020 at 2:55 PM Alexander Kukushkin <cyberdemn@gmail.com>
wrote:

The second time pg_rewind also failed, but the error looked differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file

"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":

Ouch. I think pg_rewind shouldn't try to remove any random files in pg_wal
that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

--
Alex

#3Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Alexander Kukushkin (#1)
Re: Concurrency issue in pg_rewind

On 2020-09-16 15:55, Alexander Kukushkin wrote:

Hello,

Today I bumped into an issue with pg_rewind which is not 100% clear
where should be better fixed.

The first call of pg_rewind failed with the following message:
servers diverged at WAL location A76/39E55338 on timeline 132
could not open file
"/home/postgres/pgdata/pgroot/data/pg_wal/0000008400000A760000001E":
No such file or directory

could not find previous WAL record at A76/1EFFE620
Failure, exiting

In order to avoid rebuilding the replica from scratch, we restored the
missing file by calling restore_command (wal-g) and repeated the call
of pg_rewind.

The second time pg_rewind also failed, but the error looked
differently:
servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/0000008400000A7600000024":
No such file or directory
Failure, exiting

The second call left PGDATA in an inconsistent state (empty
pg_control).

A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
from: wal-g by default when fetching the WAL file is also trying to do
a prefetch of a few next WAL files. For that it forks and the child
process doing prefetch while the parent process exits. In order to
avoid multiple parallel prefetches of the same file, wal-g keeps its
state in the pg_wal/.wal-g directory. It also keeps prefetched files
there.

Hm, I cannot understand why wal-g (or any other tool) is trying to run
pg_rewind, while WAL copying (and prefetch) is still in progress? Why do
not just wait until it is finished?

It is also not clear for me why it does not put prefetched WAL files
directly into the pg_wal?

The issue occurred on 10.14, but I believe very similar might happen
with postgres 13 when pg_rewind is called with --restore-target-wal
option.

With --restore-target-wal pg_rewind is trying to call restore_command on
its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a
divergence point. In that case file map is not even yet initialized.
Thus, all fetched WAL segments at this stage will be present in the file
map created later.

2) When it creates a data pages map. It should traverse WAL from the
last common checkpoint till the final shutdown point in order to find
all modified pages on the target. At this stage pg_rewind only updates
info about data segments in the file map. That way, I see a minor
problem that WAL segments fetched at this stage would not be deleted,
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other
files in the middle of the file map creation, so I cannot imagine, how
it can get into the same trouble on its own.

That made me think about how it could be improved in the pg_rewind.
The thing is, that we want to have a specific file to be removed, and
it is already not there. Should it be a fatal error?
traverse_datadir()/recurse_dir() already ignoring all failed lstat()
calls with errno == ENOENT.

recurse_dir() is also called on the source, which should be running when
--source-server is used, so it can remove files. There is actually a
TODO there:

/*
* File doesn't exist anymore. This is ok, if the new master
* is running and the file was just removed. If it was a data
* file, there should be a WAL record of the removal. If it
* was something else, it couldn't have been anyway.
*
* TODO: But complain if we're processing the target dir!
*/

Basically I have to options:
1. Remove modify remove_target_file(const char *path, bool missing_ok)
function and remove the missing_ok option, that would be consistent
with recurse_dir()
2. Change the logic of remove_target_file(), so it doesn't exit with
the fatal error if the file is missing, but shows only a warning.

In addition to the aforementioned options the remove_target_dir() also
should be improved, i.e. it should check errno and behave similarly to
the remove_target_file() if the errno == ENOENT

What do you think?

Although keeping arbitrary files inside PGDATA does not look like a good
idea for me, I do not see anything criminal in skipping non-existing
file, when executing a file map by pg_rewind.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#4Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Alexey Kondratov (#3)
Re: Concurrency issue in pg_rewind

On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:

Hm, I cannot understand why wal-g (or any other tool) is trying to run
pg_rewind, while WAL copying (and prefetch) is still in progress? Why do
not just wait until it is finished?

wal-g doesn't try to call pg_rewind.
First, we called wal-g, it fetched the file we requested and exited.
But, before exiting, wal-g forks, and the child process does prefetch
of a few next WALs.
We don't really know when the child process exits and can't wait for it.

It is also not clear for me why it does not put prefetched WAL files
directly into the pg_wal?

Because this is how postgres works. It doesn't matter whether the
specific WAL segment is there, postgres will call the restore_command
anyway.
The restore command also doesn't know if the file in pg_wal is OK,
therefore keeping the prefetched file in some other place and moving
it seems to be a good approach.

With --restore-target-wal pg_rewind is trying to call restore_command on
its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a
divergence point. In that case file map is not even yet initialized.
Thus, all fetched WAL segments at this stage will be present in the file
map created later.

Nope, it will fetch files you requested, and in addition to that it
will leave a child process running in the background which is doing
the prefetch (manipulating with pg_wal/.wal-g/...)

2) When it creates a data pages map. It should traverse WAL from the
last common checkpoint till the final shutdown point in order to find
all modified pages on the target. At this stage pg_rewind only updates
info about data segments in the file map. That way, I see a minor
problem that WAL segments fetched at this stage would not be deleted,
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other
files in the middle of the file map creation, so I cannot imagine, how
it can get into the same trouble on its own.

When pg_rewind was creating the map, some temporary files where there,
because the forked child process of wal-g was still running.
When the wal-g child process exits, it removes some of these files.
Specifically, it was trying to prefetch 0000008400000A7600000024 into
the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
apparently the file wasn't available on S3 and prefetch failed,
therefore the empty file was removed.

Although keeping arbitrary files inside PGDATA does not look like a good
idea for me, I do not see anything criminal in skipping non-existing
file, when executing a file map by pg_rewind.

Good, I will prepare a patch then.

Regards,
--
Alexander Kukushkin

#5Alexey Kondratov
a.kondratov@postgrespro.ru
In reply to: Alexander Kukushkin (#4)
Re: Concurrency issue in pg_rewind

On 2020-09-17 15:27, Alexander Kukushkin wrote:

On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
<a.kondratov@postgrespro.ru> wrote:

With --restore-target-wal pg_rewind is trying to call restore_command
on
its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a
divergence point. In that case file map is not even yet initialized.
Thus, all fetched WAL segments at this stage will be present in the
file
map created later.

Nope, it will fetch files you requested, and in addition to that it
will leave a child process running in the background which is doing
the prefetch (manipulating with pg_wal/.wal-g/...)

2) When it creates a data pages map. It should traverse WAL from the
last common checkpoint till the final shutdown point in order to find
all modified pages on the target. At this stage pg_rewind only updates
info about data segments in the file map. That way, I see a minor
problem that WAL segments fetched at this stage would not be deleted,
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other
files in the middle of the file map creation, so I cannot imagine, how
it can get into the same trouble on its own.

When pg_rewind was creating the map, some temporary files where there,
because the forked child process of wal-g was still running.
When the wal-g child process exits, it removes some of these files.
Specifically, it was trying to prefetch 0000008400000A7600000024 into
the pg_wal/.wal-g/prefetch/running/0000008400000A7600000024, but
apparently the file wasn't available on S3 and prefetch failed,
therefore the empty file was removed.

I do understand how you got into this problem with wal-g. This part of
my answer was about bare postgres and pg_rewind. And my point was that
from my perspective pg_rewind with --restore-target-wal cannot get into
the same trouble on its own, without 'help' of some side tools like
wal-g.

Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

#6Michael Paquier
michael@paquier.xyz
In reply to: Oleksandr Shulgin (#2)
Re: Concurrency issue in pg_rewind

On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:

Ouch. I think pg_rewind shouldn't try to remove any random files in pg_wal
that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly. If you think about
that, this is rather similar to putting tablespaces in your root data
folder: it may look fancy, but you just make the task of the backend
more complicated. Note that this practice is not wise when Postgres
decides to apply actions that loop across the entire tree, just to
name two of them: the full fsync of PGDATA at the beginning of crash
recovery or just a base backup.
--
Michael

#7Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Michael Paquier (#6)
Re: Concurrency issue in pg_rewind

18 сент. 2020 г., в 11:10, Michael Paquier <michael@paquier.xyz> написал(а):

On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:

Ouch. I think pg_rewind shouldn't try to remove any random files in pg_wal
that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/

This is whole point of having prefetch. restore_command just links file from the same partition.
In WAL-G you strictly control number of cached WALs, so if you configured max_wal_size - you can configure WALG_DOWNLOAD_CONCURRENCY too.

Best regards, Andrey Borodin.

#8Michael Paquier
michael@paquier.xyz
In reply to: Andrey M. Borodin (#7)
Re: Concurrency issue in pg_rewind

On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:

This is whole point of having prefetch. restore_command just links
file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal? You could just use different paths for
both. Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.
--
Michael

#9Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Michael Paquier (#8)
Re: Concurrency issue in pg_rewind

Hi,

On Fri, 18 Sep 2020 at 08:59, Michael Paquier <michael@paquier.xyz> wrote:

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal? You could just use different paths for
both. Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.

Well, I agree and disagree at the same time.
Yes, it would be nice not to write anything unexpected to PGDATA and
pg_wal, but this is also a usability issue.
Creating a separate directory and configuring wal-e/wal-g to use it
(now it is not even possible to configure it), requires some effort
from the administrator, while using something inside pg_wal just
works.

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Regards,
--
Alexander Kukushkin

#10Oleksandr Shulgin
oleksandr.shulgin@zalando.de
In reply to: Michael Paquier (#6)
Re: Concurrency issue in pg_rewind

On Fri, Sep 18, 2020 at 8:10 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 17, 2020 at 10:20:16AM +0200, Oleksandr Shulgin wrote:

Ouch. I think pg_rewind shouldn't try to remove any random files in

pg_wal

that it doesn't know about.
What if the administrator made a backup of some WAL segments there?

IMO, this would be a rather bad strategy anyway, so just don't do
that, because that could also mean that this is on the same partition
as pg_wal/ which would crash the server if the partition has the idea
to get full even if max_wal_size is set correctly.

To clarify my point, I don't mean to backup WAL segments in the background
when the server is running, but precisely when the server is down and you
need to intervene, such as running pg_rewind. You might want to "stash"
some of the latest segments in case you need to start over (name it
pg_wal/0000008400000A760000001E.backup, or
pg_wal/backup/0000008400000A760000001E). It is surprising that pg_rewind
might want to decide to remove those.

--
Alex

#11Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Michael Paquier (#8)
Re: Concurrency issue in pg_rewind

18 сент. 2020 г., в 11:59, Michael Paquier <michael@paquier.xyz> написал(а):

On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:

This is whole point of having prefetch. restore_command just links
file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal? You could just use different paths for
both. Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.

This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is WAL per se. Makes sense to keep it in pg_wal tree by default.

I will implement possibility to move cache out of pg_wal (similar functionality is implemented in pgBackRest). But it seems useless to me: user can configure WAL prefetch to be less performant, without any benefits.

Best regards, Andrey Borodin.

#12Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alexander Kukushkin (#9)
Re: Concurrency issue in pg_rewind

On 18/09/2020 10:17, Alexander Kukushkin wrote:

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Whenever pg_rewind runs into something unexpected, it fails loudly, so
that the administrator can re-initialize from a base backup. That's the
general rule. If a file goes missing while pg_rewind is running, that is
unexpected. It could be a sign that the server was started concurrently,
or another pg_rewind was started against it, for example.

I feel that we could make an exception of some sort here, but I'm not
sure what exactly. I don't feel comfortable just downgrading the
unexpected ENOENT on unlink() to warning in all cases. Besides, scary
warnings that you routinely ignore is not good either.

I have a hard time coming up with a general rule and justification
that's not just "do X because WAL-G does Y". pg_rewind failing because
WAL-G removed a file unexpectedly is one problem, but another is that
the restore_command might get confused if a pg_rewind removes a file
that restore_command needs. This is hard when restore_command does
things in the background, and there's no communication between the
background process and pg_rewind.

The general principle is that pg_rewind is equivalent to overwriting the
target with the source, only faster. Perhaps pg_wal should be an
exception, and pg_rewind should leave alone any files under pg_wal that
it doesn't recognize.

- Heikki

#13Stephen Frost
sfrost@snowman.net
In reply to: Heikki Linnakangas (#12)
Re: Concurrency issue in pg_rewind

Greetings,

* Heikki Linnakangas (hlinnaka@iki.fi) wrote:

On 18/09/2020 10:17, Alexander Kukushkin wrote:

At the same time, pg_rewind due to such "fatal" error leaves PGDATA in
an inconsistent state with empty pg_control file, this is totally bad
and easily fixable. We want the specific file to be absent and it is
already absent, why should it be a fatal error and not warning?

Whenever pg_rewind runs into something unexpected, it fails loudly, so that
the administrator can re-initialize from a base backup. That's the general
rule. If a file goes missing while pg_rewind is running, that is unexpected.
It could be a sign that the server was started concurrently, or another
pg_rewind was started against it, for example.

Agreed.

I feel that we could make an exception of some sort here, but I'm not sure
what exactly. I don't feel comfortable just downgrading the unexpected
ENOENT on unlink() to warning in all cases. Besides, scary warnings that you
routinely ignore is not good either.

I also dislike the idea of downgrading this.

I have a hard time coming up with a general rule and justification that's
not just "do X because WAL-G does Y". pg_rewind failing because WAL-G
removed a file unexpectedly is one problem, but another is that the
restore_command might get confused if a pg_rewind removes a file that
restore_command needs. This is hard when restore_command does things in the
background, and there's no communication between the background process and
pg_rewind.

I would also point out that wal-g isn't the only backup/restore tool
that does pre-fetching: so does pgbackrest, but we pre-fetch into an
independent spool directory, because these tools really should *not* be
modifying the PGDATA directory during restore_command.

I'm really disinclined to make concessions for external tools to start
writing into directories that they shouldn't be- and this goes for
removing .ready files too, imv. Yes, you can do such things and maybe
things will work, but if you run into issues with that, that's on you
for making changes to the PGDATA directory, not on PG to try and guess
at what you, or any other external tool, did and magically work around
it or keep it working.

Thanks,

Stephen

#14Stephen Frost
sfrost@snowman.net
In reply to: Andrey M. Borodin (#11)
Re: Concurrency issue in pg_rewind

Greetings,

* Andrey M. Borodin (x4mmm@yandex-team.ru) wrote:

18 сент. 2020 г., в 11:59, Michael Paquier <michael@paquier.xyz> написал(а):
On Fri, Sep 18, 2020 at 11:31:26AM +0500, Andrey M. Borodin wrote:

This is whole point of having prefetch. restore_command just links
file from the same partition.

If this stuff is willing to do so, you may have your reasons, but even
if you wish to locate both pg_wal/ and the prefetch path in the same
partition, I don't get why it is necessary to have the prefetch path
included directly in pg_wal? You could just use different paths for
both. Say, with a base partition at /my/path/, you can just have
/my/path/pg_wal/ that the Postgres backend links to, and
/my/path/wal-g/prefetch/ for the secondary path.

This complexity doesn't seem necessary to me. What we gain? Prefetched WAL is WAL per se. Makes sense to keep it in pg_wal tree by default.

I will implement possibility to move cache out of pg_wal (similar functionality is implemented in pgBackRest). But it seems useless to me: user can configure WAL prefetch to be less performant, without any benefits.

In this case there's certainly one very clear benefit: pg_rewind will be
more robust at detecting serious issues and complaining loudly,
hopefully avoiding having users end up with corrupted clusters. That's
certainly not nothing, from my perspective.

Thanks,

Stephen