run pg_rewind on an uncleanly shut down cluster.

Started by Oleksii Kliukinover 10 years ago7 messages

alexk@hintbits.com

over 10 years ago

Hello,

I'm trying to find out how to rewind a cluster that was not shut down
cleanly, in order to implement pg_rewind support in patroni (an
automated failover system, https://github.com/zalando/patroni).

At the moment, pg_rewind limits itself to only cleanly shut down
clusters. This works nicely in the case of a split brain caused by the
network partitioning. However, it doesn't cover the important case of a
suddenly crashed master: the crashed cluster cannot be rewound to the
new master.

One idea to overcome this limitation is to start the former master for a
short time, just to let automatic recovery do its job, and stop it
cleanly afterwards. There are some indications on the list that it
works:
/messages/by-id/79F6CEB4-F519-40FA-9C72-167DEF1EB3B5@simply.name

However, in our testing we had an issue with a missing WAL segment on a
former master, which prevented pg_rewind from bringing it up to date
with the current master:

Suppose, the current XLOG segment right before we crash the master is:

postgres=# select * from pg_xlogfile_name(pg_current_xlog_location());
pg_xlogfile_name
--------------------------
000000010000000000000003
(1 row)

(the master is configured to archive all segments into the external
directory).

The latest checkpoint location right before the crash is:

Latest checkpoint's REDO location: 0/2000028
Latest checkpoint's REDO WAL file: 000000010000000000000002

and pg_xlog contains the following data
$ ls -R postgresql0/pg_xlog/
000000010000000000000001
000000010000000000000002.00000028.backup archive_status
000000010000000000000002 000000010000000000000003

postgresql0/pg_xlog//archive_status:
000000010000000000000001.done
000000010000000000000002.done
000000010000000000000002.00000028.backup.done

Now, if we crash the master by sending it SIGKILL, and then start it
again with:

$ postgres -D data/postgresql0 -c "max_replication_slots=5" -c
"wal_level=hot_standby" -c "wal_log_hints=on"
LOG: database system was interrupted; last known up at 2015-10-05
17:28:04 CEST
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/2000028
LOG: invalid record length at 0/3000060
LOG: redo done at 0/3000028

we'll get the following contents of postgresql0/pg_xlog:

$ ls -R postgresql0/pg_xlog/
000000010000000000000002.00000028.backup 000000010000000000000004
archive_status
000000010000000000000003 000000010000000000000005

postgresql0/pg_xlog//archive_status:
000000010000000000000002.00000028.backup.done

Note, that at some moment the master removed the segment
000000010000000000000002 from its pg_xlog.

In the pg_controldata, I get:

Latest checkpoint's REDO location: 0/3000108
Latest checkpoint's REDO WAL file: 000000010000000000000003

When I try to run pg_rewind, I'm getting:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres"
The servers diverged at WAL position 0/3000060 on timeline 1.
could not open file "data/postgresql0/pg_xlog/000000010000000000000002":
No such file or directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I know I can copy the segment back from the archive, but I'd like to
avoid putting this logic into the failover tool if possible. Is there
anything we can do to avoid the problem described above, or is there a
better way to bring up the former master after the crash with pg_rewind?

Kind regards,
--
Oleksii Kliukin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Oleksii Kliukin (#1)

Re: run pg_rewind on an uncleanly shut down cluster.

On Mon, Oct 5, 2015 at 05:41:07PM +0200, Oleksii Kliukin wrote:

Hello,

I'm trying to find out how to rewind a cluster that was not shut down
cleanly, in order to implement pg_rewind support in patroni (an
automated failover system, https://github.com/zalando/patroni).

At the moment, pg_rewind limits itself to only cleanly shut down
clusters. This works nicely in the case of a split brain caused by the
network partitioning. However, it doesn't cover the important case of a
suddenly crashed master: the crashed cluster cannot be rewound to the
new master.

Did you read this thread convering the same topic from a few weeks ago?

/messages/by-id/55FA2537.4070600@gmx.net

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oleksii Kliukin

alexk@hintbits.com

over 10 years ago

In reply to: Bruce Momjian (#2)

Re: run pg_rewind on an uncleanly shut down cluster.

On 05 Oct 2015, at 18:04, Bruce Momjian <bruce@momjian.us> wrote:

On Mon, Oct 5, 2015 at 05:41:07PM +0200, Oleksii Kliukin wrote:

Hello,

I'm trying to find out how to rewind a cluster that was not shut down
cleanly, in order to implement pg_rewind support in patroni (an
automated failover system, https://github.com/zalando/patroni).

At the moment, pg_rewind limits itself to only cleanly shut down
clusters. This works nicely in the case of a split brain caused by the
network partitioning. However, it doesn't cover the important case of a
suddenly crashed master: the crashed cluster cannot be rewound to the
new master.

Did you read this thread convering the same topic from a few weeks ago?

/messages/by-id/55FA2537.4070600@gmx.net </messages/by-id/55FA2537.4070600@gmx.net

Thanks, I saw it. The problem being discussed there is different from mine: I need to rewind a crashed master, not a replica being shut down in recovery. And I’m looking for something that be achieved with 9.3 or 9.4, and there are evidences (at least suggestions in the thread linked from my previous post) that it should work.

Kind regards,
--
Oleksii

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Oleksii Kliukin (#1)

Re: run pg_rewind on an uncleanly shut down cluster.

On Tue, Oct 6, 2015 at 12:41 AM, Oleksii Kliukin <alexk@hintbits.com> wrote:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres"
The servers diverged at WAL position 0/3000060 on timeline 1.
could not open file "data/postgresql0/pg_xlog/000000010000000000000002":
No such file or directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I don't think that there is any actual reason preventing us from
rewinding a node that has its state in pg_control set as something
else than DB_SHUTDOWNED, the important point here is to be sure that
the target node is *not* running while pg_rewind is running (perhaps
pg_rewind should perform an action in the target node to not have it
run, let's say that it creates a fake postgresql.conf with invalid
data and renames the existing one). Checking pg_control makes things
easier though, there is no need to rely on external binaries like
"pg_ctl status" or some parsing of postmaster.pid with kill(pid, 0)
for example.

I know I can copy the segment back from the archive, but I'd like to
avoid putting this logic into the failover tool if possible. Is there
anything we can do to avoid the problem described above, or is there a
better way to bring up the former master after the crash with pg_rewind?

Well, for 9.5 (and actually the same applies to the 9.3 and 9.4
version on github because I am keeping the previous versions
consistent with what is in 9.5), I guess no.

This missing segment is going to be needed in any case because the
list of blocks modified needs to be found, hence the question is "how
can pg_rewind guess where a WAL segment missing from the target's
pg_xlog is located?". And there are multiple answers:
- An archive path, then let's add an option to pg_rewind to add a
path, though this needs the archive path to be mounted locally, and
usually that's not the case.
- An existing node of the cluster, perhaps the segment is still
present on another standby node that already replayed it, though this
would need an extra node.
- The source node itself, if we are lucky the missing segment created
before WAL forked is still there. It may not be there though if it has
already been recycled.
At the end it seems to me that this is going to need some extra
operation by the failover tool or the system administrator either way,
and that any additional logic to check where this segment is located
is never going to satisfy completely the failover use cases. Hence I
would keep just pg_rewind out of that.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Oleksii Kliukin

alexk@hintbits.com

over 10 years ago

In reply to: Michael Paquier (#4)

Re: run pg_rewind on an uncleanly shut down cluster.

On 06 Oct 2015, at 08:58, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Tue, Oct 6, 2015 at 12:41 AM, Oleksii Kliukin
<alexk@hintbits.com> wrote:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres" The servers diverged at WAL position 0/3000060 on
timeline 1. could not open file
"data/postgresql0/pg_xlog/000000010000000000000002": No such file or
directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I don't think that there is any actual reason preventing us from
rewinding a node that has its state in pg_control set as something
else than DB_SHUTDOWNED, the important point here is to be sure that
the target node is *not* running while pg_rewind is running (perhaps
pg_rewind should perform an action in the target node to not have it
run, let's say that it creates a fake postgresql.conf with invalid
data and renames the existing one). Checking pg_control makes things
easier though, there is no need to rely on external binaries like
"pg_ctl status" or some parsing of postmaster.pid with kill(pid, 0)
for example.

Does pg_rewind actually rely on the cluster being rewound to finish
recovery? If not, than it would be a good idea to add —force flag to
force the pg_rewind to ignore the state check, as you suggested in
this thread:

/messages/by-id/CAF8Q-Gw1HBKzpSEVtotLg=DR+Ee-6q59qQfhY5tor3FYAenyrA@mail.gmail.com

I know I can copy the segment back from the archive, but I'd like to
avoid putting this logic into the failover tool if possible. Is there
anything we can do to avoid the problem described above, or is there
a better way to bring up the former master after the crash with
pg_rewind?

Well, for 9.5 (and actually the same applies to the 9.3 and 9.4
version on github because I am keeping the previous versions
consistent with what is in 9.5), I guess no.

This missing segment is going to be needed in any case because the
list of blocks modified needs to be found, hence the question is "how
can pg_rewind guess where a WAL segment missing from the target's
pg_xlog is located?". And there are multiple answers:
- An archive path, then let's add an option to pg_rewind to add a
path, though this needs the archive path to be mounted locally, and
usually that's not the case.
- An existing node of the cluster, perhaps the segment is still
present on another standby node that already replayed it, though
this would need an extra node.
- The source node itself, if we are lucky the missing segment created
before WAL forked is still there. It may not be there though if it
has already been recycled. At the end it seems to me that this is
going to need some extra operation by the failover tool or the
system administrator either way, and that any additional logic to
check where this segment is located is never going to satisfy
completely the failover use cases. Hence I would keep just pg_rewind
out of that.

Well, checking the source node looks like an option that does not
require providing any additional information by DBA, as the connection
string or the path to the data dir is already there. It would be nice if
pg_rewind could fetch WAL from the given restore_command though, or even
use the command already there in recovery.conf (if the node being
recovered is a replica, which I guess is a pretty common case).

Anyway, thank you for describing the issue. In my case, it seems I
solved it by removing the files from the archive_status directory of the
former master (the node being rewound). This makes PostgreSQL forget
that it has to remove an already archived (but still required for
pg_rewind) segment (I guess it does it during stop when the checkpoint
is issued). Afterwards, postgres starts it in a single user mode with
archive_command=false and archive_mode=on, to make sure no segments are
archived/removed, and stopped right afterwards with:

postgres --single -D . -c "max_replication_slots=5" -c
"wal_level=hot_standby" -c "wal_log_hints=on" -c "archive_mode=on" -c
"archive_command=false” postgres </dev/null

(the other options are taken from postmaster.ops).

Afterwards, pg_rewind runs on the cluster without any noticeable issues.
Since the node is not going to continue as a master and the contents of
pg_xlog/archive_status is changed after pg_rewind anyway, I don’t think
any data is lost after initial removal of archive_status files.

Kind regards,
--
Oleksii

Michael Paquier

michael.paquier@gmail.com

over 10 years ago

In reply to: Oleksii Kliukin (#5)

Re: run pg_rewind on an uncleanly shut down cluster.

On Tue, Oct 6, 2015 at 6:04 PM, Oleksii Kliukin <alexk@hintbits.com> wrote:

Does pg_rewind actually rely on the cluster being rewound to finish
recovery?

That's not mandatory AFAIK. I think that Heikki has just implemented
it in the safest way possible for a first shot. That's something we
could relax in the future.

If not, than it would be a good idea to add —force flag to force the
pg_rewind to ignore the state check, as you suggested in this thread:
/messages/by-id/CAF8Q-Gw1HBKzpSEVtotLg=DR+Ee-6q59qQfhY5tor3FYAenyrA@mail.gmail.com

Another one would be to remove this check of pg_control by something
closer to what pg_ctl status does with postmaster.pid for example. And
to perhaps add a safeguard to prevent a concurrent user to start the
target node when pg_rewind run begins.

Well, checking the source node looks like an option that does not require
providing any additional information by DBA, as the connection string or the
path to the data dir is already there. It would be nice if pg_rewind could
fetch WAL from the given restore_command though, or even use the command
already there in recovery.conf (if the node being recovered is a replica,
which I guess is a pretty common case).

Kind of. Except that we would want a user to be able to pass a custom
restore_command for more flexibility that would be used by pg_rewind
itself.

Anyway, thank you for describing the issue. In my case, it seems I solved it
by removing the files from the archive_status directory of the former master
(the node being rewound). This makes PostgreSQL forget that it has to remove
an already archived (but still required for pg_rewind) segment (I guess it
does it during stop when the checkpoint is issued). Afterwards, postgres
starts it in a single user mode with archive_command=false and
archive_mode=on, to make sure no segments are archived/removed, and stopped
right afterwards with:

Interesting. That's one way to go.

Afterwards, pg_rewind runs on the cluster without any noticeable issues.
Since the node is not going to continue as a master and the contents of
pg_xlog/archive_status is changed after pg_rewind anyway, I don’t think any
data is lost after initial removal of archive_status files.

Yep. Its content is replaced by everything from the source node.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Michael Paquier (#4)

Re: run pg_rewind on an uncleanly shut down cluster.

On Tue, Oct 6, 2015 at 03:58:44PM +0900, Michael Paquier wrote:

On Tue, Oct 6, 2015 at 12:41 AM, Oleksii Kliukin <alexk@hintbits.com> wrote:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres"
The servers diverged at WAL position 0/3000060 on timeline 1.
could not open file "data/postgresql0/pg_xlog/000000010000000000000002":
No such file or directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I don't think that there is any actual reason preventing us from
rewinding a node that has its state in pg_control set as something
else than DB_SHUTDOWNED, the important point here is to be sure that
the target node is *not* running while pg_rewind is running (perhaps
pg_rewind should perform an action in the target node to not have it
run, let's say that it creates a fake postgresql.conf with invalid
data and renames the existing one). Checking pg_control makes things
easier though, there is no need to rely on external binaries like
"pg_ctl status" or some parsing of postmaster.pid with kill(pid, 0)
for example.

To disable the old cluster, pg_upgrade rename pg_control to
pg_control.old in disable_old_cluster(). You should do that, or
pg_upgrade should use whatever new method you choose.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers