Streaming replication and WAL archive interactions

Started by Heikki Linnakangasabout 11 years ago35 messages
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

There have been a few threads on the behavior of WAL archiving, after a
standby server is promoted [1]/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com [2]/messages/by-id/20140904175036.310c6466@erg. In short, it doesn't work as you
might expect. The standby will start archiving after it's promoted, but
it will not archive files that were replicated from the old master via
streaming replication. If those files were not already archived in the
master before the promotion, they are not archived at all. That's not
good if you wanted to restore from a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how are
people doing it in practice? Just live with the wisk that you might miss
some files in the archive if you promote? Don't even realize there's a
problem? Something else?

And how would we like it to work?

There was some discussion in August on enabling WAL archiving in the
standby, always [3]/messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.. That's a related idea, but it assumes that you have
a separate archive in the master and the standby. The problem at
promotion happens when you have a shared archive between the master and
standby.

[1]: /messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com

[2]: /messages/by-id/20140904175036.310c6466@erg

[3]: /messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.
/messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Vladimir Borodin
root@simply.name
In reply to: Heikki Linnakangas (#1)
Re: Streaming replication and WAL archive interactions

12 дек. 2014 г., в 16:46, Heikki Linnakangas <hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving, after a standby server is promoted [1] [2]. In short, it doesn't work as you might expect. The standby will start archiving after it's promoted, but it will not archive files that were replicated from the old master via streaming replication. If those files were not already archived in the master before the promotion, they are not archived at all. That's not good if you wanted to restore from a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's shared by both, and streaming replication between the master and standby. This should be a very common setup in the field, so how are people doing it in practice? Just live with the wisk that you might miss some files in the archive if you promote? Don't even realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared archive between master and replicas) and don’t even realize there’s a problem :( And I think I’m not the only one. Maybe at least a note should be added to the documentation?

And how would we like it to work?

There was some discussion in August on enabling WAL archiving in the standby, always [3]. That's a related idea, but it assumes that you have a separate archive in the master and the standby. The problem at promotion happens when you have a shared archive between the master and standby.

AFAIK most people use the scheme with shared archive.

[1] /messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com

[2] /messages/by-id/20140904175036.310c6466@erg

[3] /messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Vladimir

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Vladimir Borodin (#2)
Re: Streaming replication and WAL archive interactions

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 пїЅпїЅпїЅ. 2014 пїЅ., пїЅ 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> пїЅпїЅпїЅпїЅпїЅпїЅпїЅ(пїЅ):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and donпїЅt even realize thereпїЅs a
problem :( And I think IпїЅm not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in
the documentation is in order.

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has
archived its WAL, and don't throw away WAL in the standby that hasn't
been archived in the master yet. This is similar to the physical
replication slots, which prevent the master from recycling WAL that a
standby hasn't received yet, but in reverse. I think we can use the
.done and .ready files for this. Whenever a file is streamed
(completely) from the master, create a .ready file for it. When we get
an acknowledgement from the master that it has archived it, create a
.done file for it. To get the information from the master, add the "last
archived WAL segment" e.g. in the streaming replication keep-alive
message, or invent a new message type for it.

At promotion, archive all the WAL from the old timeline that the master
hadn't already archived. While doing this, the archive_command can be
called for files that have in fact already been archived in the master,
so the command needs to return success if it's asked to archive a file
and an identical file already exists in the archive. That's a bit
difficult to write into a one-liner, but hopefully we can still provide
an example of this. Or have another command, e.g.
"promotion_archive_command", which can just assume that everything is OK
if the file already exists.

To enable this new mode, let's add a third option to archive_mode,
besides on/off. Or just make this the default; I'm not sure if anyone
would want the old behavior.

There was some discussion in August on enabling WAL archiving in
the standby, always [3]. That's a related idea, but it assumes that
you have a separate archive in the master and the standby. The
problem at promotion happens when you have a shared archive between
the master and standby.

AFAIK most people use the scheme with shared archive.

Yeah. Anyway, we can support both scenarios.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Streaming replication and WAL archive interactions

On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 дек. 2014 г., в 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and don’t even realize there’s a
problem :( And I think I’m not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in the
documentation is in order.

+1

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has archived
its WAL, and don't throw away WAL in the standby that hasn't been archived
in the master yet. This is similar to the physical replication slots, which
prevent the master from recycling WAL that a standby hasn't received yet,
but in reverse. I think we can use the .done and .ready files for this.
Whenever a file is streamed (completely) from the master, create a .ready
file for it. When we get an acknowledgement from the master that it has
archived it, create a .done file for it. To get the information from the
master, add the "last archived WAL segment" e.g. in the streaming
replication keep-alive message, or invent a new message type for it.

Sounds OK to me.

How does this work in cascade replication case? The cascading walsender
just relays the archive location to the downstream standby?

What happens when WAL streaming is terminated and the startup process starts to
read the WAL file from the archive? After reading the WAL file from the archive,
probably we would need to change .ready files of every older WAL files to .done.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#4)
Re: Streaming replication and WAL archive interactions

On 12/18/2014 12:32 PM, Fujii Masao wrote:

On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 дек. 2014 г., в 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and don’t even realize there’s a
problem :( And I think I’m not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in the
documentation is in order.

+1

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has archived
its WAL, and don't throw away WAL in the standby that hasn't been archived
in the master yet. This is similar to the physical replication slots, which
prevent the master from recycling WAL that a standby hasn't received yet,
but in reverse. I think we can use the .done and .ready files for this.
Whenever a file is streamed (completely) from the master, create a .ready
file for it. When we get an acknowledgement from the master that it has
archived it, create a .done file for it. To get the information from the
master, add the "last archived WAL segment" e.g. in the streaming
replication keep-alive message, or invent a new message type for it.

Sounds OK to me.

How does this work in cascade replication case? The cascading walsender
just relays the archive location to the downstream standby?

Hmm. Yeah, I guess so.

What happens when WAL streaming is terminated and the startup process starts to
read the WAL file from the archive? After reading the WAL file from the archive,
probably we would need to change .ready files of every older WAL files to .done.

I suppose. Although there's no big harm in leaving them in .ready state.
As soon as you reconnect, the primary will tell if they were archived.
If the server is promoted before reconnecting, it will try to archive
the files and archive_command will see that they are already in the
archive. It has to be prepared for that situation anyway, so that's OK too.

Here's a first cut at this. It includes the changes from your
standby_wal_archiving_v1.patch, so you get that behaviour if you set
archive_mode='always', and the new behaviour I wanted with
archive_mode='shared'. I wrote it on top of the other patch I posted
recently to not archive bogus recycled WAL segments after promotion
(/messages/by-id/549489FA.4010304@vmware.com), but
it seems to apply without it too.

I suggest reading the documentation changes first, it hopefully explains
pretty well how to use this. The code should work too, and comments on
that are welcome too, but I haven't tested it much. I'll do more testing
next week.

- Heikki

Attachments:

0001-Make-WAL-archival-behave-more-sensibly-in-standby-mo.patchtext/x-diff; name=0001-Make-WAL-archival-behave-more-sensibly-in-standby-mo.patchDownload+351-63
#6Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#5)
Re: Streaming replication and WAL archive interactions

Hi,

On 2014-12-19 22:56:40 +0200, Heikki Linnakangas wrote:

This add two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

I don't really like this approach. Sharing a archive is rather dangerous
in my experience - if your old master comes up again (and writes in the
last wal file) or similar, you can get into really bad situations.

What I was thinking about was instead trying to detect the point up to
which files were safely archived by running restore command to check for
the presence of archived files. Then archive anything that has valid
content and isn't yet archived. That doesn't sound particularly
complicated to me.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#6)
Re: [HACKERS] Streaming replication and WAL archive interactions

  This should be a very common setup in the field, so how are  people doing it in practice?

One of possible workaround with archive and streaming was to use pg_receivexlog from standby to copy/save WALs to archive. but with pg_receivexlog was also issue with fsync.

[ master ] -- streaming --> [ standby ] -- pg_receivexlog --> [ /archive ]

In that case archive is always in pre standby state and it could be better than had archive broken on promote.
--
Misha

#8Venkata B Nagothi
nag1010@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: Streaming replication and WAL archive interactions

Here's a first cut at this. It includes the changes from your
standby_wal_archiving_v1.patch, so you get that behaviour if you set
archive_mode='always', and the new behaviour I wanted with
archive_mode='shared'. I wrote it on top of the other patch I posted
recently to not archive bogus recycled WAL segments after promotion (
/messages/by-id/549489FA.4010304@vmware.com), but it
seems to apply without it too.

I suggest reading the documentation changes first, it hopefully explains
pretty well how to use this. The code should work too, and comments on that
are welcome too, but I haven't tested it much. I'll do more testing next
week.

Patch did get applied successfully to the latest master. Can you please
rebase.

Regards,
Venkata Balaji N

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Venkata B Nagothi (#8)
Re: Streaming replication and WAL archive interactions

On 03/01/2015 12:36 AM, Venkata Balaji N wrote:

Patch did get applied successfully to the latest master. Can you please
rebase.

Here you go.

On 01/31/2015 03:07 PM, Andres Freund wrote:

On 2014-12-19 22:56:40 +0200, Heikki Linnakangas wrote:

This add two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

I don't really like this approach. Sharing a archive is rather dangerous
in my experience - if your old master comes up again (and writes in the
last wal file) or similar, you can get into really bad situations.

It doesn't have to actually be shared. The master and standby could
archive to different locations, but the responsibility of archiving is
shared, so that on promotion, the standby ensures that every WAL file
gets archived. If the master didn't do it, then the standby will.

Yes, if the master comes up again, it might try to archive a file that
the standby already archived. But that's not so bad. Both copies of the
file will be identical. You could put logic in archive_command to check,
if the file already exists in the archive, whether the contents are
identical, and return success without doing anything if they are.

Oh, hang on, that's not necessarily true. On promotion, the standby
archives the last, partial WAL segment from the old timeline. That's
just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and
in fact I somehow thought I changed that already, but apparently not. So
let's stop doing that.

What I was thinking about was instead trying to detect the point up to
which files were safely archived by running restore command to check for
the presence of archived files. Then archive anything that has valid
content and isn't yet archived. That doesn't sound particularly
complicated to me.

Hmm. That assumes that the standby has a valid restore_command, and can
access the WAL archive. Not a too unreasonable requirement I guess, but
with the scheme I proposed, it's not necessary. Seems a bit silly to
copy a whole segment from the archive just to check if it exists, though.

- Heikki

Attachments:

v2-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchapplication/x-patch; name=v2-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchDownload+351-63
#10Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#9)
Re: Streaming replication and WAL archive interactions

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion? I thought from previous
discussions that we should do it as master (be it crashed, burned, burried
or dead) may not have the occasion to do it. By preventing its archiving
you close the door to the case where master did not have the occasion to
archive it.

+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
This is visibly missing a comment.

As primary_last_archived is used only by ProcessArchivalReport(), wouldn't
it be better to pass it as argument to this function?

+       /* Check that the filename the primary reported looks valid */
+       if (strlen(primary_last_archived) < 24 ||
+               strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+               return;
Not related to this patch, but we had better have a macro doing this job I
think... It keeps spreading around.

People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.
One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

Are both WalSndArchivalReport() and WalSndArchivalReportIfNecessary()
really necessary? I think that for simplicity you could merge them and use
last_archival_report as a local variable.

Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG: using stale statistics instead of current ones because stats
collector is not responding
LOG: sending archival report:
You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.

Regards,
--
Michael

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#10)
Re: Streaming replication and WAL archive interactions

On 04/21/2015 09:53 AM, Michael Paquier wrote:

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion?

Yes.

I thought from previous discussions that we should do it as master
(be it crashed, burned, burried or dead) may not have the occasion to
do it. By preventing its archiving you close the door to the case
where master did not have the occasion to archive it.

The current situation is a mess:

1. Even though we archive the last segment in the standby, there is no
guarantee that the master had archived all the previous segments already.

2. If the master is not totally dead, it might try to archive the same
file with more WAL in it, at the same time or just afterwards, or even
just before the standby has completed promotion. Which copy do you keep
in the archive? Having to deal with that makes the archive_command more
complicated.

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the
new timeline. So the WAL isn't lost.

People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.

Hmm, good point.

One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

By "ignore", what behaviour do you mean? Would "on" be equivalent to
"shared", "always", or something else?

Or we could keep the current behaviour with archive_mode=on (except for
the last segment thing, which is just wrong), where the standby only
archives the new timeline, and nothing from the previous timelines. Are
the use cases where you'd want that, rather than the new "shared" mode?
I wanted to keep the 'on' mode for backwards-compatibility, but if that
causes more problems, it might be better to just remove it and force the
admin to choose what kind of a setup he has, with "shared" or "always".

Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG: using stale statistics instead of current ones because stats
collector is not responding
LOG: sending archival report:

Hmm, yeah, having walsender to wait for the stats file to appear is not
good.

You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.

The ways that the archiver process can communicate with the rest of the
system are limited, for the sake of robustness. Writing to the control
file is definitely not OK. I think using the stats collector is OK for
this, but we'll have to arrange it so that the walsender doesn't block
on it, and should probably not force new stat file so often. A 5-10
seconds old stats file would be perfectly fine for this purpose.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#11)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 09:53 AM, Michael Paquier wrote:

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and
in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last
partial
segment from the old timeline at promotion?

Yes.

I thought from previous discussions that we should do it as master

(be it crashed, burned, burried or dead) may not have the occasion to
do it. By preventing its archiving you close the door to the case
where master did not have the occasion to archive it.

The current situation is a mess:

1. Even though we archive the last segment in the standby, there is no
guarantee that the master had archived all the previous segments already.

2. If the master is not totally dead, it might try to archive the same file

with more WAL in it, at the same time or just afterwards, or even just
before the standby has completed promotion. Which copy do you keep in the
archive? Having to deal with that makes the archive_command more
complicated.

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

People may be surprised that a base backup taken from a node that has

archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.

Hmm, good point.

One idea would be to simply ignore the fact that archive_mode = on on

nodes
in recovery instead of dropping an error. Note that I like the fact that
it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

By "ignore", what behaviour do you mean? Would "on" be equivalent to
"shared", "always", or something else?

I meant something backward-compatible, with files marked as .done when they
are finished replaying... But now my words *are* weird as on != off ;)

Or we could keep the current behaviour with archive_mode=on (except for the

last segment thing, which is just wrong), where the standby only archives
the new timeline, and nothing from the previous timelines.

I guess this would solve the issue here then, which is not a bad thing in
itself:
/messages/by-id/20140918180734.361021e1@erg
We would need to check if the situation improves with the 'always' mode btw.

Are the use cases where you'd want that, rather than the new "shared"
mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
that causes more problems, it might be better to just remove it and force
the admin to choose what kind of a setup he has, with "shared" or "always".

The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.
Regards,
--
Michael

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#12)
Re: Streaming replication and WAL archive interactions

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an
up-to-the-second archive, you can use pg_receivexlog. Mind you, if the
standby wasn't promoted, the partial segment would not be present in the
archive anyway. And you can copy the WAL segment manually from
0000000200000000000000XX to pg_xlog/0000000100000000000000XX before
starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example,
we could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates
how far that file is valid (this naming is similar to how the backup
history files are named). Recovery wouldn't automatically pick up those
files, but the DBA could easily copy the partial file into pg_xlog with
the full segment's name, if he wants to do PITR to that piece of WAL.

Are the use cases where you'd want that, rather than the new "shared"
mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
that causes more problems, it might be better to just remove it and force
the admin to choose what kind of a setup he has, with "shared" or "always".

The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.

But would you ever want the old behaviour, rather than the new shared or
always behaviour?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention. I don't think that's OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#14)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 6:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention. I don't think that's OK.

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#14)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 12:42 AM, Robert Haas wrote:

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention.

No. If there is no streaming replication involved, no partial files will
be archived, with or without this patch. There is no change to that
scenario.

Note that it's a bit complicated to set up that scenario today.
Archiving is never enabled in recovery mode, so you'll need to use a
custom cron job or something to maintain the archive that C uses. The
files will not automatically flow from B to the second archive. With the
patch we're discussing, however, it would be easy: just set
archive_mode='always' in B.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#15)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 03:30 AM, Michael Paquier wrote:

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

Currently, the archive command doesn't know if the segment it's
archiving is partial or not, so you can't put any logic there to manage
it. But if we archive it with the .partial suffix, then you can put
logic in the restore_command to check for .partial files, if you really
want to.

I feel that the best approach is to archive the last, partial segment,
but with the .partial suffix. I don't see any plausible real-world setup
where the current behaviour would be better. I don't really see much
need to archive the partial segment at all, but there's also no harm in
doing it, as long as it's clearly marked with the .partial suffix.

BTW, pg_receivexlog also uses a ".partial" file, while it's streaming
WAL from the server. The .partial suffix is removed when the segment is
complete. So there's some precedence to this. pg_receivexlog adds just
".partial" to the filename, it doesn't add any information of what
portion of the file is valid like I suggested here, though. Perhaps we
should follow pg_receivexlog's example at promotion too, for consistency.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#17)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/22/2015 03:30 AM, Michael Paquier wrote:

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

Currently, the archive command doesn't know if the segment it's archiving is
partial or not, so you can't put any logic there to manage it. But if we
archive it with the .partial suffix, then you can put logic in the
restore_command to check for .partial files, if you really want to.

Well, now you can check as well if there is a file with the same name
already archived and append a suffix to the new file copied, keep the
two files, and then let restore_command sort things up as it wants
with the two segment files it finds.

I feel that the best approach is to archive the last, partial segment, but
with the .partial suffix. I don't see any plausible real-world setup where
the current behavior would be better. I don't really see much need to
archive the partial segment at all, but there's also no harm in doing it, as
long as it's clearly marked with the .partial suffix.

Well, as long as it is clearly archived at promotion, even with a
suffix, I guess that I am fine... This will need some tweaking on
restore_command for existing applications, but as long as it is
clearly documented I am fine. Shouldn't this be a different patch
though?

BTW, pg_receivexlog also uses a ".partial" file, while it's streaming WAL
from the server. The .partial suffix is removed when the segment is
complete. So there's some precedence to this. pg_receivexlog adds just
".partial" to the filename, it doesn't add any information of what portion
of the file is valid like I suggested here, though. Perhaps we should follow
pg_receivexlog's example at promotion too, for consistency.

Consistency here sounds good to me.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Note that it's a bit complicated to set up that scenario today. Archiving is
never enabled in recovery mode, so you'll need to use a custom cron job or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in B.

Hmm, I see. But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#15)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 8:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

This strikes me as equivalent to saying "we don't know how to make
this work right, but maybe our users will know". That never works
out. As things stand, we have a situation where the archive_command
examples in our documentation are known to be flawed. They don't
fsync the file, and they'll write a partial file and then, when rerun,
fail to copy the full file because there's already something there.
Efforts have been made to fix these problems (see the pg_copy thread),
but they haven't been completed yet, nor have we even documented the
issues with the commands recommended by the documentation. Let's
please not throw anything else on the pile of things we're expecting
users to somehow "get right".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#19)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#22)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#18)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#27)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#28)
#30Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#29)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#30)
#32Andrey Borodin
amborodin@acm.org
In reply to: Heikki Linnakangas (#27)
#33Harinath Kanchu
hkanchu@apple.com
In reply to: Andrey Borodin (#32)
#34Jaroslav Novikov
njrslv@yandex-team.ru
In reply to: Andrey Borodin (#32)
#35Jaroslav Novikov
njrslv@yandex-team.ru
In reply to: Andrey Borodin (#32)