Streaming replication and WAL archive interactions

Started by Heikki Linnakangasabout 11 years ago31 messages
#1Heikki Linnakangas
hlinnakangas@vmware.com

There have been a few threads on the behavior of WAL archiving, after a
standby server is promoted [1]/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com [2]/messages/by-id/20140904175036.310c6466@erg. In short, it doesn't work as you
might expect. The standby will start archiving after it's promoted, but
it will not archive files that were replicated from the old master via
streaming replication. If those files were not already archived in the
master before the promotion, they are not archived at all. That's not
good if you wanted to restore from a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how are
people doing it in practice? Just live with the wisk that you might miss
some files in the archive if you promote? Don't even realize there's a
problem? Something else?

And how would we like it to work?

There was some discussion in August on enabling WAL archiving in the
standby, always [3]/messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.. That's a related idea, but it assumes that you have
a separate archive in the master and the standby. The problem at
promotion happens when you have a shared archive between the master and
standby.

[1]: /messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com

[2]: /messages/by-id/20140904175036.310c6466@erg

[3]: /messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.
/messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Borodin Vladimir
root@simply.name
In reply to: Heikki Linnakangas (#1)
Re: Streaming replication and WAL archive interactions

12 дек. 2014 г., в 16:46, Heikki Linnakangas <hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving, after a standby server is promoted [1] [2]. In short, it doesn't work as you might expect. The standby will start archiving after it's promoted, but it will not archive files that were replicated from the old master via streaming replication. If those files were not already archived in the master before the promotion, they are not archived at all. That's not good if you wanted to restore from a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's shared by both, and streaming replication between the master and standby. This should be a very common setup in the field, so how are people doing it in practice? Just live with the wisk that you might miss some files in the archive if you promote? Don't even realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared archive between master and replicas) and don’t even realize there’s a problem :( And I think I’m not the only one. Maybe at least a note should be added to the documentation?

And how would we like it to work?

There was some discussion in August on enabling WAL archiving in the standby, always [3]. That's a related idea, but it assumes that you have a separate archive in the master and the standby. The problem at promotion happens when you have a shared archive between the master and standby.

AFAIK most people use the scheme with shared archive.

[1] /messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com

[2] /messages/by-id/20140904175036.310c6466@erg

[3] /messages/by-id/CAHGQGwHNMs-syU=MEVSESTHna+Exd9pfO_OHHFPJCwOVaYRZKw@mail.gmail.com.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Vladimir

#3Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Borodin Vladimir (#2)
Re: Streaming replication and WAL archive interactions

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 пїЅпїЅпїЅ. 2014 пїЅ., пїЅ 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> пїЅпїЅпїЅпїЅпїЅпїЅпїЅ(пїЅ):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and donпїЅt even realize thereпїЅs a
problem :( And I think IпїЅm not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in
the documentation is in order.

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has
archived its WAL, and don't throw away WAL in the standby that hasn't
been archived in the master yet. This is similar to the physical
replication slots, which prevent the master from recycling WAL that a
standby hasn't received yet, but in reverse. I think we can use the
.done and .ready files for this. Whenever a file is streamed
(completely) from the master, create a .ready file for it. When we get
an acknowledgement from the master that it has archived it, create a
.done file for it. To get the information from the master, add the "last
archived WAL segment" e.g. in the streaming replication keep-alive
message, or invent a new message type for it.

At promotion, archive all the WAL from the old timeline that the master
hadn't already archived. While doing this, the archive_command can be
called for files that have in fact already been archived in the master,
so the command needs to return success if it's asked to archive a file
and an identical file already exists in the archive. That's a bit
difficult to write into a one-liner, but hopefully we can still provide
an example of this. Or have another command, e.g.
"promotion_archive_command", which can just assume that everything is OK
if the file already exists.

To enable this new mode, let's add a third option to archive_mode,
besides on/off. Or just make this the default; I'm not sure if anyone
would want the old behavior.

There was some discussion in August on enabling WAL archiving in
the standby, always [3]. That's a related idea, but it assumes that
you have a separate archive in the master and the standby. The
problem at promotion happens when you have a shared archive between
the master and standby.

AFAIK most people use the scheme with shared archive.

Yeah. Anyway, we can support both scenarios.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#3)
Re: Streaming replication and WAL archive interactions

On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 дек. 2014 г., в 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and don’t even realize there’s a
problem :( And I think I’m not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in the
documentation is in order.

+1

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has archived
its WAL, and don't throw away WAL in the standby that hasn't been archived
in the master yet. This is similar to the physical replication slots, which
prevent the master from recycling WAL that a standby hasn't received yet,
but in reverse. I think we can use the .done and .ready files for this.
Whenever a file is streamed (completely) from the master, create a .ready
file for it. When we get an acknowledgement from the master that it has
archived it, create a .done file for it. To get the information from the
master, add the "last archived WAL segment" e.g. in the streaming
replication keep-alive message, or invent a new message type for it.

Sounds OK to me.

How does this work in cascade replication case? The cascading walsender
just relays the archive location to the downstream standby?

What happens when WAL streaming is terminated and the startup process starts to
read the WAL file from the archive? After reading the WAL file from the archive,
probably we would need to change .ready files of every older WAL files to .done.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#4)
1 attachment(s)
Re: Streaming replication and WAL archive interactions

On 12/18/2014 12:32 PM, Fujii Masao wrote:

On Wed, Dec 17, 2014 at 4:11 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 12/16/2014 10:24 AM, Borodin Vladimir wrote:

12 дек. 2014 г., в 16:46, Heikki Linnakangas
<hlinnakangas@vmware.com> написал(а):

There have been a few threads on the behavior of WAL archiving,
after a standby server is promoted [1] [2]. In short, it doesn't
work as you might expect. The standby will start archiving after
it's promoted, but it will not archive files that were replicated
from the old master via streaming replication. If those files were
not already archived in the master before the promotion, they are
not archived at all. That's not good if you wanted to restore from
a base backup + the WAL archive later.

The basic setup is a master server, a standby, a WAL archive that's
shared by both, and streaming replication between the master and
standby. This should be a very common setup in the field, so how
are people doing it in practice? Just live with the wisk that you
might miss some files in the archive if you promote? Don't even
realize there's a problem? Something else?

Yes, I do live like that (with streaming replication and shared
archive between master and replicas) and don’t even realize there’s a
problem :( And I think I’m not the only one. Maybe at least a note
should be added to the documentation?

Let's try to figure out a way to fix this in master, but yeah, a note in the
documentation is in order.

+1

And how would we like it to work?

Here's a plan:

Have a mechanism in the standby, to track how far the master has archived
its WAL, and don't throw away WAL in the standby that hasn't been archived
in the master yet. This is similar to the physical replication slots, which
prevent the master from recycling WAL that a standby hasn't received yet,
but in reverse. I think we can use the .done and .ready files for this.
Whenever a file is streamed (completely) from the master, create a .ready
file for it. When we get an acknowledgement from the master that it has
archived it, create a .done file for it. To get the information from the
master, add the "last archived WAL segment" e.g. in the streaming
replication keep-alive message, or invent a new message type for it.

Sounds OK to me.

How does this work in cascade replication case? The cascading walsender
just relays the archive location to the downstream standby?

Hmm. Yeah, I guess so.

What happens when WAL streaming is terminated and the startup process starts to
read the WAL file from the archive? After reading the WAL file from the archive,
probably we would need to change .ready files of every older WAL files to .done.

I suppose. Although there's no big harm in leaving them in .ready state.
As soon as you reconnect, the primary will tell if they were archived.
If the server is promoted before reconnecting, it will try to archive
the files and archive_command will see that they are already in the
archive. It has to be prepared for that situation anyway, so that's OK too.

Here's a first cut at this. It includes the changes from your
standby_wal_archiving_v1.patch, so you get that behaviour if you set
archive_mode='always', and the new behaviour I wanted with
archive_mode='shared'. I wrote it on top of the other patch I posted
recently to not archive bogus recycled WAL segments after promotion
(/messages/by-id/549489FA.4010304@vmware.com), but
it seems to apply without it too.

I suggest reading the documentation changes first, it hopefully explains
pretty well how to use this. The code should work too, and comments on
that are welcome too, but I haven't tested it much. I'll do more testing
next week.

- Heikki

Attachments:

0001-Make-WAL-archival-behave-more-sensibly-in-standby-mo.patchtext/x-diff; name=0001-Make-WAL-archival-behave-more-sensibly-in-standby-mo.patchDownload
>From 03dced40178c0a0b7c28ff630a15cf664995525d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 16 Dec 2014 23:09:03 +0200
Subject: [PATCH 1/1] Make WAL archival behave more sensibly in standby mode.

This add two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

Fujii Masao and me.
---
 doc/src/sgml/config.sgml                      |  12 +-
 doc/src/sgml/high-availability.sgml           |  48 +++++++
 doc/src/sgml/protocol.sgml                    |  31 +++++
 src/backend/access/transam/xlog.c             |  29 ++++-
 src/backend/postmaster/postmaster.c           |  37 ++++--
 src/backend/replication/walreceiver.c         | 172 ++++++++++++++++++++------
 src/backend/replication/walsender.c           |  47 +++++++
 src/backend/utils/misc/guc.c                  |  21 ++--
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/include/access/xlog.h                     |  14 ++-
 10 files changed, 351 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 48ae3e4..986d6eb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2475,7 +2475,7 @@ include_dir 'conf.d'
 
     <variablelist>
      <varlistentry id="guc-archive-mode" xreflabel="archive_mode">
-      <term><varname>archive_mode</varname> (<type>boolean</type>)
+      <term><varname>archive_mode</varname> (<type>enum</type>)
       <indexterm>
        <primary><varname>archive_mode</> configuration parameter</primary>
       </indexterm>
@@ -2484,7 +2484,15 @@ include_dir 'conf.d'
        <para>
         When <varname>archive_mode</> is enabled, completed WAL segments
         are sent to archive storage by setting
-        <xref linkend="guc-archive-command">.
+        <xref linkend="guc-archive-command">. In addition to <literal>off</>,
+        to disable, there are three modes: <literal>on</>, <literal>shared</>,
+        and <literal>always</>. During normal operation, there is no
+        difference between the three modes, but in archive recovery or
+        standby mode, it indicates whether the WAL archive is shared between
+        the primary and the standby server or not. See
+        <xref linkend="continuous-archiving-in-standby"> for details.
+       </para>  
+       <para>
         <varname>archive_mode</> and <varname>archive_command</> are
         separate variables so that <varname>archive_command</> can be
         changed without leaving archiving mode.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index d249959..c22b15a 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1220,6 +1220,54 @@ primary_slot_name = 'node_a_slot'
 
    </sect3>
   </sect2>
+
+  <sect2 id="continuous-archiving-in-standby">
+   <title>Continuous archiving in standby</title>
+
+   <indexterm>
+     <primary>continuous archiving</primary>
+     <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+     When continuous WAL archiving is used in a standby, there are two
+     different scenarios: the WAL archive can be shared between the primary
+     and the standby, or the standby can have its own WAL archive. In the
+     shared archive scenario, <varname>archive_mode</varname> must be set to
+     <literal>shared</literal>, and in the separate archive scenario, to
+     <literal>always</literal>. Setting it to <literal>on</literal> in a
+     standby server, or when performing point-in-time recovery, is not
+     allowed and an error will be raised. When a server is not in recovery
+     mode, there is no difference between <literal>on</literal>,
+     <literal>shared</literal>, and <literal>always</literal> modes.
+   </para>
+
+   <para>
+     In <literal>shared</literal> archive mode, the standby server tries to
+     ensure that the archive is complete, even if the primary crashes and
+     failover happens. The standby server will not archive any WAL segments
+     as long as it is in standby mode; it is the primary server's
+     responsibility to do so. It will, however, keep track of which files
+     have already been archived by the primary, and if failover happens, it
+     takes over and attempts to archive any files that the primary had not
+     yet archived.
+   </para>
+
+   <para>
+     In <literal>always</literal> archive mode, the standby server will
+     archive all WAL it receives, whether it's through streaming replication
+     or by restoring from the primary's archive using
+     <varname>restore_command</varname>.
+   </para>
+
+   <para>
+     In cascading replication, the first standby server and the cascaded
+     standby servers can use <varname>archive_mode</varname> settings. In
+     each standby, it should be set to <literal>shared</literal> or
+     <literal>always</literal>, depending on whether that standby shares the
+     archive with the primary or standby it is connected to.
+   </para>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index efe75ea..60235fe 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1646,6 +1646,37 @@ The commands accepted in walsender mode are:
       </para>
       </listitem>
       </varlistentry>
+      <varlistentry>
+      <term>
+          WAL archival report message (B)
+      </term>
+      <listitem>
+      <para>
+      <variablelist>
+      <varlistentry>
+      <term>
+          Byte1('a')
+      </term>
+      <listitem>
+      <para>
+          Tells the receiver the last archived WAL segment.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte<replaceable>n</replaceable>
+      </term>
+      <listitem>
+      <para>
+          Filename of the latest archived file.
+      </para>
+      </listitem>
+      </varlistentry>
+      </variablelist>
+      </para>
+      </listitem>
+      </varlistentry>
       </variablelist>
      </para>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e267ca1..9d0c672 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -83,7 +83,7 @@ int			CheckPointSegments = 3;
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
-bool		XLogArchiveMode = false;
+int			XLogArchiveMode = ARCHIVE_MODE_OFF;
 char	   *XLogArchiveCommand = NULL;
 bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
@@ -139,6 +139,25 @@ const struct config_enum_entry sync_method_options[] = {
 	{NULL, 0, false}
 };
 
+
+/*
+ * Although only "on", "off", and "always" are documented,
+ * we accept all the likely variants of "on" and "off".
+ */
+const struct config_enum_entry archive_mode_options[] = {
+	{"shared", ARCHIVE_MODE_SHARED, false},
+	{"always", ARCHIVE_MODE_ALWAYS, false},
+	{"on", ARCHIVE_MODE_ON, false},
+	{"off", ARCHIVE_MODE_OFF, false},
+	{"true", ARCHIVE_MODE_ON, true},
+	{"false", ARCHIVE_MODE_OFF, true},
+	{"yes", ARCHIVE_MODE_ON, true},
+	{"no", ARCHIVE_MODE_OFF, true},
+	{"1", ARCHIVE_MODE_ON, true},
+	{"0", ARCHIVE_MODE_OFF, true},
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
@@ -757,7 +776,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
+static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static void recoveryPausesHere(void);
@@ -5825,6 +5844,12 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		/* archive_mode=on is not allowed during archive recovery. */
+		if (XLogArchiveMode == ARCHIVE_MODE_ON)
+			ereport(ERROR,
+					(errmsg("archive_mode='on' cannot be used in archive recovery"),
+					 (errhint("Use 'shared' or 'always' mode instead."))));
+
 		if (StandbyModeRequested)
 			ereport(LOG,
 					(errmsg("entering standby mode")));
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5106f52..b41e34d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -824,9 +824,9 @@ PostmasterMain(int argc, char *argv[])
 		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
 		ExitPostmaster(1);
 	}
-	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
+	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
-				(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
+				(errmsg("WAL archival (archive_mode=on/always/shared) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
@@ -1624,13 +1624,21 @@ ServerLoop(void)
 				start_autovac_launcher = false; /* signal processed */
 		}
 
-		/* If we have lost the archiver, try to start a new one */
-		if (XLogArchivingActive() && PgArchPID == 0 && pmState == PM_RUN)
-			PgArchPID = pgarch_start();
-
-		/* If we have lost the stats collector, try to start a new one */
-		if (PgStatPID == 0 && pmState == PM_RUN)
-			PgStatPID = pgstat_start();
+		/*
+		 * If we have lost the archiver, try to start a new one.
+		 *
+		 * If WAL archiving is enabled always, we try to start a new archiver
+		 * even during recovery.
+		 */
+		if (PgArchPID == 0 && wal_level >= WAL_LEVEL_ARCHIVE)
+		{
+			if ((pmState == PM_RUN && XLogArchiveMode > ARCHIVE_MODE_OFF) ||
+				((pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY) &&
+				 XLogArchiveMode == ARCHIVE_MODE_ALWAYS))
+			{
+				PgArchPID = pgarch_start();
+			}
+		}
 
 		/* If we need to signal the autovacuum launcher, do so now */
 		if (avlauncher_needs_signal)
@@ -4796,6 +4804,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
+		/*
+		 * Start the archiver if we're responsible for (re-)archiving received
+		 * files.
+		 */
+		Assert(PgArchPID == 0);
+		if (wal_level >= WAL_LEVEL_ARCHIVE &&
+			XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		{
+			PgArchPID = pgarch_start();
+		}
+
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..b178d4f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -52,8 +52,11 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/pgarch.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -107,6 +110,9 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -141,6 +147,7 @@ static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessArchivalReport(void);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -537,21 +544,12 @@ WalReceiverMain(void)
 		 */
 		if (recvFile >= 0)
 		{
-			char		xlogfname[MAXFNAMELEN];
-
 			XLogWalRcvFlush(false);
 			if (close(recvFile) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
 								XLogFileNameP(recvFileTLI, recvSegNo))));
-
-			/*
-			 * Create .done file forcibly to prevent the streamed segment from
-			 * being archived later.
-			 */
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
 		}
 		recvFile = -1;
 
@@ -857,6 +855,26 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 					XLogWalRcvSendReply(true, false);
 				break;
 			}
+		case 'a':				/* Archival report */
+			{
+				/* the content of the message is a filename */
+				if (len >= sizeof(primary_last_archived))
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("invalid archival report message with length %d",
+											 (int) len)));
+				memcpy(primary_last_archived, buf, len);
+				primary_last_archived[len] = '\0';
+				if (strspn(buf, VALID_XFN_CHARS) != len)
+				{
+					primary_last_archived[0] = '\0';
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("unexpected character in primary's last archived filename")));
+				}
+				ProcessArchivalReport();
+				break;
+			}
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -878,39 +896,18 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
+		if (!XLByteInSeg(recptr, recvSegNo))
 		{
 			bool		use_existent;
 
 			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
+			 * We take care to always close the current file, after writing
+			 * the last byte to it. So this shouldn't happen.
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				ereport(ERROR,
+						(errmsg("unexpected WAL receive location %s",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -965,6 +962,51 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * If we just wrote the last byte to this segment, fsync() and close
+		 * current file before we switch to next one. We would otherwise have
+		 * to reopen this file to fsync it later.
+		 */
+		if (recvOff == XLOG_SEG_SIZE)
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			XLogWalRcvFlush(false);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not close log segment %s: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
+			recvFile = -1;
+
+			/*
+			 * Now that this segment is complete, do we need to archive it?
+			 *
+			 * In 'always' mode, we clearly need to archive this.
+			 *
+			 * In 'shared' mode, we might need to, if we get promoted before
+			 * the master has archived this file, so create a .ready file. It
+			 * will be replaced with .done later, if we get acknowledgemet
+			 * from the primary that this has already been archived.
+			 *
+			 * In 'on' mode, we're only responsible for WAL we've generated
+			 * ourselves.
+			 */
+			if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS ||
+				XLogArchiveMode == ARCHIVE_MODE_SHARED)
+			{
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+
+				XLogArchiveCheckDone(xlogfname);
+			}
+		}
 	}
 }
 
@@ -1215,3 +1257,61 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 		pfree(receipttime);
 	}
 }
+
+/*
+ * Create .done and .ready files, based on the master's last archival report.
+ */
+static void
+ProcessArchivalReport(void)
+{
+	DIR		   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "received archival report from master: %s",
+		 primary_last_archived);
+
+	if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+		return;
+
+	/* Check that the filename the primary reported looks valid */
+	if (strlen(primary_last_archived) < 24 ||
+		strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+		return;
+
+	xldir = AllocateDir(XLOGDIR);
+	if (xldir == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open transaction log directory \"%s\": %m",
+						XLOGDIR)));
+
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		/*
+		 * We ignore the timeline part of the XLOG segment identifiers in
+		 * deciding whether a segment is still needed.  This ensures that we
+		 * won't prematurely remove a segment from a parent timeline. We could
+		 * probably be a little more proactive about removing segments of
+		 * non-parent timelines, but that would be a whole lot more
+		 * complicated.
+		 *
+		 * We use the alphanumeric sorting property of the filenames to decide
+		 * which ones are earlier than the lastoff segment.
+		 */
+		if (strlen(xlde->d_name) == 24 &&
+			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
+			strcmp(xlde->d_name + 8, primary_last_archived + 8) <= 0)
+		{
+			XLogArchiveForceDone(xlde->d_name);
+		}
+	}
+
+	FreeDir(xldir);
+
+	/*
+	 * Remember this location in pgstat as well. This makes it visible in
+	 * pg_stat_archiver, and allows the location to be relayed to cascaded
+	 * standbys.
+	 */
+	pgstat_send_archiver(primary_last_archived, false);
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 019ae6a..b51bd80 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -55,6 +55,7 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
+#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
@@ -153,6 +154,7 @@ static StringInfoData tmpbuf;
  * wal_sender_timeout doesn't need to be active.
  */
 static TimestampTz last_reply_timestamp = 0;
+static char last_archival_report[MAX_XFN_CHARS + 1] = "";
 
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
@@ -210,6 +212,8 @@ static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
 static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
+static void WalSndArchivalReport(void);
+static void WalSndArchivalReportIfNecessary(void);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
@@ -2889,6 +2893,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	TimestampTz ping_time;
 
 	/*
+	 * Send an archival status message, if necessary.
+	 */
+	WalSndArchivalReportIfNecessary();
+
+	/*
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
@@ -2917,6 +2926,44 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 }
 
 /*
+ * This function is used to send archival report message to standby.
+ */
+static void
+WalSndArchivalReport(void)
+{
+	elog(LOG, "sending archival report: %s", last_archival_report);
+
+	/* construct the message... */
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'a');
+	pq_sendbytes(&output_message, last_archival_report, strlen(last_archival_report));
+
+	/* ... and send it wrapped in CopyData */
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
+}
+
+static void
+WalSndArchivalReportIfNecessary(void)
+{
+	PgStat_ArchiverStats *archiver_stats;
+
+	archiver_stats = pgstat_fetch_stat_archiver();
+
+	if (strcmp(last_archival_report, archiver_stats->last_archived_wal) == 0)
+	{
+		pgstat_clear_snapshot();
+		return;
+	}
+
+	strlcpy(last_archival_report, archiver_stats->last_archived_wal,
+			sizeof(last_archival_report));
+
+	pgstat_clear_snapshot();
+
+	WalSndArchivalReport();
+}
+
+/*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
  * future for synchronous replication.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b1bff7f..6f5b284 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -424,6 +424,7 @@ static const struct config_enum_entry row_security_options[] = {
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
+extern const struct config_enum_entry archive_mode_options[];
 extern const struct config_enum_entry sync_method_options[];
 extern const struct config_enum_entry dynamic_shared_memory_options[];
 
@@ -1458,16 +1459,6 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
-		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
-			gettext_noop("Allows archiving of WAL files using archive_command."),
-			NULL
-		},
-		&XLogArchiveMode,
-		false,
-		NULL, NULL, NULL
-	},
-
-	{
 		{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
 			gettext_noop("Allows connections and queries during recovery."),
 			NULL
@@ -3446,6 +3437,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
+			gettext_noop("Allows archiving of WAL files using archive_command."),
+			NULL
+		},
+		&XLogArchiveMode,
+		ARCHIVE_MODE_OFF, archive_mode_options,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b053659..fff5de0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -204,7 +204,7 @@
 
 # - Archiving -
 
-#archive_mode = off		# allows archiving to be done
+#archive_mode = off		# allows archiving to be done; off, on, shared, or always
 				# (change requires restart)
 #archive_command = ''		# command to use to archive a logfile segment
 				# placeholders: %p = path of file to archive
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d06fbc0..c9448c7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -93,13 +93,22 @@ extern int	CheckPointSegments;
 extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
-extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
 
+/* Archive modes */
+typedef enum ArchiveMode
+{
+	ARCHIVE_MODE_OFF = 0,	/* disabled */
+	ARCHIVE_MODE_ON,		/* enabled while server is running normally */
+	ARCHIVE_MODE_SHARED,	/* archive is shared with master */
+	ARCHIVE_MODE_ALWAYS		/* enabled always (even during recovery) */
+} ArchiveMode;
+extern int XLogArchiveMode;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -110,7 +119,8 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
-#define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
+#define XLogArchivingActive() \
+	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
 
 /*
-- 
2.1.3

#6Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#5)
Re: Streaming replication and WAL archive interactions

Hi,

On 2014-12-19 22:56:40 +0200, Heikki Linnakangas wrote:

This add two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

I don't really like this approach. Sharing a archive is rather dangerous
in my experience - if your old master comes up again (and writes in the
last wal file) or similar, you can get into really bad situations.

What I was thinking about was instead trying to detect the point up to
which files were safely archived by running restore command to check for
the presence of archived files. Then archive anything that has valid
content and isn't yet archived. That doesn't sound particularly
complicated to me.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#6)
Re: [HACKERS] Streaming replication and WAL archive interactions

  This should be a very common setup in the field, so how are  people doing it in practice?

One of possible workaround with archive and streaming was to use pg_receivexlog from standby to copy/save WALs to archive. but with pg_receivexlog was also issue with fsync.

[ master ] -- streaming --> [ standby ] -- pg_receivexlog --> [ /archive ]

In that case archive is always in pre standby state and it could be better than had archive broken on promote.
--
Misha

#8Venkata Balaji N
nag1010@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: Streaming replication and WAL archive interactions

Here's a first cut at this. It includes the changes from your
standby_wal_archiving_v1.patch, so you get that behaviour if you set
archive_mode='always', and the new behaviour I wanted with
archive_mode='shared'. I wrote it on top of the other patch I posted
recently to not archive bogus recycled WAL segments after promotion (
/messages/by-id/549489FA.4010304@vmware.com), but it
seems to apply without it too.

I suggest reading the documentation changes first, it hopefully explains
pretty well how to use this. The code should work too, and comments on that
are welcome too, but I haven't tested it much. I'll do more testing next
week.

Patch did get applied successfully to the latest master. Can you please
rebase.

Regards,
Venkata Balaji N

#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Venkata Balaji N (#8)
1 attachment(s)
Re: Streaming replication and WAL archive interactions

On 03/01/2015 12:36 AM, Venkata Balaji N wrote:

Patch did get applied successfully to the latest master. Can you please
rebase.

Here you go.

On 01/31/2015 03:07 PM, Andres Freund wrote:

On 2014-12-19 22:56:40 +0200, Heikki Linnakangas wrote:

This add two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

I don't really like this approach. Sharing a archive is rather dangerous
in my experience - if your old master comes up again (and writes in the
last wal file) or similar, you can get into really bad situations.

It doesn't have to actually be shared. The master and standby could
archive to different locations, but the responsibility of archiving is
shared, so that on promotion, the standby ensures that every WAL file
gets archived. If the master didn't do it, then the standby will.

Yes, if the master comes up again, it might try to archive a file that
the standby already archived. But that's not so bad. Both copies of the
file will be identical. You could put logic in archive_command to check,
if the file already exists in the archive, whether the contents are
identical, and return success without doing anything if they are.

Oh, hang on, that's not necessarily true. On promotion, the standby
archives the last, partial WAL segment from the old timeline. That's
just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and
in fact I somehow thought I changed that already, but apparently not. So
let's stop doing that.

What I was thinking about was instead trying to detect the point up to
which files were safely archived by running restore command to check for
the presence of archived files. Then archive anything that has valid
content and isn't yet archived. That doesn't sound particularly
complicated to me.

Hmm. That assumes that the standby has a valid restore_command, and can
access the WAL archive. Not a too unreasonable requirement I guess, but
with the scheme I proposed, it's not necessary. Seems a bit silly to
copy a whole segment from the archive just to check if it exists, though.

- Heikki

Attachments:

v2-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchapplication/x-patch; name=v2-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchDownload
From db5c4311baf4e3a2ae3308c4d0d9975ee3692a18 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 16 Apr 2015 14:40:24 +0300
Subject: [PATCH v2 1/1] Make WAL archival behave more sensibly in standby
 mode.

This adds two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has
not yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is
taken to be separate from the primary's, and the standby independently
archives all files it receives from the primary.

Fujii Masao and me.
---
 doc/src/sgml/config.sgml                      |  12 +-
 doc/src/sgml/high-availability.sgml           |  48 +++++++
 doc/src/sgml/protocol.sgml                    |  31 +++++
 src/backend/access/transam/xlog.c             |  29 ++++-
 src/backend/postmaster/postmaster.c           |  37 ++++--
 src/backend/replication/walreceiver.c         | 172 ++++++++++++++++++++------
 src/backend/replication/walsender.c           |  47 +++++++
 src/backend/utils/misc/guc.c                  |  21 ++--
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/include/access/xlog.h                     |  14 ++-
 10 files changed, 351 insertions(+), 62 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b30c68d..e352b8e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2521,7 +2521,7 @@ include_dir 'conf.d'
 
     <variablelist>
      <varlistentry id="guc-archive-mode" xreflabel="archive_mode">
-      <term><varname>archive_mode</varname> (<type>boolean</type>)
+      <term><varname>archive_mode</varname> (<type>enum</type>)
       <indexterm>
        <primary><varname>archive_mode</> configuration parameter</primary>
       </indexterm>
@@ -2530,7 +2530,15 @@ include_dir 'conf.d'
        <para>
         When <varname>archive_mode</> is enabled, completed WAL segments
         are sent to archive storage by setting
-        <xref linkend="guc-archive-command">.
+        <xref linkend="guc-archive-command">. In addition to <literal>off</>,
+        to disable, there are three modes: <literal>on</>, <literal>shared</>,
+        and <literal>always</>. During normal operation, there is no
+        difference between the three modes, but in archive recovery or
+        standby mode, it indicates whether the WAL archive is shared between
+        the primary and the standby server or not. See
+        <xref linkend="continuous-archiving-in-standby"> for details.
+       </para>  
+       <para>
         <varname>archive_mode</> and <varname>archive_command</> are
         separate variables so that <varname>archive_command</> can be
         changed without leaving archiving mode.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a17f555..62f7c75 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1220,6 +1220,54 @@ primary_slot_name = 'node_a_slot'
 
    </sect3>
   </sect2>
+
+  <sect2 id="continuous-archiving-in-standby">
+   <title>Continuous archiving in standby</title>
+
+   <indexterm>
+     <primary>continuous archiving</primary>
+     <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+     When continuous WAL archiving is used in a standby, there are two
+     different scenarios: the WAL archive can be shared between the primary
+     and the standby, or the standby can have its own WAL archive. In the
+     shared archive scenario, <varname>archive_mode</varname> must be set to
+     <literal>shared</literal>, and in the separate archive scenario, to
+     <literal>always</literal>. Setting it to <literal>on</literal> in a
+     standby server, or when performing point-in-time recovery, is not
+     allowed and an error will be raised. When a server is not in recovery
+     mode, there is no difference between <literal>on</literal>,
+     <literal>shared</literal>, and <literal>always</literal> modes.
+   </para>
+
+   <para>
+     In <literal>shared</literal> archive mode, the standby server tries to
+     ensure that the archive is complete, even if the primary crashes and
+     failover happens. The standby server will not archive any WAL segments
+     as long as it is in standby mode; it is the primary server's
+     responsibility to do so. It will, however, keep track of which files
+     have already been archived by the primary, and if failover happens, it
+     takes over and attempts to archive any files that the primary had not
+     yet archived.
+   </para>
+
+   <para>
+     In <literal>always</literal> archive mode, the standby server will
+     archive all WAL it receives, whether it's through streaming replication
+     or by restoring from the primary's archive using
+     <varname>restore_command</varname>.
+   </para>
+
+   <para>
+     In cascading replication, the first standby server and the cascaded
+     standby servers can use <varname>archive_mode</varname> settings. In
+     each standby, it should be set to <literal>shared</literal> or
+     <literal>always</literal>, depending on whether that standby shares the
+     archive with the primary or standby it is connected to.
+   </para>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3a753a0..a42344e 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1646,6 +1646,37 @@ The commands accepted in walsender mode are:
       </para>
       </listitem>
       </varlistentry>
+      <varlistentry>
+      <term>
+          WAL archival report message (B)
+      </term>
+      <listitem>
+      <para>
+      <variablelist>
+      <varlistentry>
+      <term>
+          Byte1('a')
+      </term>
+      <listitem>
+      <para>
+          Tells the receiver the last archived WAL segment.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte<replaceable>n</replaceable>
+      </term>
+      <listitem>
+      <para>
+          Filename of the latest archived file.
+      </para>
+      </listitem>
+      </varlistentry>
+      </variablelist>
+      </para>
+      </listitem>
+      </varlistentry>
       </variablelist>
      </para>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..22b5dda 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -84,7 +84,7 @@ int			min_wal_size = 5;		/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
-bool		XLogArchiveMode = false;
+int			XLogArchiveMode = ARCHIVE_MODE_OFF;
 char	   *XLogArchiveCommand = NULL;
 bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
@@ -138,6 +138,25 @@ const struct config_enum_entry sync_method_options[] = {
 	{NULL, 0, false}
 };
 
+
+/*
+ * Although only "on", "off", and "always" are documented,
+ * we accept all the likely variants of "on" and "off".
+ */
+const struct config_enum_entry archive_mode_options[] = {
+	{"shared", ARCHIVE_MODE_SHARED, false},
+	{"always", ARCHIVE_MODE_ALWAYS, false},
+	{"on", ARCHIVE_MODE_ON, false},
+	{"off", ARCHIVE_MODE_OFF, false},
+	{"true", ARCHIVE_MODE_ON, true},
+	{"false", ARCHIVE_MODE_OFF, true},
+	{"yes", ARCHIVE_MODE_ON, true},
+	{"no", ARCHIVE_MODE_OFF, true},
+	{"1", ARCHIVE_MODE_ON, true},
+	{"0", ARCHIVE_MODE_OFF, true},
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
@@ -756,7 +775,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
+static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static void recoveryPausesHere(void);
@@ -5949,6 +5968,12 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		/* archive_mode=on is not allowed during archive recovery. */
+		if (XLogArchiveMode == ARCHIVE_MODE_ON)
+			ereport(ERROR,
+					(errmsg("archive_mode='on' cannot be used in archive recovery"),
+					 (errhint("Use 'shared' or 'always' mode instead."))));
+
 		if (StandbyModeRequested)
 			ereport(LOG,
 					(errmsg("entering standby mode")));
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a9f20ac..72fe4fd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -828,9 +828,9 @@ PostmasterMain(int argc, char *argv[])
 		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
 		ExitPostmaster(1);
 	}
-	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
+	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
-				(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
+				(errmsg("WAL archival (archive_mode=on/always/shared) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
@@ -1645,13 +1645,21 @@ ServerLoop(void)
 				start_autovac_launcher = false; /* signal processed */
 		}
 
-		/* If we have lost the archiver, try to start a new one */
-		if (XLogArchivingActive() && PgArchPID == 0 && pmState == PM_RUN)
-			PgArchPID = pgarch_start();
-
-		/* If we have lost the stats collector, try to start a new one */
-		if (PgStatPID == 0 && pmState == PM_RUN)
-			PgStatPID = pgstat_start();
+		/*
+		 * If we have lost the archiver, try to start a new one.
+		 *
+		 * If WAL archiving is enabled always, we try to start a new archiver
+		 * even during recovery.
+		 */
+		if (PgArchPID == 0 && wal_level >= WAL_LEVEL_ARCHIVE)
+		{
+			if ((pmState == PM_RUN && XLogArchiveMode > ARCHIVE_MODE_OFF) ||
+				((pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY) &&
+				 XLogArchiveMode == ARCHIVE_MODE_ALWAYS))
+			{
+				PgArchPID = pgarch_start();
+			}
+		}
 
 		/* If we need to signal the autovacuum launcher, do so now */
 		if (avlauncher_needs_signal)
@@ -4807,6 +4815,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
+		/*
+		 * Start the archiver if we're responsible for (re-)archiving received
+		 * files.
+		 */
+		Assert(PgArchPID == 0);
+		if (wal_level >= WAL_LEVEL_ARCHIVE &&
+			XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		{
+			PgArchPID = pgarch_start();
+		}
+
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9c7710f..e53ffeb 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -52,8 +52,11 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/pgarch.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -107,6 +110,9 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -141,6 +147,7 @@ static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessArchivalReport(void);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -526,21 +533,12 @@ WalReceiverMain(void)
 		 */
 		if (recvFile >= 0)
 		{
-			char		xlogfname[MAXFNAMELEN];
-
 			XLogWalRcvFlush(false);
 			if (close(recvFile) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
 								XLogFileNameP(recvFileTLI, recvSegNo))));
-
-			/*
-			 * Create .done file forcibly to prevent the streamed segment from
-			 * being archived later.
-			 */
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
 		}
 		recvFile = -1;
 
@@ -846,6 +844,26 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 					XLogWalRcvSendReply(true, false);
 				break;
 			}
+		case 'a':				/* Archival report */
+			{
+				/* the content of the message is a filename */
+				if (len >= sizeof(primary_last_archived))
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("invalid archival report message with length %d",
+											 (int) len)));
+				memcpy(primary_last_archived, buf, len);
+				primary_last_archived[len] = '\0';
+				if (strspn(buf, VALID_XFN_CHARS) != len)
+				{
+					primary_last_archived[0] = '\0';
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("unexpected character in primary's last archived filename")));
+				}
+				ProcessArchivalReport();
+				break;
+			}
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -867,39 +885,18 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
+		if (!XLByteInSeg(recptr, recvSegNo))
 		{
 			bool		use_existent;
 
 			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
+			 * We take care to always close the current file, after writing
+			 * the last byte to it. So this shouldn't happen.
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				ereport(ERROR,
+						(errmsg("unexpected WAL receive location %s",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -954,6 +951,51 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * If we just wrote the last byte to this segment, fsync() and close
+		 * current file before we switch to next one. We would otherwise have
+		 * to reopen this file to fsync it later.
+		 */
+		if (recvOff == XLOG_SEG_SIZE)
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			XLogWalRcvFlush(false);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not close log segment %s: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
+			recvFile = -1;
+
+			/*
+			 * Now that this segment is complete, do we need to archive it?
+			 *
+			 * In 'always' mode, we clearly need to archive this.
+			 *
+			 * In 'shared' mode, we might need to, if we get promoted before
+			 * the master has archived this file, so create a .ready file. It
+			 * will be replaced with .done later, if we get acknowledgemet
+			 * from the primary that this has already been archived.
+			 *
+			 * In 'on' mode, we're only responsible for WAL we've generated
+			 * ourselves.
+			 */
+			if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS ||
+				XLogArchiveMode == ARCHIVE_MODE_SHARED)
+			{
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+
+				XLogArchiveCheckDone(xlogfname);
+			}
+		}
 	}
 }
 
@@ -1215,3 +1257,61 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 		pfree(receipttime);
 	}
 }
+
+/*
+ * Create .done and .ready files, based on the master's last archival report.
+ */
+static void
+ProcessArchivalReport(void)
+{
+	DIR		   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "received archival report from master: %s",
+		 primary_last_archived);
+
+	if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+		return;
+
+	/* Check that the filename the primary reported looks valid */
+	if (strlen(primary_last_archived) < 24 ||
+		strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+		return;
+
+	xldir = AllocateDir(XLOGDIR);
+	if (xldir == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open transaction log directory \"%s\": %m",
+						XLOGDIR)));
+
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		/*
+		 * We ignore the timeline part of the XLOG segment identifiers in
+		 * deciding whether a segment is still needed.  This ensures that we
+		 * won't prematurely remove a segment from a parent timeline. We could
+		 * probably be a little more proactive about removing segments of
+		 * non-parent timelines, but that would be a whole lot more
+		 * complicated.
+		 *
+		 * We use the alphanumeric sorting property of the filenames to decide
+		 * which ones are earlier than the lastoff segment.
+		 */
+		if (strlen(xlde->d_name) == 24 &&
+			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
+			strcmp(xlde->d_name + 8, primary_last_archived + 8) <= 0)
+		{
+			XLogArchiveForceDone(xlde->d_name);
+		}
+	}
+
+	FreeDir(xldir);
+
+	/*
+	 * Remember this location in pgstat as well. This makes it visible in
+	 * pg_stat_archiver, and allows the location to be relayed to cascaded
+	 * standbys.
+	 */
+	pgstat_send_archiver(primary_last_archived, false);
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4a20569..74bdeff 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -55,6 +55,7 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
+#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
@@ -152,6 +153,7 @@ static StringInfoData tmpbuf;
  * wal_sender_timeout doesn't need to be active.
  */
 static TimestampTz last_reply_timestamp = 0;
+static char last_archival_report[MAX_XFN_CHARS + 1] = "";
 
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
@@ -209,6 +211,8 @@ static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
 static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
+static void WalSndArchivalReport(void);
+static void WalSndArchivalReportIfNecessary(void);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
@@ -2879,6 +2883,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	TimestampTz ping_time;
 
 	/*
+	 * Send an archival status message, if necessary.
+	 */
+	WalSndArchivalReportIfNecessary();
+
+	/*
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
@@ -2907,6 +2916,44 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 }
 
 /*
+ * This function is used to send archival report message to standby.
+ */
+static void
+WalSndArchivalReport(void)
+{
+	elog(LOG, "sending archival report: %s", last_archival_report);
+
+	/* construct the message... */
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'a');
+	pq_sendbytes(&output_message, last_archival_report, strlen(last_archival_report));
+
+	/* ... and send it wrapped in CopyData */
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
+}
+
+static void
+WalSndArchivalReportIfNecessary(void)
+{
+	PgStat_ArchiverStats *archiver_stats;
+
+	archiver_stats = pgstat_fetch_stat_archiver();
+
+	if (strcmp(last_archival_report, archiver_stats->last_archived_wal) == 0)
+	{
+		pgstat_clear_snapshot();
+		return;
+	}
+
+	strlcpy(last_archival_report, archiver_stats->last_archived_wal,
+			sizeof(last_archival_report));
+
+	pgstat_clear_snapshot();
+
+	WalSndArchivalReport();
+}
+
+/*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
  * future for synchronous replication.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f43aff2..7115bcc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -396,6 +396,7 @@ static const struct config_enum_entry row_security_options[] = {
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
+extern const struct config_enum_entry archive_mode_options[];
 extern const struct config_enum_entry sync_method_options[];
 extern const struct config_enum_entry dynamic_shared_memory_options[];
 
@@ -1530,16 +1531,6 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
-		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
-			gettext_noop("Allows archiving of WAL files using archive_command."),
-			NULL
-		},
-		&XLogArchiveMode,
-		false,
-		NULL, NULL, NULL
-	},
-
-	{
 		{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
 			gettext_noop("Allows connections and queries during recovery."),
 			NULL
@@ -3552,6 +3543,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
+			gettext_noop("Allows archiving of WAL files using archive_command."),
+			NULL
+		},
+		&XLogArchiveMode,
+		ARCHIVE_MODE_OFF, archive_mode_options,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..90371d7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -206,7 +206,7 @@
 
 # - Archiving -
 
-#archive_mode = off		# allows archiving to be done
+#archive_mode = off		# allows archiving to be done; off, on, shared, or always
 				# (change requires restart)
 #archive_command = ''		# command to use to archive a logfile segment
 				# placeholders: %p = path of file to archive
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 2b1f423..3a49702 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -95,7 +95,6 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
-extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -105,6 +104,16 @@ extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
 
+/* Archive modes */
+typedef enum ArchiveMode
+{
+	ARCHIVE_MODE_OFF = 0,	/* disabled */
+	ARCHIVE_MODE_ON,		/* enabled while server is running normally */
+	ARCHIVE_MODE_SHARED,	/* archive is shared with master */
+	ARCHIVE_MODE_ALWAYS		/* enabled always (even during recovery) */
+} ArchiveMode;
+extern int	XLogArchiveMode;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -115,7 +124,8 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
-#define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
+#define XLogArchivingActive() \
+	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
 
 /*
-- 
2.1.4

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: Streaming replication and WAL archive interactions

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion? I thought from previous
discussions that we should do it as master (be it crashed, burned, burried
or dead) may not have the occasion to do it. By preventing its archiving
you close the door to the case where master did not have the occasion to
archive it.

+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
This is visibly missing a comment.

As primary_last_archived is used only by ProcessArchivalReport(), wouldn't
it be better to pass it as argument to this function?

+       /* Check that the filename the primary reported looks valid */
+       if (strlen(primary_last_archived) < 24 ||
+               strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+               return;
Not related to this patch, but we had better have a macro doing this job I
think... It keeps spreading around.

People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.
One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

Are both WalSndArchivalReport() and WalSndArchivalReportIfNecessary()
really necessary? I think that for simplicity you could merge them and use
last_archival_report as a local variable.

Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG: using stale statistics instead of current ones because stats
collector is not responding
LOG: sending archival report:
You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.

Regards,
--
Michael

#11Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#10)
Re: Streaming replication and WAL archive interactions

On 04/21/2015 09:53 AM, Michael Paquier wrote:

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last partial
segment from the old timeline at promotion?

Yes.

I thought from previous discussions that we should do it as master
(be it crashed, burned, burried or dead) may not have the occasion to
do it. By preventing its archiving you close the door to the case
where master did not have the occasion to archive it.

The current situation is a mess:

1. Even though we archive the last segment in the standby, there is no
guarantee that the master had archived all the previous segments already.

2. If the master is not totally dead, it might try to archive the same
file with more WAL in it, at the same time or just afterwards, or even
just before the standby has completed promotion. Which copy do you keep
in the archive? Having to deal with that makes the archive_command more
complicated.

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the
new timeline. So the WAL isn't lost.

People may be surprised that a base backup taken from a node that has
archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.

Hmm, good point.

One idea would be to simply ignore the fact that archive_mode = on on nodes
in recovery instead of dropping an error. Note that I like the fact that it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

By "ignore", what behaviour do you mean? Would "on" be equivalent to
"shared", "always", or something else?

Or we could keep the current behaviour with archive_mode=on (except for
the last segment thing, which is just wrong), where the standby only
archives the new timeline, and nothing from the previous timelines. Are
the use cases where you'd want that, rather than the new "shared" mode?
I wanted to keep the 'on' mode for backwards-compatibility, but if that
causes more problems, it might be better to just remove it and force the
admin to choose what kind of a setup he has, with "shared" or "always".

Creating a dependency between the pgstat machinery and the WAL sender looks
weak to me. For example with this patch a master cannot stop, as it waits
indefinitely:
LOG: using stale statistics instead of current ones because stats
collector is not responding
LOG: sending archival report:

Hmm, yeah, having walsender to wait for the stats file to appear is not
good.

You could scan archive_status/ but that would be costly if there are many
entries to scan and I think that walsender should be highly responsive. Or
you could directly store the name of the lastly archived WAL segment marked
as .done in let's say archive_status/last_archived. An entry for that in
the control file does not seem the right place as a node may not have
archive_mode enabled that's why I am not mentioning it.

The ways that the archiver process can communicate with the rest of the
system are limited, for the sake of robustness. Writing to the control
file is definitely not OK. I think using the stats collector is OK for
this, but we'll have to arrange it so that the walsender doesn't block
on it, and should probably not force new stat file so often. A 5-10
seconds old stats file would be perfectly fine for this purpose.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#11)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 09:53 AM, Michael Paquier wrote:

On Thu, Apr 16, 2015 at 8:57 PM, Heikki Linnakangas wrote:

Oh, hang on, that's not necessarily true. On promotion, the standby

archives

the last, partial WAL segment from the old timeline. That's just wrong
(/messages/by-id/52FCD37C.3070806@vmware.com), and
in
fact I somehow thought I changed that already, but apparently not. So

let's

stop doing that.

Er. Are you planning to prevent the standby from archiving the last
partial
segment from the old timeline at promotion?

Yes.

I thought from previous discussions that we should do it as master

(be it crashed, burned, burried or dead) may not have the occasion to
do it. By preventing its archiving you close the door to the case
where master did not have the occasion to archive it.

The current situation is a mess:

1. Even though we archive the last segment in the standby, there is no
guarantee that the master had archived all the previous segments already.

2. If the master is not totally dead, it might try to archive the same file

with more WAL in it, at the same time or just afterwards, or even just
before the standby has completed promotion. Which copy do you keep in the
archive? Having to deal with that makes the archive_command more
complicated.

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

People may be surprised that a base backup taken from a node that has

archive_mode = on set (that's the case in a very large number of cases)
will not be able to work as-is as node startup will fail as follows:
FATAL: archive_mode='on' cannot be used in archive recovery
HINT: Use 'shared' or 'always' mode instead.

Hmm, good point.

One idea would be to simply ignore the fact that archive_mode = on on

nodes
in recovery instead of dropping an error. Note that I like the fact that
it
drops an error as that's clear, I just point the fact that people may be
surprised that base backups are not working anymore now in this case.

By "ignore", what behaviour do you mean? Would "on" be equivalent to
"shared", "always", or something else?

I meant something backward-compatible, with files marked as .done when they
are finished replaying... But now my words *are* weird as on != off ;)

Or we could keep the current behaviour with archive_mode=on (except for the

last segment thing, which is just wrong), where the standby only archives
the new timeline, and nothing from the previous timelines.

I guess this would solve the issue here then, which is not a bad thing in
itself:
/messages/by-id/20140918180734.361021e1@erg
We would need to check if the situation improves with the 'always' mode btw.

Are the use cases where you'd want that, rather than the new "shared"
mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
that causes more problems, it might be better to just remove it and force
the admin to choose what kind of a setup he has, with "shared" or "always".

The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.
Regards,
--
Michael

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#12)
Re: Streaming replication and WAL archive interactions

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an
up-to-the-second archive, you can use pg_receivexlog. Mind you, if the
standby wasn't promoted, the partial segment would not be present in the
archive anyway. And you can copy the WAL segment manually from
0000000200000000000000XX to pg_xlog/0000000100000000000000XX before
starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example,
we could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates
how far that file is valid (this naming is similar to how the backup
history files are named). Recovery wouldn't automatically pick up those
files, but the DBA could easily copy the partial file into pg_xlog with
the full segment's name, if he wants to do PITR to that piece of WAL.

Are the use cases where you'd want that, rather than the new "shared"
mode? I wanted to keep the 'on' mode for backwards-compatibility, but if
that causes more problems, it might be better to just remove it and force
the admin to choose what kind of a setup he has, with "shared" or "always".

The 'on' mode is still useful IMO to get a behavior a maximum close to what
previous releases did.

But would you ever want the old behaviour, rather than the new shared or
always behaviour?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention. I don't think that's OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#14)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 6:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention. I don't think that's OK.

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#14)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 12:42 AM, Robert Haas wrote:

On Tue, Apr 21, 2015 at 6:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/21/2015 12:04 PM, Michael Paquier wrote:

On Tue, Apr 21, 2015 at 4:38 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that even though we don't archive the partial last segment on the
previous timeline, the same WAL is copied to the first segment on the new
timeline. So the WAL isn't lost.

But if the failed master has archived those segments safely, we may need
them, no? I am not sure we can ignore a user who would want to do a PITR
with recovery_target_timeline pointing to the one of the failed master.

I think it would be acceptable. If you want to maintain an up-to-the-second
archive, you can use pg_receivexlog. Mind you, if the standby wasn't
promoted, the partial segment would not be present in the archive anyway.
And you can copy the WAL segment manually from 0000000200000000000000XX to
pg_xlog/0000000100000000000000XX before starting PITR.

Another thought is that we could archive the partial file, but with a
different name to avoid confusing it with the full segment. For example, we
could archive a partial 000000010000000000000012 segment as
"000000020000000000000012.00000128.partial", where 00000128 indicates how
far that file is valid (this naming is similar to how the backup history
files are named). Recovery wouldn't automatically pick up those files, but
the DBA could easily copy the partial file into pg_xlog with the full
segment's name, if he wants to do PITR to that piece of WAL.

So, suppose you A replicating to B (via an archive) replicating to C
(via a separate archive); A dies, B is promoted. It sounds to me like
today this will work and with your proposed change it will require
manual intervention.

No. If there is no streaming replication involved, no partial files will
be archived, with or without this patch. There is no change to that
scenario.

Note that it's a bit complicated to set up that scenario today.
Archiving is never enabled in recovery mode, so you'll need to use a
custom cron job or something to maintain the archive that C uses. The
files will not automatically flow from B to the second archive. With the
patch we're discussing, however, it would be easy: just set
archive_mode='always' in B.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#15)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 03:30 AM, Michael Paquier wrote:

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

Currently, the archive command doesn't know if the segment it's
archiving is partial or not, so you can't put any logic there to manage
it. But if we archive it with the .partial suffix, then you can put
logic in the restore_command to check for .partial files, if you really
want to.

I feel that the best approach is to archive the last, partial segment,
but with the .partial suffix. I don't see any plausible real-world setup
where the current behaviour would be better. I don't really see much
need to archive the partial segment at all, but there's also no harm in
doing it, as long as it's clearly marked with the .partial suffix.

BTW, pg_receivexlog also uses a ".partial" file, while it's streaming
WAL from the server. The .partial suffix is removed when the segment is
complete. So there's some precedence to this. pg_receivexlog adds just
".partial" to the filename, it doesn't add any information of what
portion of the file is valid like I suggested here, though. Perhaps we
should follow pg_receivexlog's example at promotion too, for consistency.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#17)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/22/2015 03:30 AM, Michael Paquier wrote:

This is going to change a behavior that people are used to for a
couple of releases. I would not mind having this patch do
"archive_mode = on during recovery" => archive only segments generated
by this node + the last partial segment on the old timeline at
promotion, without renaming to preserve backward compatible behavior.
If master and standby point to separate archive locations, this way
the operator can sort out later the one he would want to use. If they
point to the same location, archive_command scripts already do
internally such renaming, at least that's what I suspect an
experienced user would do, now it is true that not many people are
experienced in this area I see mistakes regarding such things on a
weekly basis... This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

Currently, the archive command doesn't know if the segment it's archiving is
partial or not, so you can't put any logic there to manage it. But if we
archive it with the .partial suffix, then you can put logic in the
restore_command to check for .partial files, if you really want to.

Well, now you can check as well if there is a file with the same name
already archived and append a suffix to the new file copied, keep the
two files, and then let restore_command sort things up as it wants
with the two segment files it finds.

I feel that the best approach is to archive the last, partial segment, but
with the .partial suffix. I don't see any plausible real-world setup where
the current behavior would be better. I don't really see much need to
archive the partial segment at all, but there's also no harm in doing it, as
long as it's clearly marked with the .partial suffix.

Well, as long as it is clearly archived at promotion, even with a
suffix, I guess that I am fine... This will need some tweaking on
restore_command for existing applications, but as long as it is
clearly documented I am fine. Shouldn't this be a different patch
though?

BTW, pg_receivexlog also uses a ".partial" file, while it's streaming WAL
from the server. The .partial suffix is removed when the segment is
complete. So there's some precedence to this. pg_receivexlog adds just
".partial" to the filename, it doesn't add any information of what portion
of the file is valid like I suggested here, though. Perhaps we should follow
pg_receivexlog's example at promotion too, for consistency.

Consistency here sounds good to me.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Note that it's a bit complicated to set up that scenario today. Archiving is
never enabled in recovery mode, so you'll need to use a custom cron job or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in B.

Hmm, I see. But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#15)
Re: Streaming replication and WAL archive interactions

On Tue, Apr 21, 2015 at 8:30 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

This .partial segment renaming is something that we
should let the archive_command manage with its internal logic.

This strikes me as equivalent to saying "we don't know how to make
this work right, but maybe our users will know". That never works
out. As things stand, we have a situation where the archive_command
examples in our documentation are known to be flawed. They don't
fsync the file, and they'll write a partial file and then, when rerun,
fail to copy the full file because there's already something there.
Efforts have been made to fix these problems (see the pg_copy thread),
but they haven't been completed yet, nor have we even documented the
issues with the commands recommended by the documentation. Let's
please not throw anything else on the pile of things we're expecting
users to somehow "get right".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#19)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 09:30 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Note that it's a bit complicated to set up that scenario today. Archiving is
never enabled in recovery mode, so you'll need to use a custom cron job or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in B.

Hmm, I see. But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

At timeline switch, we copy the old segment to the new timeline, and
start writing where we left off. So the WAL from the old timeline is
found in the segment nominally belonging to the new timeline.

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
000000010000000000000012. Then we end recovery, and switch to timeline
2. After the switch, and some more WAL-logged actions, we'll have these
files in pg_xlog:

000000010000000000000011
000000010000000000000012
000000020000000000000012
000000020000000000000013
000000020000000000000014

Note that there are two segments ending in "12". They both have the same
point up to offset 0x37E568, corresponding to the switch point
0/1237E568. After that, the contents diverge: the segment on the new
timeline contains a checkpoint/end-of-recovery record at that point,
followed by new WAL belonging to the new timeline.

Recovery knows about that, so that if you set recovery target to
timeline 2, and it needs the WAL at the beginning of segment 12 (still
belonging to timeline 1), it will try to restoring both
"000000010000000000000012" and "000000020000000000000012".

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#21)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/22/2015 09:30 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 2:17 AM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

Note that it's a bit complicated to set up that scenario today. Archiving
is
never enabled in recovery mode, so you'll need to use a custom cron job
or
something to maintain the archive that C uses. The files will not
automatically flow from B to the second archive. With the patch we're
discussing, however, it would be easy: just set archive_mode='always' in
B.

Hmm, I see. But if C never replays the last, partial segment from the
old timeline, how does it follow the timeline switch?

At timeline switch, we copy the old segment to the new timeline, and start
writing where we left off. So the WAL from the old timeline is found in the
segment nominally belonging to the new timeline.

Check.

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
000000010000000000000012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these files
in pg_xlog:

000000010000000000000011
000000010000000000000012
000000020000000000000012
000000020000000000000013
000000020000000000000014

Is the 000000010000000000000012 file a "partial" segment of the sort
you're proposing to no longer achive?

Note that there are two segments ending in "12". They both have the same
point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
After that, the contents diverge: the segment on the new timeline contains a
checkpoint/end-of-recovery record at that point, followed by new WAL
belonging to the new timeline.

Check.

Recovery knows about that, so that if you set recovery target to timeline 2,
and it needs the WAL at the beginning of segment 12 (still belonging to
timeline 1), it will try to restoring both "000000010000000000000012" and
"000000020000000000000012".

What if you set the recovery target to timeline 3?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#22)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 10:21 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
000000010000000000000012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these files
in pg_xlog:

000000010000000000000011
000000010000000000000012
000000020000000000000012
000000020000000000000013
000000020000000000000014

Is the 000000010000000000000012 file a "partial" segment of the sort
you're proposing to no longer achive?

If you did pure archive recovery, with no streaming replication
involved, then no. If it was created by streaming replication, and the
replication had not filled the whole segment yet, then yes, it would be
a partial segment.

Note that there are two segments ending in "12". They both have the same
point up to offset 0x37E568, corresponding to the switch point 0/1237E568.
After that, the contents diverge: the segment on the new timeline contains a
checkpoint/end-of-recovery record at that point, followed by new WAL
belonging to the new timeline.

Check.

Recovery knows about that, so that if you set recovery target to timeline 2,
and it needs the WAL at the beginning of segment 12 (still belonging to
timeline 1), it will try to restoring both "000000010000000000000012" and
"000000020000000000000012".

What if you set the recovery target to timeline 3?

It depends how timeline 3 was created. If timeline 3 was forked off from
timeline 2, then recovery would find it. If it was forked off directly
from timeline 1, then no.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#23)
Re: Streaming replication and WAL archive interactions

On Wed, Apr 22, 2015 at 3:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/22/2015 10:21 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
000000010000000000000012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these
files
in pg_xlog:

000000010000000000000011
000000010000000000000012
000000020000000000000012
000000020000000000000013
000000020000000000000014

Is the 000000010000000000000012 file a "partial" segment of the sort
you're proposing to no longer achive?

If you did pure archive recovery, with no streaming replication involved,
then no. If it was created by streaming replication, and the replication had
not filled the whole segment yet, then yes, it would be a partial segment.

Why the difference?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#24)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 11:58 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 04/22/2015 10:21 PM, Robert Haas wrote:

On Wed, Apr 22, 2015 at 3:01 PM, Heikki Linnakangas <hlinnaka@iki.fi>
wrote:

For example, imagine that perform point-in-time recovery to WAL position
0/1237E568, on timeline 1. That falls within segment
000000010000000000000012. Then we end recovery, and switch to timeline 2.
After the switch, and some more WAL-logged actions, we'll have these
files
in pg_xlog:

000000010000000000000011
000000010000000000000012
000000020000000000000012
000000020000000000000013
000000020000000000000014

Is the 000000010000000000000012 file a "partial" segment of the sort
you're proposing to no longer achive?

If you did pure archive recovery, with no streaming replication involved,
then no. If it was created by streaming replication, and the replication had
not filled the whole segment yet, then yes, it would be a partial segment.

Why the difference?

Because we don't archive partial segments, except for the last one at a
timeline switch, and there was no timeline switch to timeline 1 within
that segment.

It doesn't really matter, though. The behaviour at the switch from
timeline 1 to 2 works the same, whether the 000000010000000000000012
segment is complete or not.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#18)
2 attachment(s)
Re: Streaming replication and WAL archive interactions

On 04/22/2015 10:07 AM, Michael Paquier wrote:

On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I feel that the best approach is to archive the last, partial segment, but
with the .partial suffix. I don't see any plausible real-wold setup where
the current behavior would be better. I don't really see much need to
archive the partial segment at all, but there's also no harm in doing it, as
long as it's clearly marked with the .partial suffix.

Well, as long as it is clearly archived at promotion, even with a
suffix, I guess that I am fine... This will need some tweaking on
restore_command for existing applications, but as long as it is
clearly documented I am fine. Shouldn't this be a different patch
though?

Ok, I came up with the attached, which adds the .partial suffix to the
partial WAL segment that's archived after promotion. I couldn't find any
natural place to talk about it in the docs, though. I think after the
docs changes from the main patch are applied, it would be natural to
mention this in the "Continuous archiving in standby", so I think I'll
add that later.

Barring objections, I'll push this later tonight.

Now that we got this last-partial-segment problem out of the way, I'm
going to try fixing the problem you (Michael) pointed out about relying
on pgstat file. Meanwhile, I'd love to get more feedback on the rest of
the patch, and the documentation.

- Heikki

Attachments:

0001-Add-macros-to-check-if-a-filename-is-a-WAL-segment-o.patchapplication/x-patch; name=0001-Add-macros-to-check-if-a-filename-is-a-WAL-segment-o.patchDownload
From 15c123141d1eef0d6b05a384d1c5c202ffa04a84 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 8 May 2015 12:04:46 +0300
Subject: [PATCH 1/2] Add macros to check if a filename is a WAL segment or
 other such file.

We had many instances of the strlen + strspn combination to check for that.
This makes the code a bit easier to read.
---
 src/backend/access/transam/xlog.c      | 11 +++--------
 src/backend/replication/basebackup.c   |  7 ++-----
 src/bin/pg_basebackup/pg_receivexlog.c | 16 ++--------------
 src/bin/pg_resetxlog/pg_resetxlog.c    |  8 ++++++--
 src/include/access/xlog_internal.h     | 18 ++++++++++++++++++
 5 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92822a1..5097173 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3577,8 +3577,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
 	{
 		/* Ignore files that are not XLOG segments */
-		if (strlen(xlde->d_name) != 24 ||
-			strspn(xlde->d_name, "0123456789ABCDEF") != 24)
+		if (!IsXLogFileName(xlde->d_name))
 			continue;
 
 		/*
@@ -3650,8 +3649,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
 	{
 		/* Ignore files that are not XLOG segments */
-		if (strlen(xlde->d_name) != 24 ||
-			strspn(xlde->d_name, "0123456789ABCDEF") != 24)
+		if (!IsXLogFileName(xlde->d_name))
 			continue;
 
 		/*
@@ -3839,10 +3837,7 @@ CleanupBackupHistory(void)
 
 	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
 	{
-		if (strlen(xlde->d_name) > 24 &&
-			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
-			strcmp(xlde->d_name + strlen(xlde->d_name) - strlen(".backup"),
-				   ".backup") == 0)
+		if (IsBackupHistoryFileName(xlde->d_name))
 		{
 			if (XLogArchiveCheckDone(xlde->d_name))
 			{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3563fd9..de103c6 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -350,17 +350,14 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		while ((de = ReadDir(dir, "pg_xlog")) != NULL)
 		{
 			/* Does it look like a WAL segment, and is it in the range? */
-			if (strlen(de->d_name) == 24 &&
-				strspn(de->d_name, "0123456789ABCDEF") == 24 &&
+			if (IsXLogFileName(de->d_name) &&
 				strcmp(de->d_name + 8, firstoff + 8) >= 0 &&
 				strcmp(de->d_name + 8, lastoff + 8) <= 0)
 			{
 				walFileList = lappend(walFileList, pstrdup(de->d_name));
 			}
 			/* Does it look like a timeline history file? */
-			else if (strlen(de->d_name) == 8 + strlen(".history") &&
-					 strspn(de->d_name, "0123456789ABCDEF") == 8 &&
-					 strcmp(de->d_name + 8, ".history") == 0)
+			else if (IsTLHistoryFileName(de->d_name))
 			{
 				historyFileList = lappend(historyFileList, pstrdup(de->d_name));
 			}
diff --git a/src/bin/pg_basebackup/pg_receivexlog.c b/src/bin/pg_basebackup/pg_receivexlog.c
index e77d2b6..53802af 100644
--- a/src/bin/pg_basebackup/pg_receivexlog.c
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -188,23 +188,11 @@ FindStreamingStart(uint32 *tli)
 
 		/*
 		 * Check if the filename looks like an xlog file, or a .partial file.
-		 * Xlog files are always 24 characters, and .partial files are 32
-		 * characters.
 		 */
-		if (strlen(dirent->d_name) == 24)
-		{
-			if (strspn(dirent->d_name, "0123456789ABCDEF") != 24)
-				continue;
+		if (IsXLogFileName(dirent->d_name))
 			ispartial = false;
-		}
-		else if (strlen(dirent->d_name) == 32)
-		{
-			if (strspn(dirent->d_name, "0123456789ABCDEF") != 24)
-				continue;
-			if (strcmp(&dirent->d_name[24], ".partial") != 0)
-				continue;
+		else if (IsPartialXLogFileName(dirent->d_name))
 			ispartial = true;
-		}
 		else
 			continue;
 
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 4a22575..393d580 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -906,14 +906,18 @@ FindEndOfXLOG(void)
 
 	while (errno = 0, (xlde = readdir(xldir)) != NULL)
 	{
-		if (strlen(xlde->d_name) == 24 &&
-			strspn(xlde->d_name, "0123456789ABCDEF") == 24)
+		if (IsXLogFileName(xlde->d_name))
 		{
 			unsigned int tli,
 						log,
 						seg;
 			XLogSegNo	segno;
 
+			/*
+			 * Note: We don't use XLogFromFileName here, because we want
+			 * to use the segment size from the control file, not the size
+			 * the pg_resetxlog binary was compiled with
+			 */
 			sscanf(xlde->d_name, "%08X%08X%08X", &tli, &log, &seg);
 			segno = ((uint64) log) * segs_per_xlogid + seg;
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 75cf435..714850c 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -142,6 +142,14 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
 			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
 
+#define IsXLogFileName(fname) \
+	(strlen(fname) == 24 && strspn(fname, "0123456789ABCDEF") == 24)
+
+#define IsPartialXLogFileName(fname)	\
+	(strlen(fname) == 24 + strlen(".partial") &&	\
+	 strspn(fname, "0123456789ABCDEF") == 24 &&		\
+	 strcmp((fname) + 24, ".partial") == 0)
+
 #define XLogFromFileName(fname, tli, logSegNo)	\
 	do {												\
 		uint32 log;										\
@@ -158,6 +166,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define TLHistoryFileName(fname, tli)	\
 	snprintf(fname, MAXFNAMELEN, "%08X.history", tli)
 
+#define IsTLHistoryFileName(fname)	\
+	(strlen(fname) == 8 + strlen(".history") &&		\
+	 strspn(fname, "0123456789ABCDEF") == 8 &&		\
+	 strcmp((fname) + 8, ".history") == 0)
+
 #define TLHistoryFilePath(path, tli)	\
 	snprintf(path, MAXPGPATH, XLOGDIR "/%08X.history", tli)
 
@@ -169,6 +182,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId),		  \
 			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId), offset)
 
+#define IsBackupHistoryFileName(fname) \
+	(strlen(fname) > 24 && \
+	 strspn(fname, "0123456789ABCDEF") == 24 && \
+	 strcmp((fname) + strlen(fname) - strlen(".backup"), ".backup") == 0)
+
 #define BackupHistoryFilePath(path, tli, logSegNo, offset)	\
 	snprintf(path, MAXPGPATH, XLOGDIR "/%08X%08X%08X.%08X.backup", tli, \
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
-- 
2.1.4

0002-At-promotion-archive-last-segment-from-old-timeline-.patchapplication/x-patch; name=0002-At-promotion-archive-last-segment-from-old-timeline-.patchDownload
From 1a78bffda7034cb6d37348527cc1269a09fffe1a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 8 May 2015 16:08:26 +0300
Subject: [PATCH 2/2] At promotion, archive last segment from old timeline with
 .partial suffix.

Previously, we would archive the possible-incomplete WAL segment with its
normal filename, but that causes trouble if the server owning that timeline
is still running, and tries to archive the same segment later. It's not nice
for the standby to trip up the master's archival like that. And it's pretty
confusing, anyway, to have an incomplete segment in the archive that's
indistinguishable from a normal, complete segment.

To avoid such confusion, add a .partial suffix to the file. Or to be more
precise, make a copy of the old segment under the .partial suffix, and
archive that instead of the original file. pg_receivexlog also uses the
.partial suffix for the same purpose, to tell apart incompletely streamed
files from complete ones.

There is no automatic mechanism to use the .partial files at recovery, so
they will go unused, unless the administrator manually copies to them to
the pg_xlog directory (and removes the .partial suffix). Recovery won't
normally need the WAL - when recovering to the new timeline, it will find
the same WAL on the first segment on the new timeline instead - but it
nevertheless feels better to archive the file with the .partial suffix, for
debugging purposes if nothing else.
---
 src/backend/access/transam/xlog.c  | 133 ++++++++++++++++++++++++++++---------
 src/include/access/xlog_internal.h |   5 ++
 src/include/postmaster/pgarch.h    |   2 +-
 3 files changed, 107 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5097173..6f7e3bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3020,24 +3020,22 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 }
 
 /*
- * Create a new XLOG file segment by copying a pre-existing one.
+ * Copy a WAL segment file in pg_xlog directory.
  *
- * destsegno: identify segment to be created.
+ * dstfname		destination filename
+ * srcfname		source filename
+ * upto			how much of the source file to copy? (the rest is filled with
+ *				zeros)
  *
- * srcTLI, srclog, srcseg: identify segment to be copied (could be from
- *		a different timeline)
+ * If dstfname is not given, the file is created with a temporary filename,
+ * which is returned.  Both filenames are relative to the pg_xlog directory.
  *
- * upto: how much of the source file to copy? (the rest is filled with zeros)
- *
- * Currently this is only used during recovery, and so there are no locking
- * considerations.  But we should be just as tense as XLogFileInit to avoid
- * emplacing a bogus file.
+ * NB: Any existing file with the same name will be overwritten!
  */
-static void
-XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
-			 int upto)
+static char *
+XLogFileCopy(char *dstfname, char *srcfname, int upto)
 {
-	char		path[MAXPGPATH];
+	char		srcpath[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
 	char		buffer[XLOG_BLCKSZ];
 	int			srcfd;
@@ -3047,12 +3045,12 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Open the source file
 	 */
-	XLogFilePath(path, srcTLI, srcsegno);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	snprintf(srcpath, MAXPGPATH, XLOGDIR "/%s", srcfname);
+	srcfd = OpenTransientFile(srcpath, O_RDONLY | PG_BINARY, 0);
 	if (srcfd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path)));
+				 errmsg("could not open file \"%s\": %m", srcpath)));
 
 	/*
 	 * Copy into a temp file name.
@@ -3094,10 +3092,12 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 				if (errno != 0)
 					ereport(ERROR,
 							(errcode_for_file_access(),
-							 errmsg("could not read file \"%s\": %m", path)));
+							 errmsg("could not read file \"%s\": %m",
+									srcpath)));
 				else
 					ereport(ERROR,
-							(errmsg("not enough data in file \"%s\"", path)));
+							(errmsg("not enough data in file \"%s\"",
+									srcpath)));
 			}
 		}
 		errno = 0;
@@ -3131,10 +3131,24 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	CloseTransientFile(srcfd);
 
 	/*
-	 * Now move the segment into place with its final name.
+	 * Now move the segment into place with its final name.  (Or just return
+	 * the path to the file we created, if the caller wants to handle the
+	 * rest on its own.)
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
-		elog(ERROR, "InstallXLogFileSegment should not have failed");
+	if (dstfname)
+	{
+		char		dstpath[MAXPGPATH];
+
+		snprintf(dstpath, MAXPGPATH, XLOGDIR "/%s", dstfname);
+		if (rename(tmppath, dstpath) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not rename file \"%s\" to \"%s\": %m",
+							tmppath, dstpath)));
+		return NULL;
+	}
+	else
+		return pstrdup(tmppath);
 }
 
 /*
@@ -3577,7 +3591,8 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
 	{
 		/* Ignore files that are not XLOG segments */
-		if (!IsXLogFileName(xlde->d_name))
+		if (!IsXLogFileName(xlde->d_name) &&
+			!IsPartialXLogFileName(xlde->d_name))
 			continue;
 
 		/*
@@ -5189,25 +5204,79 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
 	 *
-	 * Notify the archiver that the last WAL segment of the old timeline is
-	 * ready to copy to archival storage if its .done file doesn't exist
-	 * (e.g., if it's the restored WAL file, it's expected to have .done file).
-	 * Otherwise, it is not archived for a while.
+	 * What to do with the partial segment on the old timeline? If we don't
+	 * archive it, and the server that created the WAL never archives it
+	 * either (e.g. because it was hit by a meteor), it will never make it to
+	 * the archive. That's OK from our point of view, because the new segment
+	 * that we created with the new TLI contains all the WAL from the old
+	 * timeline up to the switch point. But if you later try to do PITR to the
+	 * "missing" WAL on the old timeline, recovery won't find it in the
+	 * archive. It's physically present in the new file with new TLI, but
+	 * recovery won't look there when it's recovering to the older timeline.
+	 * On the other hand, if we archive the partial segment, and the original
+	 * server on that timeline is still running and archives the completed
+	 * version of the same segment later, it will fail. (We used to do that in
+	 * 9.4 and below, and it caused such problems).
+	 *
+	 * As a compromise, we archive the last segment with the .partial suffix.
+	 * Archive recovery will never try to read .partial segments, so they will
+	 * normally go unused. But in the odd PITR case, the administrator can
+	 * copy them manually to the pg_xlog directory (removing the suffix). They
+	 * can be useful in debugging, too.
+	 *
+	 * If a .done file already exists for the old timeline, however, there is
+	 * already a complete copy of the file in the archive, and there is no
+	 * need to archive the partial one. (In particular, if it was restored
+	 * from the archive to begin with, it's expected to have .done file).
 	 */
 	if (endLogSegNo == startLogSegNo)
 	{
-		XLogFileCopy(startLogSegNo, endTLI, endLogSegNo,
-					 endOfLog % XLOG_SEG_SIZE);
+		char	   *tmpfname;
+
+		XLogFileName(xlogfname, endTLI, endLogSegNo);
+
+		/*
+		 * Make a copy of the file on the new timeline.
+		 *
+		 * Writing WAL isn't allowed yet, so there are no locking
+		 * considerations. But we should be just as tense as XLogFileInit to
+		 * avoid emplacing a bogus file.
+		 */
+		tmpfname = XLogFileCopy(NULL, xlogfname, endOfLog % XLOG_SEG_SIZE);
+		if (!InstallXLogFileSegment(&endLogSegNo, tmpfname, false, 0, false))
+			elog(ERROR, "InstallXLogFileSegment should not have failed");
 
-		/* Create .ready file only when neither .ready nor .done files exist */
-		if (XLogArchivingActive())
+		/*
+		 * Make a .partial copy for the archive (unless the original file was
+		 * already archived)
+		 */
+		if (XLogArchivingActive() && XLogArchiveIsBusy(xlogfname))
 		{
-			XLogFileName(xlogfname, endTLI, endLogSegNo);
-			XLogArchiveCheckDone(xlogfname);
+			char		partialfname[MAXFNAMELEN];
+
+			snprintf(partialfname, MAXFNAMELEN, "%s.partial", xlogfname);
+
+			/* Make sure there's no .done or .ready file for it. */
+			XLogArchiveCleanup(partialfname);
+
+			/*
+			 * We copy the whole segment, not just upto the switch point.
+			 * The portion after the switch point might be garbage, but it
+			 * might also be valid WAL, if we stopped recovery at user's
+			 * request before reaching the end. Better to preserve the
+			 * file as it is, garbage and all, than lose the evidence if
+			 * something goes wrong.
+			 */
+			(void) XLogFileCopy(partialfname, xlogfname, XLOG_SEG_SIZE);
+			XLogArchiveNotify(partialfname);
 		}
 	}
 	else
 	{
+		/*
+		 * The switch happened at a segment boundary, so just create the next
+		 * segment on the new timeline.
+		 */
 		bool		use_existent = true;
 		int			fd;
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 714850c..e50d0f3 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -145,6 +145,11 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define IsXLogFileName(fname) \
 	(strlen(fname) == 24 && strspn(fname, "0123456789ABCDEF") == 24)
 
+/*
+ * XLOG segment with .partial suffix.  Used by pg_receivexlog and at end of
+ * archive recovery, when we want to archive a WAL segment but it might not
+ * be complete yet.
+ */
 #define IsPartialXLogFileName(fname)	\
 	(strlen(fname) == 24 + strlen(".partial") &&	\
 	 strspn(fname, "0123456789ABCDEF") == 24 &&		\
diff --git a/src/include/postmaster/pgarch.h b/src/include/postmaster/pgarch.h
index 9f692eb..425e2ab 100644
--- a/src/include/postmaster/pgarch.h
+++ b/src/include/postmaster/pgarch.h
@@ -24,7 +24,7 @@
  */
 #define MIN_XFN_CHARS	16
 #define MAX_XFN_CHARS	40
-#define VALID_XFN_CHARS "0123456789ABCDEF.history.backup"
+#define VALID_XFN_CHARS "0123456789ABCDEF.history.backup.partial"
 
 /* ----------
  * Functions called from postmaster
-- 
2.1.4

#27Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#26)
1 attachment(s)
Re: Streaming replication and WAL archive interactions

On 05/08/2015 04:21 PM, Heikki Linnakangas wrote:

On 04/22/2015 10:07 AM, Michael Paquier wrote:

On Wed, Apr 22, 2015 at 3:38 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I feel that the best approach is to archive the last, partial segment, but
with the .partial suffix. I don't see any plausible real-wold setup where
the current behavior would be better. I don't really see much need to
archive the partial segment at all, but there's also no harm in doing it, as
long as it's clearly marked with the .partial suffix.

Well, as long as it is clearly archived at promotion, even with a
suffix, I guess that I am fine... This will need some tweaking on
restore_command for existing applications, but as long as it is
clearly documented I am fine. Shouldn't this be a different patch
though?

Ok, I came up with the attached, which adds the .partial suffix to the
partial WAL segment that's archived after promotion. I couldn't find any
natural place to talk about it in the docs, though. I think after the
docs changes from the main patch are applied, it would be natural to
mention this in the "Continuous archiving in standby", so I think I'll
add that later.

Barring objections, I'll push this later tonight.

Applied that part.

Now that we got this last-partial-segment problem out of the way, I'm
going to try fixing the problem you (Michael) pointed out about relying
on pgstat file. Meanwhile, I'd love to get more feedback on the rest of
the patch, and the documentation.

And here is a new version of the patch. I kept the approach of using
pgstat, but it now only polls pgstat every 10 seconds, and doesn't block
to wait for updated stats.

- Heikki

Attachments:

v3-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchapplication/x-patch; name=v3-0001-Make-WAL-archival-behave-more-sensibly-in-standby.patchDownload
From 08ca3cc7b9824503b793e149247ea9c6d3a7f323 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 16 Apr 2015 14:40:24 +0300
Subject: [PATCH v3 1/1] Make WAL archival behave more sensibly in standby
 mode.

This adds two new archive_modes, 'shared' and 'always', to indicate whether
the WAL archive is shared between the primary and standby, or not. In
shared mode, the standby tracks which files have been archived by the
primary. The standby refrains from recycling files that the primary has not
yet archived, and at failover, the standby archives all those files too
from the old timeline. In 'always' mode, the standby's WAL archive is taken
to be separate from the primary's, and the standby independently archives
all files it receives from the primary.

This adds a new "archival status" message to the protocol. WAL sender sends
one automatically, when the last archived WAL file, as reported in pgstat,
changes. (Or rather, some time after it changes. We're not in a hurry, the
standby doesn't need an up-to-the-second status)

Fujii Masao and me.
---
 doc/src/sgml/config.sgml                      |  12 +-
 doc/src/sgml/high-availability.sgml           |  48 +++++++
 doc/src/sgml/protocol.sgml                    |  31 +++++
 src/backend/access/transam/xlog.c             |  29 +++-
 src/backend/postmaster/pgstat.c               |  44 ++++++
 src/backend/postmaster/postmaster.c           |  37 +++--
 src/backend/replication/walreceiver.c         | 172 +++++++++++++++++++-----
 src/backend/replication/walsender.c           | 186 ++++++++++++++++++++++----
 src/backend/utils/misc/guc.c                  |  21 +--
 src/backend/utils/misc/postgresql.conf.sample |   2 +-
 src/include/access/xlog.h                     |  14 +-
 src/include/pgstat.h                          |   2 +
 12 files changed, 513 insertions(+), 85 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0d8624a..ac845e0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2521,7 +2521,7 @@ include_dir 'conf.d'
 
     <variablelist>
      <varlistentry id="guc-archive-mode" xreflabel="archive_mode">
-      <term><varname>archive_mode</varname> (<type>boolean</type>)
+      <term><varname>archive_mode</varname> (<type>enum</type>)
       <indexterm>
        <primary><varname>archive_mode</> configuration parameter</primary>
       </indexterm>
@@ -2530,7 +2530,15 @@ include_dir 'conf.d'
        <para>
         When <varname>archive_mode</> is enabled, completed WAL segments
         are sent to archive storage by setting
-        <xref linkend="guc-archive-command">.
+        <xref linkend="guc-archive-command">. In addition to <literal>off</>,
+        to disable, there are three modes: <literal>on</>, <literal>shared</>,
+        and <literal>always</>. During normal operation, there is no
+        difference between the three modes, but in archive recovery or
+        standby mode, it indicates whether the WAL archive is shared between
+        the primary and the standby server or not. See
+        <xref linkend="continuous-archiving-in-standby"> for details.
+       </para>  
+       <para>
         <varname>archive_mode</> and <varname>archive_command</> are
         separate variables so that <varname>archive_command</> can be
         changed without leaving archiving mode.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a17f555..62f7c75 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1220,6 +1220,54 @@ primary_slot_name = 'node_a_slot'
 
    </sect3>
   </sect2>
+
+  <sect2 id="continuous-archiving-in-standby">
+   <title>Continuous archiving in standby</title>
+
+   <indexterm>
+     <primary>continuous archiving</primary>
+     <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+     When continuous WAL archiving is used in a standby, there are two
+     different scenarios: the WAL archive can be shared between the primary
+     and the standby, or the standby can have its own WAL archive. In the
+     shared archive scenario, <varname>archive_mode</varname> must be set to
+     <literal>shared</literal>, and in the separate archive scenario, to
+     <literal>always</literal>. Setting it to <literal>on</literal> in a
+     standby server, or when performing point-in-time recovery, is not
+     allowed and an error will be raised. When a server is not in recovery
+     mode, there is no difference between <literal>on</literal>,
+     <literal>shared</literal>, and <literal>always</literal> modes.
+   </para>
+
+   <para>
+     In <literal>shared</literal> archive mode, the standby server tries to
+     ensure that the archive is complete, even if the primary crashes and
+     failover happens. The standby server will not archive any WAL segments
+     as long as it is in standby mode; it is the primary server's
+     responsibility to do so. It will, however, keep track of which files
+     have already been archived by the primary, and if failover happens, it
+     takes over and attempts to archive any files that the primary had not
+     yet archived.
+   </para>
+
+   <para>
+     In <literal>always</literal> archive mode, the standby server will
+     archive all WAL it receives, whether it's through streaming replication
+     or by restoring from the primary's archive using
+     <varname>restore_command</varname>.
+   </para>
+
+   <para>
+     In cascading replication, the first standby server and the cascaded
+     standby servers can use <varname>archive_mode</varname> settings. In
+     each standby, it should be set to <literal>shared</literal> or
+     <literal>always</literal>, depending on whether that standby shares the
+     archive with the primary or standby it is connected to.
+   </para>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index ac13d32..bd2dd3f 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1646,6 +1646,37 @@ The commands accepted in walsender mode are:
       </para>
       </listitem>
       </varlistentry>
+      <varlistentry>
+      <term>
+          WAL archival report message (B)
+      </term>
+      <listitem>
+      <para>
+      <variablelist>
+      <varlistentry>
+      <term>
+          Byte1('a')
+      </term>
+      <listitem>
+      <para>
+          Tells the receiver the last archived WAL segment.
+      </para>
+      </listitem>
+      </varlistentry>
+      <varlistentry>
+      <term>
+          Byte<replaceable>n</replaceable>
+      </term>
+      <listitem>
+      <para>
+          Filename of the latest archived file.
+      </para>
+      </listitem>
+      </varlistentry>
+      </variablelist>
+      </para>
+      </listitem>
+      </varlistentry>
       </variablelist>
      </para>
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f7e3bd9..ee5a4a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -85,7 +85,7 @@ int			min_wal_size = 5;		/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
-bool		XLogArchiveMode = false;
+int			XLogArchiveMode = ARCHIVE_MODE_OFF;
 char	   *XLogArchiveCommand = NULL;
 bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
@@ -139,6 +139,25 @@ const struct config_enum_entry sync_method_options[] = {
 	{NULL, 0, false}
 };
 
+
+/*
+ * Although only "on", "off", and "always" are documented,
+ * we accept all the likely variants of "on" and "off".
+ */
+const struct config_enum_entry archive_mode_options[] = {
+	{"shared", ARCHIVE_MODE_SHARED, false},
+	{"always", ARCHIVE_MODE_ALWAYS, false},
+	{"on", ARCHIVE_MODE_ON, false},
+	{"off", ARCHIVE_MODE_OFF, false},
+	{"true", ARCHIVE_MODE_ON, true},
+	{"false", ARCHIVE_MODE_OFF, true},
+	{"yes", ARCHIVE_MODE_ON, true},
+	{"no", ARCHIVE_MODE_OFF, true},
+	{"1", ARCHIVE_MODE_ON, true},
+	{"0", ARCHIVE_MODE_OFF, true},
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
@@ -766,7 +785,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
+static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static void recoveryPausesHere(void);
@@ -6037,6 +6056,12 @@ StartupXLOG(void)
 
 	if (ArchiveRecoveryRequested)
 	{
+		/* archive_mode=on is not allowed during archive recovery. */
+		if (XLogArchiveMode == ARCHIVE_MODE_ON)
+			ereport(ERROR,
+					(errmsg("archive_mode='on' cannot be used in archive recovery"),
+					 (errhint("Use 'shared' or 'always' mode instead."))));
+
 		if (StandbyModeRequested)
 			ereport(LOG,
 					(errmsg("entering standby mode")));
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 1e6073a..13bee0c 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2471,6 +2471,50 @@ pgstat_fetch_global(void)
 }
 
 
+/*
+ * ---------
+ * pgstat_use_stale_snapshot() -
+ *
+ *	Take a "snapshot" of the current stats into backend-private memory.
+ *	pgstat_fetch_*() functions can then be used to interrogate the stats.
+ *
+ *	The first call pgstat_fetch_*() in a transaction will take a snapshot
+ *	implicitly, so this is normally not required. But this can be used if
+ *	you don't want to wait for fresh stats, like pgstat_fetch_*() functions
+ *	will
+ * ---------
+ */
+void
+pgstat_use_stale_snapshot(void)
+{
+	pgstat_clear_snapshot();
+
+	/*
+	 * For all the current callers, shallow stats are enough.
+	 *
+	 * XXX: There is no way to request global stats only; we'll get stats
+	 * for all databases.
+	 */
+	pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
+}
+
+/*
+ * ---------
+ * pgstat_request_update() -
+ *
+ *	Ask the stats collector to refresh the stats file. Normally,
+ *	pgstat_fetch_*() will do this automatically, but this can be used together
+ *	with pgstat_take_snapshot() to wait for poll for updated stats
+ *	asynchronously.
+ * ---------
+ */
+void
+pgstat_request_update(TimestampTz cur_ts, TimestampTz min_ts)
+{
+	pgstat_send_inquiry(cur_ts, min_ts, InvalidOid);
+}
+
+
 /* ------------------------------------------------------------
  * Functions for management of the shared-memory PgBackendStatus array
  * ------------------------------------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a9f20ac..72fe4fd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -828,9 +828,9 @@ PostmasterMain(int argc, char *argv[])
 		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
 		ExitPostmaster(1);
 	}
-	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
+	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
-				(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
+				(errmsg("WAL archival (archive_mode=on/always/shared) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
@@ -1645,13 +1645,21 @@ ServerLoop(void)
 				start_autovac_launcher = false; /* signal processed */
 		}
 
-		/* If we have lost the archiver, try to start a new one */
-		if (XLogArchivingActive() && PgArchPID == 0 && pmState == PM_RUN)
-			PgArchPID = pgarch_start();
-
-		/* If we have lost the stats collector, try to start a new one */
-		if (PgStatPID == 0 && pmState == PM_RUN)
-			PgStatPID = pgstat_start();
+		/*
+		 * If we have lost the archiver, try to start a new one.
+		 *
+		 * If WAL archiving is enabled always, we try to start a new archiver
+		 * even during recovery.
+		 */
+		if (PgArchPID == 0 && wal_level >= WAL_LEVEL_ARCHIVE)
+		{
+			if ((pmState == PM_RUN && XLogArchiveMode > ARCHIVE_MODE_OFF) ||
+				((pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY) &&
+				 XLogArchiveMode == ARCHIVE_MODE_ALWAYS))
+			{
+				PgArchPID = pgarch_start();
+			}
+		}
 
 		/* If we need to signal the autovacuum launcher, do so now */
 		if (avlauncher_needs_signal)
@@ -4807,6 +4815,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
+		/*
+		 * Start the archiver if we're responsible for (re-)archiving received
+		 * files.
+		 */
+		Assert(PgArchPID == 0);
+		if (wal_level >= WAL_LEVEL_ARCHIVE &&
+			XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		{
+			PgArchPID = pgarch_start();
+		}
+
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9c7710f..e53ffeb 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -52,8 +52,11 @@
 #include "libpq/pqformat.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/pgarch.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
+#include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
@@ -107,6 +110,9 @@ static struct
 	XLogRecPtr	Flush;			/* last byte + 1 flushed in the standby */
 }	LogstreamResult;
 
+/* */
+static char primary_last_archived[MAX_XFN_CHARS + 1];
+
 static StringInfoData reply_message;
 static StringInfoData incoming_message;
 
@@ -141,6 +147,7 @@ static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
+static void ProcessArchivalReport(void);
 
 /* Signal handlers */
 static void WalRcvSigHupHandler(SIGNAL_ARGS);
@@ -526,21 +533,12 @@ WalReceiverMain(void)
 		 */
 		if (recvFile >= 0)
 		{
-			char		xlogfname[MAXFNAMELEN];
-
 			XLogWalRcvFlush(false);
 			if (close(recvFile) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
 								XLogFileNameP(recvFileTLI, recvSegNo))));
-
-			/*
-			 * Create .done file forcibly to prevent the streamed segment from
-			 * being archived later.
-			 */
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
 		}
 		recvFile = -1;
 
@@ -846,6 +844,26 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 					XLogWalRcvSendReply(true, false);
 				break;
 			}
+		case 'a':				/* Archival report */
+			{
+				/* the content of the message is a filename */
+				if (len >= sizeof(primary_last_archived))
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("invalid archival report message with length %d",
+											 (int) len)));
+				memcpy(primary_last_archived, buf, len);
+				primary_last_archived[len] = '\0';
+				if (strspn(buf, VALID_XFN_CHARS) != len)
+				{
+					primary_last_archived[0] = '\0';
+					ereport(ERROR,
+							(errcode(ERRCODE_PROTOCOL_VIOLATION),
+							 errmsg_internal("unexpected character in primary's last archived filename")));
+				}
+				ProcessArchivalReport();
+				break;
+			}
 		default:
 			ereport(ERROR,
 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
@@ -867,39 +885,18 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
+		if (!XLByteInSeg(recptr, recvSegNo))
 		{
 			bool		use_existent;
 
 			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
+			 * We take care to always close the current file, after writing
+			 * the last byte to it. So this shouldn't happen.
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				ereport(ERROR,
+						(errmsg("unexpected WAL receive location %s",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -954,6 +951,51 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * If we just wrote the last byte to this segment, fsync() and close
+		 * current file before we switch to next one. We would otherwise have
+		 * to reopen this file to fsync it later.
+		 */
+		if (recvOff == XLOG_SEG_SIZE)
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			XLogWalRcvFlush(false);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not close log segment %s: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo))));
+			recvFile = -1;
+
+			/*
+			 * Now that this segment is complete, do we need to archive it?
+			 *
+			 * In 'always' mode, we clearly need to archive this.
+			 *
+			 * In 'shared' mode, we might need to, if we get promoted before
+			 * the master has archived this file, so create a .ready file. It
+			 * will be replaced with .done later, if we get acknowledgemet
+			 * from the primary that this has already been archived.
+			 *
+			 * In 'on' mode, we're only responsible for WAL we've generated
+			 * ourselves.
+			 */
+			if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS ||
+				XLogArchiveMode == ARCHIVE_MODE_SHARED)
+			{
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+
+				XLogArchiveCheckDone(xlogfname);
+			}
+		}
 	}
 }
 
@@ -1215,3 +1257,61 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 		pfree(receipttime);
 	}
 }
+
+/*
+ * Create .done and .ready files, based on the master's last archival report.
+ */
+static void
+ProcessArchivalReport(void)
+{
+	DIR		   *xldir;
+	struct dirent *xlde;
+
+	elog(DEBUG2, "received archival report from master: %s",
+		 primary_last_archived);
+
+	if (XLogArchiveMode != ARCHIVE_MODE_SHARED)
+		return;
+
+	/* Check that the filename the primary reported looks valid */
+	if (strlen(primary_last_archived) < 24 ||
+		strspn(primary_last_archived, "0123456789ABCDEF") != 24)
+		return;
+
+	xldir = AllocateDir(XLOGDIR);
+	if (xldir == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open transaction log directory \"%s\": %m",
+						XLOGDIR)));
+
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		/*
+		 * We ignore the timeline part of the XLOG segment identifiers in
+		 * deciding whether a segment is still needed.  This ensures that we
+		 * won't prematurely remove a segment from a parent timeline. We could
+		 * probably be a little more proactive about removing segments of
+		 * non-parent timelines, but that would be a whole lot more
+		 * complicated.
+		 *
+		 * We use the alphanumeric sorting property of the filenames to decide
+		 * which ones are earlier than the lastoff segment.
+		 */
+		if (strlen(xlde->d_name) == 24 &&
+			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
+			strcmp(xlde->d_name + 8, primary_last_archived + 8) <= 0)
+		{
+			XLogArchiveForceDone(xlde->d_name);
+		}
+	}
+
+	FreeDir(xldir);
+
+	/*
+	 * Remember this location in pgstat as well. This makes it visible in
+	 * pg_stat_archiver, and allows the location to be relayed to cascaded
+	 * standbys.
+	 */
+	pgstat_send_archiver(primary_last_archived, false);
+}
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 4a20569..b4d4a90 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -55,6 +55,7 @@
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
 #include "nodes/replnodes.h"
+#include "pgstat.h"
 #include "replication/basebackup.h"
 #include "replication/decode.h"
 #include "replication/logical.h"
@@ -91,6 +92,21 @@
  */
 #define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
 
+/*
+ * How often to report the last archived WAL file to the client?
+ */
+#define 	ARCHIVAL_REPORT_INTERVAL	10000
+/*
+ * After requesting the stats collector for fresh stats, how often to poll
+ * for the result?
+ *
+ * This is similar to PGSTAT_RETRY_DELAY and PGSTAT_INQ_INTERVAL, but we're
+ * much more relaxed in WAL sender, as we're not in any rush to get the latest
+ * status to the client. We also just use a single value, and send a new
+ * request after each poll.
+ */
+#define 	ARCHIVAL_REQUEST_INTERVAL	1000
+
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
 
@@ -153,6 +169,19 @@ static StringInfoData tmpbuf;
  */
 static TimestampTz last_reply_timestamp = 0;
 
+/*
+ * Last file archived. This is updated from pgstats, last update was at
+ * last_archival_report_timestamp.
+ */
+static char last_archived_file[MAX_XFN_CHARS + 1] = "";
+static TimestampTz last_archival_report_timestamp = 0;
+
+/*
+ * Have we requested fresh stats from the stats collector? And when?
+ */
+static bool	archival_status_requested = false;
+static TimestampTz last_archival_request_timestamp = 0;
+
 /* Have we sent a heartbeat message asking for reply, since last reply? */
 static bool waiting_for_ping_response = false;
 
@@ -209,6 +238,8 @@ static void ProcessStandbyHSFeedbackMessage(void);
 static void ProcessRepliesIfAny(void);
 static void WalSndKeepalive(bool requestReply);
 static void WalSndKeepaliveIfNecessary(TimestampTz now);
+static void WalSndArchivalReport(void);
+static void WalSndArchivalReportIfNecessary(TimestampTz now);
 static void WalSndCheckTimeOut(TimestampTz now);
 static long WalSndComputeSleeptime(TimestampTz now);
 static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);
@@ -1693,46 +1724,72 @@ ProcessStandbyHSFeedbackMessage(void)
 
 /*
  * Compute how long send/receive loops should sleep.
- *
- * If wal_sender_timeout is enabled we want to wake up in time to send
- * keepalives and to abort the connection if wal_sender_timeout has been
- * reached.
  */
 static long
 WalSndComputeSleeptime(TimestampTz now)
 {
-	long		sleeptime = 10000;		/* 10 s */
+	TimestampTz wakeup_time;
+	long		sleeptime;
+	long		sec_to_timeout;
+	int			microsec_to_timeout;
+	TimestampTz w;
 
+	/*
+	 * If we have no other reason to wake up, wake up every 10 seconds,
+	 * just in case we miss something.
+	 */
+	wakeup_time = TimestampTzPlusMilliseconds(now, 10000);
+
+	/*
+	 * If wal_sender_timeout is enabled we want to wake up in time to send
+	 * keepalives and to abort the connection if wal_sender_timeout has been
+	 * reached.
+	 */
 	if (wal_sender_timeout > 0 && last_reply_timestamp > 0)
 	{
-		TimestampTz wakeup_time;
-		long		sec_to_timeout;
-		int			microsec_to_timeout;
-
 		/*
 		 * At the latest stop sleeping once wal_sender_timeout has been
 		 * reached.
-		 */
-		wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-												  wal_sender_timeout);
-
-		/*
+		 *
 		 * If no ping has been sent yet, wakeup when it's time to do so.
 		 * WalSndKeepaliveIfNecessary() wants to send a keepalive once half of
 		 * the timeout passed without a response.
 		 */
-		if (!waiting_for_ping_response)
-			wakeup_time = TimestampTzPlusMilliseconds(last_reply_timestamp,
-													  wal_sender_timeout / 2);
-
-		/* Compute relative time until wakeup. */
-		TimestampDifference(now, wakeup_time,
-							&sec_to_timeout, &microsec_to_timeout);
+		if (waiting_for_ping_response)
+			w = TimestampTzPlusMilliseconds(last_reply_timestamp,
+											wal_sender_timeout);
+		else
+			w = TimestampTzPlusMilliseconds(last_reply_timestamp,
+											wal_sender_timeout / 2);
+		if (w < wakeup_time)
+			wakeup_time = w;
+	}
 
-		sleeptime = sec_to_timeout * 1000 +
-			microsec_to_timeout / 1000;
+	/* If archiving is enabled, send a status report to the client */
+	if (XLogArchivingActive())
+	{
+		/*
+		 * If we requested an update from pgstat, poll every
+		 * ARCHIVE_REQUEST_INTERVAL for the result. Otherwise wait until it's
+		 * time to send a new report.
+		 */
+		if (archival_status_requested)
+			w = TimestampTzPlusMilliseconds(last_archival_request_timestamp,
+											ARCHIVAL_REQUEST_INTERVAL);
+		else
+			w = TimestampTzPlusMilliseconds(last_archival_report_timestamp,
+											ARCHIVAL_REPORT_INTERVAL);
+		if (w < wakeup_time)
+			wakeup_time = w;
 	}
 
+	/* Compute relative time until wakeup. */
+	TimestampDifference(now, wakeup_time,
+						&sec_to_timeout, &microsec_to_timeout);
+
+	sleeptime = sec_to_timeout * 1000 +
+		microsec_to_timeout / 1000;
+
 	return sleeptime;
 }
 
@@ -2879,6 +2936,11 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 	TimestampTz ping_time;
 
 	/*
+	 * Send an archival status message, if necessary.
+	 */
+	WalSndArchivalReportIfNecessary(now);
+
+	/*
 	 * Don't send keepalive messages if timeouts are globally disabled or
 	 * we're doing something not partaking in timeouts.
 	 */
@@ -2907,6 +2969,84 @@ WalSndKeepaliveIfNecessary(TimestampTz now)
 }
 
 /*
+ * This function is used to send archival report message to standby.
+ */
+static void
+WalSndArchivalReport(void)
+{
+	elog(LOG, "sending archival report: %s", last_archived_file);
+
+	/* construct the message... */
+	resetStringInfo(&output_message);
+	pq_sendbyte(&output_message, 'a');
+	pq_sendbytes(&output_message, last_archived_file, strlen(last_archived_file));
+
+	/* ... and send it wrapped in CopyData */
+	pq_putmessage_noblock('d', output_message.data, output_message.len);
+}
+
+static void
+WalSndArchivalReportIfNecessary(TimestampTz now)
+{
+	TimestampTz report_time;
+
+	/*
+	 * If we had already asked pgstat for an update, wait until it's had
+	 * some time to update the stats file before we retry.
+	 */
+	if (archival_status_requested)
+	{
+		TimestampTz next_retry;
+
+		next_retry =
+			TimestampTzPlusMilliseconds(last_archival_request_timestamp,
+										ARCHIVAL_REQUEST_INTERVAL);
+		if (now < next_retry)
+			return;
+	}
+
+	/*
+	 * If more than ARCHIVAL_REPORT_INTERVAL has elapsed since we got the
+	 * archival status from pgstat, poll.
+	 */
+	report_time = TimestampTzPlusMilliseconds(last_archival_report_timestamp,
+											  ARCHIVAL_REPORT_INTERVAL);
+	if (now >= report_time)
+	{
+		PgStat_ArchiverStats *archiver_stats;
+		PgStat_GlobalStats *global_stats;
+		TimestampTz min_ts;
+
+		pgstat_use_stale_snapshot();
+		archiver_stats = pgstat_fetch_stat_archiver();
+		global_stats = pgstat_fetch_global();
+
+		last_archival_report_timestamp = global_stats->stats_timestamp;
+
+		if (strcmp(last_archived_file, archiver_stats->last_archived_wal) != 0)
+		{
+			strlcpy(last_archived_file, archiver_stats->last_archived_wal,
+					sizeof(last_archived_file));
+			WalSndArchivalReport();
+		}
+
+		/* If this wasn't fresh enough, request an update */
+		min_ts = TimestampTzPlusMilliseconds(now, -ARCHIVAL_REPORT_INTERVAL);
+		if (last_archival_report_timestamp > min_ts)
+			archival_status_requested = false;
+		else
+		{
+			/* Not fresh enough. Request an update */
+			pgstat_request_update(now, min_ts);
+			last_archival_request_timestamp = now;
+			archival_status_requested = true;
+		}
+
+		pgstat_clear_snapshot();
+	}
+}
+
+/*
  * This isn't currently used for anything. Monitoring tools might be
  * interested in the future, and we'll need something like this in the
  * future for synchronous replication.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5f71ded..97aca46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -396,6 +396,7 @@ static const struct config_enum_entry row_security_options[] = {
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
+extern const struct config_enum_entry archive_mode_options[];
 extern const struct config_enum_entry sync_method_options[];
 extern const struct config_enum_entry dynamic_shared_memory_options[];
 
@@ -1530,16 +1531,6 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
-		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
-			gettext_noop("Allows archiving of WAL files using archive_command."),
-			NULL
-		},
-		&XLogArchiveMode,
-		false,
-		NULL, NULL, NULL
-	},
-
-	{
 		{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
 			gettext_noop("Allows connections and queries during recovery."),
 			NULL
@@ -3552,6 +3543,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
+			gettext_noop("Allows archiving of WAL files using archive_command."),
+			NULL
+		},
+		&XLogArchiveMode,
+		ARCHIVE_MODE_OFF, archive_mode_options,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..90371d7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -206,7 +206,7 @@
 
 # - Archiving -
 
-#archive_mode = off		# allows archiving to be done
+#archive_mode = off		# allows archiving to be done; off, on, shared, or always
 				# (change requires restart)
 #archive_command = ''		# command to use to archive a logfile segment
 				# placeholders: %p = path of file to archive
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f08b676..8556bb8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -96,7 +96,6 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
-extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -106,6 +105,16 @@ extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
 
+/* Archive modes */
+typedef enum ArchiveMode
+{
+	ARCHIVE_MODE_OFF = 0,	/* disabled */
+	ARCHIVE_MODE_ON,		/* enabled while server is running normally */
+	ARCHIVE_MODE_SHARED,	/* archive is shared with master */
+	ARCHIVE_MODE_ALWAYS		/* enabled always (even during recovery) */
+} ArchiveMode;
+extern int	XLogArchiveMode;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -116,7 +125,8 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
-#define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
+#define XLogArchivingActive() \
+	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
 
 /*
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e3fe06e..b95a701 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -910,6 +910,8 @@ extern void pgstat_report_stat(bool force);
 extern void pgstat_vacuum_stat(void);
 extern void pgstat_drop_database(Oid databaseid);
 
+extern void pgstat_request_update(TimestampTz cur_ts, TimestampTz min_ts);
+extern void pgstat_use_stale_snapshot(void);
 extern void pgstat_clear_snapshot(void);
 extern void pgstat_reset_counters(void);
 extern void pgstat_reset_shared_counters(const char *);
-- 
2.1.4

#28Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#27)
Re: Streaming replication and WAL archive interactions

On Mon, May 11, 2015 at 12:00 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

And here is a new version of the patch. I kept the approach of using pgstat,
but it now only polls pgstat every 10 seconds, and doesn't block to wait for
updated stats.

It's not entirely a new problem, but this error message has gotten pretty crazy:

+ (errmsg("WAL archival
(archive_mode=on/always/shared) requires wal_level \"archive\",
\"hot_standby\", or \"logical\"")));

Maybe: WAL archival cannot be enabled when wal_level is "minimal"

I think the documentation should be explicit about what happens if the
primary archives a file and dies before the standby gets notified that
the archiving happened. The standby, running in shared mode, is then
promoted. My first guess would be that the standby will end up with
files that thinks it needs to archive but, being unable to do so
because they're already there, they'll live forever in pg_xlog. I
hope that's not the case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#28)
Re: Streaming replication and WAL archive interactions

On 05/13/2015 03:36 PM, Robert Haas wrote:

On Mon, May 11, 2015 at 12:00 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

And here is a new version of the patch. I kept the approach of using pgstat,
but it now only polls pgstat every 10 seconds, and doesn't block to wait for
updated stats.

It's not entirely a new problem, but this error message has gotten pretty crazy:

+ (errmsg("WAL archival
(archive_mode=on/always/shared) requires wal_level \"archive\",
\"hot_standby\", or \"logical\"")));

Maybe: WAL archival cannot be enabled when wal_level is "minimal"

I think the documentation should be explicit about what happens if the
primary archives a file and dies before the standby gets notified that
the archiving happened.

Yes, good point.

The standby, running in shared mode, is then
promoted. My first guess would be that the standby will end up with
files that thinks it needs to archive but, being unable to do so
because they're already there, they'll live forever in pg_xlog. I
hope that's not the case.

Hmm. That is exactly what happens. The standby will attempt to archive
them, which will fail, so the archiver will get stuck retrying.

That's not actually a new problem though. Even with a single server
doing archiving, it's possible that you crash just after archive_command
has archived a file, but before it has created the .done file. After
restart, the server will try to archive the file again, which will fail.
But yeah, with this patch, that's much more likely to happen after a
promotion.

Our manual says that archive_command should refuse to overwrite an
existing file. But to work-around the double-archival problem, where the
same file is archived twice, it would be even better if it would simply
return success if the file exists, *and has identical contents*. I don't
know how to code that logic in a simple one-liner though.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#29)
Re: Streaming replication and WAL archive interactions

On Wed, May 13, 2015 at 8:53 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Our manual says that archive_command should refuse to overwrite an existing
file. But to work-around the double-archival problem, where the same file is
archived twice, it would be even better if it would simply return success if
the file exists, *and has identical contents*. I don't know how to code that
logic in a simple one-liner though.

This is why we really, really need that pg_copy command that was
proposed a while back.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Robert Haas (#30)
1 attachment(s)
Re: Streaming replication and WAL archive interactions

On 05/13/2015 04:29 PM, Robert Haas wrote:

On Wed, May 13, 2015 at 8:53 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Our manual says that archive_command should refuse to overwrite an existing
file. But to work-around the double-archival problem, where the same file is
archived twice, it would be even better if it would simply return success if
the file exists, *and has identical contents*. I don't know how to code that
logic in a simple one-liner though.

This is why we really, really need that pg_copy command that was
proposed a while back.

Yeah..

I took a step back and looked at the big picture again:

If we just implement the "always" mode, and you have a pg_copy command
or similar that handles duplicates correctly, you don't necessarily need
the "shared" mode at all. You can just set archive_command='always', and
have the master and standby archive to the same location. As long as the
archive_command works correctly and is race-free, that should work.

I cut back the patch to implement just the "always" mode. The "shared"
mode might still make sense as a future patch, as I think it's easier to
understand and has less strict requirements for the archive_command, but
let's take one step at a time.

So attached is a patch that just adds the "always" mode. This is pretty
close to what Fujii submitted long ago.

- Heikki

Attachments:

0001-Add-archive_mode-always-option.patchapplication/x-patch; name=0001-Add-archive_mode-always-option.patchDownload
From 71332900247a8c68a61fcf60782cb35cf662b756 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 16 Apr 2015 14:40:24 +0300
Subject: [PATCH 1/1] Add archive_mode='always' option.

In 'always' mode, the standby's WAL archive is taken to be separate from the
primary's, and the standby independently archives all files it receives from
the primary.

Fujii Masao and me.
---
 doc/src/sgml/config.sgml                      | 13 +++++++--
 doc/src/sgml/high-availability.sgml           | 39 +++++++++++++++++++++++++++
 src/backend/access/transam/xlog.c             | 22 +++++++++++++--
 src/backend/access/transam/xlogarchive.c      |  5 +++-
 src/backend/postmaster/postmaster.c           | 37 ++++++++++++++++++-------
 src/backend/replication/walreceiver.c         | 10 +++++--
 src/backend/utils/misc/guc.c                  | 21 ++++++++-------
 src/backend/utils/misc/postgresql.conf.sample |  2 +-
 src/include/access/xlog.h                     | 13 +++++++--
 9 files changed, 133 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0d8624a..5549b7d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2521,7 +2521,7 @@ include_dir 'conf.d'
 
     <variablelist>
      <varlistentry id="guc-archive-mode" xreflabel="archive_mode">
-      <term><varname>archive_mode</varname> (<type>boolean</type>)
+      <term><varname>archive_mode</varname> (<type>enum</type>)
       <indexterm>
        <primary><varname>archive_mode</> configuration parameter</primary>
       </indexterm>
@@ -2530,7 +2530,16 @@ include_dir 'conf.d'
        <para>
         When <varname>archive_mode</> is enabled, completed WAL segments
         are sent to archive storage by setting
-        <xref linkend="guc-archive-command">.
+        <xref linkend="guc-archive-command">. In addition to <literal>off</>,
+        to disable, there are two modes: <literal>on</>, and
+        <literal>always</>. During normal operation, there is no
+        difference between the two modes, but when set to <literal>always</>
+        the WAL archiver is enabled also during archive recovery or standby
+        mode. In <literal>always</> mode, all files restored from the archive
+        or streamed with streaming replication will be archived (again). See
+        <xref linkend="continuous-archiving-in-standby"> for details.
+       </para>  
+       <para>
         <varname>archive_mode</> and <varname>archive_command</> are
         separate variables so that <varname>archive_command</> can be
         changed without leaving archiving mode.
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index a17f555..e93b711 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1220,6 +1220,45 @@ primary_slot_name = 'node_a_slot'
 
    </sect3>
   </sect2>
+
+  <sect2 id="continuous-archiving-in-standby">
+   <title>Continuous archiving in standby</title>
+
+   <indexterm>
+     <primary>continuous archiving</primary>
+     <secondary>in standby</secondary>
+   </indexterm>
+
+   <para>
+     When continuous WAL archiving is used in a standby, there are two
+     different scenarios: the WAL archive can be shared between the primary
+     and the standby, or the standby can have its own WAL archive. When
+     the standby has its own WAL archive, set <varname>archive_mode</varname>
+     to <literal>always</literal>, and the standby will call the archive
+     command for every WAL segment it receives, whether it's by restoring
+     from the archive or by streaming replication. The shared archive can
+     be handled similarly, but the archive_command should test if the file
+     being archived exists already, and if the existing file has identical
+     contents. This requires more care in the archive_command, as it must
+     be careful to not overwrite an existing file with different contents,
+     but return success if the exactly same file is archived twice. And
+     all that must be done free of race conditions, if two servers attempt
+     to archive the same file at the same time.
+   </para>
+
+   </para>
+     If <varname>archive_mode</varname> is set to <literal>on</>, the
+     archiver is not enabled during recovery or standby mode. If the standby
+     server is promoted, it will start archiving after the promotion, but
+     will not archive any WAL it did not generate itself. To get a complete
+     series of WAL files in the archive, you must ensure that all WAL is
+     archived, before it reaches the standby. This is inherently true with
+     file-based log shipping, as the standby can only restore files that
+     are found in the archive, but not if streaming replication is enabled.
+     When a server is not in recovery mode, there is no difference between
+     <literal>on</literal> and <literal>always</literal> modes.
+   </para>
+  </sect2>
   </sect1>
 
   <sect1 id="warm-standby-failover">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f0551a..0485bb5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			min_wal_size = 5;		/* 80 MB */
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
-bool		XLogArchiveMode = false;
+int			XLogArchiveMode = ARCHIVE_MODE_OFF;
 char	   *XLogArchiveCommand = NULL;
 bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
@@ -140,6 +140,24 @@ const struct config_enum_entry sync_method_options[] = {
 	{NULL, 0, false}
 };
 
+
+/*
+ * Although only "on", "off", and "always" are documented,
+ * we accept all the likely variants of "on" and "off".
+ */
+const struct config_enum_entry archive_mode_options[] = {
+	{"always", ARCHIVE_MODE_ALWAYS, false},
+	{"on", ARCHIVE_MODE_ON, false},
+	{"off", ARCHIVE_MODE_OFF, false},
+	{"true", ARCHIVE_MODE_ON, true},
+	{"false", ARCHIVE_MODE_OFF, true},
+	{"yes", ARCHIVE_MODE_ON, true},
+	{"no", ARCHIVE_MODE_OFF, true},
+	{"1", ARCHIVE_MODE_ON, true},
+	{"0", ARCHIVE_MODE_OFF, true},
+	{NULL, 0, false}
+};
+
 /*
  * Statistics for current checkpoint are collected in this global struct.
  * Because only the checkpointer or a stand-alone backend can perform
@@ -767,7 +785,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
-static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
+static void exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog);
 static bool recoveryStopsBefore(XLogReaderState *record);
 static bool recoveryStopsAfter(XLogReaderState *record);
 static void recoveryPausesHere(void);
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index f435f65..4c69b73 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -480,7 +480,10 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 	 * Create .done file forcibly to prevent the restored segment from being
 	 * archived again later.
 	 */
-	XLogArchiveForceDone(xlogfname);
+	if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+		XLogArchiveForceDone(xlogfname);
+	else
+		XLogArchiveNotify(xlogfname);
 
 	/*
 	 * If the existing file was replaced, since walsenders might have it open,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a9f20ac..36440cb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -828,9 +828,9 @@ PostmasterMain(int argc, char *argv[])
 		write_stderr("%s: max_wal_senders must be less than max_connections\n", progname);
 		ExitPostmaster(1);
 	}
-	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
+	if (XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
-				(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
+				(errmsg("WAL archival cannot be enabled when wal_level is \"minimal\"")));
 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\", \"hot_standby\", or \"logical\"")));
@@ -1645,13 +1645,21 @@ ServerLoop(void)
 				start_autovac_launcher = false; /* signal processed */
 		}
 
-		/* If we have lost the archiver, try to start a new one */
-		if (XLogArchivingActive() && PgArchPID == 0 && pmState == PM_RUN)
-			PgArchPID = pgarch_start();
-
-		/* If we have lost the stats collector, try to start a new one */
-		if (PgStatPID == 0 && pmState == PM_RUN)
-			PgStatPID = pgstat_start();
+		/*
+		 * If we have lost the archiver, try to start a new one.
+		 *
+		 * If WAL archiving is enabled always, we try to start a new archiver
+		 * even during recovery.
+		 */
+		if (PgArchPID == 0 && wal_level >= WAL_LEVEL_ARCHIVE)
+		{
+			if ((pmState == PM_RUN && XLogArchiveMode > ARCHIVE_MODE_OFF) ||
+				((pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY) &&
+				 XLogArchiveMode == ARCHIVE_MODE_ALWAYS))
+			{
+				PgArchPID = pgarch_start();
+			}
+		}
 
 		/* If we need to signal the autovacuum launcher, do so now */
 		if (avlauncher_needs_signal)
@@ -4807,6 +4815,17 @@ sigusr1_handler(SIGNAL_ARGS)
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
 
+		/*
+		 * Start the archiver if we're responsible for (re-)archiving received
+		 * files.
+		 */
+		Assert(PgArchPID == 0);
+		if (wal_level >= WAL_LEVEL_ARCHIVE &&
+			XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		{
+			PgArchPID = pgarch_start();
+		}
+
 		pmState = PM_RECOVERY;
 	}
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9c7710f..41e57f2 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -540,7 +540,10 @@ WalReceiverMain(void)
 			 * being archived later.
 			 */
 			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
+			if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+				XLogArchiveForceDone(xlogfname);
+			else
+				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
 
@@ -897,7 +900,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * from being archived later.
 				 */
 				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
+				if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+					XLogArchiveForceDone(xlogfname);
+				else
+					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5f71ded..97aca46 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -396,6 +396,7 @@ static const struct config_enum_entry row_security_options[] = {
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
+extern const struct config_enum_entry archive_mode_options[];
 extern const struct config_enum_entry sync_method_options[];
 extern const struct config_enum_entry dynamic_shared_memory_options[];
 
@@ -1530,16 +1531,6 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 
 	{
-		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
-			gettext_noop("Allows archiving of WAL files using archive_command."),
-			NULL
-		},
-		&XLogArchiveMode,
-		false,
-		NULL, NULL, NULL
-	},
-
-	{
 		{"hot_standby", PGC_POSTMASTER, REPLICATION_STANDBY,
 			gettext_noop("Allows connections and queries during recovery."),
 			NULL
@@ -3552,6 +3543,16 @@ static struct config_enum ConfigureNamesEnum[] =
 	},
 
 	{
+		{"archive_mode", PGC_POSTMASTER, WAL_ARCHIVING,
+			gettext_noop("Allows archiving of WAL files using archive_command."),
+			NULL
+		},
+		&XLogArchiveMode,
+		ARCHIVE_MODE_OFF, archive_mode_options,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"trace_recovery_messages", PGC_SIGHUP, DEVELOPER_OPTIONS,
 			gettext_noop("Enables logging of recovery-related debugging information."),
 			gettext_noop("Each level includes all the levels that follow it. The later"
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 110983f..7bea68a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -206,7 +206,7 @@
 
 # - Archiving -
 
-#archive_mode = off		# allows archiving to be done
+#archive_mode = off		# allows archiving to be done; off, on, or always
 				# (change requires restart)
 #archive_command = ''		# command to use to archive a logfile segment
 				# placeholders: %p = path of file to archive
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 961e050..9567379 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -98,7 +98,6 @@ extern int	wal_keep_segments;
 extern int	XLOGbuffers;
 extern int	XLogArchiveTimeout;
 extern int	wal_retrieve_retry_interval;
-extern bool XLogArchiveMode;
 extern char *XLogArchiveCommand;
 extern bool EnableHotStandby;
 extern bool fullPageWrites;
@@ -108,6 +107,15 @@ extern bool log_checkpoints;
 
 extern int	CheckPointSegments;
 
+/* Archive modes */
+typedef enum ArchiveMode
+{
+	ARCHIVE_MODE_OFF = 0,	/* disabled */
+	ARCHIVE_MODE_ON,		/* enabled while server is running normally */
+	ARCHIVE_MODE_ALWAYS		/* enabled always (even during recovery) */
+} ArchiveMode;
+extern int	XLogArchiveMode;
+
 /* WAL levels */
 typedef enum WalLevel
 {
@@ -118,7 +126,8 @@ typedef enum WalLevel
 } WalLevel;
 extern int	wal_level;
 
-#define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
+#define XLogArchivingActive() \
+	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
 #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
 
 /*
-- 
2.1.4