.ready files appearing on slaves

Started by Jehan-Guillaume de Rorthaisover 11 years ago42 messages

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the next
restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL "cycle" in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those "ghost"
WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal disconnection
of the streaming replication appearing in its logs. But there was no known SR
disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,
--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Jehan-Guillaume de Rorthais (#1)
Re: .ready files appearing on slaves

Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

/messages/by-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the
next restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL "cycle" in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those "ghost"
WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal
disconnection of the streaming replication appearing in its logs. But there
was no known SR disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,

--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Jehan-Guillaume de Rorthais (#2)
BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

Hi hackers,

We spend some time with Guillaume Lelarge studying this issue.

CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This
one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped.
As our slave has "archive_mode" & "archive_command" set, XLogArchivingActive()
returns true, forcing XLogArchiveCheckDone() to look for a ".done" file. If the
corresponding ".done" file doesn't exist, XLogArchiveCheckDone() calls
XLogArchiveNotify() to actually create the ".ready" file.

Now, on a standby, we are supposed to create the ".done" files by calling
XLogArchiveForceDone() either after a successful restore_command or when the
streaming replication finishes and flushes a WAL. It seems like the root cause
of this trouble: maybe a race condition where XLogArchiveForceDone() would not
be called for a WAL. As the ".done" file is never created, we end up calling
XLogArchiveNotify() to create a ".ready" and this WAL will never be archived by
the slave, sitting in the pg_xlog dir.

But worst, we experience this under two 9.2.8 clusters and realized
this version is not supposed to have 256 segments per WAL, finishing a WAL to
the *FE one. I'm quoting back the history of my previous mail as we clearly
have some *FF files on the slave side!

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

Note that it seems to happen only during streaming replication, not in pure Log
Shipping replication. We added the debug message on the master side to make
sure it never archive *FF file

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Regards,

On Mon, 15 Sep 2014 17:37:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

/messages/by-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the
next restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files,
they are all much older than the current WAL "cycle" in both mtime and name
logic sequence. As instance on one of these box we have currently 6 of
those "ghost" WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal
disconnection of the streaming replication appearing in its logs. But there
was no known SR disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,

--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Jehan-Guillaume de Rorthais (#3)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

Comments welcome.
--
Michael

Attachments:

20141008_xlog_file_enforce_deletion.patchtext/x-diff; charset=US-ASCII; name=20141008_xlog_file_enforce_deletion.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a4dbb9..39701a3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3771,7 +3771,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
 			strcmp(xlde->d_name + 8, lastoff + 8) <= 0)
 		{
-			if (XLogArchiveCheckDone(xlde->d_name))
+			if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
 			{
 				snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
 
#5Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#4)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/08/2014 10:44 AM, Michael Paquier wrote:

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

There are two mysteries here:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed
to created, ever.

Since this only happens with streaming replication, the FF segments are
probably being created by walreceiver. XLogWalRcvWrite is the function
that opens the file. I don't see anything obviously wrong there.
XLogWalRcvWrite opens the file corresponding the start position in the
message received from the master. There is no check that the start
position is valid, though; if the master sends a start position in the
FF segment, walreceiver will merrily write it. So the problem could be
in the walsender side. However, I don't see anything wrong there either.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file
when it closes an old segment and opens a new one. However, it does this
only when it's about to start writing to the new segment, and still has
the old segment open. If you stream the FE segment fully, but drop
replication connection at exactly that point, the .done file is not
created. That might sound unlikely, but it's actually pretty easy to
trigger. Just do "select pg_switch_xlog()" in the master, followed by
"pg_ctl stop -m i" and a restart.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so
we still have the original problem that we will try to archive the
segment after failover, even though the master might already have
archived it.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed
from the master.

In that thread, it was not discussed what should happen to WAL files
that an admin manually copies into pg_xlog of the standby. Should the
standby archive them? I don't think so - the admin should copy them
manually to the archive too, if he wants them archived. It's a good and
simple rule that the server that generates the WAL, archives the WAL.

Instead of creating any .done files during recovery, we could scan
pg_xlog at promotion, and create a .done file for every WAL segment
that's present at that point. That would be more robust. And then apply
your patch, to recycle old segments during archive recovery, ignoring
.done files.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed
to created, ever.

Since this only happens with streaming replication, the FF segments are

probably being created by walreceiver. XLogWalRcvWrite is the function that
opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite
opens the file corresponding the start position in the message received
from the master. There is no check that the start position is valid,
though; if the master sends a start position in the FF segment, walreceiver
will merrily write it. So the problem could be in the walsender side.
However, I don't see anything wrong there either.

Good to hear that. By looking at the wal receiver and sender code paths, I
found nothing really weird.

I think we should add a check in walreceiver, to throw an error if the

master sends an invalid WAL pointer, pointing to an FF segment.

Then we're good for a check in ProcessWalSndrMessage for walEnd I guess.
Seems like a straight-forward patch.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file when
it closes an old segment and opens a new one. However, it does this only
when it's about to start writing to the new segment, and still has the old
segment open. If you stream the FE segment fully, but drop replication
connection at exactly that point, the .done file is not created. That might
sound unlikely, but it's actually pretty easy to trigger. Just do "select
pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a
restart.

That's exactly the test I have been doing a couple of times to trigger this
behavior before sending my previous email, but without success on the
standby with master: all the WAL files were marked as .done. Now, I have
just retried it, with far more tries on REL9_3_STABLE and on HEAD and I
have been able to actually trigger the problem a couple of times. Simply
run a long transaction generating a lot of WAL like that:
=# create table aa as select generate_series(1,1000000000);
And then run that:
$ psql -c 'select pg_switch_xlog()'; pg_ctl stop -m immediate; pg_ctl start
And with enough "luck", .ready files may appear. It may take a dozen of
tries before seeing at least ones. And I noticed that generally multiple
.ready files appear at the same time.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so we
still have the original problem that we will try to archive the segment
after failover, even though the master might already have archived it.

Yep. Agreed.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed from
the master.

In that thread, it was not discussed what should happen to WAL files that

an admin manually copies into pg_xlog of the standby. Should the standby
archive them? I don't think so - the admin should copy them manually to the
archive too, if he wants them archived. It's a good and simple rule that
the server that generates the WAL, archives the WAL.

Question time: why has the check based on recovery state of the node been
removed in 1bd42cd? Just assuming, but did you have in mind that relying on
XLogArchiveForceDone and XLogArchiveCheckDone was enough and more robust at
this point?

Instead of creating any .done files during recovery, we could scan pg_xlog

at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.
Regards,
--
Michael

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/08/2014 10:44 AM, Michael Paquier wrote:

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

There are two mysteries here:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
created, ever.

Since this only happens with streaming replication, the FF segments are
probably being created by walreceiver. XLogWalRcvWrite is the function that
opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite
opens the file corresponding the start position in the message received from
the master. There is no check that the start position is valid, though; if
the master sends a start position in the FF segment, walreceiver will
merrily write it. So the problem could be in the walsender side. However, I
don't see anything wrong there either.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file when it
closes an old segment and opens a new one. However, it does this only when
it's about to start writing to the new segment, and still has the old
segment open. If you stream the FE segment fully, but drop replication
connection at exactly that point, the .done file is not created. That might
sound unlikely, but it's actually pretty easy to trigger. Just do "select
pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a
restart.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so we
still have the original problem that we will try to archive the segment
after failover, even though the master might already have archived it.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed from
the master.

In that thread, it was not discussed what should happen to WAL files that an
admin manually copies into pg_xlog of the standby. Should the standby
archive them? I don't think so - the admin should copy them manually to the
archive too, if he wants them archived. It's a good and simple rule that the
server that generates the WAL, archives the WAL.

Instead of creating any .done files during recovery, we could scan pg_xlog
at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master? In this case, no WAL files have .done status
files, so the server will create .ready and archive all of them. Probably this
is problematic. So even if we adopt your idea, ISTM that it's better to create
.done file whenever WAL file is fullly streamed and restored.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#7)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/08/2014 04:59 PM, Fujii Masao wrote:

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Instead of creating any .done files during recovery, we could scan pg_xlog
at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master?

Um, that's not a safe thing to do anyway, is it?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#8)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 11:38 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 10/08/2014 04:59 PM, Fujii Masao wrote:

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Instead of creating any .done files during recovery, we could scan
pg_xlog
at promotion, and create a .done file for every WAL segment that's
present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master?

Um, that's not a safe thing to do anyway, is it?

That's not safe as it bypasses all the consistency checks of promotion.
Now, it is also something that repmgr for example does as far as I recall
to do a node "promotion". What if we simply document the problem properly
then? The apparition of those phantom WAL files is more scary than a user
or a utility that does a promotion with a server restart. Not to mention as
well that users as free to add themselves files to pg_xlog.
--
Michael

#10Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#6)
2 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by
ensuring that when a node is in recovery it will remove it whatever its
status in archive_status. This patch is the real fix, and should be applied
down to 9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

I imagine that it would be nice to get those things fixed before the next
minor release.
Regards,
--
Michael

Attachments:

0001-Fix-apparition-of-archive-status-files-of-.ready.patchtext/x-diff; charset=US-ASCII; name=0001-Fix-apparition-of-archive-status-files-of-.ready.patchDownload
From 18c2d47b1d5dd3c0439f990ee4da6b305d477ca4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 9 Oct 2014 15:04:26 +0900
Subject: [PATCH 1/2] Fix apparition of archive status files of .ready on
 streaming standbys

Commit 1bd42cd has removed a check based on the recovery status of a node
when removing old WAL segment files in pg_xlog, causing the apparition of
.ready files that prevented the removal of some WAL segment files that
remained stuck in the archive folder. Note that this does not prevent
the abscence of some .done files as it may still be possible that some
segments cannot be marked correctly as complete after their stream is done
in the case for example of an abrupt disconnection between a standby and
its root node, but it ensures that when a node is in recovery old WAL
segment files are removed whatever their status in the folder
archive_status.

Per report from Jehan-Guillaume de Rorthais.
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a4dbb9..39701a3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3771,7 +3771,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr endptr)
 			strspn(xlde->d_name, "0123456789ABCDEF") == 24 &&
 			strcmp(xlde->d_name + 8, lastoff + 8) <= 0)
 		{
-			if (XLogArchiveCheckDone(xlde->d_name))
+			if (RecoveryInProgress() || XLogArchiveCheckDone(xlde->d_name))
 			{
 				snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlde->d_name);
 
-- 
2.1.2

0002-Enforce-all-WAL-segment-files-to-be-marked-as-.done-.patchtext/x-diff; charset=US-ASCII; name=0002-Enforce-all-WAL-segment-files-to-be-marked-as-.done-.patchDownload
From 493a9d05e9386403fc1e6d78df19dd8a59c53cae Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Thu, 9 Oct 2014 15:09:44 +0900
Subject: [PATCH 2/2] Enforce all WAL segment files to be marked as .done at
 node promotion

This is a safety mechanism to ensure that there are no files that are not
considered as .done as some segments may have been missed particularly in
the case of partially written files or files being written when a disconnection
occurred between a streaming standby and its root node. This makes the node
reaching promotion having a state consistent with what is expected using
the assumption that all the WAL segment files that are done being streamed
should be always considered as archived by the node.
---
 src/backend/access/transam/xlog.c        | 10 ++++++++
 src/backend/access/transam/xlogarchive.c | 39 +++++++++++++++++++++++++++++++-
 src/include/access/xlog_internal.h       |  1 +
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 39701a3..af5f548 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6862,6 +6862,16 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Create a .done entry for each WAL file present in pg_xlog that has
+	 * not been yet marked as such. In some cases where for example a streaming
+	 * replica has had a connection to a remote node cut abruptly, such WAL
+	 * files may have been only partially written or even not flagged correctly
+	 * with .done.
+	 */
+	if (InRecovery)
+		XLogArchiveForceDoneAll();
+
+	/*
 	 * Kill WAL receiver, if it's still running, before we continue to write
 	 * the startup checkpoint record. It will trump over the checkpoint and
 	 * subsequent records if it's still alive when we start writing WAL.
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 047efa2..931106f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -554,6 +554,40 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
+ * XLogArchiveForceDoneAll
+ *
+ * Wrapper of XLogArchiveForceDone scanning all the XLOG segments files in
+ * XLOGDIR, switching them forcibly to <XLOG>.done.
+ */
+void
+XLogArchiveForceDoneAll(void)
+{
+	DIR			   *xldir;
+	struct dirent  *xlde;
+
+
+	xldir = AllocateDir(XLOGDIR);
+	if (xldir == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open transaction log directory \"%s\": %m",
+						XLOGDIR)));
+
+	/*
+	 * Scan all the WAL segments present in the archives and switch them to
+	 * .done.
+	 */
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		if (strlen(xlde->d_name) == 24 &&
+			strspn(xlde->d_name, "0123456789ABCDEF") == 24)
+			XLogArchiveForceDone(xlde->d_name);
+	}
+
+	FreeDir(xldir);
+}
+
+/*
  * XLogArchiveForceDone
  *
  * Emit notification forcibly that an XLOG segment file has been successfully
@@ -582,7 +616,6 @@ XLogArchiveForceDone(const char *xlog)
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
-
 		return;
 	}
 
@@ -604,6 +637,10 @@ XLogArchiveForceDone(const char *xlog)
 						archiveDone)));
 		return;
 	}
+
+	ereport(DEBUG2,
+			(errmsg("archive status file of \"%s\" has been forcibly switched from \"%s\" to \"%s\"",
+					xlog, archiveReady, archiveDone)));
 }
 
 /*
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27b9899..04a0de3 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -286,6 +286,7 @@ extern void ExecuteRecoveryCommand(char *command, char *commandName,
 extern void KeepFileRestoredFromArchive(char *path, char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveForceDoneAll(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
-- 
2.1.2

#11Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#5)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
created, ever.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

Attached is a patch for that. This would be needed for PG < 9.3 if applied.
Regards,
--
Michael

Attachments:

20141009_walreceiver_check_invalid.patchtext/x-patch; charset=US-ASCII; name=20141009_walreceiver_check_invalid.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 753316e..4f4d019 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -789,6 +789,24 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
+	/*
+	 * Check that the WAL position received from sender is valid and does
+	 * refer to a segment file whose name finishes by FF.
+	 */
+	if ((walEnd.xlogid & 0xff) == 0xff)
+	{
+		char		xlogfilename[MAXFNAMELEN];
+		uint32		xlogseg;
+		uint32		xlogid;
+
+		XLByteToPrevSeg(walEnd, xlogid, xlogseg);
+		XLogFileName(xlogfilename, ThisTimeLineID, xlogid, xlogseg);
+		ereport(ERROR,
+				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+				 errmsg_internal("invalid WAL position %X/%X referring to segment %s",
+								 walEnd.xlogid, walEnd.xrecoff, xlogfilename)));
+	}
+
 	if (log_min_messages <= DEBUG2)
 	{
 		char	   *sendtime;
#12Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#10)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by ensuring
that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down to
9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

Thanks for the patches!

I found one problem in the 0002 patch. The patch changes the recovery so that
it creates .done files for every WAL files which exist in pg_xlog directory at
the end of recovery. But even WAL files which will have to be archived later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL files.
The patch wrongly prevents them from being archvied at all.

ISTM that the 0001 patch has the similar problem. Please imagine the following
scenario.

1. There are many unarchived WAL files in pg_xlog because of the continuous
failure of archive_command, for example, and then the server unfortunately
crashes because of the corruption of database itself.

2. DBA restores the backup onto the server and copies all the WAL files
from old pg_xlog to new one. Then he or she prepares for archive recovery.

3. DBA starts the server and the archive recovery starts.

4. After all the archived WAL files are replayed, the server tries to replay
the WAL files in pg_xlog. Since there are many WAL files in pg_xlog,
more than one restartpoints happen while they are being replayed.

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#12)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by ensuring
that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down to
9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file for
the last WAL file of the old timeline even when it's restored from the archive
and has .done file. So this causes the already-archived WAL file to be archived
again.... Attached patch fixes this bug.

Regards,

--
Fujii Masao

Attachments:

bugfix-last-restored-wal.patchtext/x-patch; charset=US-ASCII; name=bugfix-last-restored-wal.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 5351,5368 **** exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
  	 * for the new timeline.
  	 *
  	 * Notify the archiver that the last WAL segment of the old timeline is
! 	 * ready to copy to archival storage. Otherwise, it is not archived for a
! 	 * while.
  	 */
  	if (endTLI != ThisTimeLineID)
  	{
  		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
  
! 		if (XLogArchivingActive())
! 		{
! 			XLogFileName(xlogpath, endTLI, endLogSegNo);
! 			XLogArchiveNotify(xlogpath);
! 		}
  	}
  
  	/*
--- 5351,5367 ----
  	 * for the new timeline.
  	 *
  	 * Notify the archiver that the last WAL segment of the old timeline is
! 	 * ready to copy to archival storage if its .done file doesn't exist
! 	 * (e.g., if it's the restored WAL file, it's expected to have .done file).
! 	 * Otherwise, it is not archived for a while.
  	 */
  	if (endTLI != ThisTimeLineID)
  	{
  		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
  
! 		/* Create .ready file only when neither .ready nor .done files exist */
! 		XLogFileName(xlogpath, endTLI, endLogSegNo);
! 		XLogArchiveCheckDone(xlogpath);
  	}
  
  	/*
*** a/src/backend/access/transam/xlogarchive.c
--- b/src/backend/access/transam/xlogarchive.c
***************
*** 516,521 **** XLogArchiveNotify(const char *xlog)
--- 516,527 ----
  	char		archiveStatusPath[MAXPGPATH];
  	FILE	   *fd;
  
+ 	/*
+ 	 * We should not create .ready file for already archived WAL file to
+ 	 * prevent it from being archived again.
+ 	 */
+ 	Assert(XLogArchiveIsBusy(xlog));
+ 
  	/* insert an otherwise empty file called <XLOG>.ready */
  	StatusFilePath(archiveStatusPath, xlog, ".ready");
  	fd = AllocateFile(archiveStatusPath, "w");
#14Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#12)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL
files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file
not marked as .done during recovery when stream connection is abruptly cut
when this segment is switched, marking it as .ready in archive_status and
simply letting this segment in pg_xlog because it will neither be recycled
nor removed. I have not been able to look much at this code these days, so
I am not sure how invasive it would be in back-branches, but perhaps we
should try to improve code such as when a segment file is switched and
connection to the is cut, we guarantee that this file is completed and
marked as .done.
--
Michael

#15Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#14)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

This will enforce a segment file switch and restart the master in
crash recovery. This has as effect to immediately cut the WAL stream
on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
For example, let's imagine that stream fails when switching from
000000010000000000000003 to the next segment, then the
the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
restarts it will begin once again from 0/4000000, ignoring that
000000010000000000000003 should be marked as .done, ultimately marking
it in .ready state when old segment files are recycled or removed.
There is nothing that can really be done to enforce the creation of a
.done file before the FATAL of libpqrcv_receive because we cannot
predict the stream failure..

Now, we can do better than what we have now by looking at WAL start
position used when starting streaming in WAL receiver and enforce
.done if the start position is the last one of previous segment.
Hence, in the case of start position 0/4000000 that was found
previously, the file that will be enforced to .done is
000000010000000000000003. I have written the patch attached that
implements this idea and fixes the problem. Now let's see if you guys
see any flaws in this simple logic which uses a sniper gun instead of
a bazooka as in the previous patches sent.

Regards,
--
Michael

Attachments:

20141020_wal_phantom_fix.patchtext/x-patch; charset=US-ASCII; name=20141020_wal_phantom_fix.patchDownload
From ce7e1eec97dbe7b1648b4f56a1f9825eeba7ebed Mon Sep 17 00:00:00 2001
From: Michael Paquier <mpaquier@vmware.com>
Date: Mon, 20 Oct 2014 14:40:37 +0900
Subject: [PATCH] Mark segment file .done for last WAL position at stream start

When stream connection between a standby and its root node is unstable,
WAL stream errors make this standby node fail with FATAL errors because
of the WAL receiver, forcing it to restart in crash-recovery mode. Now,
when the crash occurs exactly when a switch to a new segment file is
done, it is possible that the WAL receiver restarts from a position
located on the next segment file, while the previous file is let as is,
marked ultimately in .ready state once old WAL files are recycled or
removed. Note that this file should have been marked as .done by the
WAL receiver itself.

With this patch, WAL receiver checks for the presence of the previous
segment file of start position if this WAL position is the last one of
the previous segment file and enforces it to .done, preventing .ready
files from being accumulated on standbys.

This failure can be easily reproducible with the following command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate
---
 src/backend/access/transam/xlogarchive.c | 26 ++++++++++++++++++++++++++
 src/backend/replication/walreceiver.c    | 12 ++++++++++++
 src/include/access/xlog_internal.h       |  1 +
 3 files changed, 39 insertions(+)

diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 047efa2..25a153f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -553,6 +553,32 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 	XLogArchiveNotify(xlog);
 }
 
+
+/*
+ * XLogArchiveForceDoneIfPresent
+ *
+ * Wrapper of XLogArchiveForceDone, used conditionally based on the presence
+ * of given XLOG segment file.
+ */
+void
+XLogArchiveForceDoneIfPresent(TimeLineID tli, XLogSegNo segno)
+{
+	struct stat	stat_buf;
+	char		xlogfname[MAXFNAMELEN];
+	char		xlogfpath[MAXPGPATH];
+
+	XLogFilePath(xlogfpath, tli, segno);
+
+	/* File is missing, nothing to do */
+	if (stat(xlogfpath, &stat_buf) != 0)
+		return;
+
+	/* All is fine, process... */
+	XLogFileName(xlogfname, tli, segno);
+	XLogArchiveForceDone(xlogfname);
+}
+
+
 /*
  * XLogArchiveForceDone
  *
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..6857a05 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -361,6 +361,7 @@ WalReceiverMain(void)
 								  slotname[0] != '\0' ? slotname : NULL))
 		{
 			bool		endofwal = false;
+			XLogSegNo	startPointSegNo;
 
 			if (first_stream)
 				ereport(LOG,
@@ -383,6 +384,17 @@ WalReceiverMain(void)
 			last_recv_timestamp = GetCurrentTimestamp();
 			ping_sent = false;
 
+			/*
+			 * Check if current start point is located exactly at the end of
+			 * a segment file and mark this file as already archived. It is
+			 * possible that if WAL stream connection has been abruptly cut
+			 * exactly during a segment file switch that this file is still
+			 * present but not marked as .done even if it should be.
+			 */
+			XLByteToPrevSeg(startpoint, startPointSegNo);
+			if (!XLByteInSeg(startpoint, startPointSegNo))
+				XLogArchiveForceDoneIfPresent(startpointTLI, startPointSegNo);
+
 			/* Loop until end-of-streaming or error */
 			while (!endofwal)
 			{
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27b9899..3b1a99e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -286,6 +286,7 @@ extern void ExecuteRecoveryCommand(char *command, char *commandName,
 extern void KeepFileRestoredFromArchive(char *path, char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveForceDoneIfPresent(TimeLineID tli, XLogSegNo segno);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
-- 
2.1.2

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#13)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

The additional process at promotion sounds like a good idea, I'll try

to

get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve

the

situation:
- 0001 prevents the apparition of those phantom WAL segment file by

ensuring

that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down

to

9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the

segment

files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the

process

more robust IMO.

Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in
recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file
for
the last WAL file of the old timeline even when it's restored from the
archive
and has .done file. So this causes the already-archived WAL file to be
archived
again.... Attached patch fixes this bug.

That's a good catch! Patch looks good. I think that you should change
xlogpath to use MAXFNAMELEN instead of MAXPGPATH in exitArchiveRecovery.
This is only for correctness, so that's a master-only remark, because this
variable is used only to calculate a segment file name, and not a path.
Renaming the variable from xlogpath to xlogname would make sense as well.
Regards,
--
Michael

#17Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#15)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/20/2014 09:26 AM, Michael Paquier wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

This will enforce a segment file switch and restart the master in
crash recovery. This has as effect to immediately cut the WAL stream
on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
For example, let's imagine that stream fails when switching from
000000010000000000000003 to the next segment, then the
the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
restarts it will begin once again from 0/4000000, ignoring that
000000010000000000000003 should be marked as .done, ultimately marking
it in .ready state when old segment files are recycled or removed.
There is nothing that can really be done to enforce the creation of a
.done file before the FATAL of libpqrcv_receive because we cannot
predict the stream failure..

Now, we can do better than what we have now by looking at WAL start
position used when starting streaming in WAL receiver and enforce
.done if the start position is the last one of previous segment.
Hence, in the case of start position 0/4000000 that was found
previously, the file that will be enforced to .done is
000000010000000000000003. I have written the patch attached that
implements this idea and fixes the problem. Now let's see if you guys
see any flaws in this simple logic which uses a sniper gun instead of
a bazooka as in the previous patches sent.

Hmm. This will still miss the .done file if you don't re-establish the
streaming replication connection after the restart. For example, if you
shut down the master, and promote the standby server.

I think we should take a more wholesale approach to this. We should
enforce the rule that the server only ever archives WAL files belonging
to the same timeline that the server generates. IOW, the server only
archives the WAL that it has generated.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#12)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 2:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I found one problem in the 0002 patch. The patch changes the recovery so
that
it creates .done files for every WAL files which exist in pg_xlog
directory at
the end of recovery. But even WAL files which will have to be archived
later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL
files.
The patch wrongly prevents them from being archived at all.

Re-looking at patch 2, yes you are right. Even if it was mentioned that we
should do that for a node that had promotion triggered it was not done this
way as a check on CheckForStandbyTrigger() is actually missing. Attached is
an updated patch.

ISTM that the 0001 patch has the similar problem. Please imagine the
following
scenario.
In this case, the patch seems to make the restartpoint recycle even WAL
files
which have .ready files and will have to be archived later. Thought?

Right, that's really backward. This was the approach taken before c9cc7e0,
and this commit actually prevents removal of unarchived WAL files during
recovery.
--
Michael

Attachments:

20141022_enforce_done_at_promotion.patchtext/x-diff; charset=US-ASCII; name=20141022_enforce_done_at_promotion.patchDownload
From fdf1f4a46a2f870ee2aeda11f909c465ed12450a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@otacoo.com>
Date: Wed, 22 Oct 2014 15:18:33 +0200
Subject: [PATCH] Enforce all WAL segment files to be marked as .done at  node
 promotion

This is a safety mechanism to ensure that there are no files that are not
considered as .done as some segments may have been missed particularly in
the case of partially written files or files being written when a disconnection
occurred between a streaming standby and its root node. This makes the node
reaching promotion having a state consistent with what is expected using
the assumption that all the WAL segment files that are done being streamed
should be always considered as archived by the node.
---
 src/backend/access/transam/xlog.c        | 10 ++++++++
 src/backend/access/transam/xlogarchive.c | 39 +++++++++++++++++++++++++++++++-
 src/include/access/xlog_internal.h       |  1 +
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 235b442..7179003 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6862,6 +6862,16 @@ StartupXLOG(void)
 	}
 
 	/*
+	 * Create a .done entry for each WAL file present in pg_xlog that has
+	 * not been yet marked as such for a node that has been promoted. In some
+	 * cases where for example a streaming replica has had a connection to a
+	 * remote node cut abruptly, such WAL files may have been only partially
+	 * written or even not flagged correctly with .done.
+	 */
+	if (InRecovery && CheckForStandbyTrigger())
+		XLogArchiveForceDoneAll();
+
+	/*
 	 * Kill WAL receiver, if it's still running, before we continue to write
 	 * the startup checkpoint record. It will trump over the checkpoint and
 	 * subsequent records if it's still alive when we start writing WAL.
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 047efa2..931106f 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -554,6 +554,40 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
+ * XLogArchiveForceDoneAll
+ *
+ * Wrapper of XLogArchiveForceDone scanning all the XLOG segments files in
+ * XLOGDIR, switching them forcibly to <XLOG>.done.
+ */
+void
+XLogArchiveForceDoneAll(void)
+{
+	DIR			   *xldir;
+	struct dirent  *xlde;
+
+
+	xldir = AllocateDir(XLOGDIR);
+	if (xldir == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open transaction log directory \"%s\": %m",
+						XLOGDIR)));
+
+	/*
+	 * Scan all the WAL segments present in the archives and switch them to
+	 * .done.
+	 */
+	while ((xlde = ReadDir(xldir, XLOGDIR)) != NULL)
+	{
+		if (strlen(xlde->d_name) == 24 &&
+			strspn(xlde->d_name, "0123456789ABCDEF") == 24)
+			XLogArchiveForceDone(xlde->d_name);
+	}
+
+	FreeDir(xldir);
+}
+
+/*
  * XLogArchiveForceDone
  *
  * Emit notification forcibly that an XLOG segment file has been successfully
@@ -582,7 +616,6 @@ XLogArchiveForceDone(const char *xlog)
 					(errcode_for_file_access(),
 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
 							archiveReady, archiveDone)));
-
 		return;
 	}
 
@@ -604,6 +637,10 @@ XLogArchiveForceDone(const char *xlog)
 						archiveDone)));
 		return;
 	}
+
+	ereport(DEBUG2,
+			(errmsg("archive status file of \"%s\" has been forcibly switched from \"%s\" to \"%s\"",
+					xlog, archiveReady, archiveDone)));
 }
 
 /*
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27b9899..04a0de3 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -286,6 +286,7 @@ extern void ExecuteRecoveryCommand(char *command, char *commandName,
 extern void KeepFileRestoredFromArchive(char *path, char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
+extern void XLogArchiveForceDoneAll(void);
 extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
-- 
2.1.2

#19Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/22/2014 04:24 PM, Michael Paquier wrote:

On Wed, Oct 22, 2014 at 1:53 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

I think we should take a more wholesale approach to this. We should
enforce the rule that the server only ever archives WAL files belonging to
the same timeline that the server generates. IOW, the server only archives
the WAL that it has generated.

Hm?! Would that be really back-patchable? There may be in the wild tools or
users that rely on the fact a node archives segment files from all
timelines.

Hmm, so it would be a tool or user that manually copies a file to the
pg_xlog directory of a standby server, and expects the standby to
archive the file after promotion. That seems a bit far-fetched, although
I've seen people do strange things. I think it would be acceptable as
long as we document the change in behavior in the release notes. I don't
have much hope that we'll ever be able to nail down the correct behavior
with the current approach.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#15)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

What about fixing this problem directly? That is, we can make walreceiver
check whether the end of last received WAL data is the end of current WAL file
or not, and then close the WAL file and create .done file if the test is true.

This is not a perfect solution. If the standby crashes during very
short interval
(i.e., after closing the WAL file and before creating .done file), the problem
would happen again. But it can really rarely happen, so I don't think that it's
worth fixing the corner case at least in back-branches. Of course, we can
find out the "perfect" solution for the master, though.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#16)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Mon, Oct 20, 2014 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier
<michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try
to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve
the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by
ensuring
that when a node is in recovery it will remove it whatever its status
in
archive_status. This patch is the real fix, and should be applied down
to
9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the
segment
files present in pg_xlog to have their status to .done, marking them
for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This
patch
would actually not be mandatory for back-patching, but it makes the
process
more robust IMO.

Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in
recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file
for
the last WAL file of the old timeline even when it's restored from the
archive
and has .done file. So this causes the already-archived WAL file to be
archived
again.... Attached patch fixes this bug.

That's a good catch! Patch looks good. I think that you should change
xlogpath to use MAXFNAMELEN instead of MAXPGPATH in exitArchiveRecovery.
This is only for correctness, so that's a master-only remark, because this
variable is used only to calculate a segment file name, and not a path.
Renaming the variable from xlogpath to xlogname would make sense as well.

Thanks for the review! Applied.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#20)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/23/2014 08:59 AM, Fujii Masao wrote:

On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

What about fixing this problem directly? That is, we can make walreceiver
check whether the end of last received WAL data is the end of current WAL file
or not, and then close the WAL file and create .done file if the test is true.

This is not a perfect solution. If the standby crashes during very
short interval
(i.e., after closing the WAL file and before creating .done file), the problem
would happen again. But it can really rarely happen, so I don't think that it's
worth fixing the corner case at least in back-branches. Of course, we can
find out the "perfect" solution for the master, though.

Sounds reasonable, for back-branches. Although I'm still worried we
might miss some corner-case unless we go with a more wholesale solution.

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another timeline.

I just remembered that we discussed a different problem related to this
some time ago, at
/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline. Also, we could make the
archive recovery more precise in what WAL files it tries to restore. For
example, if we're doing recovery with target timeline 4, but the
timeline history says that segment AB came from timeline 3 (i.e. the
switch to timeline 4 happened within a later segment), we should not try
to restore segment AB from timeline 4. It shouldn't exist, so trying to
restore it is a waste of cycles, but more importantly, if there's some
sort of confusion with the timelines, and the file exists anyway, we
should not restore it. Also, if the timeline history says e.g. that we
switched from TLI 3 to 4 at WAL position 0/AD123456, and we restore
segment AD from timeline 3, we should stop recovering that segment at
position 0/AD123456 and try to find the AD segment from timeline 4. We
currently only do that when streaming, but we should also do that when
restoring from archive.

In summary, let's do something small for back-branches, like what you
suggested. But for master, let's do bigger changes to the timeline
handling. Would you like to write a patch for the back-branches, if I
start working on master?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#22)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch
attached? There is still a small window between XLogWalRcvFlush and
XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
between them.
--
Michael

Attachments:

20141023_switch_walend.patchtext/x-diff; charset=US-ASCII; name=20141023_switch_walend.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..b367cb7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -136,7 +136,8 @@ static void WalRcvFetchTimeLineHistoryFiles(TimeLineID first, TimeLineID last);
 static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *startpointTLI);
 static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
-static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
+							XLogRecPtr walEnd);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
@@ -831,7 +832,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				buf += hdrlen;
 				len -= hdrlen;
-				XLogWalRcvWrite(buf, len, dataStart);
+				XLogWalRcvWrite(buf, len, dataStart, walEnd);
 				break;
 			}
 		case 'k':				/* Keepalive */
@@ -869,7 +870,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
  * Write XLOG data to disk.
  */
 static void
-XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
+XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr, XLogRecPtr walEnd)
 {
 	int			startoff;
 	int			byteswritten;
@@ -878,7 +879,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo))
+		if (recvFile < 0 || !XLByteInSeg(walEnd, recvSegNo))
 		{
 			bool		use_existent;
 
#24Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#23)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/23/2014 01:25 PM, Michael Paquier wrote:

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch
attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

There is still a small window between XLogWalRcvFlush and
XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
between them.

Yeah. I think we can live with that.

- Heikki

Attachments:

switch-walend-2.patchtext/x-diff; name=switch-walend-2.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..ca24acc 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -138,6 +138,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvFileClose(void);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -887,30 +888,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * would otherwise have to reopen this file to fsync it later
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				XLogWalRcvFileClose();
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -966,6 +944,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
 		LogstreamResult.Write = recptr;
 	}
+
+	/* If we just finished writing to the current file, fsync and close it */
+	if (recvFile >= 0 && recvOff == XLOG_SEG_SIZE)
+		XLogWalRcvFileClose();
 }
 
 /*
@@ -1022,6 +1004,38 @@ XLogWalRcvFlush(bool dying)
 }
 
 /*
+ * Close the currently open WAL file.
+ */
+static void
+XLogWalRcvFileClose(void)
+{
+	char		xlogfname[MAXFNAMELEN];
+
+	Assert(recvFile >= 0);
+
+	XLogWalRcvFlush(false);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						XLogFileNameP(recvFileTLI, recvSegNo))));
+	recvFile = -1;
+
+	/*
+	 * Create .done file forcibly to prevent the streamed segment from being
+	 * archived later.
+	 */
+	XLogFileName(xlogfname, recvFileTLI, recvSegNo);
+	XLogArchiveForceDone(xlogfname);
+}
+
+/*
  * Send reply message to primary, indicating our current XLOG positions, oldest
  * xmin and the current time.
  *
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#22)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 5:09 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:

On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

In this case, the patch seems to make the restartpoint recycle even WAL
files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file
not marked as .done during recovery when stream connection is abruptly cut
when this segment is switched, marking it as .ready in archive_status and
simply letting this segment in pg_xlog because it will neither be recycled
nor removed. I have not been able to look much at this code these days, so I
am not sure how invasive it would be in back-branches, but perhaps we should
try to improve code such as when a segment file is switched and connection
to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

What about fixing this problem directly? That is, we can make walreceiver
check whether the end of last received WAL data is the end of current WAL
file
or not, and then close the WAL file and create .done file if the test is
true.

This is not a perfect solution. If the standby crashes during very
short interval
(i.e., after closing the WAL file and before creating .done file), the
problem
would happen again. But it can really rarely happen, so I don't think that
it's
worth fixing the corner case at least in back-branches. Of course, we can
find out the "perfect" solution for the master, though.

Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

+1

At least for master, we should consider changing the way the archiving works
so that we only archive WAL that was generated in the same server. I.e. we
should never try to archive WAL files belonging to another timeline.

I just remembered that we discussed a different problem related to this some
time ago, at
/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, the last, partial, segment of the old timeline is never archived?
If yes, I'm afraid that the PITR to the old timeline cannot replay the
last segment. No? Or you're thinking to change the code so that
the segment of new timeline is replayed in that case?

In summary, let's do something small for back-branches, like what you
suggested. But for master, let's do bigger changes to the timeline handling.

Yep.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 01:25 PM, Michael Paquier wrote:

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch
attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

Looks good to me. WalReceiverMain has almost the same code as
what XLogWalRcvFileClose does. So we can refactor that.

There is still a small window between XLogWalRcvFlush and
XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
between them.

Yeah. I think we can live with that.

Yes.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#26)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 01:25 PM, Michael Paquier wrote:

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch
attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

Looks good to me. WalReceiverMain has almost the same code as
what XLogWalRcvFileClose does. So we can refactor that.

While looking at the code of WAL archiving and recovery, I found
another small issue. The standby always creates .ready file for
the timeline history file even when WAL archiving is not enabled.
Since WAL archiving is off, that .ready file will remain infinitely.
Probably this is almost harmless but confusing, so I'd like to fix that.
Patch attached. Thought?

Regards,

--
Fujii Masao

Attachments:

timeline-archive.patchtext/x-patch; charset=US-ASCII; name=timeline-archive.patchDownload
*** a/src/backend/access/transam/timeline.c
--- b/src/backend/access/transam/timeline.c
***************
*** 36,41 ****
--- 36,42 ----
  #include <unistd.h>
  
  #include "access/timeline.h"
+ #include "access/xlog.h"
  #include "access/xlog_internal.h"
  #include "access/xlogdefs.h"
  #include "storage/fd.h"
***************
*** 437,444 **** writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
  #endif
  
  	/* The history file can be archived immediately. */
! 	TLHistoryFileName(histfname, newTLI);
! 	XLogArchiveNotify(histfname);
  }
  
  /*
--- 438,448 ----
  #endif
  
  	/* The history file can be archived immediately. */
! 	if (XLogArchivingActive())
! 	{
! 		TLHistoryFileName(histfname, newTLI);
! 		XLogArchiveNotify(histfname);
! 	}
  }
  
  /*
#28Michael Paquier
michael.paquier@gmail.com
In reply to: Fujii Masao (#27)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 2:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 23, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 01:25 PM, Michael Paquier wrote:

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we

might

miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple

patch

attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

Looks good to me. WalReceiverMain has almost the same code as
what XLogWalRcvFileClose does. So we can refactor that.

While looking at the code of WAL archiving and recovery, I found
another small issue. The standby always creates .ready file for
the timeline history file even when WAL archiving is not enabled.
Since WAL archiving is off, that .ready file will remain infinitely.
Probably this is almost harmless but confusing, so I'd like to fix that.
Patch attached. Thought?

Good catch once again. We could as well put the check of
XLogArchivingActive directly in XLogArchiveNotify...
--
Michael

#29Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#24)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 1:45 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 10/23/2014 01:25 PM, Michael Paquier wrote:

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:

Sounds reasonable, for back-branches. Although I'm still worried we might
miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch
attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

There is still a small window between XLogWalRcvFlush and

XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
between them.

Yeah. I think we can live with that.

Just tested this patch with the combo pg_switch_xlog() + stop/immediate and
the apparition of the .ready files is not fixed.
Regards,
--
Michael

#30Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#29)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 23, 2014 at 10:16 PM, Michael Paquier <michael.paquier@gmail.com

wrote:

On Thu, Oct 23, 2014 at 1:45 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:
That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

Just tested this patch with the combo pg_switch_xlog() + stop/immediate
and the apparition of the .ready files is not fixed.

Btw, I think that we should pass walEnd to XLogWalRcvWrite and add an
additional check based on that after the write() loop calls to enforce the
segment to be switched to .done.
--
Michael

#31Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#22)
1 attachment(s)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another timeline.

I just remembered that we discussed a different problem related to this
some time ago, at
/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem
with it?

- Heikki

Attachments:

0001-Don-t-try-to-archive-WAL-from-different-timelines.patchtext/x-diff; name=0001-Don-t-try-to-archive-WAL-from-different-timelines.patchDownload
>From 9115276335d37d7d2dab55ff8b0642041c16bfc3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 24 Oct 2014 15:39:02 +0300
Subject: [PATCH] Don't try to archive WAL from different timelines.

Previously, we would always create a .done file for files restored from the
archive, or streamed from a master server. That served two purposes: it
allowed the files to be recycled at restartpoints, and ensured that we don't
try to re-archive the files after promotion. However, there were some gaps
in that approach: if streaming replication was stopped at a segment boundary,
no .done file was created. The standby server might also crash just before
it has created the .done file.

This is a more wholesale solution to the problem. WAL files belonging to
a different timeline are never archived, regardless of .done or .ready
files. The rule is that the server that generates a WAL file is solely
responsible for archiving it.

This also changes the long-standing behavior at end of recovery, where we
archived the last, partial, WAL segment from the old timeline. We no longer
do that; if you copy any files manually to pg_xlog, and want them to be
archived, you should copy them manually to the archive, too. (The first
segment on the new timeline, which is copied from the last partial one, will
be archived as usual.)

This is the same idea as in commit c9cc7e05c6d82a9781883a016c70d95aa4923122,
which was reverted in favor of the .done file approach in commit
1bd42cd70abdbc946ad64c3c8eaefed4bb8b1145. Turns out the .done file approach
was not such a good idea after all. The behavior at end-of-recovery was
discussed on pgsql-hackers back in February 2014
(20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp), but the plan
was never followed through (my bad, I forgot all about).

This should fix the issue reported by Jehan-Guillaume de Rorthais, where
a standby server sometimes creates .ready files for WAL files streamed from
the master. But this is too risky to backpatch - the changes in behavior
are quite signicant - so we'll have to fix it with some other approach in
back-branches.

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e77af22..2370467 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -208,7 +208,8 @@ static int	LocalXLogInsertAllowed = -1;
  * When ArchiveRecoveryRequested is set, archive recovery was requested,
  * ie. recovery.conf file was present. When InArchiveRecovery is set, we are
  * currently recovering using offline XLOG archives. These variables are only
- * valid in the startup process.
+ * valid in the startup process, but there is a copy of InArchiveRecovery
+ * in shared memory.
  *
  * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
  * currently performing crash recovery using only XLOG files in pg_xlog, but
@@ -546,9 +547,11 @@ typedef struct XLogCtlData
 
 	/*
 	 * SharedRecoveryInProgress indicates if we're still in crash or archive
-	 * recovery.  Protected by info_lck.
+	 * recovery.  SharedInArchiveRecovery indicates if we're currently in
+	 * archive recovery.  Protected by info_lck.
 	 */
 	bool		SharedRecoveryInProgress;
+	bool		SharedInArchiveRecovery;
 
 	/*
 	 * SharedHotStandbyActive indicates if we're still in crash or archive
@@ -748,6 +751,7 @@ static MemoryContext walDebugCxt = NULL;
 #endif
 
 static void readRecoveryCommandFile(void);
+static void enterArchiveRecovery(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
 static bool recoveryStopsBefore(XLogRecord *record);
 static bool recoveryStopsAfter(XLogRecord *record);
@@ -4168,9 +4172,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 			{
 				ereport(DEBUG1,
 						(errmsg_internal("reached end of WAL in pg_xlog, entering archive recovery")));
-				InArchiveRecovery = true;
-				if (StandbyModeRequested)
-					StandbyMode = true;
+				enterArchiveRecovery();
 
 				/* initialize minRecoveryPoint to this record */
 				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
@@ -4866,6 +4868,7 @@ XLOGShmemInit(void)
 	 */
 	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
 	XLogCtl->SharedRecoveryInProgress = true;
+	XLogCtl->SharedInArchiveRecovery = false;
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
@@ -5316,6 +5319,18 @@ readRecoveryCommandFile(void)
 	FreeConfigVariables(head);
 }
 
+static void
+enterArchiveRecovery(void)
+{
+	InArchiveRecovery = true;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedInArchiveRecovery = true;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (StandbyModeRequested)
+		StandbyMode = true;
+}
+
 /*
  * Exit archive-recovery state
  */
@@ -5326,11 +5341,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 	char		xlogpath[MAXPGPATH];
 
 	/*
-	 * We are no longer in archive recovery state.
-	 */
-	InArchiveRecovery = false;
-
-	/*
 	 * Update min recovery point one last time.
 	 */
 	UpdateMinRecoveryPoint(InvalidXLogRecPtr, true);
@@ -5349,22 +5359,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 	 * If we are establishing a new timeline, we have to copy data from the
 	 * last WAL segment of the old timeline to create a starting WAL segment
 	 * for the new timeline.
-	 *
-	 * Notify the archiver that the last WAL segment of the old timeline is
-	 * ready to copy to archival storage. Otherwise, it is not archived for a
-	 * while.
 	 */
 	if (endTLI != ThisTimeLineID)
-	{
 		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);
 
-		if (XLogArchivingActive())
-		{
-			XLogFileName(xlogpath, endTLI, endLogSegNo);
-			XLogArchiveNotify(xlogpath);
-		}
-	}
-
 	/*
 	 * Let's just make real sure there are not .ready or .done flags posted
 	 * for the new segment.
@@ -5394,6 +5392,12 @@ exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo)
 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
 
+	/* Ok, we're out of archive recovery state. */
+	InArchiveRecovery = false;
+	SpinLockAcquire(&XLogCtl->info_lck);
+	XLogCtl->SharedInArchiveRecovery = false;
+	SpinLockRelease(&XLogCtl->info_lck);
+
 	ereport(LOG,
 			(errmsg("archive recovery complete")));
 }
@@ -6108,9 +6112,7 @@ StartupXLOG(void)
 		 * file, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
-		InArchiveRecovery = true;
-		if (StandbyModeRequested)
-			StandbyMode = true;
+		enterArchiveRecovery();
 
 		/*
 		 * When a backup_label file is present, we want to roll forward from
@@ -6174,9 +6176,7 @@ StartupXLOG(void)
 			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
 			 ControlFile->state == DB_SHUTDOWNED))
 		{
-			InArchiveRecovery = true;
-			if (StandbyModeRequested)
-				StandbyMode = true;
+			enterArchiveRecovery();
 		}
 
 		/*
@@ -7382,6 +7382,23 @@ RecoveryInProgress(void)
 }
 
 /*
+ * Are we currently in archive recovery? In the startup process, you can just
+ * check InArchiveRecovery variable instead.
+ */
+bool
+ArchiveRecoveryInProgress()
+{
+	bool		result;
+
+	/* spinlock is essential on machines with weak memory ordering! */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	result = XLogCtl->SharedInArchiveRecovery;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	return result;
+}
+
+/*
  * Is HotStandby active yet? This is only important in special backends
  * since normal backends won't ever be able to connect until this returns
  * true. Postmaster knows this by way of signal, not via shared memory.
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 047efa2..7f200f9 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -477,12 +477,6 @@ KeepFileRestoredFromArchive(char *path, char *xlogfname)
 						path, xlogfpath)));
 
 	/*
-	 * Create .done file forcibly to prevent the restored segment from being
-	 * archived again later.
-	 */
-	XLogArchiveForceDone(xlogfname);
-
-	/*
 	 * If the existing file was replaced, since walsenders might have it open,
 	 * request them to reload a currently-open segment. This is only required
 	 * for WAL segments, walsenders don't hold other files open, but there's
@@ -554,59 +548,6 @@ XLogArchiveNotifySeg(XLogSegNo segno)
 }
 
 /*
- * XLogArchiveForceDone
- *
- * Emit notification forcibly that an XLOG segment file has been successfully
- * archived, by creating <XLOG>.done regardless of whether <XLOG>.ready
- * exists or not.
- */
-void
-XLogArchiveForceDone(const char *xlog)
-{
-	char		archiveReady[MAXPGPATH];
-	char		archiveDone[MAXPGPATH];
-	struct stat stat_buf;
-	FILE	   *fd;
-
-	/* Exit if already known done */
-	StatusFilePath(archiveDone, xlog, ".done");
-	if (stat(archiveDone, &stat_buf) == 0)
-		return;
-
-	/* If .ready exists, rename it to .done */
-	StatusFilePath(archiveReady, xlog, ".ready");
-	if (stat(archiveReady, &stat_buf) == 0)
-	{
-		if (rename(archiveReady, archiveDone) < 0)
-			ereport(WARNING,
-					(errcode_for_file_access(),
-					 errmsg("could not rename file \"%s\" to \"%s\": %m",
-							archiveReady, archiveDone)));
-
-		return;
-	}
-
-	/* insert an otherwise empty file called <XLOG>.done */
-	fd = AllocateFile(archiveDone, "w");
-	if (fd == NULL)
-	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not create archive status file \"%s\": %m",
-						archiveDone)));
-		return;
-	}
-	if (FreeFile(fd))
-	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not write archive status file \"%s\": %m",
-						archiveDone)));
-		return;
-	}
-}
-
-/*
  * XLogArchiveCheckDone
  *
  * This is called when we are ready to delete or recycle an old XLOG segment
@@ -630,6 +571,30 @@ XLogArchiveCheckDone(const char *xlog)
 	if (!XLogArchivingActive())
 		return true;
 
+	/*
+	 * While we are still in archive recovery, we have not generated any WAL
+	 * ourselves yet. We're not responsible for archiving the segments we're
+	 * replaying; the server that generated them originally is.
+	 *
+	 * During normal operation, too, we're not responsible for archiving
+	 * segments from timelines other than the one we're generating. Such
+	 * segments might be left over in pg_xlog, after a point-in-time
+	 * recovery, or promotion of a standby server.
+	 */
+	if (ArchiveRecoveryInProgress())
+		return true;
+
+	/* ThisTimeLineID should be set, but err on the side of caution */
+	if (ThisTimeLineID != 0)
+	{
+		TimeLineID	fileTLI = 0;
+		uint64		logSegNo;
+
+		XLogFromFileName(xlog, &fileTLI, &logSegNo);
+		if (fileTLI != 0 && fileTLI != ThisTimeLineID)
+			return true;
+	}
+
 	/* First check for .done --- this means archiver is done with it */
 	StatusFilePath(archiveStatusPath, xlog, ".done");
 	if (stat(archiveStatusPath, &stat_buf) == 0)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c2d4ed3..284de9c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -138,6 +138,7 @@ static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
+static void XLogWalRcvFileClose(void);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
@@ -536,24 +537,7 @@ WalReceiverMain(void)
 		 * segment, and await for new orders from the startup process.
 		 */
 		if (recvFile >= 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-
-			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not close log segment %s: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo))));
-
-			/*
-			 * Create .done file forcibly to prevent the streamed segment from
-			 * being archived later.
-			 */
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-			XLogArchiveForceDone(xlogfname);
-		}
-		recvFile = -1;
+			XLogWalRcvFileClose();
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -887,30 +871,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * would otherwise have to reopen this file to fsync it later
 			 */
 			if (recvFile >= 0)
-			{
-				char		xlogfname[MAXFNAMELEN];
-
-				XLogWalRcvFlush(false);
-
-				/*
-				 * XLOG segment files will be re-read by recovery in startup
-				 * process soon, so we don't advise the OS to release cache
-				 * pages associated with the file like XLogFileClose() does.
-				 */
-				if (close(recvFile) != 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not close log segment %s: %m",
-									XLogFileNameP(recvFileTLI, recvSegNo))));
-
-				/*
-				 * Create .done file forcibly to prevent the streamed segment
-				 * from being archived later.
-				 */
-				XLogFileName(xlogfname, recvFileTLI, recvSegNo);
-				XLogArchiveForceDone(xlogfname);
-			}
-			recvFile = -1;
+				XLogWalRcvFileClose();
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo);
@@ -1022,6 +983,29 @@ XLogWalRcvFlush(bool dying)
 }
 
 /*
+ * Close the currently open WAL file.
+ */
+static void
+XLogWalRcvFileClose(void)
+{
+	Assert(recvFile >= 0);
+
+	XLogWalRcvFlush(false);
+
+	/*
+	 * XLOG segment files will be re-read by recovery in startup process soon,
+	 * so we don't advise the OS to release cache pages associated with the
+	 * file like XLogFileClose() does.
+	 */
+	if (close(recvFile) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close log segment %s: %m",
+						XLogFileNameP(recvFileTLI, recvSegNo))));
+	recvFile = -1;
+}
+
+/*
  * Send reply message to primary, indicating our current XLOG positions, oldest
  * xmin and the current time.
  *
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0ae110f..e6c6737 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern const char *xlog_identify(uint8 info);
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
+extern bool ArchiveRecoveryInProgress(void);
 extern bool HotStandbyActive(void);
 extern bool HotStandbyActiveInReplay(void);
 extern bool XLogInsertAllowed(void);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27b9899..3bb891d 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -166,6 +166,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 			 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
 			 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
 
+/*
+ * Extract TLI and segment number from WAL filename. Also works with backup
+ * history files.
+ */
 #define XLogFromFileName(fname, tli, logSegNo)	\
 	do {												\
 		uint32 log;										\
@@ -286,7 +290,6 @@ extern void ExecuteRecoveryCommand(char *command, char *commandName,
 extern void KeepFileRestoredFromArchive(char *path, char *xlogfname);
 extern void XLogArchiveNotify(const char *xlog);
 extern void XLogArchiveNotifySeg(XLogSegNo segno);
-extern void XLogArchiveForceDone(const char *xlog);
 extern bool XLogArchiveCheckDone(const char *xlog);
 extern bool XLogArchiveIsBusy(const char *xlog);
 extern void XLogArchiveCleanup(const char *xlog);
-- 
2.1.1

#32Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 24, 2014 at 3:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem with
it?

Thinking long-term, this is a solid approach, so +1 for it. I just
tested the patch and the extra segment files do not show up anymore.
Patch looks good as well.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 24, 2014 at 9:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

So, this is what I came up with for master. Does anyone see a problem with
it?

In the proposed commit message, you mis-spelled "significant" as "signicant".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#31)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem with
it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has. Isn't this problem? For example, please imagine the
following scenario.

1. The important data was deleted but no one noticed that. This deletion was
logged in last, partial WAL file.
2. The server crashed and DBA started an archive recovery from old backup.
3. After recovery, all WAL files of the old timeline were recycled.
4. Finally DBA noticed the loss of important data and tried to do PITR
to the point
where the data was deleted.

HOWEVER, the WAL file containing that deletion operation no longer exists.
So DBA will never be able to recover that important data ....

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#34)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/27/2014 02:12 PM, Fujii Masao wrote:

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem with
it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has.

A partial WAL file is never archived in the master server to begin with,
so if it's ever used in archive recovery, the administrator must have
performed some manual action to copy the partial WAL file from the
original server. When he does that, he can also copy it manually to the
archive, or whatever he wants to do with it.

Note that the same applies to any complete, but not-yet archived WAL
files. But we've never had any mechanism in place to archive those in
the new instance, after PITR.

Isn't this problem? For example, please imagine the
following scenario.

1. The important data was deleted but no one noticed that. This deletion was
logged in last, partial WAL file.
2. The server crashed and DBA started an archive recovery from old backup.
3. After recovery, all WAL files of the old timeline were recycled.
4. Finally DBA noticed the loss of important data and tried to do PITR
to the point
where the data was deleted.

HOWEVER, the WAL file containing that deletion operation no longer exists.
So DBA will never be able to recover that important data ....

I think you're missing a step above:

1.5: The administrator copies the last, partial WAL file (and any
complete but not yet-archived files) to the new server's pg_xlog directory.

Without that, it won't be available for PITR anyway, and the new server
won't see it or try to archive it, no matter what.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#35)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Tue, Oct 28, 2014 at 1:12 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/27/2014 02:12 PM, Fujii Masao wrote:

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem
with
it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being
archived.
So we can never PITR the database to the point that the last, partial WAL
file has.

A partial WAL file is never archived in the master server to begin with, so
if it's ever used in archive recovery, the administrator must have performed
some manual action to copy the partial WAL file from the original server.

Yes.

When he does that, he can also copy it manually to the archive, or whatever
he wants to do with it.

Yes, but currently he doesn't need to do that manually. At the end of
archive recovery, .ready file is created and the partial, last, WAL file of
the old timeline would be archived later. It's never recycled until it's
archived. So when subsequent archive recovery is necessary, we can expect
that it exists in either pg_xlog or archival storage. OTOH, if the patch is
applied, it's never archived and may have been already recycled, and then
subsequent archive recovery would fail.

Therefore, the problem that the patch can cause is that the partial WAL file
may be recycled even when it has not been archived yet. The partial WAL file
may not be able to be archived for some reasons, but that's not directly
a problem if it exists in pg_xlog.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#35)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/27/2014 06:12 PM, Heikki Linnakangas wrote:

On 10/27/2014 02:12 PM, Fujii Masao wrote:

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

/messages/by-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem with
it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has.

A partial WAL file is never archived in the master server to begin with,
so if it's ever used in archive recovery, the administrator must have
performed some manual action to copy the partial WAL file from the
original server. When he does that, he can also copy it manually to the
archive, or whatever he wants to do with it.

Note that the same applies to any complete, but not-yet archived WAL
files. But we've never had any mechanism in place to archive those in
the new instance, after PITR.

Actually, I'll take back what I said above. I had misunderstood the
current behavior. Currently, a server *does* archive any files that you
copy manually to pg_xlog, after PITR has finished. Eventually. We don't
create a .ready file for them until they're old enough to be recycled.
We do create a .ready file for the last, partial, segment, but it's
pretty weird to do it just for that, and not any other, complete,
segments that might've been copied to pg_xlog. So what happens is that
the last partial segment gets archived immediately after promotion, but
any older segments will linger unarchived until much later.

The special treatment of the last partial segment still makes no sense.
If we want the segments from the old timeline to be archived after PITR,
we should archive them all immediately after end of recovery, not just
the partial one. The exception for just the last partial segment is silly.

Now, the bigger question is whether we want the server after PITR to be
responsible for archiving the segments from the old timeline at all. If
we do, then we should remove the special treatment of the last, partial
segment, and create the .ready files for all the complete segments too.
And actually, I think we should *not* archive the partial segment. We
don't normally archive partial segments, and all the WAL required to
restore the server to new timeline is copied to the file with the new
TLI. If the old timeline is still live, i.e. there's a server somewhere
still writing new WAL on the old timeline, the partial segment will
clash with a complete segment that the other server will archive later.

Yet another consideration is that we currently don't archive files
streamed from the master. If we think that the standby server is
responsible for archiving old segments after recovery, why is it not
responsible for archiving the streamed segments? It's because in most
cases, the master will archive the file, and we don't want two servers
to archive the same file, but there is actually no guarantee on that. It
might well be that the archiver runs a little bit behind in the master,
and after crash the archive will miss some of the segments required.
That's not good either.

I'm not sure what to do here. The current behavior is inconsistent, and
there are a some nasty gotchas that would be nice to fix. I think
someone needs to sit down and write a high-level design of how this all
should work.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Dennis Kögel
dk@neveragain.de
In reply to: Heikki Linnakangas (#37)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x. […]
So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
2. Why are the .done files sometimes not being created?

We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there is an odd piling up of old WALs and .ready files in pg_xlog, going back several months.

The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so here we go:

Environment:

Master, 9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)

- On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a disk space problem)

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the Streaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the opposite

- We observe files named *FF as well

As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL name gaps (?).

I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s several days between the WAL and the .ready file.

- Master: http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4

I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing was committed.
If there’s any way I can help, please let me know.

- D.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Dennis Kögel (#38)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 12/10/2014 04:32 PM, Dennis Kögel wrote:

Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x. […]
So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
2. Why are the .done files sometimes not being created?

We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there is an odd piling up of old WALs and .ready files in pg_xlog, going back several months.

The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so here we go:

Environment:

Master, 9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)

- On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a disk space problem)

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the Streaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the opposite

- We observe files named *FF as well

As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL name gaps (?).

I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s several days between the WAL and the .ready file.

- Master: http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4

I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing was committed.
If there’s any way I can help, please let me know.

Yeah. It wasn't totally clear how all this should work, so I got
distracted with other stuff an dropped the ball; sorry.

I'm thinking that we should change the behaviour on master so that the
standby never archives any files from older timelines, only the new one
that it generates itself. That will solve the immediate problem of old
WAL files accumulating, and bogus .ready files appearing in the standby.
However, it will not solve the bigger problem of how do you ensure that
all WAL files are archived, when you promote a standby server. There is
no guarantee on that today anyway, but this will make it even less
reliable, because it will increase the chances that you miss a file on
the old timeline in the archive, after promoting. I'd argue that that's
a good thing; it makes the issue more obvious, so you are more likely to
encounter it in testing, and you won't be surprised in an emergency. But
I've started a new thread on that bigger issue, hopefully we'll come up
with a solution
(/messages/by-id/548AF1CB.80702@vmware.com).

Now, what do we do with the back-branches? I'm not sure. Changing the
behaviour in back-branches could cause nasty surprises. Perhaps it's
best to just leave it as it is, even though it's buggy.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Guillaume Lelarge
guillaume@lelarge.info
In reply to: Heikki Linnakangas (#39)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-12-12 14:58 GMT+01:00 Heikki Linnakangas <hlinnakangas@vmware.com>:

On 12/10/2014 04:32 PM, Dennis Kögel wrote:

Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <
jgdr@dalibo.com>:

Since few months, we occasionally see .ready files appearing on some
slave
instances from various context. The two I have in mind are under 9.2.x.
[…]
So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <
hlinnakangas@vmware.com>:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed
to created, ever. […]
2. Why are the .done files sometimes not being created?

We’ve encountered behaviour which seems to match what has been described
here: On Streaming Replication slaves, there is an odd piling up of old
WALs and .ready files in pg_xlog, going back several months.

The fine people on IRC have pointed me to this thread, and have
encouraged me to revive it with our observations, so here we go:

Environment:

Master, 9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not*
a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)

- On both slaves, we have „ghost“ WALs and corresponding .ready files
(currently >600 of each on S2, slowly becoming a disk space problem)

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not
always

- The slave with the „bad“ network link has significantly more of these
files, which suggests that disturbances of the Streaming Replication
increase chances of triggering this bug; OTOH, the presence of a name gap
pattern suggests the opposite

- We observe files named *FF as well

As you can see in the directory listings below, this setup is *very* low
traffic, which may explain the pattern in WAL name gaps (?).

I’ve listed the entries by time, expecting to easily match WALs to their
.ready files.
There sometimes is an interesting delay between the WAL’s mtime and the
.ready file — especially for *FF, where there’s several days between the
WAL and the .ready file.

- Master: http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4

I’ve only skimmed through the thread; my understanding is that there were
several patches floating around, but nothing was committed.
If there’s any way I can help, please let me know.

Yeah. It wasn't totally clear how all this should work, so I got
distracted with other stuff an dropped the ball; sorry.

I'm thinking that we should change the behaviour on master so that the
standby never archives any files from older timelines, only the new one
that it generates itself. That will solve the immediate problem of old WAL
files accumulating, and bogus .ready files appearing in the standby.
However, it will not solve the bigger problem of how do you ensure that all
WAL files are archived, when you promote a standby server. There is no
guarantee on that today anyway, but this will make it even less reliable,
because it will increase the chances that you miss a file on the old
timeline in the archive, after promoting. I'd argue that that's a good
thing; it makes the issue more obvious, so you are more likely to encounter
it in testing, and you won't be surprised in an emergency. But I've started
a new thread on that bigger issue, hopefully we'll come up with a solution (
/messages/by-id/548AF1CB.80702@vmware.com).

Now, what do we do with the back-branches? I'm not sure. Changing the
behaviour in back-branches could cause nasty surprises. Perhaps it's best
to just leave it as it is, even though it's buggy.

As long as master is fixed, I don't actually care. But I agree with Dennis
that it's hard to see what's been commited with all the different issues
found, and if any commits were done, in which branch. I'd like to be able
to tell my customers: update to this minor release to see if it's fixed,
but I can't even do that.

--
Guillaume.
http://blog.guillaume.lelarge.info
http://www.dalibo.com

#41Michael Paquier
michael.paquier@gmail.com
In reply to: Guillaume Lelarge (#40)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Dec 31, 2014 at 4:44 PM, Guillaume Lelarge
<guillaume@lelarge.info> wrote:

2014-12-12 14:58 GMT+01:00 Heikki Linnakangas <hlinnakangas@vmware.com>:

Now, what do we do with the back-branches? I'm not sure. Changing the
behaviour in back-branches could cause nasty surprises. Perhaps it's best to
just leave it as it is, even though it's buggy.

As long as master is fixed, I don't actually care. But I agree with Dennis
that it's hard to see what's been commited with all the different issues
found, and if any commits were done, in which branch. I'd like to be able to
tell my customers: update to this minor release to see if it's fixed, but I
can't even do that.

This bug does not endanger at all data consistency as only old WAL
files remain on the standby, so I'm fine as well with just a clean fix
on master, and nothing done on back-branches.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Dennis Kögel
dk@neveragain.de
In reply to: Michael Paquier (#41)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

Am 31.12.2014 um 11:40 schrieb Michael Paquier <michael.paquier@gmail.com>:

As long as master is fixed, I don't actually care. But I agree with Dennis
that it's hard to see what's been commited with all the different issues
found, and if any commits were done, in which branch. I'd like to be able to
tell my customers: update to this minor release to see if it's fixed, but I
can't even do that.

This bug does not endanger at all data consistency as only old WAL
files remain on the standby, so I'm fine as well with just a clean fix
on master, and nothing done on back-branches.

I’d like to point out that this issue is causing severe disk space bloat over time, as described in my previous posting. I’d assume that, as time goes by, this issue will pop up more and more, as long as back-branches remain unfixed.

And as the general recommendation is to never ever fiddle with pg_xlog/ contents, I’m not even sure what to do about it.

- D.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers