.ready files appearing on slaves

Started by Jehan-Guillaume de Rorthaisover 11 years ago42 messages
Jump to latest

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the next
restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL "cycle" in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those "ghost"
WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal disconnection
of the streaming replication appearing in its logs. But there was no known SR
disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,
--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Jehan-Guillaume de Rorthais (#1)
Re: .ready files appearing on slaves

Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

/messages/by-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the
next restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL "cycle" in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those "ghost"
WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal
disconnection of the streaming replication appearing in its logs. But there
was no known SR disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,

--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Jehan-Guillaume de Rorthais (#2)
BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

Hi hackers,

We spend some time with Guillaume Lelarge studying this issue.

CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This
one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped.
As our slave has "archive_mode" & "archive_command" set, XLogArchivingActive()
returns true, forcing XLogArchiveCheckDone() to look for a ".done" file. If the
corresponding ".done" file doesn't exist, XLogArchiveCheckDone() calls
XLogArchiveNotify() to actually create the ".ready" file.

Now, on a standby, we are supposed to create the ".done" files by calling
XLogArchiveForceDone() either after a successful restore_command or when the
streaming replication finishes and flushes a WAL. It seems like the root cause
of this trouble: maybe a race condition where XLogArchiveForceDone() would not
be called for a WAL. As the ".done" file is never created, we end up calling
XLogArchiveNotify() to create a ".ready" and this WAL will never be archived by
the slave, sitting in the pg_xlog dir.

But worst, we experience this under two 9.2.8 clusters and realized
this version is not supposed to have 256 segments per WAL, finishing a WAL to
the *FE one. I'm quoting back the history of my previous mail as we clearly
have some *FF files on the slave side!

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

Note that it seems to happen only during streaming replication, not in pure Log
Shipping replication. We added the debug message on the master side to make
sure it never archive *FF file

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Regards,

On Mon, 15 Sep 2014 17:37:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:

/messages/by-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the
next restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files,
they are all much older than the current WAL "cycle" in both mtime and name
logic sequence. As instance on one of these box we have currently 6 of
those "ghost" WALs:

0000000200001E53000000FF
0000000200001F18000000FF
0000000200002047000000FF
00000002000020BF000000FF
0000000200002140000000FF
0000000200002370000000FF
000000020000255D000000A8
000000020000255D000000A9
[...normal WAL sequence...]
000000020000255E0000009D

And on another box:

000000010000040E000000FF
0000000100000414000000DA
000000010000046E000000FF
0000000100000470000000FF
00000001000004850000000F
000000010000048500000010
[...normal WAL sequence...]
000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal
disconnection of the streaming replication appearing in its logs. But there
was no known SR disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,

--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Jehan-Guillaume de Rorthais (#3)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

Comments welcome.
--
Michael

Attachments:

20141008_xlog_file_enforce_deletion.patchtext/x-diff; charset=US-ASCII; name=20141008_xlog_file_enforce_deletion.patchDownload+1-1
#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#4)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/08/2014 10:44 AM, Michael Paquier wrote:

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

There are two mysteries here:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed
to created, ever.

Since this only happens with streaming replication, the FF segments are
probably being created by walreceiver. XLogWalRcvWrite is the function
that opens the file. I don't see anything obviously wrong there.
XLogWalRcvWrite opens the file corresponding the start position in the
message received from the master. There is no check that the start
position is valid, though; if the master sends a start position in the
FF segment, walreceiver will merrily write it. So the problem could be
in the walsender side. However, I don't see anything wrong there either.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file
when it closes an old segment and opens a new one. However, it does this
only when it's about to start writing to the new segment, and still has
the old segment open. If you stream the FE segment fully, but drop
replication connection at exactly that point, the .done file is not
created. That might sound unlikely, but it's actually pretty easy to
trigger. Just do "select pg_switch_xlog()" in the master, followed by
"pg_ctl stop -m i" and a restart.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so
we still have the original problem that we will try to archive the
segment after failover, even though the master might already have
archived it.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed
from the master.

In that thread, it was not discussed what should happen to WAL files
that an admin manually copies into pg_xlog of the standby. Should the
standby archive them? I don't think so - the admin should copy them
manually to the archive too, if he wants them archived. It's a good and
simple rule that the server that generates the WAL, archives the WAL.

Instead of creating any .done files during recovery, we could scan
pg_xlog at promotion, and create a .done file for every WAL segment
that's present at that point. That would be more robust. And then apply
your patch, to recycle old segments during archive recovery, ignoring
.done files.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#5)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed
to created, ever.

Since this only happens with streaming replication, the FF segments are

probably being created by walreceiver. XLogWalRcvWrite is the function that
opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite
opens the file corresponding the start position in the message received
from the master. There is no check that the start position is valid,
though; if the master sends a start position in the FF segment, walreceiver
will merrily write it. So the problem could be in the walsender side.
However, I don't see anything wrong there either.

Good to hear that. By looking at the wal receiver and sender code paths, I
found nothing really weird.

I think we should add a check in walreceiver, to throw an error if the

master sends an invalid WAL pointer, pointing to an FF segment.

Then we're good for a check in ProcessWalSndrMessage for walEnd I guess.
Seems like a straight-forward patch.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file when
it closes an old segment and opens a new one. However, it does this only
when it's about to start writing to the new segment, and still has the old
segment open. If you stream the FE segment fully, but drop replication
connection at exactly that point, the .done file is not created. That might
sound unlikely, but it's actually pretty easy to trigger. Just do "select
pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a
restart.

That's exactly the test I have been doing a couple of times to trigger this
behavior before sending my previous email, but without success on the
standby with master: all the WAL files were marked as .done. Now, I have
just retried it, with far more tries on REL9_3_STABLE and on HEAD and I
have been able to actually trigger the problem a couple of times. Simply
run a long transaction generating a lot of WAL like that:
=# create table aa as select generate_series(1,1000000000);
And then run that:
$ psql -c 'select pg_switch_xlog()'; pg_ctl stop -m immediate; pg_ctl start
And with enough "luck", .ready files may appear. It may take a dozen of
tries before seeing at least ones. And I noticed that generally multiple
.ready files appear at the same time.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so we
still have the original problem that we will try to archive the segment
after failover, even though the master might already have archived it.

Yep. Agreed.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed from
the master.

In that thread, it was not discussed what should happen to WAL files that

an admin manually copies into pg_xlog of the standby. Should the standby
archive them? I don't think so - the admin should copy them manually to the
archive too, if he wants them archived. It's a good and simple rule that
the server that generates the WAL, archives the WAL.

Question time: why has the check based on recovery state of the node been
removed in 1bd42cd? Just assuming, but did you have in mind that relying on
XLogArchiveForceDone and XLogArchiveCheckDone was enough and more robust at
this point?

Instead of creating any .done files during recovery, we could scan pg_xlog

at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.
Regards,
--
Michael

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10/08/2014 10:44 AM, Michael Paquier wrote:

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always
enforce the create of .done files when WAL files are done being streamed
(XLogWalRcvWrite and WalReceiverMain) or archived
(KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
changed a bit RemoveOldXlogFiles, removing a check looking if the node is
in recovery. Now, based on the information given here yes it happens that
there are still cases where .done file creation is not correctly done,
leading to those extra files. Even by looking at the code, I am not
directly seeing any code paths where an extra call to XLogArchiveForceDone
would be needed on the WAL receiver side but... Something like the patch
attached (which is clearly a band-aid) may help though as it would make
files to be removed even if they are not marked as .done for a node in
recovery. And this is consistent with the pre-1bd42cd.

There are two mysteries here:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
created, ever.

Since this only happens with streaming replication, the FF segments are
probably being created by walreceiver. XLogWalRcvWrite is the function that
opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite
opens the file corresponding the start position in the message received from
the master. There is no check that the start position is valid, though; if
the master sends a start position in the FF segment, walreceiver will
merrily write it. So the problem could be in the walsender side. However, I
don't see anything wrong there either.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file when it
closes an old segment and opens a new one. However, it does this only when
it's about to start writing to the new segment, and still has the old
segment open. If you stream the FE segment fully, but drop replication
connection at exactly that point, the .done file is not created. That might
sound unlikely, but it's actually pretty easy to trigger. Just do "select
pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a
restart.

The creation of the .done files seems quite unreliable anyway. If only a
portion of a segment is streamed, we don't write a .done file for it, so we
still have the original problem that we will try to archive the segment
after failover, even though the master might already have archived it.

I looked again at the thread where this was discussed:
/messages/by-id/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
I believe the idea was that the server that generates a WAL segment is
always responsible for archiving it. A standby should never attempt to
archive a WAL segment that was restored from the archive, or streamed from
the master.

In that thread, it was not discussed what should happen to WAL files that an
admin manually copies into pg_xlog of the standby. Should the standby
archive them? I don't think so - the admin should copy them manually to the
archive too, if he wants them archived. It's a good and simple rule that the
server that generates the WAL, archives the WAL.

Instead of creating any .done files during recovery, we could scan pg_xlog
at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master? In this case, no WAL files have .done status
files, so the server will create .ready and archive all of them. Probably this
is problematic. So even if we adopt your idea, ISTM that it's better to create
.done file whenever WAL file is fullly streamed and restored.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#7)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/08/2014 04:59 PM, Fujii Masao wrote:

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Instead of creating any .done files during recovery, we could scan pg_xlog
at promotion, and create a .done file for every WAL segment that's present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master?

Um, that's not a safe thing to do anyway, is it?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#8)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 11:38 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 10/08/2014 04:59 PM, Fujii Masao wrote:

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Instead of creating any .done files during recovery, we could scan
pg_xlog
at promotion, and create a .done file for every WAL segment that's
present
at that point. That would be more robust. And then apply your patch, to
recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master?

Um, that's not a safe thing to do anyway, is it?

That's not safe as it bypasses all the consistency checks of promotion.
Now, it is also something that repmgr for example does as far as I recall
to do a node "promotion". What if we simply document the problem properly
then? The apparition of those phantom WAL files is more scary than a user
or a utility that does a promotion with a server restart. Not to mention as
well that users as free to add themselves files to pg_xlog.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#6)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by
ensuring that when a node is in recovery it will remove it whatever its
status in archive_status. This patch is the real fix, and should be applied
down to 9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

I imagine that it would be nice to get those things fixed before the next
minor release.
Regards,
--
Michael

Attachments:

0001-Fix-apparition-of-archive-status-files-of-.ready.patchtext/x-diff; charset=US-ASCII; name=0001-Fix-apparition-of-archive-status-files-of-.ready.patchDownload+1-2
0002-Enforce-all-WAL-segment-files-to-be-marked-as-.done-.patchtext/x-diff; charset=US-ASCII; name=0002-Enforce-all-WAL-segment-files-to-be-marked-as-.done-.patchDownload+49-2
#11Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#5)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
created, ever.

I think we should add a check in walreceiver, to throw an error if the
master sends an invalid WAL pointer, pointing to an FF segment.

Attached is a patch for that. This would be needed for PG < 9.3 if applied.
Regards,
--
Michael

Attachments:

20141009_walreceiver_check_invalid.patchtext/x-patch; charset=US-ASCII; name=20141009_walreceiver_check_invalid.patchDownload+18-0
#12Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#10)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by ensuring
that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down to
9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

Thanks for the patches!

I found one problem in the 0002 patch. The patch changes the recovery so that
it creates .done files for every WAL files which exist in pg_xlog directory at
the end of recovery. But even WAL files which will have to be archived later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL files.
The patch wrongly prevents them from being archvied at all.

ISTM that the 0001 patch has the similar problem. Please imagine the following
scenario.

1. There are many unarchived WAL files in pg_xlog because of the continuous
failure of archive_command, for example, and then the server unfortunately
crashes because of the corruption of database itself.

2. DBA restores the backup onto the server and copies all the WAL files
from old pg_xlog to new one. Then he or she prepares for archive recovery.

3. DBA starts the server and the archive recovery starts.

4. After all the archived WAL files are replayed, the server tries to replay
the WAL files in pg_xlog. Since there are many WAL files in pg_xlog,
more than one restartpoints happen while they are being replayed.

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#12)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

The additional process at promotion sounds like a good idea, I'll try to
get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the
situation:
- 0001 prevents the apparition of those phantom WAL segment file by ensuring
that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down to
9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the segment
files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the process
more robust IMO.

Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file for
the last WAL file of the old timeline even when it's restored from the archive
and has .done file. So this causes the already-archived WAL file to be archived
again.... Attached patch fixes this bug.

Regards,

--
Fujii Masao

Attachments:

bugfix-last-restored-wal.patchtext/x-patch; charset=US-ASCII; name=bugfix-last-restored-wal.patchDownload+19-13
#14Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#12)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL
files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file
not marked as .done during recovery when stream connection is abruptly cut
when this segment is switched, marking it as .ready in archive_status and
simply letting this segment in pg_xlog because it will neither be recycled
nor removed. I have not been able to look much at this code these days, so
I am not sure how invasive it would be in back-branches, but perhaps we
should try to improve code such as when a segment file is switched and
connection to the is cut, we guarantee that this file is completed and
marked as .done.
--
Michael

#15Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#14)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

This will enforce a segment file switch and restart the master in
crash recovery. This has as effect to immediately cut the WAL stream
on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
For example, let's imagine that stream fails when switching from
000000010000000000000003 to the next segment, then the
the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
restarts it will begin once again from 0/4000000, ignoring that
000000010000000000000003 should be marked as .done, ultimately marking
it in .ready state when old segment files are recycled or removed.
There is nothing that can really be done to enforce the creation of a
.done file before the FATAL of libpqrcv_receive because we cannot
predict the stream failure..

Now, we can do better than what we have now by looking at WAL start
position used when starting streaming in WAL receiver and enforce
.done if the start position is the last one of previous segment.
Hence, in the case of start position 0/4000000 that was found
previously, the file that will be enforced to .done is
000000010000000000000003. I have written the patch attached that
implements this idea and fixes the problem. Now let's see if you guys
see any flaws in this simple logic which uses a sniper gun instead of
a bazooka as in the previous patches sent.

Regards,
--
Michael

Attachments:

20141020_wal_phantom_fix.patchtext/x-patch; charset=US-ASCII; name=20141020_wal_phantom_fix.patchDownload+39-1
#16Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#13)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
wrote:

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <

michael.paquier@gmail.com>

wrote:

The additional process at promotion sounds like a good idea, I'll try

to

get a patch done tomorrow. This would result as well in removing the
XLogArchiveForceDone stuff. Either way, not that I have been able to
reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve

the

situation:
- 0001 prevents the apparition of those phantom WAL segment file by

ensuring

that when a node is in recovery it will remove it whatever its status in
archive_status. This patch is the real fix, and should be applied down

to

9.2.
- 0002 is a patch implementing Heikki's idea of enforcing all the

segment

files present in pg_xlog to have their status to .done, marking them for
removal. When looking at the code, I finally concluded that Fujii-san's
point, about marking in all cases as .done segment files that have been
fully streamed, actually makes more sense to not be backward. This patch
would actually not be mandatory for back-patching, but it makes the

process

more robust IMO.

Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in
recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file
for
the last WAL file of the old timeline even when it's restored from the
archive
and has .done file. So this causes the already-archived WAL file to be
archived
again.... Attached patch fixes this bug.

That's a good catch! Patch looks good. I think that you should change
xlogpath to use MAXFNAMELEN instead of MAXPGPATH in exitArchiveRecovery.
This is only for correctness, so that's a master-only remark, because this
variable is used only to calculate a segment file name, and not a path.
Renaming the variable from xlogpath to xlogname would make sense as well.
Regards,
--
Michael

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#15)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/20/2014 09:26 AM, Michael Paquier wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

This will enforce a segment file switch and restart the master in
crash recovery. This has as effect to immediately cut the WAL stream
on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
For example, let's imagine that stream fails when switching from
000000010000000000000003 to the next segment, then the
the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
restarts it will begin once again from 0/4000000, ignoring that
000000010000000000000003 should be marked as .done, ultimately marking
it in .ready state when old segment files are recycled or removed.
There is nothing that can really be done to enforce the creation of a
.done file before the FATAL of libpqrcv_receive because we cannot
predict the stream failure..

Now, we can do better than what we have now by looking at WAL start
position used when starting streaming in WAL receiver and enforce
.done if the start position is the last one of previous segment.
Hence, in the case of start position 0/4000000 that was found
previously, the file that will be enforced to .done is
000000010000000000000003. I have written the patch attached that
implements this idea and fixes the problem. Now let's see if you guys
see any flaws in this simple logic which uses a sniper gun instead of
a bazooka as in the previous patches sent.

Hmm. This will still miss the .done file if you don't re-establish the
streaming replication connection after the restart. For example, if you
shut down the master, and promote the standby server.

I think we should take a more wholesale approach to this. We should
enforce the rule that the server only ever archives WAL files belonging
to the same timeline that the server generates. IOW, the server only
archives the WAL that it has generated.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#12)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Fri, Oct 17, 2014 at 2:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I found one problem in the 0002 patch. The patch changes the recovery so
that
it creates .done files for every WAL files which exist in pg_xlog
directory at
the end of recovery. But even WAL files which will have to be archived
later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL
files.
The patch wrongly prevents them from being archived at all.

Re-looking at patch 2, yes you are right. Even if it was mentioned that we
should do that for a node that had promotion triggered it was not done this
way as a check on CheckForStandbyTrigger() is actually missing. Attached is
an updated patch.

ISTM that the 0001 patch has the similar problem. Please imagine the
following
scenario.
In this case, the patch seems to make the restartpoint recycle even WAL
files
which have .ready files and will have to be archived later. Thought?

Right, that's really backward. This was the approach taken before c9cc7e0,
and this commit actually prevents removal of unarchived WAL files during
recovery.
--
Michael

Attachments:

20141022_enforce_done_at_promotion.patchtext/x-diff; charset=US-ASCII; name=20141022_enforce_done_at_promotion.patchDownload+49-2
#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jehan-Guillaume de Rorthais (#1)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On 10/22/2014 04:24 PM, Michael Paquier wrote:

On Wed, Oct 22, 2014 at 1:53 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

I think we should take a more wholesale approach to this. We should
enforce the rule that the server only ever archives WAL files belonging to
the same timeline that the server generates. IOW, the server only archives
the WAL that it has generated.

Hm?! Would that be really back-patchable? There may be in the wild tools or
users that rely on the fact a node archives segment files from all
timelines.

Hmm, so it would be a tool or user that manually copies a file to the
pg_xlog directory of a standby server, and expects the standby to
archive the file after promotion. That seems a bit far-fetched, although
I've seen people do strange things. I think it would be acceptable as
long as we document the change in behavior in the release notes. I don't
have much hope that we'll ever be able to nail down the correct behavior
with the current approach.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#15)
Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

The real problem currently is that it is possible to have a segment file not marked as .done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and marked as .done.

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

What about fixing this problem directly? That is, we can make walreceiver
check whether the end of last received WAL data is the end of current WAL file
or not, and then close the WAL file and create .done file if the test is true.

This is not a perfect solution. If the standby crashes during very
short interval
(i.e., after closing the WAL file and before creating .done file), the problem
would happen again. But it can really rarely happen, so I don't think that it's
worth fixing the corner case at least in back-branches. Of course, we can
find out the "perfect" solution for the master, though.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#16)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#20)
#23Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#22)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#23)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#22)
#26Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#24)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Fujii Masao (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#24)
#30Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#29)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#22)
#32Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#31)
#34Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#31)
#35Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#34)
#36Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#35)
#37Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#35)
#38Dennis Kögel
dk@neveragain.de
In reply to: Heikki Linnakangas (#37)
#39Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dennis Kögel (#38)
#40Guillaume Lelarge
guillaume@lelarge.info
In reply to: Heikki Linnakangas (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Guillaume Lelarge (#40)
#42Dennis Kögel
dk@neveragain.de
In reply to: Michael Paquier (#41)