[BUG] Archive recovery failure on 9.3+.

Started by Kyotaro Horiguchiover 12 years ago29 messageshackers
Jump to latest
#1Kyotaro Horiguchi
horikyota.ntt@gmail.com

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

if (checkPoint.redo < RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

- recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

- archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

recvtest.shtext/plain; charset=us-asciiDownload
archrecvfix_20131212.patchtext/x-patch; charset=us-asciiDownload+6-3
#2Tomonari Katsumata
t.katsumata1122@gmail.com
In reply to: Kyotaro Horiguchi (#1)
Re: [BUG] Archive recovery failure on 9.3+.

Hi,

Somebody is reading this thread?

This problem seems still remaining on REL9_3_STABLE.
Many users would face this problem, so we should
resolve this in next release.

I think his patch is reasonable to fix this problem.

Please check this again.

regards,
--------------------------
Tomonari Katsumata

2013/12/12 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>

Show quoted text

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

if (checkPoint.redo < RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

- recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

- archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Kyotaro Horiguchi (#1)
Re: [BUG] Archive recovery failure on 9.3+.

On Thu, Dec 12, 2013 at 11:00 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

if (checkPoint.redo < RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

- recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

- archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

regards,

Horiguchi-san,

Wonder if the following commit in 9.2 branch was to address a
similar/same problem?
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4aed94f1660fb55bc825bf7f3135379dab28eb55

--
Amit Langote

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#1)
Re: [BUG] Archive recovery failure on 9.3+.

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

if (checkPoint.redo < RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

Hmm. After running the script, pg_controldata says:

Latest checkpoint location: 0/2000090
Prior checkpoint location: 0/2000090
Latest checkpoint's REDO location: 0/2000090

The PANIC is actually coming from here:

/*
* Re-fetch the last valid or last applied record, so we can identify the
* exact endpoint of what we consider the valid portion of WAL.
*/
record = ReadRecord(xlogreader, LastRec, PANIC, false);

If recovery started from an online checkpoint, then I think you'd get a
similar PANIC from the code you quoted.

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Yep.

Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

- recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

- archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

Hmm, that seems reasonable at a quick glance. I think it also needs a
change to the state transition code earlier in the loop, to not move
from XLOG_FROM_ARCHIVE to XLOG_FROM_PG_XLOG, if we've already tried
reading pg_xlog.

Another way to look at this is that it's wrong that we immediately PANIC
if we successfully restore a file from archive, but then fail to read
the record we're looking for. Instead, we should advance to next state
in the state machine, ie. try reading the same record from pg_xlog, and
only give up if the read fails from all sources.

Yet another way to look at this is that we shouldn't even try to read
the file with TLI 1 from the archive, when we know the checkpoint record
is on timeline 2. It can't possibly work.

I've been thinking for some time that we should stop doing the scan of
all possible TLIs, and only try to read the exact file that contains the
record we're reading. We have that information now, in the timeline
history file, and we already do that during streaming replication. I was
afraid of changing the behavior of archive recovery in 9.4, but maybe
that's the way to go in the long term.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#4)
Re: [BUG] Archive recovery failure on 9.3+.

On 01/09/2014 08:18 PM, Heikki Linnakangas wrote:

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.

if (checkPoint.redo < RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

Hmm. After running the script, pg_controldata says:

Latest checkpoint location: 0/2000090
Prior checkpoint location: 0/2000090
Latest checkpoint's REDO location: 0/2000090

The PANIC is actually coming from here:

/*
* Re-fetch the last valid or last applied record, so we can
identify the
* exact endpoint of what we consider the valid portion of WAL.
*/
record = ReadRecord(xlogreader, LastRec, PANIC, false);

If recovery started from an online checkpoint, then I think you'd get a
similar PANIC from the code you quoted.

And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.

Yep.

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

This is the code that does that:

/*
* If we are establishing a new timeline, we have to copy data from the
* last WAL segment of the old timeline to create a starting WAL segment
* for the new timeline.
*
* Notify the archiver that the last WAL segment of the old timeline is
* ready to copy to archival storage. Otherwise, it is not archived for a
* while.
*/
if (endTLI != ThisTimeLineID)
{
XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);

if (XLogArchivingActive())
{
XLogFileName(xlogpath, endTLI, endLogSegNo);
XLogArchiveNotify(xlogpath);
}
}

So, the rationale is that otherwise it would take a long time until that
segment is archived. To be precise, I don't think the segment with the
old TLI would ever be archived without the above, but the same segment
on the new timeline would, after it fills up.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#5)
Re: [BUG] Archive recovery failure on 9.3+.

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? ...

So, the rationale is that otherwise it would take a long time until that
segment is archived. To be precise, I don't think the segment with the
old TLI would ever be archived without the above, but the same segment
on the new timeline would, after it fills up.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

Don't we want to archive both? If you want to recover to the end of the
old timeline, you're going to need that file too, no?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#6)
Re: [BUG] Archive recovery failure on 9.3+.

On 01/09/2014 10:16 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? ...

So, the rationale is that otherwise it would take a long time until that
segment is archived. To be precise, I don't think the segment with the
old TLI would ever be archived without the above, but the same segment
on the new timeline would, after it fills up.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

Don't we want to archive both? If you want to recover to the end of the
old timeline, you're going to need that file too, no?

Hmm. It should be the responsibility of the original server to archive
the segment on the old timeline. Mind you, partial segments are never
archived, except for this one case, so how did the old segment find its
way to the new server? A few possibilities come to mind: the DBA
manually copied it from the old server to pg_xlog, it was streamed by
streaming replication, or it was included in a base backup. The OP's
test script resembles the base backup case.

In all of those cases, I don't think it's the new server's
responsibility to archive it. If it was copied to pg_xlog manually, the
administrator may also copy it to the archive if he feels like it. If it
was streamed from a live server, the original server should take care of
it. If it was included in a backup, well, it's included in the backup so
it doesn't necessarily need to be archived.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#7)
Re: [BUG] Archive recovery failure on 9.3+.

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 01/09/2014 10:16 PM, Tom Lane wrote:

Don't we want to archive both? If you want to recover to the end of the
old timeline, you're going to need that file too, no?

Hmm. It should be the responsibility of the original server to archive
the segment on the old timeline.

Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery. If you're sure this same case
doesn't come up that way, then nevermind.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#8)
Re: [BUG] Archive recovery failure on 9.3+.

On 01/09/2014 10:36 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 01/09/2014 10:16 PM, Tom Lane wrote:

Don't we want to archive both? If you want to recover to the end of the
old timeline, you're going to need that file too, no?

Hmm. It should be the responsibility of the original server to archive
the segment on the old timeline.

Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery. If you're sure this same case
doesn't come up that way, then nevermind.

I'm not sure what you mean, but it does apply to same-server PITR
recovery too. In that case, again, if you copy the partial segment to
pg_xlog when doing the PITR recovery, clearly you already somehow have
access to the partial segment, and if you want to do another PITR to the
same point, you should just copy it again.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Josh Berkus
josh@agliodbs.com
In reply to: Kyotaro Horiguchi (#1)
Re: [BUG] Archive recovery failure on 9.3+.

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

Oh! This explains some transient errors I've seen.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#10)
Re: [BUG] Archive recovery failure on 9.3+.

On 01/09/2014 10:55 PM, Josh Berkus wrote:

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

Oh! This explains some transient errors I've seen.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply the
same fix to all stable branches, as the current behavior is silly and
always has been. On the other hand, we haven't heard any complaints
about it, so we probably shouldn't fix what ain't broken. Perhaps we
should apply it to 9.3, as that's where we have the acute problem the OP
reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not
archive the partial segment from previous timeline. Older branches will
keep the current behavior.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tomonari Katsumata
katsumata.tomonari@po.ntts.co.jp
In reply to: Heikki Linnakangas (#11)
Re: [BUG] Archive recovery failure on 9.3+.

Hi Heikki,

I need PostgreSQL9.3 which fixed this problem.

It didn't happen in PostgreSQL9.2, so I agree
with your proposal which changes are done
against 93_STABLE and master.

Can you fix this in next release(9.3.3)?

------------
Tomonari Katsumata

(2014/01/13 20:16), Heikki Linnakangas wrote:

On 01/09/2014 10:55 PM, Josh Berkus wrote:

On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

Oh! This explains some transient errors I've seen.

Wouldn't it be better to not archive the old segment, and instead

switch

to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply

the same fix to all stable branches, as the current behavior is silly
and always has been. On the other hand, we haven't heard any complaints
about it, so we probably shouldn't fix what ain't broken. Perhaps we
should apply it to 9.3, as that's where we have the acute problem the OP
reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not

archive the partial segment from previous timeline. Older branches will
keep the current behavior.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Christoph Berg
myon@debian.org
In reply to: Heikki Linnakangas (#11)
Re: [BUG] Archive recovery failure on 9.3+.

Re: Heikki Linnakangas 2014-01-13 <52D3CAFF.3010807@vmware.com>

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

Oh! This explains some transient errors I've seen.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply
the same fix to all stable branches, as the current behavior is
silly and always has been. On the other hand, we haven't heard any
complaints about it, so we probably shouldn't fix what ain't broken.
Perhaps we should apply it to 9.3, as that's where we have the acute
problem the OP reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not
archive the partial segment from previous timeline. Older branches
will keep the current behavior.

I've seen the "can't archive file from the old timeline" problem on
9.2 and 9.3 slaves after promotion. The problem is in conjunction with
the proposed archive_command in the default postgresql.conf comments:

# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the "test ! -f" part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)

I've always blamed 9.2+'s cascading replication for this, but haven't
investigated deeper.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tomonari Katsumata (#12)
Re: [BUG] Archive recovery failure on 9.3+.

Hello,

I need PostgreSQL9.3 which fixed this problem.

It didn't happen in PostgreSQL9.2, so I agree
with your proposal which changes are done
against 93_STABLE and master.

Can you fix this in next release(9.3.3)?

I was going to push to move this a bit, but...

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

I've almost inclined to that but on some thoughts on the idea,
comming to think of recovery upto target timeline, the old
segment found to be necessary for the case. Without the old
segment, we would be obliged to seek to the first segment of the
*next* timeline (Is there any (simple) means to predict where is
it?) to complete the task. Is it the right way we kick the older
one out of archive?

Now, what to do about back-branches? On one hand, I'd like to apply
the same fix to all stable branches, as the current behavior is silly
and always has been. On the other hand, we haven't heard any
complaints about it, so we probably shouldn't fix what ain't
broken. Perhaps we should apply it to 9.3, as that's where we have the
acute problem the OP reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not
archive the partial segment from previous timeline. Older branches
will keep the current behavior.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#14)
Re: [BUG] Archive recovery failure on 9.3+.

On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

I've almost inclined to that but on some thoughts on the idea,
comming to think of recovery upto target timeline, the old
segment found to be necessary for the case. Without the old
segment, we would be obliged to seek to the first segment of the
*next* timeline (Is there any (simple) means to predict where is
it?) to complete the task.

How did the server that created the new timeline get the old, partial,
segment? Was it already archived? Or did the DBA copy it into pg_xlog
manually? Or was it streamed by streaming replication? Whatever the
mechanism, the same mechanism ought to make sure the old segment is
available for PITR, too.

Hmm. If you have set up streaming replication and a WAL archive, and
your master dies and you fail over to a standby, what you describe does
happen. The partial old segment is not in the archive, so you cannot
PITR to a point in the old timeline that falls within the partial
segment (ie. just before the failover). However, it's not guaranteed
that all the preceding WAL segments on the old timeline were already
archived, anyway, so even if the partial segment is archived, it's not
guaranteed to work.

The old master is responsible for archiving the WAL on the old timeline,
and the new master is responsible for archiving all the WAL on the new
timeline. That's a straightforward, easy-to-understand rule. It might be
useful to have a mode where the standby also archives all the received
WAL, but that would need to be a separate option.

Is it the right way we kick the older one out of archive?

If it's already in the archive, it's not going to be removed from the
archive.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Christoph Berg (#13)
Re: [BUG] Archive recovery failure on 9.3+.

On 02/12/2014 01:24 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2014-01-13 <52D3CAFF.3010807@vmware.com>

Actually, why is the partially-filled 000000010000000000000002 file
archived in the first place? Looking at the code, it's been like that
forever, but it seems like a bad idea. If the original server is still
up and running, and writing more data to that file, what will happen is
that when the original server later tries to archive it, it will fail
because the partial version of the file is already in the archive. Or
worse, the partial version overwrites a previously archived more
complete version.

Oh! This explains some transient errors I've seen.

Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Now, what to do about back-branches? On one hand, I'd like to apply
the same fix to all stable branches, as the current behavior is
silly and always has been. On the other hand, we haven't heard any
complaints about it, so we probably shouldn't fix what ain't broken.
Perhaps we should apply it to 9.3, as that's where we have the acute
problem the OP reported. Thoughts?

In summary, I propose that we change master and REL9_3_STABLE to not
archive the partial segment from previous timeline. Older branches
will keep the current behavior.

I've seen the "can't archive file from the old timeline" problem on
9.2 and 9.3 slaves after promotion. The problem is in conjunction with
the proposed archive_command in the default postgresql.conf comments:

# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the "test ! -f" part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)

Hmm, the behavior is the same in 9.1 and 9.2. Did you use a different
archive_command in 9.1, without the "test"?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#15)
Re: [BUG] Archive recovery failure on 9.3+.

Hello, I might have been misunderstood your words.

At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <52FC7E2A.9060703@vmware.com>

On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Does this mean that no mechanical solution for this situation
will be given and an operator should remove the older segment for
the same segment id before starting recovoery process?

I've almost inclined to that but on some thoughts on the idea,
comming to think of recovery upto target timeline, the old
segment found to be necessary for the case. Without the old
segment, we would be obliged to seek to the first segment of the
*next* timeline (Is there any (simple) means to predict where is
it?) to complete the task.

How did the server that created the new timeline get the old, partial,
segment? Was it already archived? Or did the DBA copy it into pg_xlog
manually? Or was it streamed by streaming replication? Whatever the
mechanism, the same mechanism ought to make sure the old segment is
available for PITR, too.

Sure.

Hmm. If you have set up streaming replication and a WAL archive, and
your master dies and you fail over to a standby, what you describe
does happen. The partial old segment is not in the archive, so you
cannot PITR to a point in the old timeline that falls within the
partial segment (ie. just before the failover). However, it's not
guaranteed that all the preceding WAL segments on the old timeline
were already archived, anyway, so even if the partial segment is
archived, it's not guaranteed to work.

Yes, and putting aside the insane or vanished segments in
archive, I understand that a pair of master and standby (standby
and cascaded standby and so on, too) can share one WAL archive,
or archived WAL segments and all the WAL segments not archived
and left in pg_xlog of the old master should be merged into WAL
archive of new master (promoted old slave) to keep the
availability of the online backup taken from the old master. Even
with the shared WAL archive, missing segments in archive should
be filled up using pg_xlog though. Nevertheless, the process can
be implemented in automatic way.

The test script at first of this thread is for the case of shared
archive and I have unconsciously put that as the context.

The old master is responsible for archiving the WAL on the old
timeline, and the new master is responsible for archiving all the WAL
on the new timeline. That's a straightforward, easy-to-understand
rule.

Yes, I was somewhat confused because of my assumption of shared
archive, but it actually can be converged into single archive,
and the older version of PostgreSQL could cope with that
situation.

It might be useful to have a mode where the standby also
archives all the received WAL, but that would need to be a separate
option.

Perhaps such a mechanism is not demanded :)

Is it the right way we kick the older one out of archive?

If it's already in the archive, it's not going to be removed from the
archive.

I have understood the conclusion so far is not archiving the
older segment when promotion but it seems a bit odd story as you
suggested. If the conclusion here is no aid as my new
understanding, would you let me hear the reason why recovery have
changed to prefer archive to pg_xlog?

The commit abf5c5c9a4 seems to change the behavior but I don't
find the reason for the change.

ragards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#17)
Re: [BUG] Archive recovery failure on 9.3+.

On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote:

At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <52FC7E2A.9060703@vmware.com>

On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:

Wouldn't it be better to not archive the old segment, and instead
switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes. We should
NEVER be in a position of archiving two different versions of the same
segment.

Ok, I think we're in agreement that that's the way to go for master.

Does this mean that no mechanical solution for this situation
will be given and an operator should remove the older segment for
the same segment id before starting recovoery process?

I don't understand. With the proposed change to never archive a partial
segment, there is no segment with the same filename in the archive.
There is nothing to remove. Or did you mean that you have to do that in
9.0-9.2 back-branches, if we only back-patch this change to 9.3?

If it's already in the archive, it's not going to be removed from the
archive.

I have understood the conclusion so far is not archiving the
older segment when promotion but it seems a bit odd story as you
suggested. If the conclusion here is no aid as my new
understanding, would you let me hear the reason why recovery have
changed to prefer archive to pg_xlog?

The commit abf5c5c9a4 seems to change the behavior but I don't
find the reason for the change.

Well, it didn't work too well before abf5c5c9a4 either. With your test
script, and commit abf5c5c9a4 reverted [1]just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due to further changes to the code, but I did this to simulate the old behavior:, you got this:

### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###
waiting for server to start....LOG: database system was interrupted;
last known up at 2014-02-13 14:17:53 EET
LOG: starting archive recovery
LOG: restored log file "00000002.history" from archive
/bin/cp: cannot stat
�/home/heikki/pgsql.93stable/test/arc/000000020000000000000002�: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
LOG: record with zero length at 0/2000090
LOG: invalid primary checkpoint record
/bin/cp: cannot stat
�/home/heikki/pgsql.93stable/test/arc/000000020000000000000002�: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
LOG: using previous checkpoint record at 0/2000028
LOG: consistent recovery state reached at 0/2000028
LOG: record with zero length at 0/2000090
LOG: redo is not required
/bin/cp: cannot stat
�/home/heikki/pgsql.93stable/test/arc/000000020000000000000002�: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
/bin/cp: cannot stat
�/home/heikki/pgsql.93stable/test/arc/00000003.history�: No such file or
directory
LOG: selected new timeline ID: 3
LOG: restored log file "00000002.history" from archive
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Note the "invalid primary checkpoint record" message. It restored the
partial 000000010000000000000002 file from the archive, and tried to
recover using that instead of the 000000020000000000000002 file from
pg_xlog. The primary checkpoint record is not present in
000000010000000000000002, so it fell back to recover from the previous
checkpoint.

It was the exact same issue, even though it appeared to work if you
don't look carefully. The second recovery never replayed the WAL
generated on timeline 2, it branched off directly from timeline 1, at
the same point where timeline 2 did. You can see that by looking at the
history files. Also, if you create a table and insert a row to it after
the first recovery, just before killing the server, you'll see that the
row is not there after the second recovery. That's wrong; the data files
might already contain some changes from timeline 2, and you end up with
a corrupt database. Thanks to commit abf5c5c9a4, the situation is at
least detected, and the system refuses to start up.

The behavior where we prefer a segment from archive with lower TLI over
a file with higher TLI in pg_xlog actually changed in commit
a068c391ab0. Arguably changing it wasn't a good idea, but the problem
your test script demonstrates can be fixed by not archiving the partial
segment, with no change to the preference of archive/pg_xlog. As
discussed, archiving a partial segment seems like a bad idea anyway, so
let's just stop doing that.

[1]: just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due to further changes to the code, but I did this to simulate the old behavior:
to further changes to the code, but I did this to simulate the old behavior:

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 6a4a504..352d3f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void)

/* Enable fetching from archive recovery area */
ArchiveRecoveryRequested = true;
+ InArchiveRecovery = true;

/*
* If user specified recovery_target_timeline, validate it or compute the

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Christoph Berg
myon@debian.org
In reply to: Heikki Linnakangas (#16)
Re: [BUG] Archive recovery failure on 9.3+.

Re: Heikki Linnakangas 2014-02-13 <52FC9468.4050602@vmware.com>

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the "test ! -f" part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)

Hmm, the behavior is the same in 9.1 and 9.2. Did you use a
different archive_command in 9.1, without the "test"?

The behavior is different: 9.1 doesn't try to re-archive the last
segment from the old timeline. (I've seen it always happen in training
workshops, so that's always mostly idle machines. Possibly 9.1 will
re-archive the last segment under load.)

I've just reproduced it here with these settings (in Debian's
/etc/postgresql-common/createcluster.conf syntax):
archive_mode = on
archive_command = 'test ! -f /tmp/pg-%v/%%f && cp %%p /tmp/pg-%v/%%f'
wal_level = hot_standby
max_wal_senders = 3

... pg_basebackup ...

recovery.conf:
restore_command = 'cp /tmp/pg-9.3/%f %p'

9.1 slave log (in German, but it should be easy to figure out):

LOG: konsistenter Wiederherstellungszustand erreicht bei 0/20000D8
cp: der Aufruf von stat für „/tmp/pg-9.1/000000010000000000000003“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: konnte Datei »pg_xlog/000000010000000000000003« nicht öffnen (Logdatei 0, Segment 3): Datei oder Verzeichnis nicht gefunden
LOG: Redo fertig bei 0/20000D8
LOG: Logdatei »000000010000000000000002« aus Archiv wiederhergestellt
cp: der Aufruf von stat für „/tmp/pg-9.1/00000002.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: gewählte neue Timeline-ID: 2
cp: der Aufruf von stat für „/tmp/pg-9.1/00000001.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: Wiederherstellung aus Archiv abgeschlossen
LOG: unvollständiges Startpaket
LOG: Autovacuum-Launcher startet
LOG: Datenbanksystem ist bereit, um Verbindungen anzunehmen
(eof)

ls /tmp/pg-9.1
000000010000000000000001 000000010000000000000002.00000020.backup 000000020000000000000003
000000010000000000000002 000000020000000000000002 00000002.history

9.2 slave log:

LOG: konsistenter Wiederherstellungszustand erreicht bei 0/20000E0
cp: der Aufruf von stat für „/tmp/pg-9.2/000000010000000000000003“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: konnte Datei »pg_xlog/000000010000000000000003« nicht öffnen (Logdatei 0, Segment 3): Datei oder Verzeichnis nicht gefunden
LOG: Redo fertig bei 0/20000E0
LOG: Logdatei »000000010000000000000002« aus Archiv wiederhergestellt
cp: der Aufruf von stat für „/tmp/pg-9.2/00000002.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: gewählte neue Timeline-ID: 2
cp: der Aufruf von stat für „/tmp/pg-9.2/00000001.history“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
LOG: Wiederherstellung aus Archiv abgeschlossen
LOG: unvollständiges Startpaket
LOG: Autovacuum-Launcher startet
LOG: Datenbanksystem ist bereit, um Verbindungen anzunehmen
LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1
DETAIL: Der fehlgeschlagene Archivbefehl war: test ! -f /tmp/pg-9.2/000000010000000000000002 && cp pg_xlog/000000010000000000000002 /tmp/pg-9.2/000000010000000000000002
LOG: Archivbefehl ist fehlgeschlagen mit Statuscode 1
...

ls /tmp/pg-9.2
000000010000000000000001 000000010000000000000002 000000010000000000000002.00000020.backup

The 9.3 log/ls looks the same.

Mit freundlichen Grüßen,
Christoph Berg
--
Senior Berater, Tel.: +49 (0)21 61 / 46 43-187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Hohenzollernstr. 133, 41061 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970 87C6 4C5A 6BAB 12D2 A7AE

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Christoph Berg (#19)
Re: [BUG] Archive recovery failure on 9.3+.

On 02/13/2014 02:42 PM, Christoph Berg wrote:

Re: Heikki Linnakangas 2014-02-13 <52FC9468.4050602@vmware.com>

With 9.1, it works, but 9.2 and 9.3 don't archive anything until I
remove the "test ! -f" part. (An alternative fix would be to declare
the behavior ok and adjust that example in the config.)

Hmm, the behavior is the same in 9.1 and 9.2. Did you use a
different archive_command in 9.1, without the "test"?

The behavior is different: 9.1 doesn't try to re-archive the last
segment from the old timeline. (I've seen it always happen in training
workshops, so that's always mostly idle machines. Possibly 9.1 will
re-archive the last segment under load.)

I've just reproduced it here with these settings (in Debian's
/etc/postgresql-common/createcluster.conf syntax):

Ah, I think I see what's going on. 9.2 changed the archive recovery
behavior so that the restored WAL segments are copied into pg_xlog, so
that you can recover the standby back to the same point after restart,
even if the archive later stops working (also needed for cascading
standbys). In 9.1, we only attempted to archive the last segment from
previous timeline if it wasn't restored from archive. In 9.2, it's
always archived.

I was testing this with streaming replication; 9.1 and 9.2 behave the
same in that scenario. But they differ when doing archive recovery.

Is this an argument for back-patching the "don't archive last segment
from old timeline" patch to 9.2 and 9.3, but leaving 9.1 alone? You have
the same problem with 9.1 and streaming replication, but no-one's
complained..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Christoph Berg
myon@debian.org
In reply to: Heikki Linnakangas (#20)
#22Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Christoph Berg (#21)
#23Christoph Berg
myon@debian.org
In reply to: Heikki Linnakangas (#22)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Christoph Berg (#23)
#25Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#18)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#25)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#26)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#27)
#29Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#28)