9.2.3 crashes during archive recovery

Started by Kyotaro HORIGUCHIalmost 13 years ago47 messages
#1Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp

Hello, 9.2.3 crashes during archive recovery.

This was also corrected at some point on origin/master with
another problem fixed by the commit below if my memory is
correct. But current HEAD and 9.2.3 crashes during archive
recovery (not on standby) by the 'marking deleted page visible'
problem.

/messages/by-id/50C7612E.5060008@vmware.com

The script attached replays the situation.

This could be illustrated as below,

1. StartupXLOG() initializes minRecoveryPoint with
ControlFile->CheckPoint.redo (virtually) at first of archive
recovery process.

2. Consistency state becomes true just before redo starts.

3. Redoing certain XLOG_HEAP2_VISIBLE record causes crash because
the page for visibility map has been alrady removed by
smgr_truncate() who had emitted XLOG_SMGR_TRUNCATE record
after that.

PANIC: WAL contains references to invalid pages

After all, the consistency point should be postponed until the
XLOG_SMGR_TRUNCATE.

In this case, the FINAL consistency point is at the
XLOG_SMGR_TRUNCATE record, but current implemet does not record
the consistency point (checkpoint, or commit or smgr_truncate)
itself, so we cannot predict the final consistency point on
starting of recovery.

Recovery was completed successfully with the small and rough
patch below. This allows multiple consistency points but also
kills quick cessation. (And of course no check is done against
replication/promotion until now X-)

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6029,7 +6029,9 @@ CheckRecoveryConsistency(void)
*/
XLogCheckInvalidPages();
-		reachedConsistency = true;
+		//reachedConsistency = true;
+		minRecoveryPoint = InvalidXLogRecPtr;
+
ereport(LOG,
(errmsg("consistent recovery state reached at %X/%X",
(uint32) (XLogCtl->lastReplayedEndRecPtr >> 32),

On the other hand, updating control file on every commits or
smgr_truncate's should slow the transactions..

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

====== Replay script.
#! /bin/sh
PGDATA=$HOME/pgdata
PGARC=$HOME/pgarc
rm -rf $PGDATA/* $PGARC/*
initdb
cat >> $PGDATA/postgresql.conf <<EOF
wal_level = hot_standby
checkpoint_segments = 300
checkpoint_timeout = 1h
archive_mode = on
archive_command = 'cp %p $PGARC/%f'
EOF
pg_ctl -w start
createdb
psql <<EOF
select version();
create table t (a int);
insert into t values (5);
checkpoint;
vacuum;
delete from t;
vacuum;
EOF
pg_ctl -w stop -m i
cat >> $PGDATA/recovery.conf <<EOF
restore_command='if [ -f $PGARC/%f ]; then cp $PGARC/%f %p; fi'
EOF
postgres
=====

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 09:46, Kyotaro HORIGUCHI wrote:

In this case, the FINAL consistency point is at the
XLOG_SMGR_TRUNCATE record, but current implemet does not record
the consistency point (checkpoint, or commit or smgr_truncate)
itself, so we cannot predict the final consistency point on
starting of recovery.

Hmm, what you did was basically:

1. Run server normally.
2. Kill it with "pg_ctl stop -m immediate".
3. Create a recovery.conf file, turning the server into a hot standby.

Without step 3, the server would perform crash recovery, and it would
work. But because of the recovery.conf file, the server goes into
archive recovery, and because minRecoveryPoint is not set, it assumes
that the system is consistent from the start.

Aside from the immediate issue with truncation, the system really isn't
consistent until the WAL has been replayed far enough, so it shouldn't
open for hot standby queries. There might be other, later, changes
already flushed to data files. The system has no way of knowing how far
it needs to replay the WAL to become consistent.

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least
not if the master was not shut down cleanly first.

If there's a use case for doing that, maybe we can do something better
in HEAD. If the control file says that the system was running
(DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash
recovery first, until we reach the end of WAL, and go into archive
recovery mode after that. We'd recover all the WAL files in pg_xlog as
far as we can, same as in crash recovery, and only start restoring files
from the archive once we reach the end of WAL in pg_xlog. At that point,
we'd also consider the system as consistent, and start up for hot standby.

I'm not sure that's worth the trouble, though. Perhaps it would be
better to just throw an error if the control file state is
DB_IN_PRODUCTION and a recovery.conf file exists. The admin can always
start the server normally first, shut it down cleanly, and then create
the recovery.conf file.

On the other hand, updating control file on every commits or
smgr_truncate's should slow the transactions..

To be precise, we'd need to update the control file on every
XLogFlush(), like we do during archive recovery. That would indeed be
unacceptable from a performance point of view. Updating the control file
that often would also be bad for robustness.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: 9.2.3 crashes during archive recovery

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least
not if the master was not shut down cleanly first.
...
I'm not sure that's worth the trouble, though. Perhaps it would be
better to just throw an error if the control file state is
DB_IN_PRODUCTION and a recovery.conf file exists.

+1 for that approach, at least until it's clear there's a market for
doing this sort of thing. I think the error check could be
back-patched, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: 9.2.3 crashes during archive recovery

On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

To be precise, we'd need to update the control file on every XLogFlush(),
like we do during archive recovery. That would indeed be unacceptable from a
performance point of view. Updating the control file that often would also
be bad for robustness.

If those arguments make sense, then why don't they apply to recovery as well?

It sounds like we need to look at something better for use during
archive recovery.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#4)
Re: 9.2.3 crashes during archive recovery

Simon Riggs <simon@2ndQuadrant.com> writes:

On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

To be precise, we'd need to update the control file on every XLogFlush(),
like we do during archive recovery. That would indeed be unacceptable from a
performance point of view. Updating the control file that often would also
be bad for robustness.

If those arguments make sense, then why don't they apply to recovery as well?

In plain old crash recovery, don't the checks on whether to apply WAL
records based on LSN take care of this?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Simon Riggs (#4)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 20:25, Simon Riggs wrote:

On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

To be precise, we'd need to update the control file on every XLogFlush(),
like we do during archive recovery. That would indeed be unacceptable from a
performance point of view. Updating the control file that often would also
be bad for robustness.

If those arguments make sense, then why don't they apply to recovery as well?

To some degree, they do. The big difference is that during normal
operation, every commit is XLogFlushed(). During recovery, XLogFlush()
happens much less frequently - certainly not after replaying each commit
record.

It sounds like we need to look at something better for use during
archive recovery.

Well, no-one's complained about the performance. From a robustness point
of view, it might be good to keep the minRecoveryPoint value in a
separate file, for example, to avoid rewriting the control file that
often. Then again, why fix it when it's not broken.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#5)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 21:03, Tom Lane wrote:

Simon Riggs<simon@2ndQuadrant.com> writes:

On 13 February 2013 09:04, Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

To be precise, we'd need to update the control file on every XLogFlush(),
like we do during archive recovery. That would indeed be unacceptable from a
performance point of view. Updating the control file that often would also
be bad for robustness.

If those arguments make sense, then why don't they apply to recovery as well?

In plain old crash recovery, don't the checks on whether to apply WAL
records based on LSN take care of this?

The problem we're trying to solve is determining how much WAL needs to
be replayed until the database is consistent again. In crash recovery,
the answer is "all of it". That's why the CRC in the WAL is essential;
it's required to determine where the WAL ends. But if we had some other
mechanism, like if we updated minRecoveryPoint after every XLogFlush()
like Simon suggested, we wouldn't necessarily need the CRC to detect end
of WAL (not that I'd suggest removing it anyway), and we could throw an
error if there is corrupt bit somewhere in the WAL before the true end
of WAL.

In archive recovery, we can't just say "replay all the WAL", because the
whole idea of PITR is to not recover all the WAL. So we use
minRecoveryPoint to keep track of how far the WAL needs to be replayed
at a minimum, for the database to be consistent.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#6)
Re: 9.2.3 crashes during archive recovery

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Well, no-one's complained about the performance. From a robustness point
of view, it might be good to keep the minRecoveryPoint value in a
separate file, for example, to avoid rewriting the control file that
often. Then again, why fix it when it's not broken.

It would only be broken if someone interrupted a crash recovery
mid-flight and tried to establish a recovery stop point before the end
of WAL, no? Why don't we just forbid that case? This would either be
the same as, or a small extension of, the pg_control state vs existence
of recovery.conf error check that was just discussed.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#8)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 21:21, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

Well, no-one's complained about the performance. From a robustness point
of view, it might be good to keep the minRecoveryPoint value in a
separate file, for example, to avoid rewriting the control file that
often. Then again, why fix it when it's not broken.

It would only be broken if someone interrupted a crash recovery
mid-flight and tried to establish a recovery stop point before the end
of WAL, no? Why don't we just forbid that case? This would either be
the same as, or a small extension of, the pg_control state vs existence
of recovery.conf error check that was just discussed.

The problem is when you interrupt archive recovery (kill -9), and
restart. After restart, the system needs to know how far the WAL was
replayed before the crash, because it must not open for hot standby
queries, or allow the database to be started up in master-mode, until
it's replayed the WAL up to that same point again.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#9)
Re: 9.2.3 crashes during archive recovery

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 13.02.2013 21:21, Tom Lane wrote:

It would only be broken if someone interrupted a crash recovery
mid-flight and tried to establish a recovery stop point before the end
of WAL, no? Why don't we just forbid that case? This would either be
the same as, or a small extension of, the pg_control state vs existence
of recovery.conf error check that was just discussed.

The problem is when you interrupt archive recovery (kill -9), and
restart. After restart, the system needs to know how far the WAL was
replayed before the crash, because it must not open for hot standby
queries, or allow the database to be started up in master-mode, until
it's replayed the WAL up to that same point again.

Well, archive recovery is a different scenario --- Simon was questioning
whether we need a minRecoveryPoint mechanism in crash recovery, or at
least that's what I thought he asked.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#10)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 21:30, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 13.02.2013 21:21, Tom Lane wrote:

It would only be broken if someone interrupted a crash recovery
mid-flight and tried to establish a recovery stop point before the end
of WAL, no? Why don't we just forbid that case? This would either be
the same as, or a small extension of, the pg_control state vs existence
of recovery.conf error check that was just discussed.

The problem is when you interrupt archive recovery (kill -9), and
restart. After restart, the system needs to know how far the WAL was
replayed before the crash, because it must not open for hot standby
queries, or allow the database to be started up in master-mode, until
it's replayed the WAL up to that same point again.

Well, archive recovery is a different scenario --- Simon was questioning
whether we need a minRecoveryPoint mechanism in crash recovery, or at
least that's what I thought he asked.

Ah, ok. The short answer to that is "no", because in crash recovery, we
just replay the WAL all the way to the end. I thought he was questioning
updating the control file at every XLogFlush() during archive recovery.
The answer to that is that it's not so bad, because XLogFlush() is
called so infrequently during recovery.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#7)
Re: 9.2.3 crashes during archive recovery

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

The problem we're trying to solve is determining how much WAL needs to
be replayed until the database is consistent again. In crash recovery,
the answer is "all of it". That's why the CRC in the WAL is essential;
it's required to determine where the WAL ends. But if we had some other
mechanism, like if we updated minRecoveryPoint after every XLogFlush()
like Simon suggested, we wouldn't necessarily need the CRC to detect end
of WAL (not that I'd suggest removing it anyway), and we could throw an
error if there is corrupt bit somewhere in the WAL before the true end
of WAL.

Meh. I think that would be disastrous from both performance and
reliability standpoints. (Performance because the whole point of WAL is
to commit with only one disk write in one place, and reliability because
of greatly increasing the number of writes to the utterly-critical
pg_control file.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#11)
Re: 9.2.3 crashes during archive recovery

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 13.02.2013 21:30, Tom Lane wrote:

Well, archive recovery is a different scenario --- Simon was questioning
whether we need a minRecoveryPoint mechanism in crash recovery, or at
least that's what I thought he asked.

Ah, ok. The short answer to that is "no", because in crash recovery, we
just replay the WAL all the way to the end. I thought he was questioning
updating the control file at every XLogFlush() during archive recovery.
The answer to that is that it's not so bad, because XLogFlush() is
called so infrequently during recovery.

Right, and it's not so evil from a reliability standpoint either, partly
because of that and partly because, by definition, this isn't your only
copy of the data.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#3)
Re: 9.2.3 crashes during archive recovery

On 13.02.2013 17:02, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least
not if the master was not shut down cleanly first.
...
I'm not sure that's worth the trouble, though. Perhaps it would be
better to just throw an error if the control file state is
DB_IN_PRODUCTION and a recovery.conf file exists.

+1 for that approach, at least until it's clear there's a market for
doing this sort of thing. I think the error check could be
back-patched, too.

Hmm, I just realized a little problem with that approach. If you take a
base backup using an atomic filesystem backup from a running server, and
start archive recovery from that, that's essentially the same thing as
Kyotaro's test case.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#2)
Re: 9.2.3 crashes during archive recovery

On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Without step 3, the server would perform crash recovery, and it would work.
But because of the recovery.conf file, the server goes into archive
recovery, and because minRecoveryPoint is not set, it assumes that the
system is consistent from the start.

Aside from the immediate issue with truncation, the system really isn't
consistent until the WAL has been replayed far enough, so it shouldn't open
for hot standby queries. There might be other, later, changes already
flushed to data files. The system has no way of knowing how far it needs to
replay the WAL to become consistent.

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least not if
the master was not shut down cleanly first.

If there's a use case for doing that, maybe we can do something better in
HEAD. If the control file says that the system was running
(DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash
recovery first, until we reach the end of WAL, and go into archive recovery
mode after that. We'd recover all the WAL files in pg_xlog as far as we can,
same as in crash recovery, and only start restoring files from the archive
once we reach the end of WAL in pg_xlog. At that point, we'd also consider
the system as consistent, and start up for hot standby.

I'm not sure that's worth the trouble, though. Perhaps it would be better to
just throw an error if the control file state is DB_IN_PRODUCTION and a
recovery.conf file exists. The admin can always start the server normally
first, shut it down cleanly, and then create the recovery.conf file.

Now I've read the whole thing...

The problem is that we startup Hot Standby before we hit the min
recovery point because that isn't recorded. For me, the thing to do is
to make the min recovery point == end of WAL when state is
DB_IN_PRODUCTION. That way we don't need to do any new writes and we
don't need to risk people seeing inconsistent results if they do this.

But I think that still gives you a timeline problem when putting a
master back into a standby.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Ants Aasma
ants.aasma@eesti.ee
In reply to: Heikki Linnakangas (#14)
Re: 9.2.3 crashes during archive recovery

On Feb 13, 2013 10:29 PM, "Heikki Linnakangas" <hlinnakangas@vmware.com>
wrote:

Hmm, I just realized a little problem with that approach. If you take a

base backup using an atomic filesystem backup from a running server, and
start archive recovery from that, that's essentially the same thing as
Kyotaro's test case.

Coincidentally I was wondering about the same thing when I was reviewing
our slave provisioning procedures. There didn't seem to be any
communication channel from pg_stop_backup for the slave to know when it was
safe to allow connections.

Maybe there should be some mechanism akin to backup label to communicate
the minimum recovery point? When the min recovery point file exists the
value inside it is used, when the recovery point is reached the file is
removed. pg_basebackup would just append the file to the archive. Custom
backup procedures could also use it to communicate the necessary WAL
location.

--
Ants Aasma

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#14)
Re: 9.2.3 crashes during archive recovery

On Thu, Feb 14, 2013 at 5:15 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 13.02.2013 17:02, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least
not if the master was not shut down cleanly first.
...
I'm not sure that's worth the trouble, though. Perhaps it would be
better to just throw an error if the control file state is
DB_IN_PRODUCTION and a recovery.conf file exists.

+1 for that approach, at least until it's clear there's a market for
doing this sort of thing. I think the error check could be
back-patched, too.

Hmm, I just realized a little problem with that approach. If you take a base
backup using an atomic filesystem backup from a running server, and start
archive recovery from that, that's essentially the same thing as Kyotaro's
test case.

Yes. And the resource agent for streaming replication in Pacemaker (it's the
OSS clusterware) is the user of that archive recovery scenario, too. When it
starts up the server, it always creates the recovery.conf and starts the server
as the standby. It cannot start the master directly, IOW the server is always
promoted to the master from the standby. So when it starts up the server
after the server crashes, obviously it executes the same recovery scenario
(i.e., force archive recovery instead of crash one) as Kyotaro described.

The reason why that resource agent cannot start up the master directly is
that it manages three server states, called Master, Slave and Down. It can
move the server state from Down to Slave, and the reverse direction.
Also it can move the state from Slave to Master, and the reverse direction.
But there is no way to move the state between Down and Master directly.
This kind of the state transition model is isolated case in
clusterware, I think.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#15)
Re: 9.2.3 crashes during archive recovery

On Thu, Feb 14, 2013 at 5:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 13 February 2013 09:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Without step 3, the server would perform crash recovery, and it would work.
But because of the recovery.conf file, the server goes into archive
recovery, and because minRecoveryPoint is not set, it assumes that the
system is consistent from the start.

Aside from the immediate issue with truncation, the system really isn't
consistent until the WAL has been replayed far enough, so it shouldn't open
for hot standby queries. There might be other, later, changes already
flushed to data files. The system has no way of knowing how far it needs to
replay the WAL to become consistent.

At least in back-branches, I'd call this a pilot error. You can't turn a
master into a standby just by creating a recovery.conf file. At least not if
the master was not shut down cleanly first.

If there's a use case for doing that, maybe we can do something better in
HEAD. If the control file says that the system was running
(DB_IN_PRODUCTION), but there is a recovery.conf file, we could do crash
recovery first, until we reach the end of WAL, and go into archive recovery
mode after that. We'd recover all the WAL files in pg_xlog as far as we can,
same as in crash recovery, and only start restoring files from the archive
once we reach the end of WAL in pg_xlog. At that point, we'd also consider
the system as consistent, and start up for hot standby.

I'm not sure that's worth the trouble, though. Perhaps it would be better to
just throw an error if the control file state is DB_IN_PRODUCTION and a
recovery.conf file exists. The admin can always start the server normally
first, shut it down cleanly, and then create the recovery.conf file.

Now I've read the whole thing...

The problem is that we startup Hot Standby before we hit the min
recovery point because that isn't recorded. For me, the thing to do is
to make the min recovery point == end of WAL when state is
DB_IN_PRODUCTION. That way we don't need to do any new writes and we
don't need to risk people seeing inconsistent results if they do this.

+1

And if it's the standby case, the min recovery point can be set to the end
of WAL files located in the standby. IOW, we can regard the database as
consistent when we replay all the WAL files in local and try to connect to
the master.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#14)
Re: 9.2.3 crashes during archive recovery

Sorry, I omitted to show how we found this issue.

In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
is stopped by 'pg_ctl stop -m i' regardless of situation.

On the other hand, PosrgreSQL RA(Rsource Agent) is obliged to
start the master node via hot standby state because of the
restriction of the state transition of Pacemaker,

So the simply stopping and then starting the master node can fall
into this situation.

Hmm, I just realized a little problem with that approach. If you take
a base backup using an atomic filesystem backup from a running server,
and start archive recovery from that, that's essentially the same
thing as Kyotaro's test case.

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Ants Aasma
ants@cybertec.at
In reply to: Simon Riggs (#15)
Re: 9.2.3 crashes during archive recovery

On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The problem is that we startup Hot Standby before we hit the min
recovery point because that isn't recorded. For me, the thing to do is
to make the min recovery point == end of WAL when state is
DB_IN_PRODUCTION. That way we don't need to do any new writes and we
don't need to risk people seeing inconsistent results if they do this.

While this solution would help solve my issue, it assumes that the
correct amount of WAL files are actually there. Currently the docs for
setting up a standby refer to "24.3.4. Recovering Using a Continuous
Archive Backup", and that step recommends emptying the contents of
pg_xlog. If this is chosen as the solution the docs should be adjusted
to recommend using pg_basebackup -x for setting up the standby. As a
related point, pointing standby setup to that section has confused at
least one of my clients. That chapter is rather scarily complicated
compared to what's usually necessary.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Ants Aasma (#20)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

On 15.02.2013 13:05, Ants Aasma wrote:

On Wed, Feb 13, 2013 at 10:52 PM, Simon Riggs<simon@2ndquadrant.com> wrote:

The problem is that we startup Hot Standby before we hit the min
recovery point because that isn't recorded. For me, the thing to do is
to make the min recovery point == end of WAL when state is
DB_IN_PRODUCTION. That way we don't need to do any new writes and we
don't need to risk people seeing inconsistent results if they do this.

While this solution would help solve my issue, it assumes that the
correct amount of WAL files are actually there. Currently the docs for
setting up a standby refer to "24.3.4. Recovering Using a Continuous
Archive Backup", and that step recommends emptying the contents of
pg_xlog. If this is chosen as the solution the docs should be adjusted
to recommend using pg_basebackup -x for setting up the standby.

When the backup is taken using pg_start_backup or pg_basebackup,
minRecoveryPoint is set correctly anyway, and it's OK to clear out
pg_xlog. It's only if you take the backup using an atomic filesystem
snapshot, or just kill -9 the server and take a backup while it's not
running, that we have a problem. In those scenarios, you should not
clear pg_xlog.

Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

As a
related point, pointing standby setup to that section has confused at
least one of my clients. That chapter is rather scarily complicated
compared to what's usually necessary.

Yeah, it probably could use some editing, as the underlying code has
evolved a lot since it was written. The suggestion to clear out pg_xlog
seems like an unnecessary complication. It's safe to do so, if you
restore with an archive, but unnecessary.

The "File System Level Backup" chapter
(http://www.postgresql.org/docs/devel/static/backup-file.html) probably
should mention "pg_basebackup -x", too.

Docs patches are welcome..

- Heikki

Attachments:

crash-recover-before-archive-recovery.patchtext/x-diff; name=crash-recover-before-archive-recovery.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 479c14d..b4e7830 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -189,7 +189,18 @@ static bool LocalHotStandbyActive = false;
  */
 static int	LocalXLogInsertAllowed = -1;
 
-/* Are we recovering using offline XLOG archives? (only valid in the startup process) */
+/*
+ * When ArchiveRecoveryRequested is set, archive recovery was
+ * requested (recovery.conf file was present). When InArchiveRecovery is set,
+ * we are currently recovering using offline XLOG archives. (these variables
+ * are only valid in the startup process).
+ *
+ * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
+ * currently performing crash recovery using only XLOG files in pg_xlog, but
+ * will switch to using offline XLOG archives as soon as we reach the end of
+ * WAL in pg_xlog.
+*/
+static bool ArchiveRecoveryRequested = false;
 bool InArchiveRecovery = false;
 
 /* Was the last xlog file restored from archive, or local? */
@@ -207,10 +218,12 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 
 /* options taken from recovery.conf for XLOG streaming */
-bool StandbyMode = false;
+static bool StandbyModeRequested = false;
 static char *PrimaryConnInfo = NULL;
 static char *TriggerFile = NULL;
 
+bool StandbyMode = false;
+
 /* whether request for fast promotion has been made yet */
 static bool fast_promote = false;
 
@@ -3217,10 +3230,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 	private->emode = emode;
 	private->randAccess = (RecPtr != InvalidXLogRecPtr);
 
-	/* This is the first try to read this page. */
+	/* This is the first attempt to read this page. */
 	lastSourceFailed = false;
 
-	do
+	for (;;)
 	{
 		char   *errormsg;
 
@@ -3229,8 +3242,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			lastSourceFailed = true;
-
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -3247,22 +3258,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				ereport(emode_for_corrupt_record(emode,
 												 RecPtr ? RecPtr : EndRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */));
-
-			/* Give up, or retry if we're in standby mode. */
-			continue;
 		}
-
 		/*
 		 * Check page TLI is one of the expected values.
 		 */
-		if (!tliInHistory(xlogreader->latestPageTLI, expectedTLEs))
+		else if (!tliInHistory(xlogreader->latestPageTLI, expectedTLEs))
 		{
 			char		fname[MAXFNAMELEN];
 			XLogSegNo segno;
 			int32 offset;
 
-			lastSourceFailed = true;
-
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3273,11 +3278,61 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 							fname,
 							offset)));
 			record = NULL;
-			continue;
 		}
-	} while (StandbyMode && record == NULL && !CheckForStandbyTrigger());
 
-	return record;
+		if (record)
+		{
+			/* Great, got a record */
+			return record;
+		}
+		else
+		{
+			/* No valid record available from this source */
+			lastSourceFailed = true;
+
+			/*
+			 * If archive recovery was requested, but we were still doing crash
+			 * recovery, switch to archive recovery and retry using the offline
+			 * archive. We have now replayed all the valid WAL in pg_xlog, so
+			 * we are presumably now consistent.
+			 *
+			 * We require that there's at least some valid WAL present in
+			 * pg_xlog, however (!fetch_ckpt). We could recover using the WAL
+			 * from the archive, even if pg_xlog is completely empty, but we'd
+			 * have no idea how far we'd have to replay to reach consistency.
+			 * So err on the safe side and give up.
+			 */
+			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
+				!fetching_ckpt)
+			{
+				ereport(DEBUG2,
+						(errmsg_internal("reached end of WAL in pg_xlog, entering archive recovery")));
+				InArchiveRecovery = true;
+				if (StandbyModeRequested)
+					StandbyMode = true;
+
+				/* initialize minRecoveryPoint to this record */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				if (ControlFile->minRecoveryPoint < EndRecPtr)
+				{
+					ControlFile->minRecoveryPoint = EndRecPtr;
+					ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+					UpdateControlFile();
+				}
+				LWLockRelease(ControlFileLock);
+
+				CheckRecoveryConsistency();
+
+				continue;
+			}
+
+			/* In standby mode, loop back to retry. Otherwise, give up. */
+			if (StandbyMode && !CheckForStandbyTrigger())
+				continue;
+			else
+				return NULL;
+		}
+	}
 }
 
 /*
@@ -4213,7 +4268,7 @@ readRecoveryCommandFile(void)
 		}
 		else if (strcmp(item->name, "standby_mode") == 0)
 		{
-			if (!parse_bool(item->value, &StandbyMode))
+			if (!parse_bool(item->value, &StandbyModeRequested))
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("parameter \"%s\" requires a Boolean value",
@@ -4244,7 +4299,7 @@ readRecoveryCommandFile(void)
 	/*
 	 * Check for compulsory parameters
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 	{
 		if (PrimaryConnInfo == NULL && recoveryRestoreCommand == NULL)
 			ereport(WARNING,
@@ -4261,7 +4316,7 @@ readRecoveryCommandFile(void)
 	}
 
 	/* Enable fetching from archive recovery area */
-	InArchiveRecovery = true;
+	ArchiveRecoveryRequested = true;
 
 	/*
 	 * If user specified recovery_target_timeline, validate it or compute the
@@ -4271,6 +4326,11 @@ readRecoveryCommandFile(void)
 	 */
 	if (rtliGiven)
 	{
+		/*
+		 * Temporarily set InArchiveRecovery, so that existsTimeLineHistory
+		 * or findNewestTimeLine below will check the archive.
+		 */
+		InArchiveRecovery = true;
 		if (rtli)
 		{
 			/* Timeline 1 does not have a history file, all else should */
@@ -4287,6 +4347,7 @@ readRecoveryCommandFile(void)
 			recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
 			recoveryTargetIsLatest = true;
 		}
+		InArchiveRecovery = false;
 	}
 
 	FreeConfigVariables(head);
@@ -4850,9 +4911,9 @@ StartupXLOG(void)
 			archiveCleanupCommand ? archiveCleanupCommand : "",
 			sizeof(XLogCtl->archiveCleanupCommand));
 
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
-		if (StandbyMode)
+		if (StandbyModeRequested)
 			ereport(LOG,
 					(errmsg("entering standby mode")));
 		else if (recoveryTarget == RECOVERY_TARGET_XID)
@@ -4892,7 +4953,7 @@ StartupXLOG(void)
 	 * Take ownership of the wakeup latch if we're going to sleep during
 	 * recovery.
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/* Set up XLOG reader facility */
@@ -4909,6 +4970,15 @@ StartupXLOG(void)
 						  &backupFromStandby))
 	{
 		/*
+		 * Archive recovery was requested, and thanks to the backup label file,
+		 * we know how far we need to replay to reach consistency. Enter
+		 * archive recovery directly.
+		 */
+		InArchiveRecovery = true;
+		if (StandbyModeRequested)
+			StandbyMode = true;
+
+		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
@@ -4949,6 +5019,33 @@ StartupXLOG(void)
 	else
 	{
 		/*
+		 * It's possible that archive recovery was requested, but we don't
+		 * know how far we need to replay the WAL before we reach consistency.
+		 * This can happen for example if a base backup is taken from a running
+		 * server using an atomic filesystem snapshot, without calling
+		 * pg_start/stop_backup. Or if you just kill a running master server
+		 * and put it into archive recovery by creating a recovery.conf file.
+		 *
+		 * Our strategy in that case is to perform crash recovery first,
+		 * replaying all the WAL present in pg_xlog, and only enter archive
+		 * recovery after that.
+		 *
+		 * But usually we already know how far we need to replay the WAL (up to
+		 * minRecoveryPoint, up to backupEndPoint, or until we see an
+		 * end-of-backup record), and we can enter archive recovery directly.
+		 */
+		if (ArchiveRecoveryRequested &&
+			(ControlFile->minRecoveryPoint != InvalidXLogRecPtr ||
+			 ControlFile->backupEndRequired ||
+			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
+			 ControlFile->state == DB_SHUTDOWNED))
+		{
+			InArchiveRecovery = true;
+			if (StandbyModeRequested)
+				StandbyMode = true;
+		}
+
+		/*
 		 * Get the last valid checkpoint record.  If the latest one according
 		 * to pg_control is broken, try the next-to-last one.
 		 */
@@ -5116,7 +5213,7 @@ StartupXLOG(void)
 	}
 	else if (ControlFile->state != DB_SHUTDOWNED)
 		InRecovery = true;
-	else if (InArchiveRecovery)
+	else if (ArchiveRecoveryRequested)
 	{
 		/* force recovery due to presence of recovery.conf */
 		InRecovery = true;
@@ -5155,15 +5252,6 @@ StartupXLOG(void)
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
-		if (InArchiveRecovery)
-		{
-			/* initialize minRecoveryPoint if not set yet */
-			if (ControlFile->minRecoveryPoint < checkPoint.redo)
-			{
-				ControlFile->minRecoveryPoint = checkPoint.redo;
-				ControlFile->minRecoveryPointTLI = checkPoint.ThisTimeLineID;
-			}
-		}
 
 		/*
 		 * Set backupStartPoint if we're starting recovery from a base backup.
@@ -5243,7 +5331,7 @@ StartupXLOG(void)
 		 * control file and we've established a recovery snapshot from a
 		 * running-xacts WAL record.
 		 */
-		if (InArchiveRecovery && EnableHotStandby)
+		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
 			TransactionId *xids;
 			int			nxids;
@@ -5344,7 +5432,7 @@ StartupXLOG(void)
 		 * process in addition to postmaster!  Also, fsync requests are
 		 * subsequently to be handled by the checkpointer, not locally.
 		 */
-		if (InArchiveRecovery && IsUnderPostmaster)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
 			SetForwardFsyncRequests();
@@ -5601,7 +5689,7 @@ StartupXLOG(void)
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 		DisownLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/*
@@ -5646,7 +5734,7 @@ StartupXLOG(void)
 		 * crashes while an online backup is in progress. We must not treat
 		 * that as an error, or the database will refuse to start up.
 		 */
-		if (InArchiveRecovery || ControlFile->backupEndRequired)
+		if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
 		{
 			if (ControlFile->backupEndRequired)
 				ereport(FATAL,
@@ -5677,10 +5765,12 @@ StartupXLOG(void)
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
 	PrevTimeLineID = ThisTimeLineID;
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
 		char	reason[200];
 
+		Assert(InArchiveRecovery);
+
 		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
 		ereport(LOG,
 				(errmsg("selected new timeline ID: %u", ThisTimeLineID)));
@@ -5720,7 +5810,7 @@ StartupXLOG(void)
 	 * that we also have a copy of the last block of the old WAL in readBuf;
 	 * we will use that below.)
 	 */
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 		exitArchiveRecovery(xlogreader->readPageTLI, endLogSegNo);
 
 	/*
@@ -7706,7 +7796,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
 		 */
-		if (InArchiveRecovery &&
+		if (ArchiveRecoveryRequested &&
 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
 			ereport(PANIC,
@@ -9118,7 +9208,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (StandbyMode && bgwriterLaunched)
+		if (StandbyModeRequested && bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
@@ -9141,27 +9231,18 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
-		if (StandbyMode)
+		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+										 private->randAccess,
+										 private->fetching_ckpt,
+										 targetRecPtr))
 		{
-			if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-											 private->randAccess,
-											 private->fetching_ckpt,
-											 targetRecPtr))
-				goto triggered;
-		}
-		/* In archive or crash recovery. */
-		else if (readFile < 0)
-		{
-			int source;
-
-			if (InArchiveRecovery)
-				source = XLOG_FROM_ANY;
-			else
-				source = XLOG_FROM_PG_XLOG;
+			if (readFile >= 0)
+				close(readFile);
+			readFile = -1;
+			readLen = 0;
+			readSource = 0;
 
-			readFile = XLogFileReadAnyTLI(readSegNo, emode, source);
-			if (readFile < 0)
-				return -1;
+			return -1;
 		}
 	}
 
@@ -9234,22 +9315,17 @@ next_record_is_invalid:
 		goto retry;
 	else
 		return -1;
-
-triggered:
-	if (readFile >= 0)
-		close(readFile);
-	readFile = -1;
-	readLen = 0;
-	readSource = 0;
-
-	return -1;
 }
 
 /*
- * In standby mode, wait for WAL at position 'RecPtr' to become available, either
- * via restore_command succeeding to restore the segment, or via walreceiver
- * having streamed the record (or via someone copying the segment directly to
- * pg_xlog, but that is not documented or recommended).
+ * Open the WAL segment containing WAL position 'RecPtr'.
+ *
+ * The segment can be fetched via restore_command succeeding to restore
+ * the segment, or via walreceiver having streamed the record, or it can
+ * already be present in pg_xlog. Checking pg_xlog is mainly for crash
+ * recovery, but it will be polled in standby mode too, in case someone copies
+ * a new segment directly to pg_xlog, although that is not documented or
+ * recommended.
  *
  * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
  * prepare to read WAL starting from RedoStartLSN after this.
@@ -9259,6 +9335,10 @@ triggered:
  * 'tliRecPtr' is the position of the WAL record we're interested in. It is
  * used to decide which timeline to stream the requested WAL from.
  *
+ * If the the record is not immediately available, the function returns false
+ * if we're not in standby mode. In standby mode, waits for it to become
+ * available.
+ *
  * When the requested record becomes available, the function opens the file
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
@@ -9292,7 +9372,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * part of advancing to the next state.
 	 *-------
 	 */
-	if (currentSource == 0)
+	if (!InArchiveRecovery)
+		currentSource = XLOG_FROM_PG_XLOG;
+	else if (currentSource == 0)
 		currentSource = XLOG_FROM_ARCHIVE;
 
 	for (;;)
@@ -9307,7 +9389,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
-
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -9321,13 +9402,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * file, we still finish replaying as much as we can from
 					 * archive and pg_xlog before failover.
 					 */
-					if (CheckForStandbyTrigger())
+					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						ShutdownWalRcv();
 						return false;
 					}
 
 					/*
+					 * Not in standby mode, and we've now tried the archive and
+					 * pg_xlog.
+					 */
+					if (!StandbyMode)
+						return false;
+
+					/*
 					 * If primary_conninfo is set, launch walreceiver to try to
 					 * stream the missing WAL.
 					 *
@@ -9431,7 +9519,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			 * in the archive over ones in pg_xlog, so try the next file
 			 * again from the archive first.
 			 */
-			currentSource = XLOG_FROM_ARCHIVE;
+			if (InArchiveRecovery)
+				currentSource = XLOG_FROM_ARCHIVE;
 		}
 
 		if (currentSource != oldSource)
@@ -9584,9 +9673,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 * process.
 		 */
 		HandleStartupProcInterrupts();
-	}
+	} while (StandbyMode);
 
-	return false; 	/* not reached */
+	return false;
 }
 
 /*
#22Ants Aasma
ants@cybertec.at
In reply to: Heikki Linnakangas (#21)
Re: 9.2.3 crashes during archive recovery

On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

While this solution would help solve my issue, it assumes that the
correct amount of WAL files are actually there. Currently the docs for
setting up a standby refer to "24.3.4. Recovering Using a Continuous
Archive Backup", and that step recommends emptying the contents of
pg_xlog. If this is chosen as the solution the docs should be adjusted
to recommend using pg_basebackup -x for setting up the standby.

When the backup is taken using pg_start_backup or pg_basebackup,
minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog.

How is minRecoveryPoint supposed to get set? I just tried taking a
pg_basebackup on master running pgbench. The resulting data dir
controlfile had this:

Min recovery ending location: 0/0

The end location was not in the backup_label either.

Looking at basebackup.c the process seems to be:
1. pg_start_backup
2. copy out backup_label
3. copy out rest of the datadir
4. copy out control file
5. pg_stop_backup
6. copy out WAL
7. send backup stop location

And pg_basebackup.c only uses the stop location to decide how much WAL to fetch.

I don't see anything here that could correctly communicate min
recovery point. Maybe I'm missing something.

Yeah, it probably could use some editing, as the underlying code has evolved
a lot since it was written. The suggestion to clear out pg_xlog seems like
an unnecessary complication. It's safe to do so, if you restore with an
archive, but unnecessary.

The "File System Level Backup" chapter
(http://www.postgresql.org/docs/devel/static/backup-file.html) probably
should mention "pg_basebackup -x", too.

Docs patches are welcome..

I will give it a shot.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Ants Aasma (#22)
Re: 9.2.3 crashes during archive recovery

On 16.02.2013 10:40, Ants Aasma wrote:

On Fri, Feb 15, 2013 at 3:49 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

While this solution would help solve my issue, it assumes that the
correct amount of WAL files are actually there. Currently the docs for
setting up a standby refer to "24.3.4. Recovering Using a Continuous
Archive Backup", and that step recommends emptying the contents of
pg_xlog. If this is chosen as the solution the docs should be adjusted
to recommend using pg_basebackup -x for setting up the standby.

When the backup is taken using pg_start_backup or pg_basebackup,
minRecoveryPoint is set correctly anyway, and it's OK to clear out pg_xlog.

How is minRecoveryPoint supposed to get set?

I was a slightly imprecise. minRecoveryPoint is not set at backup, but
backupStartPoint is. minRecoveryPoint is set later, once the
end-of-backup record is seen. A valid backupStartPoint has the same
effect as minRecoveryPoint: the system is considered inconsistent until
the end-of-backup record is seen.

I just tried taking a
pg_basebackup on master running pgbench. The resulting data dir
controlfile had this:

Min recovery ending location: 0/0

The end location was not in the backup_label either.

Looking at basebackup.c the process seems to be:
1. pg_start_backup
2. copy out backup_label
3. copy out rest of the datadir
4. copy out control file
5. pg_stop_backup
6. copy out WAL
7. send backup stop location

And pg_basebackup.c only uses the stop location to decide how much WAL to fetch.

I don't see anything here that could correctly communicate min
recovery point. Maybe I'm missing something.

backupStartPoint is set, which signals recovery to wait for an
end-of-backup record, until the system is considered consistent. If the
backup is taken from a hot standby, backupEndPoint is set, instead of
inserting an end-of-backup record.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Ants Aasma
ants@cybertec.at
In reply to: Heikki Linnakangas (#23)
Re: 9.2.3 crashes during archive recovery

On Mon, Feb 18, 2013 at 8:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

backupStartPoint is set, which signals recovery to wait for an end-of-backup
record, until the system is considered consistent. If the backup is taken
from a hot standby, backupEndPoint is set, instead of inserting an
end-of-backup record.

Thank you for explaining this, I can see how it works now. I'll see if
I can document this better so others won't have to go through as much
effort.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Ants Aasma (#24)
Re: 9.2.3 crashes during archive recovery

Hello, I looked this from another point of view.

I consider the current discussion to be based on how to predict
the last consistency point. But there is another aspect of this
issue.

I tried to postpone smgrtruncate after the next checkpoint. This
is similar to what hotstandby feedback does to vacuum. It seems
to be working fine but I warry that it might also bloats the
table. I haven't found the way to postpone only objective
smgrtruncate.

The patch below is a immediate verification patch for this
solution.

- CreateCheckPoint records the oldest xmin at that point. Let's
call it 'checkpoint xmin'.

- vacuum skips the modification by the transactions at the same
time or after the checkpoint xmin.

What do you think of this?

--
Kyotaro Horiguchi
NTT Open Source Software Center

=========
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 479c14d..a274393 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -493,6 +493,7 @@ typedef struct XLogCtlData
 	XLogRecPtr	lastFpwDisableRecPtr;

slock_t info_lck; /* locks shared variables shown above */
+ TransactionId chkptxactid;
} XLogCtlData;

static XLogCtlData *XLogCtl = NULL;
@@ -676,6 +677,11 @@ static bool read_backup_label(XLogRecPtr *checkPointLoc,
static void rm_redo_error_callback(void *arg);
static int get_sync_bit(int method);

+TransactionId
+XLogGetChkptTrId()
+{
+	return XLogCtl->chkptxactid;
+}
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -3875,7 +3881,7 @@ XLOGShmemInit(void)
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
 	InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
-
+	XLogCtl->chkptxactid = InvalidTransactionId;
 	/*
 	 * If we are not in bootstrap mode, pg_control should already exist. Read
 	 * and validate it immediately (see comments in ReadControlFile() for the
@@ -6700,6 +6706,8 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
+	XLogCtl->chkptxactid = GetOldestXmin(true, true);
+
 	/*
 	 * We must hold WALInsertLock while examining insert state to determine
 	 * the checkpoint REDO pointer.
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 4800b43..79205a0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -374,6 +374,7 @@ get_rel_oids(Oid relid, const RangeVar *vacrel)
 /*
  * vacuum_set_xid_limits() -- compute oldest-Xmin and freeze cutoff points
  */
+extern TransactionId XLogGetChkptTrId(void);
 void
 vacuum_set_xid_limits(int freeze_min_age,
 					  int freeze_table_age,
@@ -397,6 +398,11 @@ vacuum_set_xid_limits(int freeze_min_age,
 	 * always an independent transaction.
 	 */
 	*oldestXmin = GetOldestXmin(sharedRel, true);
+	{
+		TransactionId chkpttrid = XLogGetChkptTrId();
+		if (chkpttrid < *oldestXmin)
+			*oldestXmin = chkpttrid;
+	}

Assert(TransactionIdIsNormal(*oldestXmin));
==========

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#25)
Re: 9.2.3 crashes during archive recovery

Sorry, Let me correct a bit.

I tried to postpone smgrtruncate after the next checkpoint. This

I tried to postpone smgrtruncate TO the next checktpoint.

is similar to what hotstandby feedback does to vacuum. It seems
to be working fine but I warry that it might also bloats the
table. I haven't found the way to postpone only objective
smgrtruncate.

The patch below is a immediate verification patch for this
solution.

- CreateCheckPoint records the oldest xmin at that point. Let's
call it 'checkpoint xmin'.

- vacuum skips the modification by the transactions at the same
time or after the checkpoint xmin.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Kyotaro HORIGUCHI (#26)
Re: 9.2.3 crashes during archive recovery

On 20.02.2013 10:01, Kyotaro HORIGUCHI wrote:

Sorry, Let me correct a bit.

I tried to postpone smgrtruncate after the next checkpoint. This

I tried to postpone smgrtruncate TO the next checktpoint.

Umm, why? I don't understand this patch at all.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#21)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

On 15.02.2013 15:49, Heikki Linnakangas wrote:

Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

New version of this attached, with a few bugs fixed.

I'm thinking that this should be back-patched to 9.2, but not to earlier
branches. Before 9.2, we don't PANIC at a reference to a non-existent
page until end of recovery, even if we've already reached consistency.
The same basic issue still exists in earlier versions, though: if you
have hot_standby=on, the system will open for read-only queries too
early, before the database is consistent. But this patch is invasive
enough that I'm weary of back-patching it further, when the worst that
can happen is that there's a small window right after startup when you
can see an inconsistent database in hot standby mode. Maybe after we get
some more testing of this in 9.2 and master. Opinions on that?

- Heikki

Attachments:

crash-recover-before-archive-recovery-2.patchtext/x-diff; name=crash-recover-before-archive-recovery-2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index abc525a..29d1f96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -189,7 +189,18 @@ static bool LocalHotStandbyActive = false;
  */
 static int	LocalXLogInsertAllowed = -1;
 
-/* Are we recovering using offline XLOG archives? (only valid in the startup process) */
+/*
+ * When ArchiveRecoveryRequested is set, archive recovery was requested,
+ * ie. recovery.conf file was present. When InArchiveRecovery is set, we are
+ * currently recovering using offline XLOG archives. These variables are only
+ * valid in the startup process.
+ *
+ * When ArchiveRecoveryRequested is true, but InArchiveRecovery is false, we're
+ * currently performing crash recovery using only XLOG files in pg_xlog, but
+ * will switch to using offline XLOG archives as soon as we reach the end of
+ * WAL in pg_xlog.
+*/
+static bool ArchiveRecoveryRequested = false;
 bool InArchiveRecovery = false;
 
 /* Was the last xlog file restored from archive, or local? */
@@ -207,10 +218,13 @@ static TimestampTz recoveryTargetTime;
 static char *recoveryTargetName;
 
 /* options taken from recovery.conf for XLOG streaming */
-bool StandbyMode = false;
+static bool StandbyModeRequested = false;
 static char *PrimaryConnInfo = NULL;
 static char *TriggerFile = NULL;
 
+/* are we currently in standby mode? */
+bool StandbyMode = false;
+
 /* whether request for fast promotion has been made yet */
 static bool fast_promote = false;
 
@@ -3217,10 +3231,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 	private->emode = emode;
 	private->randAccess = (RecPtr != InvalidXLogRecPtr);
 
-	/* This is the first try to read this page. */
+	/* This is the first attempt to read this page. */
 	lastSourceFailed = false;
 
-	do
+	for (;;)
 	{
 		char   *errormsg;
 
@@ -3229,8 +3243,6 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			lastSourceFailed = true;
-
 			if (readFile >= 0)
 			{
 				close(readFile);
@@ -3247,22 +3259,16 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 				ereport(emode_for_corrupt_record(emode,
 												 RecPtr ? RecPtr : EndRecPtr),
 						(errmsg_internal("%s", errormsg) /* already translated */));
-
-			/* Give up, or retry if we're in standby mode. */
-			continue;
 		}
-
 		/*
 		 * Check page TLI is one of the expected values.
 		 */
-		if (!tliInHistory(xlogreader->latestPageTLI, expectedTLEs))
+		else if (!tliInHistory(xlogreader->latestPageTLI, expectedTLEs))
 		{
 			char		fname[MAXFNAMELEN];
 			XLogSegNo segno;
 			int32 offset;
 
-			lastSourceFailed = true;
-
 			XLByteToSeg(xlogreader->latestPagePtr, segno);
 			offset = xlogreader->latestPagePtr % XLogSegSize;
 			XLogFileName(fname, xlogreader->readPageTLI, segno);
@@ -3273,11 +3279,73 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 							fname,
 							offset)));
 			record = NULL;
-			continue;
 		}
-	} while (StandbyMode && record == NULL && !CheckForStandbyTrigger());
 
-	return record;
+		if (record)
+		{
+			/* Great, got a record */
+			return record;
+		}
+		else
+		{
+			/* No valid record available from this source */
+			lastSourceFailed = true;
+
+			/*
+			 * If archive recovery was requested, but we were still doing crash
+			 * recovery, switch to archive recovery and retry using the offline
+			 * archive. We have now replayed all the valid WAL in pg_xlog, so
+			 * we are presumably now consistent.
+			 *
+			 * We require that there's at least some valid WAL present in
+			 * pg_xlog, however (!fetch_ckpt). We could recover using the WAL
+			 * from the archive, even if pg_xlog is completely empty, but we'd
+			 * have no idea how far we'd have to replay to reach consistency.
+			 * So err on the safe side and give up.
+			 */
+			if (!InArchiveRecovery && ArchiveRecoveryRequested &&
+				!fetching_ckpt)
+			{
+				ereport(DEBUG1,
+						(errmsg_internal("reached end of WAL in pg_xlog, entering archive recovery")));
+				InArchiveRecovery = true;
+				if (StandbyModeRequested)
+					StandbyMode = true;
+
+				/* initialize minRecoveryPoint to this record */
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				if (ControlFile->minRecoveryPoint < EndRecPtr)
+				{
+					ControlFile->minRecoveryPoint = EndRecPtr;
+					ControlFile->minRecoveryPointTLI = ThisTimeLineID;
+				}
+				/* update local copy */
+				minRecoveryPoint = ControlFile->minRecoveryPoint;
+				minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
+
+				UpdateControlFile();
+				LWLockRelease(ControlFileLock);
+
+				CheckRecoveryConsistency();
+
+				/*
+				 * Before we retry, reset lastSourceFailed and currentSource
+				 * so that we will check the archive next.
+				 */
+				lastSourceFailed = false;
+				currentSource = 0;
+
+				continue;
+			}
+
+			/* In standby mode, loop back to retry. Otherwise, give up. */
+			if (StandbyMode && !CheckForStandbyTrigger())
+				continue;
+			else
+				return NULL;
+		}
+	}
 }
 
 /*
@@ -4213,7 +4281,7 @@ readRecoveryCommandFile(void)
 		}
 		else if (strcmp(item->name, "standby_mode") == 0)
 		{
-			if (!parse_bool(item->value, &StandbyMode))
+			if (!parse_bool(item->value, &StandbyModeRequested))
 				ereport(ERROR,
 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 						 errmsg("parameter \"%s\" requires a Boolean value",
@@ -4244,7 +4312,7 @@ readRecoveryCommandFile(void)
 	/*
 	 * Check for compulsory parameters
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 	{
 		if (PrimaryConnInfo == NULL && recoveryRestoreCommand == NULL)
 			ereport(WARNING,
@@ -4261,7 +4329,7 @@ readRecoveryCommandFile(void)
 	}
 
 	/* Enable fetching from archive recovery area */
-	InArchiveRecovery = true;
+	ArchiveRecoveryRequested = true;
 
 	/*
 	 * If user specified recovery_target_timeline, validate it or compute the
@@ -4271,6 +4339,11 @@ readRecoveryCommandFile(void)
 	 */
 	if (rtliGiven)
 	{
+		/*
+		 * Temporarily set InArchiveRecovery, so that existsTimeLineHistory
+		 * or findNewestTimeLine below will check the archive.
+		 */
+		InArchiveRecovery = true;
 		if (rtli)
 		{
 			/* Timeline 1 does not have a history file, all else should */
@@ -4287,6 +4360,7 @@ readRecoveryCommandFile(void)
 			recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
 			recoveryTargetIsLatest = true;
 		}
+		InArchiveRecovery = false;
 	}
 
 	FreeConfigVariables(head);
@@ -4850,9 +4924,9 @@ StartupXLOG(void)
 			archiveCleanupCommand ? archiveCleanupCommand : "",
 			sizeof(XLogCtl->archiveCleanupCommand));
 
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
-		if (StandbyMode)
+		if (StandbyModeRequested)
 			ereport(LOG,
 					(errmsg("entering standby mode")));
 		else if (recoveryTarget == RECOVERY_TARGET_XID)
@@ -4892,7 +4966,7 @@ StartupXLOG(void)
 	 * Take ownership of the wakeup latch if we're going to sleep during
 	 * recovery.
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 		OwnLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/* Set up XLOG reader facility */
@@ -4909,6 +4983,15 @@ StartupXLOG(void)
 						  &backupFromStandby))
 	{
 		/*
+		 * Archive recovery was requested, and thanks to the backup label file,
+		 * we know how far we need to replay to reach consistency. Enter
+		 * archive recovery directly.
+		 */
+		InArchiveRecovery = true;
+		if (StandbyModeRequested)
+			StandbyMode = true;
+
+		/*
 		 * When a backup_label file is present, we want to roll forward from
 		 * the checkpoint it identifies, rather than using pg_control.
 		 */
@@ -4949,6 +5032,33 @@ StartupXLOG(void)
 	else
 	{
 		/*
+		 * It's possible that archive recovery was requested, but we don't
+		 * know how far we need to replay the WAL before we reach consistency.
+		 * This can happen for example if a base backup is taken from a running
+		 * server using an atomic filesystem snapshot, without calling
+		 * pg_start/stop_backup. Or if you just kill a running master server
+		 * and put it into archive recovery by creating a recovery.conf file.
+		 *
+		 * Our strategy in that case is to perform crash recovery first,
+		 * replaying all the WAL present in pg_xlog, and only enter archive
+		 * recovery after that.
+		 *
+		 * But usually we already know how far we need to replay the WAL (up to
+		 * minRecoveryPoint, up to backupEndPoint, or until we see an
+		 * end-of-backup record), and we can enter archive recovery directly.
+		 */
+		if (ArchiveRecoveryRequested &&
+			(ControlFile->minRecoveryPoint != InvalidXLogRecPtr ||
+			 ControlFile->backupEndRequired ||
+			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
+			 ControlFile->state == DB_SHUTDOWNED))
+		{
+			InArchiveRecovery = true;
+			if (StandbyModeRequested)
+				StandbyMode = true;
+		}
+
+		/*
 		 * Get the last valid checkpoint record.  If the latest one according
 		 * to pg_control is broken, try the next-to-last one.
 		 */
@@ -5116,7 +5226,7 @@ StartupXLOG(void)
 	}
 	else if (ControlFile->state != DB_SHUTDOWNED)
 		InRecovery = true;
-	else if (InArchiveRecovery)
+	else if (ArchiveRecoveryRequested)
 	{
 		/* force recovery due to presence of recovery.conf */
 		InRecovery = true;
@@ -5155,15 +5265,6 @@ StartupXLOG(void)
 		ControlFile->prevCheckPoint = ControlFile->checkPoint;
 		ControlFile->checkPoint = checkPointLoc;
 		ControlFile->checkPointCopy = checkPoint;
-		if (InArchiveRecovery)
-		{
-			/* initialize minRecoveryPoint if not set yet */
-			if (ControlFile->minRecoveryPoint < checkPoint.redo)
-			{
-				ControlFile->minRecoveryPoint = checkPoint.redo;
-				ControlFile->minRecoveryPointTLI = checkPoint.ThisTimeLineID;
-			}
-		}
 
 		/*
 		 * Set backupStartPoint if we're starting recovery from a base backup.
@@ -5243,7 +5344,7 @@ StartupXLOG(void)
 		 * control file and we've established a recovery snapshot from a
 		 * running-xacts WAL record.
 		 */
-		if (InArchiveRecovery && EnableHotStandby)
+		if (ArchiveRecoveryRequested && EnableHotStandby)
 		{
 			TransactionId *xids;
 			int			nxids;
@@ -5344,7 +5445,7 @@ StartupXLOG(void)
 		 * process in addition to postmaster!  Also, fsync requests are
 		 * subsequently to be handled by the checkpointer, not locally.
 		 */
-		if (InArchiveRecovery && IsUnderPostmaster)
+		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
 			SetForwardFsyncRequests();
@@ -5601,7 +5702,7 @@ StartupXLOG(void)
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
-	if (StandbyMode)
+	if (StandbyModeRequested)
 		DisownLatch(&XLogCtl->recoveryWakeupLatch);
 
 	/*
@@ -5646,7 +5747,7 @@ StartupXLOG(void)
 		 * crashes while an online backup is in progress. We must not treat
 		 * that as an error, or the database will refuse to start up.
 		 */
-		if (InArchiveRecovery || ControlFile->backupEndRequired)
+		if (ArchiveRecoveryRequested || ControlFile->backupEndRequired)
 		{
 			if (ControlFile->backupEndRequired)
 				ereport(FATAL,
@@ -5677,10 +5778,12 @@ StartupXLOG(void)
 	 * In a normal crash recovery, we can just extend the timeline we were in.
 	 */
 	PrevTimeLineID = ThisTimeLineID;
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
 		char	reason[200];
 
+		Assert(InArchiveRecovery);
+
 		ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
 		ereport(LOG,
 				(errmsg("selected new timeline ID: %u", ThisTimeLineID)));
@@ -5720,7 +5823,7 @@ StartupXLOG(void)
 	 * that we also have a copy of the last block of the old WAL in readBuf;
 	 * we will use that below.)
 	 */
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 		exitArchiveRecovery(xlogreader->readPageTLI, endLogSegNo);
 
 	/*
@@ -7706,7 +7809,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
 		 * record, the backup was canceled and the end-of-backup record will
 		 * never arrive.
 		 */
-		if (InArchiveRecovery &&
+		if (ArchiveRecoveryRequested &&
 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
 			ereport(PANIC,
@@ -9118,7 +9221,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		 * Request a restartpoint if we've replayed too much xlog since the
 		 * last one.
 		 */
-		if (StandbyMode && bgwriterLaunched)
+		if (StandbyModeRequested && bgwriterLaunched)
 		{
 			if (XLogCheckpointNeeded(readSegNo))
 			{
@@ -9141,27 +9244,18 @@ retry:
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
-		if (StandbyMode)
-		{
-			if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
-											 private->randAccess,
-											 private->fetching_ckpt,
-											 targetRecPtr))
-				goto triggered;
-		}
-		/* In archive or crash recovery. */
-		else if (readFile < 0)
+		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
+										 private->randAccess,
+										 private->fetching_ckpt,
+										 targetRecPtr))
 		{
-			int source;
-
-			if (InArchiveRecovery)
-				source = XLOG_FROM_ANY;
-			else
-				source = XLOG_FROM_PG_XLOG;
+			if (readFile >= 0)
+				close(readFile);
+			readFile = -1;
+			readLen = 0;
+			readSource = 0;
 
-			readFile = XLogFileReadAnyTLI(readSegNo, emode, source);
-			if (readFile < 0)
-				return -1;
+			return -1;
 		}
 	}
 
@@ -9234,22 +9328,16 @@ next_record_is_invalid:
 		goto retry;
 	else
 		return -1;
-
-triggered:
-	if (readFile >= 0)
-		close(readFile);
-	readFile = -1;
-	readLen = 0;
-	readSource = 0;
-
-	return -1;
 }
 
 /*
- * In standby mode, wait for WAL at position 'RecPtr' to become available, either
- * via restore_command succeeding to restore the segment, or via walreceiver
- * having streamed the record (or via someone copying the segment directly to
- * pg_xlog, but that is not documented or recommended).
+ * Open the WAL segment containing WAL position 'RecPtr'.
+ *
+ * The segment can be fetched via restore_command, or via walreceiver having
+ * streamed the record, or it can already be present in pg_xlog. Checking
+ * pg_xlog is mainly for crash recovery, but it will be polled in standby mode
+ * too, in case someone copies a new segment directly to pg_xlog. That is not
+ * documented or recommended, though.
  *
  * If 'fetching_ckpt' is true, we're fetching a checkpoint record, and should
  * prepare to read WAL starting from RedoStartLSN after this.
@@ -9259,6 +9347,10 @@ triggered:
  * 'tliRecPtr' is the position of the WAL record we're interested in. It is
  * used to decide which timeline to stream the requested WAL from.
  *
+ * If the the record is not immediately available, the function returns false
+ * if we're not in standby mode. In standby mode, waits for it to become
+ * available.
+ *
  * When the requested record becomes available, the function opens the file
  * containing it (if not open already), and returns true. When end of standby
  * mode is triggered by the user, and there is no more WAL available, returns
@@ -9292,7 +9384,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	 * part of advancing to the next state.
 	 *-------
 	 */
-	if (currentSource == 0)
+	if (!InArchiveRecovery)
+		currentSource = XLOG_FROM_PG_XLOG;
+	else if (currentSource == 0)
 		currentSource = XLOG_FROM_ARCHIVE;
 
 	for (;;)
@@ -9307,7 +9401,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 */
 		if (lastSourceFailed)
 		{
-
 			switch (currentSource)
 			{
 				case XLOG_FROM_ARCHIVE:
@@ -9321,13 +9414,20 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * file, we still finish replaying as much as we can from
 					 * archive and pg_xlog before failover.
 					 */
-					if (CheckForStandbyTrigger())
+					if (StandbyMode && CheckForStandbyTrigger())
 					{
 						ShutdownWalRcv();
 						return false;
 					}
 
 					/*
+					 * Not in standby mode, and we've now tried the archive and
+					 * pg_xlog.
+					 */
+					if (!StandbyMode)
+						return false;
+
+					/*
 					 * If primary_conninfo is set, launch walreceiver to try to
 					 * stream the missing WAL.
 					 *
@@ -9431,7 +9531,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			 * in the archive over ones in pg_xlog, so try the next file
 			 * again from the archive first.
 			 */
-			currentSource = XLOG_FROM_ARCHIVE;
+			if (InArchiveRecovery)
+				currentSource = XLOG_FROM_ARCHIVE;
 		}
 
 		if (currentSource != oldSource)
@@ -9584,9 +9685,9 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		 * process.
 		 */
 		HandleStartupProcInterrupts();
-	}
+	} while (StandbyMode);
 
-	return false; 	/* not reached */
+	return false;
 }
 
 /*
#29Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#28)
Re: 9.2.3 crashes during archive recovery

On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 15.02.2013 15:49, Heikki Linnakangas wrote:

Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

New version of this attached, with a few bugs fixed.

I'm thinking that this should be back-patched to 9.2, but not to earlier
branches. Before 9.2, we don't PANIC at a reference to a non-existent page
until end of recovery, even if we've already reached consistency. The same
basic issue still exists in earlier versions, though: if you have
hot_standby=on, the system will open for read-only queries too early,
before the database is consistent. But this patch is invasive enough that
I'm weary of back-patching it further, when the worst that can happen is
that there's a small window right after startup when you can see an
inconsistent database in hot standby mode. Maybe after we get some more
testing of this in 9.2 and master. Opinions on that?

People have not yet complained about this problem with versions prior to
9.1. Is it worth backpatching in this case?
--
Michael

#30Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#17)
Re: 9.2.3 crashes during archive recovery

On 14.02.2013 19:18, Fujii Masao wrote:

Yes. And the resource agent for streaming replication in Pacemaker (it's the
OSS clusterware) is the user of that archive recovery scenario, too. When it
starts up the server, it always creates the recovery.conf and starts the server
as the standby. It cannot start the master directly, IOW the server is always
promoted to the master from the standby. So when it starts up the server
after the server crashes, obviously it executes the same recovery scenario
(i.e., force archive recovery instead of crash one) as Kyotaro described.

The reason why that resource agent cannot start up the master directly is
that it manages three server states, called Master, Slave and Down. It can
move the server state from Down to Slave, and the reverse direction.
Also it can move the state from Slave to Master, and the reverse direction.
But there is no way to move the state between Down and Master directly.
This kind of the state transition model is isolated case in
clusterware, I think.

I don't have much sympathy for that to be honest. Seems like something
that should be fixed in Pacemaker or the scripts used to glue it with
PostgreSQL. However, this patch should make that work, so I guess
everyone is happy.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Kyotaro HORIGUCHI (#19)
Re: 9.2.3 crashes during archive recovery

On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote:

Sorry, I omitted to show how we found this issue.

In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
is stopped by 'pg_ctl stop -m i' regardless of situation.

That seems like a bad idea. If nothing else, crash recovery can take a
long time. I don't know much about Pacemaker, but wouldn't it make more
sense to at least try fast shutdown first, falling back to immediate
shutdown after a timeout.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Michael Paquier (#29)
Re: 9.2.3 crashes during archive recovery

On 22.02.2013 02:13, Michael Paquier wrote:

On Thu, Feb 21, 2013 at 11:09 PM, Heikki Linnakangas<
hlinnakangas@vmware.com> wrote:

On 15.02.2013 15:49, Heikki Linnakangas wrote:

Attached is a patch for git master. The basic idea is to split
InArchiveRecovery into two variables, InArchiveRecovery and
ArchiveRecoveryRequested. ArchiveRecoveryRequested is set when
recovery.conf exists. But if we don't know how far we need to recover,
we first perform crash recovery with InArchiveRecovery=false. When we
reach the end of WAL in pg_xlog, InArchiveRecovery is set, and we
continue with normal archive recovery.

New version of this attached, with a few bugs fixed.

I'm thinking that this should be back-patched to 9.2, but not to earlier
branches. Before 9.2, we don't PANIC at a reference to a non-existent page
until end of recovery, even if we've already reached consistency. The same
basic issue still exists in earlier versions, though: if you have
hot_standby=on, the system will open for read-only queries too early,
before the database is consistent. But this patch is invasive enough that
I'm weary of back-patching it further, when the worst that can happen is
that there's a small window right after startup when you can see an
inconsistent database in hot standby mode. Maybe after we get some more
testing of this in 9.2 and master. Opinions on that?

People have not yet complained about this problem with versions prior to
9.1. Is it worth backpatching in this case?

Possibly not..

Anyway, I've committed this to master and 9.2 now.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#32)
Re: 9.2.3 crashes during archive recovery

Hello,

Anyway, I've committed this to master and 9.2 now.

This seems to fix the issue. We'll examine this further.

Thank you.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#31)
Re: 9.2.3 crashes during archive recovery

At Fri, 22 Feb 2013 11:42:39 +0200, Heikki Linnakangas <hlinnakangas@vmware.com> wrote in <51273D8F.7060609@vmware.com>

On 15.02.2013 10:33, Kyotaro HORIGUCHI wrote:

In HA DB cluster cosists of Pacemaker and PostgreSQL, PostgreSQL
is stopped by 'pg_ctl stop -m i' regardless of situation.

That seems like a bad idea. If nothing else, crash recovery can take a
long time. I don't know much about Pacemaker, but wouldn't it make
more sense to at least try fast shutdown first, falling back to
immediate shutdown after a timeout.

I completely agree with you. But I hear that they want to
shutdown as fast as possible on failure and also don't want more
states on RA.. On the other hand he result in case of this
failure is catastrophic so this should be fixed regardless of
what pacemaker does.

regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#27)
Re: 9.2.3 crashes during archive recovery

However this has become useless, I want to explain about how this
works.

I tried to postpone smgrtruncate TO the next checktpoint.

Umm, why? I don't understand this patch at all.

This inhibits truncate files after (quite vague in the patch:-)
the previous checkpoint by hindering the deleted tuples from
vacuum then hindering the vm pages from truncation. They are
truncated only If no tuples is deleted on the pages correspond to
them since the last checkpoint. Finally any nonexistent pages in
visibility maps won't be touched after the consistency point -
which is after the last checkpoint - on archive recovery.

# Umm.. sorry for confused explanation...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Josh Berkus
josh@agliodbs.com
In reply to: Kyotaro HORIGUCHI (#1)
Re: 9.2.3 crashes during archive recovery

Folks,

Is there any way this particular issue could cause data corruption
without causing a crash? I don't see a way for it to do so, but I
wanted to verify.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#33)
Re: 9.2.3 crashes during archive recovery

This is an interim report for this patch.

We found that PostgreSQL with this patch unexpctedly becomes
primary when starting up as standby. We'll do further
investigation for the behavior.

Anyway, I've committed this to master and 9.2 now.

This seems to fix the issue. We'll examine this further.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#37)
2 attachment(s)
Re: 9.2.3 crashes during archive recovery

Hello, I could cause the behavior and might understand the cause.

The head of origin/REL9_2_STABLE shows the behavior I metioned in
the last message when using the shell script attached. 9.3dev
runs as expected.

In XLogPageRead, when RecPtr goes beyond the last page, the
current xlog file is released and new page requested.

The variables were as below at the point.

StandbyRequested == true
StandbyMode == false
ArchiveRecoveryRequested == true
InArchiveRecovery == false

In this case, XLogPageRead immediately returns NULL before trying
to get xlogs via streaming nor from archive. So ReadRecord
returns NULL, then unexpectedly exits 'main redo apply loop' and
increases timeline ID as if it were promoted.

This seems fiexed by letting it try all requested
sources. Attached patch does it and the test script runs as
expected.

We found that PostgreSQL with this patch unexpctedly becomes
primary when starting up as standby. We'll do further
investigation for the behavior.

Anyway, I've committed this to master and 9.2 now.

This seems to fix the issue. We'll examine this further.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

run.shtext/plain; charset=us-asciiDownload
923crash-20130305.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..00b5bc5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10604,7 +10604,19 @@ retry:
 											  sources);
 				switched_segment = true;
 				if (readFile < 0)
+				{
+					if (!InArchiveRecovery && ArchiveRecoveryRequested)
+					{
+						InArchiveRecovery = true;
+						goto retry;
+					}
+					else if (!StandbyMode && StandbyModeRequested)
+					{
+						StandbyMode = true;
+						goto retry;
+					}
 					return false;
+				}
 			}
 		}
 	}
#39Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#38)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

Sorry, I sent wrong script.

The head of origin/REL9_2_STABLE shows the behavior I metioned in
the last message when using the shell script attached. 9.3dev
runs as expected.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

run.shtext/plain; charset=us-asciiDownload
#40KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#38)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

Hi,

Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery -> record minRecoveryPoint in control file -> archive recovery]
I think that this is an original intention of Heikki's patch.

I also found a bug in latest 9.2_stable. It does not get latest timeline and
recovery history file in archive recovery when master and standby timeline is different.

Best regards,

(2013/03/05 18:22), Kyotaro HORIGUCHI wrote:

Hello, I could cause the behavior and might understand the cause.

The head of origin/REL9_2_STABLE shows the behavior I metioned in
the last message when using the shell script attached. 9.3dev
runs as expected.

In XLogPageRead, when RecPtr goes beyond the last page, the
current xlog file is released and new page requested.

The variables were as below at the point.

StandbyRequested == true
StandbyMode == false
ArchiveRecoveryRequested == true
InArchiveRecovery == false

In this case, XLogPageRead immediately returns NULL before trying
to get xlogs via streaming nor from archive. So ReadRecord
returns NULL, then unexpectedly exits 'main redo apply loop' and
increases timeline ID as if it were promoted.

This seems fiexed by letting it try all requested
sources. Attached patch does it and the test script runs as
expected.

We found that PostgreSQL with this patch unexpctedly becomes
primary when starting up as standby. We'll do further
investigation for the behavior.

Anyway, I've committed this to master and 9.2 now.

This seems to fix the issue. We'll examine this further.

regards,

--
Mitsumasa KONDO
NTT OSS Center

Attachments:

crash-archive-recovery9.2_stable.patchtext/x-diff; name=crash-archive-recovery9.2_stable.patchDownload
--- a/src/backend/access/transam/xlog.c	2013-03-04 15:13:49.000000000 -0500
+++ b/src/backend/access/transam/xlog.c	2013-03-05 06:43:49.435093827 -0500
@@ -4446,7 +4446,7 @@
 	if (targetTLI == 1)
 		return list_make1_int((int) targetTLI);
 
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
 		TLHistoryFileName(histfname, targetTLI);
 		fromArchive =
@@ -10603,8 +10603,11 @@
 				readFile = XLogFileReadAnyTLI(readId, readSeg, emode,
 											  sources);
 				switched_segment = true;
-				if (readFile < 0)
+				if (readFile < 0){
+					if (StandbyModeRequested)
+						return true;
 					return false;
+				}
 			}
 		}
 	}
#41Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#40)
Re: 9.2.3 crashes during archive recovery

Hmm..

Horiguch's patch does not seem to record minRecoveryPoint in
ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery -> record minRecoveryPoint in control file -> archive
recovery]
I think that this is an original intention of Heikki's patch.

It could be. Before that, my patch does not wake up hot standby
because I didn't care of minRecoveryPoint in it:-p

On the other hand, your patch fixes that point but ReadRecord
runs on the false page data. However the wrong record on the
false page can be identified as broken, It should be an
undesiable behavior.

If we want to show the final solution by ourselves, we need to
make another patch to settle them all. Let me take further
thought..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#41)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

Hi, I suppose the attached patch is close to the solution.

I think that this is an original intention of Heikki's patch.

I noticed that archive recovery will be turned on in
next_record_is_invalid thanks to your patch.

On the other hand, your patch fixes that point but ReadRecord
runs on the false page data. However the wrong record on the
false page can be identified as broken, It should be an
undesiable behavior.

All we should do to update minRecoveryPoint as needed when
XLogPageRead is failed in ReadRecord is to simply jump to
next_record_is_invalid if archive recovery is requested but doing
crash recovery yet.

Your modification in readTimeLineHistory and my modifictions in
XLogPageRead seem not necessary for the objective in this thread,
so removed.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

923crash-20130306.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..28d6f2e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4010,7 +4010,15 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 retry:
 	/* Read the page containing the record */
 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
+	{
+		/*
+		 * If archive recovery was requested when crash recovery failed, go to
+		 * the label next_record_is_invalid to switch to archive recovery.
+		 */
+		if (!InArchiveRecovery && ArchiveRecoveryRequested)
+			goto next_record_is_invalid;
 		return NULL;
+	}
 
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
#43Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: KONDO Mitsumasa (#40)
1 attachment(s)
Re: 9.2.3 crashes during archive recovery

On 05.03.2013 14:09, KONDO Mitsumasa wrote:

Hi,

Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery -> record minRecoveryPoint in control file -> archive
recovery]
I think that this is an original intention of Heikki's patch.

Yeah. That fix isn't right, though; XLogPageRead() is supposed to return
true on success, and false on error, and the patch makes it return
'true' on error, if archive recovery was requested but we're still in
crash recovery. The real issue here is that I missed the two "return
NULL;"s in ReadRecord(), so the code that I put in the
next_record_is_invalid codepath isn't run if XLogPageRead() doesn't find
the file at all. Attached patch is the proper fix for this.

I also found a bug in latest 9.2_stable. It does not get latest timeline
and
recovery history file in archive recovery when master and standby
timeline is different.

Works for me.. Can you create a test script for that? Remember to set
"recovery_target_timeline='latest'".

- Heikki

Attachments:

further-archive-after-crash-recovery-fix.patchtext/x-diff; name=further-archive-after-crash-recovery-fix.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..74a54f6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4010,7 +4010,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 retry:
 	/* Read the page containing the record */
 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
-		return NULL;
+		goto next_record_is_invalid;
 
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
@@ -4168,7 +4168,7 @@ retry:
 			}
 			/* Wait for the next page to become available */
 			if (!XLogPageRead(&pagelsn, emode, false, false))
-				return NULL;
+				goto next_record_is_invalid;
 
 			/* Check that the continuation record looks valid */
 			if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
#44KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#43)
3 attachment(s)
Re: 9.2.3 crashes during archive recovery

(2013/03/06 16:50), Heikki Linnakangas wrote:>

Hi,

Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery -> record minRecoveryPoint in control file -> archive
recovery]
I think that this is an original intention of Heikki's patch.

Yeah. That fix isn't right, though; XLogPageRead() is supposed to return true on success, and false on error, and the patch makes it return 'true' on error, if archive recovery was requested but we're still in crash recovery. The real issue here is that I missed the two "return NULL;"s in ReadRecord(), so the code that I put in the next_record_is_invalid codepath isn't run if XLogPageRead() doesn't find the file at all. Attached patch is the proper fix for this.

Thanks for createing patch! I test your patch in 9.2_STABLE, but it does not use promote command...
When XLogPageRead() was returned false ,it means the end of stanby loop, crash recovery loop, and archive recovery loop.
Your patch is not good for promoting Standby to Master. It does not come off standby loop.

So I make new patch which is based Heikki's and Horiguchi's patch.
I attempt test script which was modifyed Horiuch's script. This script does not depend on shell enviroment. It was only needed to fix PGPATH.
Please execute this test script.

I also found a bug in latest 9.2_stable. It does not get latest timeline
and
recovery history file in archive recovery when master and standby
timeline is different.

Works for me.. Can you create a test script for that? Remember to set "recovery_target_timeline='latest'".

I set recovery_target_timeline=latest. hmm...

Here is my recovery.conf.

mitsu-ko@localhost postgresql]$ cat Standby/recovery.conf
standby_mode = 'yes'
recovery_target_timeline='latest'
primary_conninfo='host=localhost port=65432'
restore_command='cp ../arc/%f %p'

And my system's log message is here.

waiting for server to start....[Standby] LOG: database system was shut down in recovery at 2013-03-07 02:56:05 EST
[Standby] LOG: restored log file "00000002.history" from archive
cp: cannot stat `../arc/00000003.history': そのようなファイルやディレクトリはありません
[Standby] FATAL: requested timeline 2 is not a child of database system timeline 1
[Standby] LOG: startup process (PID 20941) exited with exit code 1
[Standby] LOG: aborting startup due to startup process failure

It can be reproduced in my test script, too.
Last master start command might seem not to exist generally in my test script.
But it is generally that PostgreSQL with Pacemaker system.

Best regards,
--
Mitsumasa KONDO
NTT OSS Center

Attachments:

crash-archive-recovery9.2_stable_v2.patchtext/x-diff; name=crash-archive-recovery9.2_stable_v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..2486683 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4010,7 +4010,15 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 retry:
 	/* Read the page containing the record */
 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
+	{
+		/*
+		 * If archive recovery was requested when crash recovery failed, go to
+		 * the label next_record_is_invalid to switch to archive recovery.
+		 */
+		if (!InArchiveRecovery && ArchiveRecoveryRequested)
+			goto next_record_is_invalid;
 		return NULL;
+	}
 
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 	targetRecOff = RecPtr->xrecoff % XLOG_BLCKSZ;
@@ -4168,7 +4176,15 @@ retry:
 			}
 			/* Wait for the next page to become available */
 			if (!XLogPageRead(&pagelsn, emode, false, false))
+		        {
+	        	        /*
+				 * If archive recovery was requested when crash recovery failed, go to
+				 * the label next_record_is_invalid to switch to archive recovery.
+				 */
+				if (!InArchiveRecovery && ArchiveRecoveryRequested)
+					goto next_record_is_invalid;
 				return NULL;
+			}
 
 			/* Check that the continuation record looks valid */
 			if (!(((XLogPageHeader) readBuf)->xlp_info & XLP_FIRST_IS_CONTRECORD))
fix-not-recovery-target-timeline-bug.patchtext/x-diff; name=fix-not-recovery-target-timeline-bug.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..591e8c0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4446,7 +4462,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 	if (targetTLI == 1)
 		return list_make1_int((int) targetTLI);
 
-	if (InArchiveRecovery)
+	if (ArchiveRecoveryRequested)
 	{
 		TLHistoryFileName(histfname, targetTLI);
 		fromArchive =
run_v2.shtext/plain; charset=Shift_JIS; name=run_v2.shDownload
#45Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: KONDO Mitsumasa (#44)
Re: 9.2.3 crashes during archive recovery

On 07.03.2013 10:05, KONDO Mitsumasa wrote:

(2013/03/06 16:50), Heikki Linnakangas wrote:>

Yeah. That fix isn't right, though; XLogPageRead() is supposed to
return true on success, and false on error, and the patch makes it
return 'true' on error, if archive recovery was requested but we're
still in crash recovery. The real issue here is that I missed the two
"return NULL;"s in ReadRecord(), so the code that I put in the
next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
find the file at all. Attached patch is the proper fix for this.

Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
not use promote command...
When XLogPageRead() was returned false ,it means the end of stanby loop,
crash recovery loop, and archive recovery loop.
Your patch is not good for promoting Standby to Master. It does not come
off standby loop.

So I make new patch which is based Heikki's and Horiguchi's patch.

Ah, I see. I committed a slightly modified version of this.

I also found a bug in latest 9.2_stable. It does not get latest timeline
and
recovery history file in archive recovery when master and standby
timeline is different.

Works for me.. Can you create a test script for that? Remember to set
"recovery_target_timeline='latest'".

...
It can be reproduced in my test script, too.

I see the problem now, with that script. So what happens is that the
startup process first scans the timeline history files to choose the
recovery target timeline. For that scan, I temporarily set
InArchiveRecovery=true, in readRecoveryCommandFile. However, after
readRecoveryCommandFile returns, we then try to read the timeline
history file corresponding the chosen recovery target timeline, but
InArchiveRecovery is no longer set, so we don't fetch the file from
archive, and return a "dummy" history, with just the target timeline in
it. That doesn't contain the older timeline, so you get an error at
recovery.

Fixed per your patch to check for ArchiveRecoveryRequested instead of
InArchiveRecovery, when reading timeline history files. This also makes
it unnecessary to temporarily set InArchiveRecovery=true, so removed that.

Committed both fixes. Please confirm this this fixed the problem in your
test environment. Many thanks for the testing and the patches!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#45)
Re: 9.2.3 crashes during archive recovery

(2013/03/07 19:41), Heikki Linnakangas wrote:

On 07.03.2013 10:05, KONDO Mitsumasa wrote:

(2013/03/06 16:50), Heikki Linnakangas wrote:>

Yeah. That fix isn't right, though; XLogPageRead() is supposed to
return true on success, and false on error, and the patch makes it
return 'true' on error, if archive recovery was requested but we're
still in crash recovery. The real issue here is that I missed the two
"return NULL;"s in ReadRecord(), so the code that I put in the
next_record_is_invalid codepath isn't run if XLogPageRead() doesn't
find the file at all. Attached patch is the proper fix for this.

Thanks for createing patch! I test your patch in 9.2_STABLE, but it does
not use promote command...
When XLogPageRead() was returned false ,it means the end of stanby loop,
crash recovery loop, and archive recovery loop.
Your patch is not good for promoting Standby to Master. It does not come
off standby loop.

So I make new patch which is based Heikki's and Horiguchi's patch.

Ah, I see. I committed a slightly modified version of this.

I feel that your modification is legible. Thanks for your modification and committing patch!

I also found a bug in latest 9.2_stable. It does not get latest timeline
and
recovery history file in archive recovery when master and standby
timeline is different.

Works for me.. Can you create a test script for that? Remember to set
"recovery_target_timeline='latest'".

...
It can be reproduced in my test script, too.

I see the problem now, with that script. So what happens is that the startup process first scans the timeline history files to choose the recovery target timeline. For that scan, I temporarily set InArchiveRecovery=true, in readRecoveryCommandFile. However, after readRecoveryCommandFile returns, we then try to read the timeline history file corresponding the chosen recovery target timeline, but InArchiveRecovery is no longer set, so we don't fetch the file from archive, and return a "dummy" history, with just the target timeline in it. That doesn't contain the older timeline, so you get an error at recovery.
Fixed per your patch to check for ArchiveRecoveryRequested instead of InArchiveRecovery, when reading timeline history files. This also makes it unnecessary to temporarily set InArchiveRecovery=true, so removed that.
Committed both fixes. Please confirm this this fixed the problem in your test environment. Many thanks for the testing and the patches!

I understand this problem. Thank you for adding modification and detail explanation! I test latest REL9_2_STABLE in my system. I confirm that it run good without problem. If I found an another problem, I will report and send you patch and test script!

Best regards,
--
Mitsumasa KONDO
NTT OSS Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#46)
Re: 9.2.3 crashes during archive recovery

Everything seems settled up above my head while sleeping..

Sorry for crumsy test script, and thank you for refining it, Mitsumasa.

And thank you for fixing the bug and the detailed explanation, Heikki.

I confirmed that the problem is fixed also for me at origin/REL9_2_STABLE.

I understand this problem. Thank you for adding modification and
detail explanation! I test latest REL9_2_STABLE in my system. I
confirm that it run good without problem. If I found an another
problem, I will report and send you patch and test script!

regards,

--
堀口恭太郎

日本電信電話株式会社 NTTオープンソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers