Fast promotion failure

Started by Heikki Linnakangasover 12 years ago23 messages
#1Heikki Linnakangas
hlinnakangas@vmware.com
1 attachment(s)

While testing the bug from the "Assertion failure at standby promotion",
I bumped into a different bug in fast promotion. When the first
checkpoint after fast promotion is performed, there is no guarantee that
the checkpointer process is running with the correct, new,
ThisTimeLineID. In CreateCheckPoint(), we have this:

/*
* An end-of-recovery checkpoint is created before anyone is allowed to
* write WAL. To allow us to write the checkpoint record, temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();

That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast
promotion. The consequence is that the checkpoint is created with the
old timeline, and subsequent recovery from it will fail.

I ran into this with the attached script. It sets up a master (M), a
standby (B), and a cascading standby (C). I'm not sure why, but when I
tried to simplify the script by removing the cascading standby, it
started to work. The bug occurs in standby B, so I'm not sure why the
presence of the cascading standby makes any difference. Maybe it just
affects the timing.

- Heikki

Attachments:

fast-promotion-bug.shapplication/x-shellscript; name=fast-promotion-bug.shDownload
#2Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Fast promotion failure

On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

While testing the bug from the "Assertion failure at standby promotion", I
bumped into a different bug in fast promotion. When the first checkpoint
after fast promotion is performed, there is no guarantee that the
checkpointer process is running with the correct, new, ThisTimeLineID. In
CreateCheckPoint(), we have this:

/*
* An end-of-recovery checkpoint is created before anyone is
allowed to
* write WAL. To allow us to write the checkpoint record,
temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();

That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
The consequence is that the checkpoint is created with the old timeline, and
subsequent recovery from it will fail.

I ran into this with the attached script. It sets up a master (M), a standby
(B), and a cascading standby (C). I'm not sure why, but when I tried to
simplify the script by removing the cascading standby, it started to work.
The bug occurs in standby B, so I'm not sure why the presence of the
cascading standby makes any difference. Maybe it just affects the timing.

Can this really happen? ISTM that the checkpointer should detect that
the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
before calling CreateCheckPoint().

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Amit Kapila
amit.kapila@huawei.com
In reply to: Fujii Masao (#2)
Re: Fast promotion failure

On Thursday, May 09, 2013 6:29 AM Fujii Masao wrote:

On Tue, May 7, 2013 at 6:57 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

While testing the bug from the "Assertion failure at standby

promotion", I

bumped into a different bug in fast promotion. When the first

checkpoint

after fast promotion is performed, there is no guarantee that the
checkpointer process is running with the correct, new,

ThisTimeLineID. In

CreateCheckPoint(), we have this:

/*
* An end-of-recovery checkpoint is created before anyone is
allowed to
* write WAL. To allow us to write the checkpoint record,
temporarily
* enable XLogInsertAllowed. (This also ensures

ThisTimeLineID is

* initialized, which we need here and in

AdvanceXLInsertBuffer.)

*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();

That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast

promotion.

The consequence is that the checkpoint is created with the old

timeline, and

subsequent recovery from it will fail.

I ran into this with the attached script. It sets up a master (M), a

standby

(B), and a cascading standby (C). I'm not sure why, but when I tried

to

simplify the script by removing the cascading standby, it started to

work.

The bug occurs in standby B, so I'm not sure why the presence of the
cascading standby makes any difference. Maybe it just affects the

timing.

Can this really happen? ISTM that the checkpointer should detect that
the recovery mode ends and call RecoveryInProgress()->InitXLOGAccess()
before calling CreateCheckPoint().

Without fast-promotion, it will request/perform End of Recovery checkpoint
while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
So I think before any new operation can start, it can make sure that
Checkpoint with new timeline is performed.

However with fast promotion, the request for checkpoint is done after
recovery; so some operations can happen before checkpoint with new timeline.
I think it can so happen that last checkpoint is with old timeline and there
are operations with new timeline which might have caused the problem Heikki
has seen.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Kapila (#3)
Re: Fast promotion failure

Hello,

I think it can so happen that last checkpoint is with old timeline and there
are operations with new timeline which might have caused the problem Heikki
has seen.

I suppose to have seen that.

After adding an SQL command to modify the DB on standby-B after
passive(propagated?) promotion of standby-C,

< pg_ctl -D data-standbyB promote
---

pg_ctl -D data-standbyB -w promote
sleep 20
psql -p 5433 postgres -c 'create table t (a int); checkpoint;'

I saw following complaint from standby-C.

B 2013-05-09 17:00:06.019 JST 30268 LOG: checkpoint starting: immediate force wait
C 2013-05-09 17:00:06.211 JST 30277 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
C 2013-05-09 17:00:06.211 JST 30295 FATAL: terminating walreceiver process due to administrator command
B 2013-05-09 17:00:06.219 JST 30268 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.157 s, total=0.199 s; sync files=13, longest=0.041 s, average=0.012 s
CHECKPOINT
C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
C 2013-05-09 16:41:33.974 JST 29624 LOG: record with zero length at 0/53F8B90
C 2013-05-09 16:41:38.980 JST 29624 LOG: record with zero length at 0/53F8B90

.. repeats forever ..

This seems to be caused from a kind of timeline inconsistency at
first glance.. Explicit checkpoint before the modification on
Sby-B does not help.

psql -p 5433 postgres -c 'checkpoint;create table t (a int); checkpoint;'

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#4)
Re: Fast promotion failure

With printing some additinal logs, the situation should be more
clear..

It seems that Sby-B failes to promote to TLI= 2; nevertheless the
history file for TLI = 2 is somehow sent to sby-C. So sby-B
remains on TLI=1 but sby-C solely switches onto TLI=2.

# Come to think of this, I suspect that the additional logs is not so useful :(

B 2013-05-09 17:29:53.380 JST 32258 ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000
C 2013-05-09 17:29:53.380 JST 32257 FATAL: could not receive data from WAL stream: ERROR: server switched off timeline 1 at 0/53F8B60, but walsender already streamed up to 0/53FA000

B 2013-05-09 17:29:53.380 JST 32244 LOG: database system is ready to accept connections

..

C 2013-05-09 17:30:08.395 JST 32256 LOG: Reading page on Timeline ID = 1
C 2013-05-09 17:30:08.398 JST 32274 LOG: fetching timeline history file for timeline 2 from primary server
C 2013-05-09 17:30:08.448 JST 32274 LOG: started streaming WAL from primary at 0/5000000 on timeline 1
C 2013-05-09 17:30:08.452 JST 32274 LOG: replication terminated by primary server
C 2013-05-09 17:30:08.452 JST 32274 DETAIL: End of WAL reached on timeline 1 at 0/53F8B60
C 2013-05-09 17:30:08.452 JST 32256 LOG: new target timeline is 2
C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
C 2013-05-09 17:30:08.452 JST 32256 LOG: Reading page on Timeline ID = 1
C 2013-05-09 17:30:08.453 JST 32274 LOG: restarted WAL streaming at 0/5000000 on timeline 2
B 2013-05-09 17:30:10.913 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8C30
B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint complete: wrote 637 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=13.502 s, sync=0.105 s, total=13.733 s; sync files=2, longest=0.089 s, average=0.052 s
B 2013-05-09 17:30:10.953 JST 32248 LOG: checkpoint starting: immediate force wait
B 2013-05-09 17:30:10.963 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/53F8CD0
B 2013-05-09 17:30:11.003 JST 32248 LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s, total=0.049 s; sync files=0, longest=0.000 s, average=0.000 s
B 2013-05-09 17:30:11.096 JST 32248 LOG: checkpoint starting: immediate force wait
B 2013-05-09 17:30:11.909 JST 32248 LOG: This checkpoint record is on TimelineID = 1, loc is about 0/540BEF8
C 2013-05-09 17:30:11.929 JST 32256 LOG: invalid magic number 0000 in log segment 000000010000000000000005, offset 4169728
C 2013-05-09 17:30:11.929 JST 32274 FATAL: terminating walreceiver process due to administrator command
B 2013-05-09 17:30:11.951 JST 32248 LOG: checkpoint complete: wrote 18 buffers (0.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.017 s, sync=0.785 s, total=0.855 s; sync files=13, longest=0.235 s, average=0.060 s
CHECKPOINT
C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
C 2013-05-09 17:30:13.931 JST 32256 LOG: Reading page on Timeline ID = 2
C 2013-05-09 17:30:13.931 JST 32256 LOG: record with zero length at 0/53F8B90
C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
C 2013-05-09 17:30:18.936 JST 32256 LOG: Reading page on Timeline ID = 2
C 2013-05-09 17:30:18.936 JST 32256 LOG: record with zero length at 0/53F8B90

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila@huawei.com
In reply to: Kyotaro HORIGUCHI (#5)
Re: Fast promotion failure

On Thursday, May 09, 2013 2:14 PM Kyotaro HORIGUCHI wrote:

With printing some additinal logs, the situation should be more
clear..

It seems that Sby-B failes to promote to TLI= 2; nevertheless the
history file for TLI = 2 is somehow sent to sby-C. So sby-B
remains on TLI=1 but sby-C solely switches onto TLI=2.

# Come to think of this, I suspect that the additional logs is not so
useful :(

It seems to me, it is the same problem as discussed and fixed in below
thread.
/messages/by-id/51894942.4080500@vmware.com

Could you try with fixes given by heikki.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Kapila (#6)
Re: Fast promotion failure

Thank you for noticing me of that.

It seems to me, it is the same problem as discussed and fixed in below
thread.
/messages/by-id/51894942.4080500@vmware.com

Could you try with fixes given by heikki.

The first one settles the timeline transition problem for the
present. Besides, I have no longer found any recovery failure
except unrepeatable one shown below.

C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at 0/5000000 on timeline 2
B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete: (snip.)
B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at 0/3000028
B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data from WAL stream:

I'm get confused, the patch seems to me ensureing the "first
checkpoint after fast promotion is performed" to use the
"correct, new, ThisTimeLineID".

At Thu, 9 May 2013 11:10:23 +0530, Amit Kapila <amit.kapila@huawei.com> wrote in <005501ce4c77$b45f0b70$1d1d2250$@kapila@huawei.com>

Without fast-promotion, it will request/perform End of Recovery checkpoint
while still in recovery (before setting xlogctl->SharedRecoveryInProgress),
So I think before any new operation can start, it can make sure that
Checkpoint with new timeline is performed.

However with fast promotion, the request for checkpoint is done after
recovery; so some operations can happen before checkpoint with new timeline.
I think it can so happen that last checkpoint is with old timeline and there
are operations with new timeline which might have caused the problem Heikki
has seen.

I don't see any problem (in my view :) that something writes WAL
runs before the first checkpoint starts - as described in the
comment just above the checkpoint request..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Amit Kapila
amit.kapila@huawei.com
In reply to: Kyotaro HORIGUCHI (#7)
Re: Fast promotion failure

On Friday, May 10, 2013 2:07 PM Kyotaro HORIGUCHI wrote:

Thank you for noticing me of that.

It seems to me, it is the same problem as discussed and fixed in

below

thread.
/messages/by-id/51894942.4080500@vmware.com

Could you try with fixes given by heikki.

The first one settles the timeline transition problem for the
present. Besides, I have no longer found any recovery failure
except unrepeatable one shown below.

C 2013-05-10 15:30:29.589 JST 9242 LOG: restarted WAL streaming at

0/5000000 on timeline 2

B 2013-05-10 15:30:33.789 JST 9233 LOG: restartpoint complete:

(snip.)

B 2013-05-10 15:30:33.789 JST 9233 LOG: recovery restart point at

0/3000028

B 2013-05-10 15:30:33.789 JST 9233 LOG: checkpoint starting:
C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data

from WAL stream:

Is there any chance, that there is any network glitch caused this one time
error.

I'm get confused, the patch seems to me ensureing the "first
checkpoint after fast promotion is performed" to use the
"correct, new, ThisTimeLineID".

What is your confusion?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Kapila (#8)
Re: Fast promotion failure

2013/05/10 20:01 "Amit Kapila" <amit.kapila@huawei.com>:

C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data

from WAL stream:

Is there any chance, that there is any network glitch caused this one time
error.

Unix domam sockets are hardly likely to have such troubles. This
test ran within single host.

I'm get confused, the patch seems to me ensureing the "first
checkpoint after fast promotion is performed" to use the
"correct, new, ThisTimeLineID".

What is your confusion?

Heikki said in the fist message in this thread that he suspected
the cause of the failure he had seen to be wrong TLI on whitch
checkpointer runs. Nevertheless, the patch you suggested for me
looks fixing it. Moreover (one of?) the failure from the same
cause looks fixed with the patch.

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila@huawei.com
In reply to: Kyotaro HORIGUCHI (#9)
Re: Fast promotion failure

On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:

2013/05/10 20:01 "Amit Kapila" <amit.kapila@huawei.com>:

C 2013-05-10 15:32:32.170 JST 9242 FATAL: could not receive data

from WAL stream:

Is there any chance, that there is any network glitch caused this one

time

error.

Unix domam sockets are hardly likely to have such troubles. This
test ran within single host.

I'm get confused, the patch seems to me ensureing the "first
checkpoint after fast promotion is performed" to use the
"correct, new, ThisTimeLineID".

What is your confusion?

Heikki said in the fist message in this thread that he suspected
the cause of the failure he had seen to be wrong TLI on whitch
checkpointer runs. Nevertheless, the patch you suggested for me
looks fixing it. Moreover (one of?) the failure from the same
cause looks fixed with the patch.

There were 2 problems:
1. There was some issue in walsender logic due to which after promotion in
some cases it hits assertion or error
2. During fast promotion, checkpoint gets created with wrong TLI

He has provided 2 different patches
fix-standby-promotion-assert-fail-2.patch and
fast-promotion-quick-fix.patch.
Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
75c99e52984f7ee81b47d137b5b4751)

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

AFAIU, the committed patch has some gap in overall scenario which is the
fast promotion issue.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Amit Kapila (#10)
Re: Fast promotion failure

On 13.05.2013 06:07, Amit Kapila wrote:

On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:

Heikki said in the fist message in this thread that he suspected
the cause of the failure he had seen to be wrong TLI on whitch
checkpointer runs. Nevertheless, the patch you suggested for me
looks fixing it. Moreover (one of?) the failure from the same
cause looks fixed with the patch.

There were 2 problems:
1. There was some issue in walsender logic due to which after promotion in
some cases it hits assertion or error
2. During fast promotion, checkpoint gets created with wrong TLI

He has provided 2 different patches
fix-standby-promotion-assert-fail-2.patch and
fast-promotion-quick-fix.patch.
Among 2, he has already committed fix-standby-promotion-assert-fail-2.patch
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa66f49
75c99e52984f7ee81b47d137b5b4751)

That's correct.

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

AFAIU, the committed patch has some gap in overall scenario which is the
fast promotion issue.

Right, the fast promotion issue is still there.

Just to get us all on the same page again: Does anyone see a problem
with a fresh git checkout, with the fast-promotion-quick-fix.patch
applied?
(/messages/by-id/51894942.4080500@vmware.com). If
you do, please speak up. As far as I know, the already-committed patch,
together with fast-promotion-quick-fix.patch, should fix all known
issues (*).

I haven't committed a fix for the issue I reported in this thread,
because I'm not 100% on what the right fix for it would be.
fast-promotion-quick-fix.patch seems to do the trick, but at least the
comments need to be updated, and I'm not sure if there some related
corner cases that it doesn't handle. Simon?

(*) Well, almost. This one is still pending:
/messages/by-id/CAB7nPqRhuCuuD012GCB_tAAFrixx2WioN_zfXQcvLuRab8DN2g@mail.gmail.com

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Amit Kapila
amit.kapila@huawei.com
In reply to: Heikki Linnakangas (#11)
Re: Fast promotion failure

On Monday, May 13, 2013 1:13 PM Heikki Linnakangas wrote:

On 13.05.2013 06:07, Amit Kapila wrote:

On Monday, May 13, 2013 5:54 AM Kyotaro HORIGUCHI wrote:

Heikki said in the fist message in this thread that he suspected
the cause of the failure he had seen to be wrong TLI on whitch
checkpointer runs. Nevertheless, the patch you suggested for me
looks fixing it. Moreover (one of?) the failure from the same
cause looks fixed with the patch.

There were 2 problems:
1. There was some issue in walsender logic due to which after

promotion in

some cases it hits assertion or error
2. During fast promotion, checkpoint gets created with wrong TLI

He has provided 2 different patches
fix-standby-promotion-assert-fail-2.patch and
fast-promotion-quick-fix.patch.
Among 2, he has already committed fix-standby-promotion-assert-fail-

2.patch

(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ffa
66f49

75c99e52984f7ee81b47d137b5b4751)

That's correct.

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

AFAIU, the committed patch has some gap in overall scenario which is

the

fast promotion issue.

Right, the fast promotion issue is still there.

Just to get us all on the same page again: Does anyone see a problem
with a fresh git checkout, with the fast-promotion-quick-fix.patch
applied?
(/messages/by-id/51894942.4080500@vmware.com). If
you do, please speak up. As far as I know, the already-committed patch,
together with fast-promotion-quick-fix.patch, should fix all known
issues (*).

I haven't committed a fix for the issue I reported in this thread,
because I'm not 100% on what the right fix for it would be.
fast-promotion-quick-fix.patch seems to do the trick, but at least the
comments need to be updated, and I'm not sure if there some related
corner cases that it doesn't handle. Simon?

The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
of recovery checkpoint, it doesn't matter w.r.t performance but actually the
purpose will
be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
or am I missing something.

One more thing, I think after fast promotion, either it should set timeline
or give error in CreateCheckPoint() function before it reaches the check
mentioned by you in your initial mail.
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
elog(ERROR, "can't create a checkpoint during recovery");
Shouldn't it set timeline in above check (RecoveryInProgress()) or when
RecoveryInProgress() is called before CreateCheckPoint()?

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Kapila (#12)
Re: Fast promotion failure

Hello,

Is the point of this discussion that the patch may leave out some
glich about timing of timeline-related changing and Heikki saw an
egress of that?

AFAIU, the committed patch has some gap in overall scenario which is

the

fast promotion issue.

Right, the fast promotion issue is still there.

Just to get us all on the same page again: Does anyone see a problem
with a fresh git checkout, with the fast-promotion-quick-fix.patch
applied?
(/messages/by-id/51894942.4080500@vmware.com). If
you do, please speak up. As far as I know, the already-committed patch,
together with fast-promotion-quick-fix.patch, should fix all known
issues (*).

Shared XLogCtl->ThisTimeLineID is written and read without
fencing by spinlock unlike some other XLogCtl members. Can this
break coherency of its memory between different processors? It
is quite reasonable that I cannot find the trouble if it is the
cause. I didn't see the issue even without
fast-promotion-quick-fix.patch.

The patch provided will un-necessarily call InitXLOGAccess() 2 times for End
of recovery checkpoint, it doesn't matter w.r.t performance but actually the
purpose will
be almost same for calling LocalSetXLogInsertAllowed() and InitXLOGAccess(),
or am I missing something.

One more thing, I think after fast promotion, either it should set timeline
or give error in CreateCheckPoint() function before it reaches the check
mentioned by you in your initial mail.
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
elog(ERROR, "can't create a checkpoint during recovery");
Shouldn't it set timeline in above check (RecoveryInProgress()) or when
RecoveryInProgress() is called before CreateCheckPoint()?

Thinking of checkpointer, it does RecoveryInProgress() far
earlier to there, in waiting loop in CheckPointerMain where to
decide which to do between checkpoint and restartpoint. So
InitXLogAccess() has been already done when checkpoint is choosed
there for the first time. And before that, ThisTimeLineID in the
startup process gets incremented and is copied onto
XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
becomes false in StartupXLog(). I don't think it is possible for
checkpointer to run on older timeline id on codition that all
processes looks exactly the same memory image.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Simon Riggs
simon@2ndQuadrant.com
In reply to: Kyotaro HORIGUCHI (#13)
Re: Fast promotion failure

On 16 May 2013 07:02, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

fast promotion issue.

Excuse me for not joining the thread earlier. I'm not available today,
but will join in later in my evening.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Amit Kapila
amit.kapila@huawei.com
In reply to: Kyotaro HORIGUCHI (#13)
Re: Fast promotion failure

On Thursday, May 16, 2013 11:33 AM Kyotaro HORIGUCHI wrote:

Hello,

Is the point of this discussion that the patch may leave out

some

glich about timing of timeline-related changing and Heikki saw

an

egress of that?

AFAIU, the committed patch has some gap in overall scenario which

is

the

fast promotion issue.

Right, the fast promotion issue is still there.

Just to get us all on the same page again: Does anyone see a

problem

with a fresh git checkout, with the fast-promotion-quick-fix.patch
applied?
(/messages/by-id/51894942.4080500@vmware.com).

If

you do, please speak up. As far as I know, the already-committed

patch,

together with fast-promotion-quick-fix.patch, should fix all known
issues (*).

Shared XLogCtl->ThisTimeLineID is written and read without
fencing by spinlock unlike some other XLogCtl members. Can this
break coherency of its memory between different processors? It
is quite reasonable that I cannot find the trouble if it is the
cause. I didn't see the issue even without
fast-promotion-quick-fix.patch.

The patch provided will un-necessarily call InitXLOGAccess() 2 times

for End

of recovery checkpoint, it doesn't matter w.r.t performance but

actually the

purpose will
be almost same for calling LocalSetXLogInsertAllowed() and

InitXLOGAccess(),

or am I missing something.

One more thing, I think after fast promotion, either it should set

timeline

or give error in CreateCheckPoint() function before it reaches the

check

mentioned by you in your initial mail.
if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) ==

0)

elog(ERROR, "can't create a checkpoint during

recovery");

Shouldn't it set timeline in above check (RecoveryInProgress()) or

when

RecoveryInProgress() is called before CreateCheckPoint()?

Thinking of checkpointer, it does RecoveryInProgress() far
earlier to there, in waiting loop in CheckPointerMain where to
decide which to do between checkpoint and restartpoint. So
InitXLogAccess() has been already done when checkpoint is choosed
there for the first time. And before that, ThisTimeLineID in the
startup process gets incremented and is copied onto
XLogCtl->ThisTimeLineID before xlogctl->SharedRecoveryInProgress
becomes false in StartupXLog(). I don't think it is possible for
checkpointer to run on older timeline id on codition that all
processes looks exactly the same memory image.

I also think the same, but now one difference with fast promotion is the
request for checkpoint is done after recovery; so some operations can happen
before checkpoint with new timeline.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#1)
Re: Fast promotion failure

On 7 May 2013 10:57, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

While testing the bug from the "Assertion failure at standby promotion", I
bumped into a different bug in fast promotion. When the first checkpoint
after fast promotion is performed, there is no guarantee that the
checkpointer process is running with the correct, new, ThisTimeLineID. In
CreateCheckPoint(), we have this:

/*
* An end-of-recovery checkpoint is created before anyone is
allowed to
* write WAL. To allow us to write the checkpoint record,
temporarily
* enable XLogInsertAllowed. (This also ensures ThisTimeLineID is
* initialized, which we need here and in AdvanceXLInsertBuffer.)
*/
if (flags & CHECKPOINT_END_OF_RECOVERY)
LocalSetXLogInsertAllowed();

That ensures that ThisTimeLineID is updated when performing an
end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
The consequence is that the checkpoint is created with the old timeline, and
subsequent recovery from it will fail.

LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(),
but in fast promotion this is called by Startup process, not
Checkpointer process. So there is no longer an explicit call to
InitXLOGAccess() in the checkpointer process via a checkpoint with
flag CHECKPOINT_END_OF_RECOVERY.

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
"initializes TimeLineID if it's not set yet." The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this "doesn't get executed with fast promotion", or the title of
the thread.

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

I see that the comment in InitXLOGAccess() is incorrect
"ThisTimeLineID doesn't change so we need no lock to copy it", which
seems worth correcting since there's no need to save time in a once
only function.

Continuing to investigate further.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Simon Riggs (#16)
Re: Fast promotion failure

On 19.05.2013 17:25, Simon Riggs wrote:

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
"initializes TimeLineID if it's not set yet." The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this "doesn't get executed with fast promotion", or the title of
the thread.

Hmm, I see.

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

Yeah. I spent some more time debugging this, and found the culprit. In
CreateRestartPoint, we do this:

/*
* Update ThisTimeLineID to the timeline we're currently replaying,
* so that we install any recycled segments on that timeline.
*
* There is no guarantee that the WAL segments will be useful on the
* current timeline; if recovery proceeds to a new timeline right
* after this, the pre-allocated WAL segments on this timeline will
* not be used, and will go wasted until recycled on the next
* restartpoint. We'll live with that.
*/
(void) GetXLogReplayRecPtr(&ThisTimeLineID);

Here's what happens:

0. System is in recovery
1. checkpointer process starts creating a restartpoint.
2. Startup process ends recovery, creates end-of-recovery record, sets
the shared ThisTimeLineID value to 2, and exits.
3. checkpointer process calls RecoveryInProgess() while performing the
restartpoint (there are RecoveryInProgress() calls in many places, e.g
in XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and
initializes ThisTimeLineID to 2.
4. At the end of restartpoint, the checkpointer process does the above
GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID
back to 1.
5. Checkpointer calls RecoveryInProgress, which returns true
immediately, as LocalRecoveryInProgress is already set.
5. Checkpointer performs the first checkpoint, with ThisTimeLineID set
incorrectly to 1.

Not sure what the best fix would be. Perhaps change the code in
CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
ThisTimeLineID = replayTLI;

I see that the comment in InitXLOGAccess() is incorrect
"ThisTimeLineID doesn't change so we need no lock to copy it", which
seems worth correcting since there's no need to save time in a once
only function.

Hmm, the code seems right to me, XLogCtl->ThisTimeLineID indeed doesn't
change, once it's set. And InitXLOGAccess() isn't called until it is
set. The comment could explain that better, though.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#17)
1 attachment(s)
Re: Fast promotion failure

On 20 May 2013 18:47, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 19.05.2013 17:25, Simon Riggs wrote:

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

Yeah. I spent some more time debugging this, and found the culprit. In
CreateRestartPoint, we do this:

/*
* Update ThisTimeLineID to the timeline we're currently
replaying,
* so that we install any recycled segments on that
timeline.
*
* There is no guarantee that the WAL segments will be
useful on the
* current timeline; if recovery proceeds to a new
timeline right
* after this, the pre-allocated WAL segments on this
timeline will
* not be used, and will go wasted until recycled on the
next
* restartpoint. We'll live with that.
*/
(void) GetXLogReplayRecPtr(&ThisTimeLineID);

Here's what happens:

0. System is in recovery
1. checkpointer process starts creating a restartpoint.
2. Startup process ends recovery, creates end-of-recovery record, sets the
shared ThisTimeLineID value to 2, and exits.
3. checkpointer process calls RecoveryInProgess() while performing the
restartpoint (there are RecoveryInProgress() calls in many places, e.g in
XLogNeedsFlush()). It sets LocalRecoveryInProgress = true, and initializes
ThisTimeLineID to 2.
4. At the end of restartpoint, the checkpointer process does the above
GetXLogReplayRecPtr(&ThisTimeLineID). That overwrites ThisTimeLineID back to
1.
5. Checkpointer calls RecoveryInProgress, which returns true immediately, as
LocalRecoveryInProgress is already set.
6. Checkpointer performs the first checkpoint, with ThisTimeLineID set
incorrectly to 1.

That description looks correct to me from the code.

Not sure what the best fix would be. Perhaps change the code in
CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
ThisTimeLineID = replayTLI;

Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.

Suggested patch attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

avoid_setting_tli.v1.patchapplication/octet-stream; name=avoid_setting_tli.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7d34218..d106a13 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7456,9 +7456,11 @@ CreateRestartPoint(int flags)
 		KeepLogSeg(endptr, &_logSegNo);
 		_logSegNo--;
 
+		RemoveOldXlogFiles(_logSegNo, endptr);
+
 		/*
-		 * Update ThisTimeLineID to the timeline we're currently replaying,
-		 * so that we install any recycled segments on that timeline.
+		 * Make more log segments if needed.  (Do this after recycling old log
+		 * segments, since that may supply some of the needed files.)
 		 *
 		 * There is no guarantee that the WAL segments will be useful on the
 		 * current timeline; if recovery proceeds to a new timeline right
@@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
 		 * not be used, and will go wasted until recycled on the next
 		 * restartpoint. We'll live with that.
 		 */
-		(void) GetXLogReplayRecPtr(&ThisTimeLineID);
-
-		RemoveOldXlogFiles(_logSegNo, endptr);
-
-		/*
-		 * Make more log segments if needed.  (Do this after recycling old log
-		 * segments, since that may supply some of the needed files.)
-		 */
 		PreallocXlogFiles(endptr);
 	}
 
@@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
 	SpinLockAcquire(&xlogctl->info_lck);
 	recptr = xlogctl->lastReplayedEndRecPtr;
 	tli = xlogctl->lastReplayedTLI;
+	if (replayTLI && xlogctl->SharedRecoveryInProgress)
+		*replayTLI = tli;
 	SpinLockRelease(&xlogctl->info_lck);
 
-	if (replayTLI)
-		*replayTLI = tli;
 	return recptr;
 }
 
#19Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Simon Riggs (#18)
Re: Fast promotion failure

On 20.05.2013 22:18, Simon Riggs wrote:

On 20 May 2013 18:47, Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

Not sure what the best fix would be. Perhaps change the code in
CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
ThisTimeLineID = replayTLI;

Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.

Suggested patch attached.

@@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)
* not be used, and will go wasted until recycled on the next
* restartpoint. We'll live with that.
*/
-		(void) GetXLogReplayRecPtr(&ThisTimeLineID);
-
-		RemoveOldXlogFiles(_logSegNo, endptr);
-
-		/*
-		 * Make more log segments if needed.  (Do this after recycling old log
-		 * segments, since that may supply some of the needed files.)
-		 */
PreallocXlogFiles(endptr);
}

That's essentially reverting commit 343ee00b7, resurrecting the bug that
that fixed.

@@ -9098,10 +9092,10 @@ GetXLogReplayRecPtr(TimeLineID *replayTLI)
SpinLockAcquire(&xlogctl->info_lck);
recptr = xlogctl->lastReplayedEndRecPtr;
tli = xlogctl->lastReplayedTLI;
+	if (replayTLI && xlogctl->SharedRecoveryInProgress)
+		*replayTLI = tli;
SpinLockRelease(&xlogctl->info_lck);

- if (replayTLI)
- *replayTLI = tli;
return recptr;
}

That breaks the only remaining caller that passes a replayTLI pointer,
GetStandbyFlushRecPtr(); *replayTLI points to a local variable in that
function, which is left uninitialized if !xlogctl->SharedRecoveryInProgress.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#19)
1 attachment(s)
Re: Fast promotion failure

On 20 May 2013 20:40, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 20.05.2013 22:18, Simon Riggs wrote:

On 20 May 2013 18:47, Heikki Linnakangas<hlinnakangas@vmware.com> wrote:

Not sure what the best fix would be. Perhaps change the code in

CreateRestartPoint() to do something like this instead:

GetXLogReplayRecPtr(&replayTLI);
if (RecoveryInProgress())
ThisTimeLineID = replayTLI;

Installing a few extra WAL files doesn't seem to be a good reason to
mess with such an important variable as ThisTimeLineID, especially
since its such a rare event and hardly worth optimising for.

I would prefer it if we didn't set ThisTimeLineID at all in that way,
or anywhere else. If we do, it needs to be done safely, otherwise any
caller could make the same mistake.

Suggested patch attached.

@@ -7466,14 +7468,6 @@ CreateRestartPoint(int flags)

* not be used, and will go wasted until recycled on the
next
* restartpoint. We'll live with that.
*/
- (void) GetXLogReplayRecPtr(&ThisTimeLineID);
-
- RemoveOldXlogFiles(_logSegNo, endptr);
-
- /*
- * Make more log segments if needed. (Do this after
recycling old log
- * segments, since that may supply some of the needed
files.)
- */
PreallocXlogFiles(endptr);
}

That's essentially reverting commit 343ee00b7, resurrecting the bug that
that fixed.

Hmm, OK.

Then we should using the correct value, not the one that patch set. It
certainly worked at that time, but not in a principled way.

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

Patch attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

set_lastReplayedTLI.v1.patchapplication/octet-stream; name=set_lastReplayedTLI.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7d34218..3fd4b4e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5838,8 +5838,16 @@ StartupXLOG(void)
 	}
 
 	/* Save the selected TimeLineID in shared memory, too */
-	XLogCtl->ThisTimeLineID = ThisTimeLineID;
-	XLogCtl->PrevTimeLineID = PrevTimeLineID;
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile XLogCtlData *xlogctl = XLogCtl;
+
+		SpinLockAcquire(&xlogctl->info_lck);
+		XLogCtl->ThisTimeLineID = ThisTimeLineID;
+		XLogCtl->lastReplayedTLI = ThisTimeLineID;
+		XLogCtl->PrevTimeLineID = PrevTimeLineID;
+		SpinLockRelease(&xlogctl->info_lck);
+	}
 
 	/*
 	 * We are now done reading the old WAL.  Turn off archive fetching if it
#21Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Simon Riggs (#20)
Re: Fast promotion failure

On 21.05.2013 00:00, Simon Riggs wrote:

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5838,8 +5838,16 @@ StartupXLOG(void)
}
/* Save the selected TimeLineID in shared memory, too */
-	XLogCtl->ThisTimeLineID = ThisTimeLineID;
-	XLogCtl->PrevTimeLineID = PrevTimeLineID;
+	{
+		/* use volatile pointer to prevent code rearrangement */
+		volatile XLogCtlData *xlogctl = XLogCtl;
+
+		SpinLockAcquire(&xlogctl->info_lck);
+		XLogCtl->ThisTimeLineID = ThisTimeLineID;
+		XLogCtl->lastReplayedTLI = ThisTimeLineID;
+		XLogCtl->PrevTimeLineID = PrevTimeLineID;
+		SpinLockRelease(&xlogctl->info_lck);
+	}

Hmm. lastReplayedTLI is supposed to be the timeline of the last record
that was replayed, ie. the timeline corresponding lastReplayedEndRecPtr.
I think it would work, but it feels like it could be an easy source of
bugs in the future.

It might be a good idea to change walsender to not store that in
ThisTimeLineID, though. It used to make sense for ThisTimeLineID to
track the current recovery timeline in 9.2 and below, but now that
walsender can be sending any older timeline, using ThisTimeLineID to
store the latest one seems confusing.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#21)
1 attachment(s)
Re: Fast promotion failure

On 21 May 2013 07:46, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 21.05.2013 00:00, Simon Riggs wrote:

When we set the new timeline we should be updating all values that
might be used elsewhere. If we do that, then no matter when or how we
run GetXLogReplayRecPtr, it can't ever get it wrong in any backend.

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5838,8 +5838,16 @@ StartupXLOG(void)
}
/* Save the selected TimeLineID in shared memory, too */
-       XLogCtl->ThisTimeLineID = ThisTimeLineID;
-       XLogCtl->PrevTimeLineID = PrevTimeLineID;
+       {
+               /* use volatile pointer to prevent code rearrangement */
+               volatile XLogCtlData *xlogctl = XLogCtl;
+
+               SpinLockAcquire(&xlogctl->info_lck);
+               XLogCtl->ThisTimeLineID = ThisTimeLineID;
+               XLogCtl->lastReplayedTLI = ThisTimeLineID;
+               XLogCtl->PrevTimeLineID = PrevTimeLineID;
+               SpinLockRelease(&xlogctl->info_lck);
+       }

Hmm. lastReplayedTLI is supposed to be the timeline of the last record that
was replayed, ie. the timeline corresponding lastReplayedEndRecPtr. I think
it would work, but it feels like it could be an easy source of bugs in the
future.

I'm OK with that principle, as long as we don't touch ThisTimeLineID,
which has been the source of multiple bugs.

So we should set the TLI explicitly before installing, like attached patch.

Otherwise we'd need multiple permanent TLIs which would be overkill.

I feel there are problems because we set the newly selected TLI from
startup process into shared memory, then some time later we set
SharedRecoveryInProgress = false. That timing window isn't good, but I
don't see a different way.

It might be a good idea to change walsender to not store that in
ThisTimeLineID, though. It used to make sense for ThisTimeLineID to track
the current recovery timeline in 9.2 and below, but now that walsender can
be sending any older timeline, using ThisTimeLineID to store the latest one
seems confusing.

Agreed, but looks like too much code to touch that lightly.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

install_xlog_right.v1.patchapplication/octet-stream; name=install_xlog_right.v1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7d34218..370a927 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2522,8 +2522,22 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
+	TimeLineID	InstallTimeLineID = ThisTimeLineID;
 
-	XLogFilePath(path, ThisTimeLineID, *segno);
+	/*
+	 * Update InstallTimeLineID to the timeline we're currently replaying,
+	 * so that we install any recycled segments on that timeline.
+	 *
+	 * There is no guarantee that the WAL segments will be useful on the
+	 * current timeline; if recovery proceeds to a new timeline right
+	 * after this, the pre-allocated WAL segments on this timeline will
+	 * not be used, and will go wasted until recycled on the next
+	 * restartpoint. We'll live with that.
+	 */
+	if (RecoveryInProgress())
+	(void) GetXLogReplayRecPtr(&InstallTimeLineID);
+
+	XLogFilePath(path, InstallTimeLineID, *segno);
 
 	/*
 	 * We want to be sure that only one process does this at a time.
@@ -7456,18 +7470,6 @@ CreateRestartPoint(int flags)
 		KeepLogSeg(endptr, &_logSegNo);
 		_logSegNo--;
 
-		/*
-		 * Update ThisTimeLineID to the timeline we're currently replaying,
-		 * so that we install any recycled segments on that timeline.
-		 *
-		 * There is no guarantee that the WAL segments will be useful on the
-		 * current timeline; if recovery proceeds to a new timeline right
-		 * after this, the pre-allocated WAL segments on this timeline will
-		 * not be used, and will go wasted until recycled on the next
-		 * restartpoint. We'll live with that.
-		 */
-		(void) GetXLogReplayRecPtr(&ThisTimeLineID);
-
 		RemoveOldXlogFiles(_logSegNo, endptr);
 
 		/*
#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#22)
Re: Fast promotion failure

On 21 May 2013 09:26, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm OK with that principle...

Well, after fighting some more with that, I've gone with the, er,
principle of slightly less ugliness.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers