Re: Hot Backup with rsync fails at pg_clog if under load

chris@replicon.com

over 14 years ago

In reply to: Chris Redekop (#1)

I can confirm that both the pg_clog and pg_subtrans errors do occur when
using pg_basebackup instead of rsync. The data itself seems to be fine
because using the exact same data I can start up a warm standby no problem,
it is just the hot standby that will not start up.

On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop <chris@replicon.com> wrote:

Show quoted text

Linas, could you capture the output of pg_controldata *and* increase

the

log level to DEBUG1 on the standby? We should then see nextXid value of
the checkpoint the recovery is starting from.

I'll try to do that whenever I'm in that territory again... Incidentally,
recently there was a lot of unrelated-to-this-post work to polish things

up

for a talk being given at PGWest 2011 Today :)

I also checked what rsync does when a file vanishes after rsync

computed the

file list, but before it is sent. rsync 3.0.7 on OSX, at least,

complains

loudly, and doesn't sync the file. It BTW also exits non-zero, with a

special

exit code for precisely that failure case.

To be precise, my script has logic to accept the exit code 24, just as
stated in PG manual:

Docs> For example, some versions of rsync return a separate exit code for
Docs> "vanished source files", and you can write a driver script to

accept

Docs> this exit code as a non-error case.

I also am running into this issue and can reproduce it very reliably. For
me, however, it happens even when doing the "fast backup" like so:
pg_start_backup('whatever', true)...my traffic is more write-heavy than
linas's tho, so that might have something to do with it. Yesterday it
reliably errored out on pg_clog every time, but today it is
failing sporadically on pg_subtrans (which seems to be past where the
pg_clog error was)....the only thing that has changed is that I've changed
the log level to debug1....I wouldn't think that could be related though.
I've linked the requested pg_controldata and debug1 logs for both errors.
Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
pg_controldata, and then the postgres debug1 log produced from a subsequent
startup attempt.

pg_clog: http://pastebin.com/mTfdcjwH
pg_subtrans: http://pastebin.com/qAXEHAQt

Any workarounds would be very appreciated.....would copying clog+subtrans
before or after the rest of the data directory (or something like that) make
any difference?

Thanks!

chris@replicon.com

over 14 years ago

In reply to: Chris Redekop (#2)

Well, on the other hand maybe there is something wrong with the data.
Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will
not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql
restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm
slave
6. I then turn off the load, the slave is all caught up, master and slave
are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any
load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my
surprise, instead of failing, it successfully starts up as a hot standby
(this is after maybe 5 minutes or so of sitting idle)

So...given that it continued to fail even after the load had been turned of,
that makes me believe that the data which was copied over was invalid in
some way. And when a checkpoint/logrotation/somethingelse occurred when not
under load it cleared itself up....I'm shooting in the dark here

Anyone have any suggestions/ideas/things to try?

On Mon, Oct 17, 2011 at 2:13 PM, Chris Redekop <chris@replicon.com> wrote:

Show quoted text

I can confirm that both the pg_clog and pg_subtrans errors do occur when
using pg_basebackup instead of rsync. The data itself seems to be fine
because using the exact same data I can start up a warm standby no problem,
it is just the hot standby that will not start up.

On Sat, Oct 15, 2011 at 7:33 PM, Chris Redekop <chris@replicon.com> wrote:

Linas, could you capture the output of pg_controldata *and* increase

the

log level to DEBUG1 on the standby? We should then see nextXid value

of

the checkpoint the recovery is starting from.

I'll try to do that whenever I'm in that territory again...

Incidentally,

recently there was a lot of unrelated-to-this-post work to polish things

up

for a talk being given at PGWest 2011 Today :)

I also checked what rsync does when a file vanishes after rsync

computed the

file list, but before it is sent. rsync 3.0.7 on OSX, at least,

complains

loudly, and doesn't sync the file. It BTW also exits non-zero, with a

special

exit code for precisely that failure case.

To be precise, my script has logic to accept the exit code 24, just as
stated in PG manual:

Docs> For example, some versions of rsync return a separate exit code

for

Docs> "vanished source files", and you can write a driver script to

accept

Docs> this exit code as a non-error case.

I also am running into this issue and can reproduce it very reliably. For
me, however, it happens even when doing the "fast backup" like so:
pg_start_backup('whatever', true)...my traffic is more write-heavy than
linas's tho, so that might have something to do with it. Yesterday it
reliably errored out on pg_clog every time, but today it is
failing sporadically on pg_subtrans (which seems to be past where the
pg_clog error was)....the only thing that has changed is that I've changed
the log level to debug1....I wouldn't think that could be related though.
I've linked the requested pg_controldata and debug1 logs for both errors.
Both links contain the output from pg_start_backup, rsync, pg_stop_backup,
pg_controldata, and then the postgres debug1 log produced from a subsequent
startup attempt.

pg_clog: http://pastebin.com/mTfdcjwH
pg_subtrans: http://pastebin.com/qAXEHAQt

Any workarounds would be very appreciated.....would copying clog+subtrans
before or after the rest of the data directory (or something like that) make
any difference?

Thanks!

Daniel Farina

daniel@heroku.com

over 14 years ago

In reply to: Chris Redekop (#3)

On Mon, Oct 17, 2011 at 11:30 PM, Chris Redekop <chris@replicon.com> wrote:

Well, on the other hand maybe there is something wrong with the data.
Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will
not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql
restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm
slave
6. I then turn off the load, the slave is all caught up, master and slave
are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any
load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my
surprise, instead of failing, it successfully starts up as a hot standby
(this is after maybe 5 minutes or so of sitting idle)
So...given that it continued to fail even after the load had been turned of,
that makes me believe that the data which was copied over was invalid in
some way. And when a checkpoint/logrotation/somethingelse occurred when not
under load it cleared itself up....I'm shooting in the dark here
Anyone have any suggestions/ideas/things to try?

Having digged at this a little -- but not too much -- the problem
seems to be that postgres is reading the commit logs way, way too
early, that is to say, before it has played enough WAL to be
'consistent' (the WAL between pg_start and pg_stop backup). I have
not been able to reproduce this problem (I think) after the message
from postgres suggesting it has reached a consistent state; at that
time I am able to go into hot-standby mode.

The message is like: "consistent recovery state reached at %X/%X".
(this is the errmsg)

It doesn't seem meaningful for StartupCLOG (or, indeed, any of the
hot-standby path functionality) to be called before that code is
executed, but it is anyway right now. I'm not sure if this oversight
is simply an oversight, or indicative of a misplaced assumption
somewhere. Basically, my thoughts for a fix are to suppress
hot_standby = on (in spirit) before the consistent recovery state is
reached.

--
fdr

simon@2ndQuadrant.com

over 14 years ago

In reply to: Daniel Farina (#4)

On Sun, Oct 23, 2011 at 9:48 PM, Daniel Farina <daniel@heroku.com> wrote:

On Mon, Oct 17, 2011 at 11:30 PM, Chris Redekop <chris@replicon.com> wrote:

Well, on the other hand maybe there is something wrong with the data.
Here's the test/steps I just did -
1. I do the pg_basebackup when the master is under load, hot slave now will
not start up but warm slave will.
2. I start a warm slave and let it catch up to current
3. On the slave I change 'hot_standby=on' and do a 'service postgresql
restart'
4. The postgres fails to restart with the same error.
5. I turn hot_standby back off and postgres starts back up fine as a warm
slave
6. I then turn off the load, the slave is all caught up, master and slave
are both sitting idle
7. I, again, change 'hot_standby=on' and do a service restart
8. Again it fails, with the same error, even though there is no longer any
load.
9. I repeat this warmstart/hotstart cycle a couple more times until to my
surprise, instead of failing, it successfully starts up as a hot standby
(this is after maybe 5 minutes or so of sitting idle)
So...given that it continued to fail even after the load had been turned of,
that makes me believe that the data which was copied over was invalid in
some way. And when a checkpoint/logrotation/somethingelse occurred when not
under load it cleared itself up....I'm shooting in the dark here
Anyone have any suggestions/ideas/things to try?

Having digged at this a little -- but not too much -- the problem
seems to be that postgres is reading the commit logs way, way too
early, that is to say, before it has played enough WAL to be
'consistent' (the WAL between pg_start and pg_stop backup). I have
not been able to reproduce this problem (I think) after the message
from postgres suggesting it has reached a consistent state; at that
time I am able to go into hot-standby mode.

The message is like: "consistent recovery state reached at %X/%X".
(this is the errmsg)

It doesn't seem meaningful for StartupCLOG (or, indeed, any of the
hot-standby path functionality) to be called before that code is
executed, but it is anyway right now. I'm not sure if this oversight
is simply an oversight, or indicative of a misplaced assumption
somewhere. Basically, my thoughts for a fix are to suppress
hot_standby = on (in spirit) before the consistent recovery state is
reached.

Not sure about that, but I'll look at where this comes from.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

simon@2ndQuadrant.com

over 14 years ago

In reply to: Chris Redekop (#1)

On Sun, Oct 16, 2011 at 2:33 AM, Chris Redekop <chris@replicon.com> wrote:

pg_subtrans: http://pastebin.com/qAXEHAQt

I confirm this as a HS issue and will investigate from here.

FATAL: could not access status of transaction 21110784
which, in pg_subtrans, is the first xid on a new subtrans page. So we
have missed zeroing a page.

pg_control shows ... Latest checkpoint's oldestActiveXID: 21116666
which shows quite clearly that the pg_control file is later than it should be.

Chris, can you rearrange the backup so you copy the pg_control file as
the first act after the pg_start_backup?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

simon@2ndQuadrant.com

over 14 years ago

In reply to: Daniel Farina (#4)

On Sun, Oct 23, 2011 at 9:48 PM, Daniel Farina <daniel@heroku.com> wrote:

Having digged at this a little -- but not too much -- the problem
seems to be that postgres is reading the commit logs way, way too
early, that is to say, before it has played enough WAL to be
'consistent' (the WAL between pg_start and pg_stop backup). I have
not been able to reproduce this problem (I think) after the message
from postgres suggesting it has reached a consistent state; at that
time I am able to go into hot-standby mode.

The WAL appears too early because the other control info is later than
it should be.

So this is approx backwards and nothing related to consistent state,
but thanks for drawing my attention to this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#6)

On Oct24, 2011, at 01:27 , Simon Riggs wrote:

FATAL: could not access status of transaction 21110784
which, in pg_subtrans, is the first xid on a new subtrans page. So we
have missed zeroing a page.

pg_control shows ... Latest checkpoint's oldestActiveXID: 21116666
which shows quite clearly that the pg_control file is later than it should be.

But shouldn't pg_control be largely irrelevant in a hot backup scenario? Most
(all?) of the information contained therein should be overwritten with the
contents of the checkpoint referenced by the backup label, shouldn't it?

best regards,
Florian Pflug

fgp@phlo.org

over 14 years ago

In reply to: Daniel Farina (#4)

On Oct23, 2011, at 22:48 , Daniel Farina wrote:

It doesn't seem meaningful for StartupCLOG (or, indeed, any of the
hot-standby path functionality) to be called before that code is
executed, but it is anyway right now.

I think the idea is to check that the CLOG part which recovery *won't*
overwrite is consistent (or rather, given the simplicity of the check,
at least accessible)

Heikki said the following somewhere else in this thread when I suggested
something similar to your proposal:

There are pretty clear rules on what state clog can be in. When you launch postmaster in a standby:

* Any clog preceding the nextXid from the checkpoint record we start recovery from, must either be valid, or the clog file must be missing altogether (which can happen when it was vacuumed away while the backup in progress - if the clog is still needed at the end of backup it must not be missing, of course).
* Any clog following nextXid can be garbled or missing.

Recovery will overwrite any clog after nextXid from the WAL, but not the clog before it.

I think Simon's theory that we're starting recovery from the wrong place,
i.e. should start with an earlier WAL location, is probably correct. The
question is, why?

best regards,
Florian Pflug

#10

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#9)

On Mon, Oct 24, 2011 at 7:13 AM, Florian Pflug <fgp@phlo.org> wrote:

I think Simon's theory that we're starting recovery from the wrong place,
i.e. should start with an earlier WAL location, is probably correct. The
question is, why?

Err, that's not what I said and I don't mean that. Having said that,
what I said about pg_control being invalid would imply that, so is
wrong also.

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

2. Change the way subtrans pages are initialized during recovery so we
don't rely on oldestActiveXid

I need to think some more before a decision on this in my own mind,
but I lean towards doing (1) as a longer term fix and doing (2) as a
short term fix for existing releases. I expect to have a fix later
today.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#10)

On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

2. Change the way subtrans pages are initialized during recovery so we
don't rely on oldestActiveXid

I need to think some more before a decision on this in my own mind,
but I lean towards doing (1) as a longer term fix and doing (2) as a
short term fix for existing releases. I expect to have a fix later
today.

(1) looks the best way forwards in all cases.

Patch attached. Will be backpatched to 9.0

I think it is possible to avoid taking XidGenLock during
GetRunningTransactions() now, but I haven't included that change in
this patch.

Any other comments before commit?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#11)

On Oct25, 2011, at 11:13 , Simon Riggs wrote:

On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

(1) looks the best way forwards in all cases.

Let me see if I understand this

The probem seems to be that we currently derive oldestActiveXid end the end of
the checkpoint, just before writing the checkpoint record. Since we use
oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before
that checkpoint record (but after the REDO location, of course) may very well
contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt
to touch these XID's SUBTRANS state, we error out.

Your patch seems sensible, because the checkpoint "logically" occurs at the
REDO location not the checkpoint's location, so we ought to log an oldestActiveXID
corresponding to that location.

What I don't understand is how this affects the CLOG. How does oldestActiveXID
factor into CLOG initialization?

best regards,
Florian Pflug

#13

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#12)

On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote:

What I don't understand is how this affects the CLOG. How does oldestActiveXID
factor into CLOG initialization?

It is an entirely different error.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#14

chris@replicon.com

over 14 years ago

In reply to: Simon Riggs (#13)

Chris, can you rearrange the backup so you copy the pg_control file as
the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference. I also tried the
patch and I can no longer reproduce the subtrans error, however instead it
now it starts up, but never gets to the point where it'll accept
connections. It starts up but if I try to do anything I always get "FATAL:
the database system is starting up"...even if the load is removed from the
primary, the standby still never finishes "starting up". Attached below is
a log of one of these startup attempts. In my testing with the patch
applied approx 3 in 10 attempts start up successfully, 7 in 10 attempts go
into the "db is starting up" state....the pg_clog error is still there, but
seems much harder to reproduce now....I've seen it only once since applying
the patch (out of probably 50 or 60 under-load startup attempts). It does
seem to be "moody" like that tho....it will be very difficult to reproduce
for a while, and then it will happen damn-near every time for a
while...weirdness

On a bit of a side note, I've been thinking of changing my scripts so that
they perform an initial rsync prior to doing the
startbackup-rsync-stopbackup just so that the second rsync will be
faster....so that the backup is in progress for a shorter period of time, as
while it is running it will stop other standbys from starting up....this
shouldn't cause any issues eh?

2011-10-25 13:43:24.035 MDT [15072]: [1-1] LOG: database system was
interrupted; last known up at 2011-10-25 13:43:11 MDT
2011-10-25 13:43:24.035 MDT [15072]: [2-1] LOG: creating missing WAL
directory "pg_xlog/archive_status"
2011-10-25 13:43:24.037 MDT [15072]: [3-1] LOG: entering standby mode
DEBUG: received replication command: IDENTIFY_SYSTEM
DEBUG: received replication command: START_REPLICATION 2/CF000000
2011-10-25 13:43:24.041 MDT [15073]: [1-1] LOG: streaming replication
successfully connected to primary
2011-10-25 13:43:24.177 MDT [15092]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:24.781 MDT [15072]: [4-1] DEBUG: checkpoint record is at
2/CF81A478
2011-10-25 13:43:24.781 MDT [15072]: [5-1] DEBUG: redo record is at
2/CF000020; shutdown FALSE
2011-10-25 13:43:24.781 MDT [15072]: [6-1] DEBUG: next transaction ID:
0/4634700; next OID: 1188228
2011-10-25 13:43:24.781 MDT [15072]: [7-1] DEBUG: next MultiXactId: 839;
next MultiXactOffset: 1686
2011-10-25 13:43:24.781 MDT [15072]: [8-1] DEBUG: oldest unfrozen
transaction ID: 1669, in database 1
2011-10-25 13:43:24.781 MDT [15072]: [9-1] DEBUG: transaction ID wrap limit
is 2147485316, limited by database with OID 1
2011-10-25 13:43:24.783 MDT [15072]: [10-1] DEBUG: resetting unlogged
relations: cleanup 1 init 0
2011-10-25 13:43:24.791 MDT [15072]: [11-1] DEBUG: initializing for hot
standby
2011-10-25 13:43:24.791 MDT [15072]: [12-1] LOG: consistent recovery state
reached at 2/CF81A4D0
2011-10-25 13:43:24.791 MDT [15072]: [13-1] LOG: redo starts at 2/CF000020
2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG: consistent state delayed
because recovery snapshot incomplete
2011-10-25 13:43:25.019 MDT [15072]: [15-1] CONTEXT: xlog redo running
xacts:
nextXid 4634700 latestCompletedXid 4634698 oldestRunningXid 4634336; 130
xacts:
4634336 4634337 4634338 4634339 4634340 4634341 4634342 4634343 4634344
4634345
4634346 4634347 4634348 4634349 4634350 4634351 4634352 4634353 4634354
4634355
4634356 4634357 4634358 4634359 4634360 4634361 4634362 4634363 4634364
4634365
4634366 4634367 4634368 4634369 4634370 4634371 4634515 4634516 4634517
4634518
4634519 4634520 4634521 4634522 4634523 4634524 4634525 4634526 4634527
4634528
4634529 4634530 4634531 4634532 4634533 4634534 4634535 4634536 4634537
4634538
4634539 4634540 4634541 4634542 4634543 4634385 4634386 4634387 4634388
4634389
4634390 4634391 4634392 4634393 4634394 4634395 4634396 4634397 4634398
4634399
4634400 4634401 4634402 4634403 4634404 4634405 4634406 4634407 4634408
4634409
4634410 4634411 4634412 4634413 4634414 4634415 4634416 4634417 4634418
4634419
4634420 4634579 4634580 4634581 4634582 4634583 4634584 4634585 4634586
4634587
4634588 4634589 4634590 4634591 4634592 4634593 4634594 4634595 4634596
4634597
4634598 4634599 4634600 4634601 4634602 4634603 4634604 4634605 4634606
4634607;
subxid ovf
2011-10-25 13:43:25.240 MDT [15130]: [1-1] FATAL: the database system is
starting up
DEBUG: standby "sync_rep_test" has now caught up with primary
2011-10-25 13:43:26.304 MDT [15167]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:27.366 MDT [15204]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:28.426 MDT [15241]: [1-1] FATAL: the database system is
starting up
2011-10-25 13:43:29.461 MDT [15275]: [1-1] FATAL: the database system is
starting up
and so on...

On Tue, Oct 25, 2011 at 6:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Show quoted text

On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote:

What I don't understand is how this affects the CLOG. How does

oldestActiveXID

factor into CLOG initialization?

It is an entirely different error.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

chris@replicon.com

over 14 years ago

In reply to: Simon Riggs (#13)

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

Do you have any theories on this that I could help investigate? It happens
even when using pg_basebackup and it persists until another sync is
performed, so the files must be in some state that that it can't recover
from....without understanding the internals just viewing from an
outside perspective, I don't really see how this could not be a PostgreSQL
problem....

#16

simon@2ndQuadrant.com

over 14 years ago

In reply to: Chris Redekop (#14)

On Tue, Oct 25, 2011 at 10:06 PM, Chris Redekop <chris@replicon.com> wrote:

Chris, can you rearrange the backup so you copy the pg_control file as
the first act after the pg_start_backup?

I tried this and it doesn't seem to make any difference.

It won't, that was a poor initial diagnosis on my part.

I also tried the
patch and I can no longer reproduce the subtrans error

Good

, however instead it
now it starts up, but never gets to the point where it'll accept
connections. It starts up but if I try to do anything I always get "FATAL:
the database system is starting up"...even if the load is removed from the
primary, the standby still never finishes "starting up".

...

2011-10-25 13:43:25.019 MDT [15072]: [14-1] LOG: consistent state delayed
because recovery snapshot incomplete

...

This is a different problem and has already been reported by one of
your colleagues in a separate thread, and answered in detail by me
there. There is no bug related to this error message.

From here, it looks like the published fixes the originally reported problem.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#13)

On Oct25, 2011, at 14:51 , Simon Riggs wrote:

On Tue, Oct 25, 2011 at 12:39 PM, Florian Pflug <fgp@phlo.org> wrote:

What I don't understand is how this affects the CLOG. How does oldestActiveXID
factor into CLOG initialization?

It is an entirely different error.

Ah, OK. I assumed that you believe the wrong oldestActiveXID computation
solved both the SUBTRANS-related *and* the CLOG-related errors, since you
said "We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly" at the start of the mail.

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

Yep.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

It's very likely that it's a PostgreSQL problem, though. It's probably
not a pilot error since it happens even for backups taken with pg_basebackup(),
so the only explanation other than a PostgreSQL bug is broken hardware or
a pretty serious kernel/filesystem bug.

OTOH SlruPhysicalReadPage() does cope gracefully with missing clog
files during recovery, so maybe we can think of a way to make recovery
cope with a SLRU_READ_FAILED error gracefully also. Any ideas?

As long as we don't understand how the CLOG-related errors happen in
the first place, I think it's a bad idea to silence them.

best regards,
Florian Pflug

#18

fgp@phlo.org

over 14 years ago

In reply to: Florian Pflug (#12)

On Oct25, 2011, at 13:39 , Florian Pflug wrote:

On Oct25, 2011, at 11:13 , Simon Riggs wrote:

On Tue, Oct 25, 2011 at 8:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We are starting recovery at the right place but we are initialising
the clog and subtrans incorrectly. Precisely, the oldestActiveXid is
being derived later than it should be, which can cause problems if
this then means that whole pages are unitialised in subtrans. The bug
only shows up if you do enough transactions (2048 is always enough) to
move to the next subtrans page between the redo pointer and the
checkpoint record while at the same time we do not have a long running
transaction that spans those two points. That's just enough to happen
reasonably frequently on busy systems and yet just enough to have
slipped through testing.

We must either

1. During CreateCheckpoint() we should derive oldestActiveXid before
we derive the redo location

(1) looks the best way forwards in all cases.

Let me see if I understand this

The probem seems to be that we currently derive oldestActiveXid end the end of
the checkpoint, just before writing the checkpoint record. Since we use
oldestActiveXid to initialize SUBTRANS, this is wrong. Records written before
that checkpoint record (but after the REDO location, of course) may very well
contain XIDs earlier than that wrongly derived oldestActiveXID, and if attempt
to touch these XID's SUBTRANS state, we error out.

Your patch seems sensible, because the checkpoint "logically" occurs at the
REDO location not the checkpoint's location, so we ought to log an oldestActiveXID
corresponding to that location.

Thinking about this some more (and tracing through the code), I realized that
things are a bit more complicated.

What we actually need to ensure, I think, is that the XID we pass to StartupSUBTRANS()
is earlier than any top-level XID in XLOG_XACT_ASSIGNMENT records. Which, at first
glance, implies that we ought to use the nextId at the *beginning* of the checkpoint
for SUBTRANS initialization. At second glace, however, that'd be wrong, because
backends emit XLOG_XACT_ASSIGNMENT only every PGPROC_MAX_CACHED_SUBXIDS sub-xid
assignment. Thus, an XLOG_XACT_ASSIGNMENT written *after* the checkpoint has started
may contain sub-XIDs which were assigned *before* the checkpoint has started.

Using oldestActiveXID works around that because we guarantee that sub-XIDs are always
larger than their parent XIDs and because only active transactions can produce
XLOG_XACT_ASSIGNMENT records.

So your patch is fine, but I think the reasoning about why oldestActiveXID is
the correct value for StartupSUBTRANS deserves an explanation somewhere.

best regards,
Florian Pflug

#19

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#17)

On Wed, Oct 26, 2011 at 12:16 PM, Florian Pflug <fgp@phlo.org> wrote:

Chris' clog error was caused by a file read error. The file was
opened, we did a seek within the file and then the call to read()
failed to return a complete page from the file.

The xid shown is 22811359, which is the nextxid in the control file.

So StartupClog() must have failed trying to read the clog page from disk.

Yep.

That isn't a Hot Standby problem, a recovery problem nor is it certain
its a PostgreSQL problem.

It's very likely that it's a PostgreSQL problem, though. It's probably
not a pilot error since it happens even for backups taken with pg_basebackup(),
so the only explanation other than a PostgreSQL bug is broken hardware or
a pretty serious kernel/filesystem bug.

The way forwards here is for someone to show the clog file that causes
the error and find out why the call to read() fails.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Aidan Van Dyk

aidan@highrise.ca

over 14 years ago

In reply to: Simon Riggs (#19)

On Wed, Oct 26, 2011 at 7:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It's very likely that it's a PostgreSQL problem, though. It's probably
not a pilot error since it happens even for backups taken with pg_basebackup(),
so the only explanation other than a PostgreSQL bug is broken hardware or
a pretty serious kernel/filesystem bug.

The way forwards here is for someone to show the clog file that causes
the error and find out why the call to read() fails.

Sorry, I thought the problem was obvious. Either that, of I've
completely missed something in these threads... I'll admit to not
following this one very closely anymore...

When the backup started, the clog was small. So on the "recovering
instance", the clog is small. PostgreSQL is supposed to be able to
deal with any file as it was when the backup starts.

When the backup is stopped, clog is big. But that file was copied
after the backup was started, not after the backup finished. So its
size is only guarenteed to be as big as it was when the backup
started. Recovery is responsible for extending it as it was extended
during the backup period on the master.

The read fails because their is no data at the location it's trying to
read from, because clog hasn't been extended yet by recovery.

a.
--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.

#21

simon@2ndQuadrant.com

over 14 years ago

In reply to: Aidan Van Dyk (#20)

#22

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#21)

#23

Aidan Van Dyk

aidan@highrise.ca

over 14 years ago

In reply to: Florian Pflug (#22)

#24

fgp@phlo.org

over 14 years ago

In reply to: Florian Pflug (#22)

#25

chris@replicon.com

over 14 years ago

In reply to: Simon Riggs (#19)

#26

chris@replicon.com

over 14 years ago

In reply to: Chris Redekop (#25)

#27

fgp@phlo.org

over 14 years ago

In reply to: Chris Redekop (#26)

#28

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#24)

#29

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#28)

#30

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#29)

#31

chris@replicon.com

over 14 years ago

In reply to: Simon Riggs (#30)

#32

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#28)

#33

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Simon Riggs (#29)

#34

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Robert Haas (#33)

#35

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Florian Pflug (#32)

#36

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Heikki Linnakangas (#35)

#37

simon@2ndQuadrant.com

over 14 years ago

In reply to: Tom Lane (#34)

#38

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Simon Riggs (#37)

#39

fgp@phlo.org

over 14 years ago

In reply to: Heikki Linnakangas (#35)

#40

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#38)

#41

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#32)

#42

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#41)

#43

Tom Lane

tgl@sss.pgh.pa.us

over 14 years ago

In reply to: Simon Riggs (#40)

#44

simon@2ndQuadrant.com

over 14 years ago

In reply to: Florian Pflug (#42)

#45

simon@2ndQuadrant.com

over 14 years ago

In reply to: Tom Lane (#43)

#46

fgp@phlo.org

over 14 years ago

In reply to: Simon Riggs (#44)

#47

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#45)

#48

chris@replicon.com

over 14 years ago

In reply to: Simon Riggs (#47)

#49

simon@2ndQuadrant.com

over 14 years ago

In reply to: Chris Redekop (#48)

#50

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#49)

#51