warm standby server stops doing checkpoints after a while
Hello list,
I have a problem regarding running a warm standby server as described in
the postgresql 8.2 documentation.
I set up two servers. Both running PostgreSQL 8.2.3-1PGDG on Fedora Core
6 (x86_64). (Master driven by AMD Opteron / Slave running on Intel Xeon)
The master server copys its WAL-logs to the slave using rsync over SSH
which are then replayed by the slave using a little perl script which
does not really do much more then the pseudo code in the documentation does.
To this point everything works just fine.
To reduce space consumtion of the log archive my restore script checks
pg_controldata when it enters a wait cycle and deletes all log files
which are older than the last checkpoint segment. (Checkpointing happens
about every 5 to 10 minutes)
The problem is that the slave server stops checkpointing after some
hours of working (about 24 to 48 hours of conitued log replay).
After this point in time the slave does not do checkpoints anymore but
still replayes every log that it gets from the master. The master works
on as expected doing checkpoints every 5 to 10 minutes.
I raised loglevel to debug5 but there's nothing interresting in the
logs. Just log replay and the fact that there's absolutely no sign of
anything related to checkpoints.
Restart does not help. Only beginning with a new base backup brings the
slave to work again as expected.
Both servers are configured identically except archiving/recovery settings.
Does somebody has a clue what could cause this behavior of the slave server?
Could upgrading to 8.2.4 help? - I didn't find something related in the
release notes.
Could this be a bug?
Greetings,
Frank Wittig
Frank Wittig <fw@weisshuhn.de> writes:
The problem is that the slave server stops checkpointing after some
hours of working (about 24 to 48 hours of conitued log replay).
Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be
something wrong with this logic?
/*
* Do nothing if the elapsed time since the last restartpoint is less than
* half of checkpoint_timeout. (We use a value less than
* checkpoint_timeout so that variations in the timing of checkpoints on
* the master, or speed of transmission of WAL segments to a slave, won't
* make the slave skip a restartpoint once it's synced with the master.)
* Checking true elapsed time keeps us from doing restartpoints too often
* while rapidly scanning large amounts of WAL.
*/
elapsed_secs = time(NULL) - ControlFile->time;
if (elapsed_secs < CheckPointTimeout / 2)
return;
The idea is that the slave (once in sync with the master) ought to
checkpoint every time it sees a checkpoint record in the master's
output. I'm not seeing a flaw but maybe there is one here, or somewhere
nearby. Are you sure the master is checkpointing?
regards, tom lane
Tom Lane schrieb:
Are you sure the master is checkpointing?
Yes. I double checked using pg_controldata on the master.
On both servers checkpoint_segments is set to 16 and checkpoint_timeout
is 300 seconds default.
There were two messages in the logs of the master that checkpointing
happened too often during bulk imports last night.
It's possible that the problem occures during periods if high database
activity where checkpointing on the master happens very often.
It would be ok if the slave skipped checkpoints if its falling behind
but i'd expect it to checkpoint again if it's in sync again.
I watched the servers a while after it happend today and found that the
slave catched up waiting for new logs. But it didnt do any checkpointing.
The last checkpoint was over 2 hours and about 500 log segments ago.
I upgraded the slave to 8.2.4-1PGDG today and will upgrade the master
too as soon as I can get me a maintenance window for this.
Greetings,
Frank Wittig
On Thu, 2007-05-31 at 10:23 -0400, Tom Lane wrote:
Frank Wittig <fw@weisshuhn.de> writes:
The problem is that the slave server stops checkpointing after some
hours of working (about 24 to 48 hours of conitued log replay).Hm ... look at RecoveryRestartPoint() in xlog.c. Could there be
something wrong with this logic?/*
* Do nothing if the elapsed time since the last restartpoint is less than
* half of checkpoint_timeout. (We use a value less than
* checkpoint_timeout so that variations in the timing of checkpoints on
* the master, or speed of transmission of WAL segments to a slave, won't
* make the slave skip a restartpoint once it's synced with the master.)
* Checking true elapsed time keeps us from doing restartpoints too often
* while rapidly scanning large amounts of WAL.
*/
elapsed_secs = time(NULL) - ControlFile->time;
if (elapsed_secs < CheckPointTimeout / 2)
return;The idea is that the slave (once in sync with the master) ought to
checkpoint every time it sees a checkpoint record in the master's
output. I'm not seeing a flaw but maybe there is one here, or somewhere
nearby. Are you sure the master is checkpointing?
Hmmm. This can happen if a backend crashes while half-way through any
set of changes that causes safe_restartpoint() to be true. Or it might
be that one of the Index AMs don't correctly clear the multi-WAL actions
in some corner cases.
Or it could be that the mdsync looping problem has been worse than we
thought and checkpoints have been avoided completely for some time.
Frank,
This is repeatable, yes?
Has anything crashed on your server?
Are you using GIN or GIST indexes?
I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.
Continuing to think about this one....
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Simon Riggs schrieb:
This is repeatable, yes?
Yes, it occures every time I begin with a new base backup. And it seem
to happen during recreation of tsearch2 vectors of large amounts of data
sets.
Has anything crashed on your server?
No. Crashes didn't occur duriung that times.
Are you using GIN or GIST indexes?
I'm using GIN indesex on tsearch2 vectors of very large ammount of data
sets. (About 3,8 million data sets of which about 30-50 thousands are
recreated and indexed when the descibed behavior occures.)
I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.
Extra debug info would be great.
I tested myself adding some debug output into the function Tom Lane
mentioned and found that after the server stopped checkpointing every
time the function is called it exits at this point:
/*
* Is it safe to checkpoint? We must ask each of the resource managers
* whether they have any partial state information that might prevent a
* correct restart from this point. If so, we skip this opportunity, but
* return at the next checkpoint record for another try.
*/
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
if (!(RmgrTable[rmid].rm_safe_restartpoint()))
return;
}
It exits every time with the same value for rmid.
Logs look like this (The quoted lines repeat):
<2007-06-01 13:10:28.936 CEST:%> DEBUG: 00000: executing restore
command "/var/lib/pgsql/restore.pl
/mnt/wal_archive/00000001000000C9000000C2 pg_xlog/RECOVERYXLOG"
<2007-06-01 13:10:28.936 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2474
<2007-06-01 13:11:29.055 CEST:%> LOG: 00000: restored log file
"00000001000000C9000000C2" from archive
<2007-06-01 13:11:29.055 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2504
<2007-06-01 13:11:29.364 CEST:%> DEBUG: 00000: found Checkpoint in XLOG
<2007-06-01 13:11:29.364 CEST:%> CONTEXT: xlog redo checkpoint: redo
C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
offset 15; online
<2007-06-01 13:11:29.364 CEST:%> LOCATION: RecoveryRestartPoint,
xlog.c:5739
<2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13)
has partial state information
<2007-06-01 13:11:29.365 CEST:%> CONTEXT: xlog redo checkpoint: redo
C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
offset 15; online
<2007-06-01 13:11:29.365 CEST:%> LOCATION: RecoveryRestartPoint,
xlog.c:5769
best regards,
Frank Wittig
On Fri, 2007-06-01 at 13:33 +0200, Frank Wittig wrote:
Simon Riggs schrieb:
This is repeatable, yes?
Yes, it occures every time I begin with a new base backup. And it seem
to happen during recreation of tsearch2 vectors of large amounts of data
sets.Has anything crashed on your server?
No. Crashes didn't occur duriung that times.
Are you using GIN or GIST indexes?
I'm using GIN indesex on tsearch2 vectors of very large ammount of data
sets. (About 3,8 million data sets of which about 30-50 thousands are
recreated and indexed when the descibed behavior occures.)I'll look at putting some debug information in there that logs whether
multi-WAL actions remain unresolved for any length of time.Extra debug info would be great.
I tested myself adding some debug output into the function Tom Lane
mentioned and found that after the server stopped checkpointing every
time the function is called it exits at this point:/*
* Is it safe to checkpoint? We must ask each of the resource managers
* whether they have any partial state information that might prevent a
* correct restart from this point. If so, we skip this opportunity, but
* return at the next checkpoint record for another try.
*/
for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
if (!(RmgrTable[rmid].rm_safe_restartpoint()))
return;
}It exits every time with the same value for rmid.
Logs look like this (The quoted lines repeat):<2007-06-01 13:10:28.936 CEST:%> DEBUG: 00000: executing restore
command "/var/lib/pgsql/restore.pl
/mnt/wal_archive/00000001000000C9000000C2 pg_xlog/RECOVERYXLOG"
<2007-06-01 13:10:28.936 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2474
<2007-06-01 13:11:29.055 CEST:%> LOG: 00000: restored log file
"00000001000000C9000000C2" from archive
<2007-06-01 13:11:29.055 CEST:%> LOCATION: RestoreArchivedFile, xlog.c:2504
<2007-06-01 13:11:29.364 CEST:%> DEBUG: 00000: found Checkpoint in XLOG
<2007-06-01 13:11:29.364 CEST:%> CONTEXT: xlog redo checkpoint: redo
C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
offset 15; online
<2007-06-01 13:11:29.364 CEST:%> LOCATION: RecoveryRestartPoint,
xlog.c:5739
<2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13)
has partial state information
<2007-06-01 13:11:29.365 CEST:%> CONTEXT: xlog redo checkpoint: redo
C9/C20DE050; undo 0/0; tli 1; xid 0/36130541; oid 241990328; multi 8;
offset 15; online
<2007-06-01 13:11:29.365 CEST:%> LOCATION: RecoveryRestartPoint,
xlog.c:5769
To me, this points clearly to there being an improperly completed action
in resource manager 13. (GIN) In summary, it appears that there may be
an issue with the GIN code for WAL recovery and this is effecting the
Warm Standby.
I'll reply again soon with a patch to allow data of incomplete splits to
be logged so we can see what's going on. After that we should continue
to discuss this on -hackers.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
<2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13)
has partial state informationTo me, this points clearly to there being an improperly completed action
in resource manager 13. (GIN) In summary, it appears that there may be
an issue with the GIN code for WAL recovery and this is effecting the
Warm Standby.
Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. Is it
possible reason of bug?
Attached patch fixes it.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
patch.txttext/plain; name=patch.txtDownload+1-0
Teodor Sigaev schrieb:
Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list.
Is it possible reason of bug?
Sounds reasonable to me. Would explain why the system didn't ever
recover from that state.
I'll test your patch and report results on this list.
Greetings,
Frank Wittig
On Fri, 2007-06-01 at 16:58 +0400, Teodor Sigaev wrote:
<2007-06-01 13:11:29.365 CEST:%> DEBUG: 00000: Ressource manager (13)
has partial state informationTo me, this points clearly to there being an improperly completed action
in resource manager 13. (GIN) In summary, it appears that there may be
an issue with the GIN code for WAL recovery and this is effecting the
Warm Standby.Hmm. I found that gin_xlog_cleanup doesn't reset incomplete_splits list. Is it
possible reason of bug?
Hi Teodor,
Hmm, well, the list should be empty by that point anyway. That code is
only executed at the end of xlog replay, not half-way through as we are
seeing.
There are two possibilities:
1. There are some incomplete splits, pointing to a likely bug in GIN
2. There are so many index splits that we aren't able to make a
successful restartpoint using the current mechanism. Not a bug, but
would be an issue with how restartpoints interact with GIN (possibly
other index types also).
When we wrote this I thought (2) would be a problem, but its not shown
up to be so for btrees (yet, I guess). I have some ideas if its (2).
The attached patch should show which of these it is. I'll dress it up a
little better so we have a debug option on this. Please note I've not
tested this patch myself, so Frank if you don't mind me splatting
something at you we'll see what we see.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Attachments:
ginwalsplit.v1.patchtext/x-patch; charset=UTF-8; name=ginwalsplit.v1.patchDownload+22-5
Hi Simon,
The attached patch should show which of these it is. I'll dress it up a
little better so we have a debug option on this. Please note I've not
tested this patch myself, so Frank if you don't mind me splatting
something at you we'll see what we see.
I'll test that. I have an idea on how to trigger that problem on purpose
so I dont have to wait until it occures on its own.
Please feel free to let me go through any test you want to run. I really
need that WAL standby - so I will make any effort I can to help getting
it stable.
Greetings,
Frank Wittig
Hi Simon,
The attached patch should show which of these it is.
As I thought, I can trigger the issue by letting the database recreate
and reindex some ten thousand data sets.
Here is what happened in the log (I cleaned it to the lines your patch
provides):
<2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete splits=1
<2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:29:13.912 CEST:%> LOG: GIN incomplete splits=1
<2007-06-01 16:29:13.912 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:29:51.594 CEST:%> LOG: GIN incomplete splits=1
<2007-06-01 16:29:51.594 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:30:19.351 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
<2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:31:41.991 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
<2007-06-01 16:34:09.217 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:34:09.218 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:34:09.218 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
<2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:35:31.933 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
<2007-06-01 16:36:56.391 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:36:56.391 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:36:56.392 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
<2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete splits=2
<2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
<2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:4269
l:102123 r:111741 at redo CA/EEAD8B80
It didnt do us the favor to produce more incomplete splits. But these
two are enough to keep the standby server from doing checkpoints.
Greetings,
Frank Wittig
<2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
...
<2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28
Looks like a bug in GIN. I'll play with it. Can you provide more details about
your test?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
I am following this thread with interest and hence as a request if the
discussion is kept in this thread, is easier for me to follow it.
Just recently I have put the WAL standby in production using 8.2.3 and
would like to know the circumstances where I will need to upgrade.
Hence the interest.
Regards
Dhaval
On Fri, 2007-06-01 at 19:11 +0400, Teodor Sigaev wrote:
<2007-06-01 16:28:51.708 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28...
<2007-06-01 16:38:23.133 CEST:%> LOG: GIN incomplete split root:11118
l:45303 r:111740 at redo CA/C8243C28Looks like a bug in GIN. I'll play with it. Can you provide more details about
your test?
Agreed, sorry about that.
I note that forgetIncompleteSplit doesn't report anything if it fails to
find an incomplete split when it looks for it.
ISTM that either we have
1) sometimes a starting split, but no ending split
2) a starting split, but we try to remove the wrong split later, for
some reason. i.e. we make the change correctly, just misjudge the xlog
somehow.
(1) would show up on data retrieval tests, but (2) would show up only on
recovery. So I would think it is (2).
I'd suggest we throw an error, as shown in the enclosed patch. Frank,
can you give that a whirl to provide Teodor with something more to work
with? Thanks.
Neither GIST nor B-tree seems to throw an error in corresponding
locations also, so the potential for not being able to track this is
high. I'd want to throw errors in those locations also.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Attachments:
ginwalsplit.v2.patchtext/x-patch; charset=UTF-8; name=ginwalsplit.v2.patchDownload+36-5
I'd suggest we throw an error, as shown in the enclosed patch. Frank,
can you give that a whirl to provide Teodor with something more to work
with? Thanks.
I already makes a test suite which reproduce the problem - it leaves incompleted
splits. But I discover one more problem: deadlock on buffer's lock. So, right
now I investigate the problem.
Neither GIST nor B-tree seems to throw an error in corresponding
locations also, so the potential for not being able to track this is
high. I'd want to throw errors in those locations also.
Agreed, I'll add more check
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
On Fri, 2007-06-01 at 09:22 -0700, Dhaval Shah wrote:
I am following this thread with interest and hence as a request if the
discussion is kept in this thread, is easier for me to follow it.Just recently I have put the WAL standby in production using 8.2.3 and
would like to know the circumstances where I will need to upgrade.
Hence the interest.
OK, in summary:
We have a rare bug in WAL recovery for GIN indexes that has been noticed
during Warm Standby recovery. Currently, this is effecting the ability
to perform restartable recovery, which would cause some administrative
difficulty if the *standby* server fails. We're working on a fix.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Fri, 2007-06-01 at 20:49 +0400, Teodor Sigaev wrote:
I'd suggest we throw an error, as shown in the enclosed patch. Frank,
can you give that a whirl to provide Teodor with something more to work
with? Thanks.I already makes a test suite which reproduce the problem - it leaves incompleted
splits. But I discover one more problem: deadlock on buffer's lock. So, right
now I investigate the problem.
OK, I'll leave the code to you from here....
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
Found a reason: if parent page is fully backuped after child's split then
forgetIncompleteSplit() isn't called at all.
Hope, attached patch fix that. Pls, test it.
PS I'm going away for weekend, so I'll not be online until Monday.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
On Fri, 2007-06-01 at 22:42 +0400, Teodor Sigaev wrote:
Found a reason: if parent page is fully backuped after child's split then
forgetIncompleteSplit() isn't called at all.
i.e. full_page_writes = on
Hope, attached patch fix that. Pls, test it.
PS I'm going away for weekend, so I'll not be online until Monday.
OK, I'll check the other AMs to see if we have similar problems.
Have a good weekend.
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com