Reliability of WAL replication
We had some corrupted data files in the past (missing clog, see
http://archives.postgresql.org/pgsql-bugs/2007-07/msg00124.php) and are
thinking about setting up a warm standby system using WAL replication.
Would an error like the one we had appear in WAL and would it be
replicated too? Or is there some kind of consistency check, that
prevents broken WAL from being restored?
I've already posted this question to the bugs list two weeks ago, but
didn't receive an answer so far. Maybe it was the wrong list for that
kind of question, so we'll give it another try here.
Our customer demands a final statement from us, so we would appreciate a
soon reply ( I know, it's always urgent, isn't it? ;) ).
Regards,
Marc Schablewski
click:ware Informationstechnik GmbH
Marc,
On Tue, 2007-10-23 at 13:58 +0200, Marc Schablewski wrote:
We had some corrupted data files in the past (missing clog, see
http://archives.postgresql.org/pgsql-bugs/2007-07/msg00124.php) and are
thinking about setting up a warm standby system using WAL replication.Would an error like the one we had appear in WAL and would it be
replicated too? Or is there some kind of consistency check, that
prevents broken WAL from being restored?
Here we had WAL based replication in place some time ago, and the result
are somewhat mixed: in one case the corruption was replicated, other
times it was not... I guess it has to do with where the corruption
occurred, and I have a feeling the first case (corruption replicated)
was some postgres corner case reacting badly on kill -9 and alike, the
second case (corruption not replicated) was file system corruption. I
didn't run WAL based replication for a while, so I don't know what have
changed in it lately...
Cheers,
Csaba.
On Tue, 2007-10-23 at 13:58 +0200, Marc Schablewski wrote:
We had some corrupted data files in the past (missing clog, see
http://archives.postgresql.org/pgsql-bugs/2007-07/msg00124.php) and are
thinking about setting up a warm standby system using WAL replication.Would an error like the one we had appear in WAL and would it be
replicated too? Or is there some kind of consistency check, that
prevents broken WAL from being restored?
Each WAL record is CRC checked, so it is quite unlikely that it could be
corrupt on its own.
The contents of the WAL record may cause the system to do something
wrong on the second server, but if this occurs it usually causes some
form of error and we can see that this has happened, report the bug and
then restart replication. If that kind of error occurs it is because of
a problem in the PostgreSQL software, not a fault of the replication
technique. That means these incidents are very rare and we have quickly
fixed such bugs when they do occur. I think this has happened twice in
12-18 months.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com