Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).
The details of each incident are similar, but not identical.
The details of each incident are:
INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of an existing, correct primary (P1) for the base backup, and using WAL-E for WAL segment shipping. Both the primary and secondary were running 9.0.14. S1 properly connected to the primary once the it was caught up on WAL segments, and S1 was then promoted as a primary using the trigger file.
No errors in the log files on either system.
After promotion, it was discovered that there was significant data loss on S1. Rows that were present on P1 were missing on S1, and some rows were duplicated (including duplicates that violated primary key and other unique constraints). The indexes were corrupt, in that they seemed to think that the duplicates were not duplicated, and that the missing rows were still present.
Because the client's schema included a "last_updated" field, we were able to determine that all of the rows that were either missing or duplicated had been updated on P1 shortly (3-5 minutes) before S1 was promoted. It's possible, but not confirmed, that there were active queries (including updates) running on P1 at the moment of S1's promotion.
As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), and no data corruption was observed.
P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.
P1 and S1 have been destroyed at this point.
INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and unique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident.
A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream, although the WAL-E archiving was still present.
S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although this time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering in the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed.
P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.
No errors in the log files on either system.
P2 and S2 are still operational.
INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 system (P3) to a new 9.2.5 system (S3) using streaming replication. As I personally didn't do the steps on this one, I don't have quite as much information, but the basics are close to incident #2: When S3 was promoted using the trigger file, no errors were observed and the database came up normally, but rows were missing from S3 that were present on P3.
P1 is running Centos 6.3 with ext4 as the file system.
P2 is running Centos 6.4 with ext3 as the file system.
Log shipping in this case was done via rsync.
P3 and S3 are still operational.
No errors in the log files on either system.
--
Obviously, we're very concerned that a bug was introduced in the latest minor release. We're happy to gather data as required to assist in diagnosing this.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 11/18/2013 10:58 AM, Christophe Pettus wrote:
Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on client systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were for the same client (9.0.14 and 9.3.1), one for a different client (9.2.5).
To emphasize a salient point: we have not previously seen data
corruption with these exact symptoms prior to the recent patch release.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMaf2c34db722373c4cb391c0fe58f0a5e8c005aa05074fd893b582d23b7079037fb7e21476e58d3601c410e92670b5c58@asav-3.01.com
On Nov 18, 2013, at 10:58 AM, Christophe Pettus <xof@thebuild.com> wrote:
As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), and no data corruption was observed.
As another data point, P0 was running 9.0.13, rather than 9.0.14.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using
rsync off of an existing, correct primary (P1) for the base backup,
and using WAL-E for WAL segment shipping. Both the primary and
secondary were running 9.0.14. S1 properly connected to the primary
once the it was caught up on WAL segments, and S1 was then promoted as
a primary using the trigger file.
Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?
No errors in the log files on either system.
Do you have the log entries for the startup after the base backup?
Because the client's schema included a "last_updated" field, we were
able to determine that all of the rows that were either missing or
duplicated had been updated on P1 shortly (3-5 minutes) before S1 was
promoted. It's possible, but not confirmed, that there were active
queries (including updates) running on P1 at the moment of S1's
promotion.
Any chance you have log_checkpoints enabled? If so, could you check
whether there was a checkpoint around the time of the base backup?
This server is gone, right? If not, could you do:
SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row?
INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and unique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident.
A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream, although the WAL-E archiving was still present.
S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although this time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering in the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed.
P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume.
No errors in the log files on either system.
Could you list the *exact* steps you did to startup the cluster?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 11:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?
0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
1. pg_start_backup('', true) on P1.
2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2.
3. Once the rsync was complete, pg_stop_backup() on P1.
4. Create appropriate recovery.conf on S1.
5. Bring up PostgreSQL on S1.
6. PostgreSQL recovers normally (pulling WAL segments from WAL-E), and eventually connects to P1.
Do you have the log entries for the startup after the base backup?
Sadly, not anymore.
This server is gone, right?
Correct.
Could you list the *exact* steps you did to startup the cluster?
0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
1. Initial (empty) data directory deleted on S2.
2. New data directory created with:
/usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main
3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.
4. Bring up PostgreSQL on S2.
5. PostgreSQL recovers normally (pulling a small number of WAL segments from WAL-E), and eventually connects to P2.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-18 11:38:43 -0800, Christophe Pettus wrote:
On Nov 18, 2013, at 11:28 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Could you detail how exactly the base backup was created? Including the
*exact* logic for copying?0. Before any of this began, P1 was archiving WAL segments to AWS-S3.
1. pg_start_backup('', true) on P1.
2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2.
Without deleting any data, including pg_xlog/, backup.label, anything?
Did you have hot_standby enabled on all of those machines? Even on the
9.0.13 cluster?
Could you list the *exact* steps you did to startup the cluster?
0. Before any of this began, P2 was archiving WAL segments to AWS-S3.
1. Initial (empty) data directory deleted on S2.
2. New data directory created with:/usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main
3. Once the pg_basebackup completed, create appropriate recovery.conf on S1.
That was just recovery command and primary conninfo?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Without deleting any data, including pg_xlog/, backup.label, anything?
Correct.
Did you have hot_standby enabled on all of those machines? Even on the
9.0.13 cluster?
Yes.
That was just recovery command and primary conninfo?
Correct.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Did you have hot_standby enabled on all of those machines? Even on the
9.0.13 cluster?
Actually, it's a bit more complex than this:
1. We don't know about P0, the 9.0.13 machine. I assume it was, but it was managed elsewhere.
2. P1 never had hot_standby = 'on', as it was never intended to be a hot stand_by.
3. S1 did have hot_standby = 'on.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Without deleting any data, including pg_xlog/, backup.label, anything?
One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 12:00 PM, Christophe Pettus <xof@thebuild.com> wrote:
One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted.
pg_stop_backup(), sorry.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote:
After promotion, it was discovered that there was significant data
loss on S1. Rows that were present on P1 were missing on S1, and some
rows were duplicated (including duplicates that violated primary key
and other unique constraints). The indexes were corrupt, in that they
seemed to think that the duplicates were not duplicated, and that the
missing rows were still present.
Were there any kind of patterns in the lost data? What kind of workload
are they running? I have an idea what the issue might be...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Were there any kind of patterns in the lost data? What kind of workload
are they running? I have an idea what the issue might be...
On the P1 > S1 case, the data corrupted was data modified in the last few minutes before the switchover. I don't want to over-analyze, but it was within the checkpoint_timeout value for that sever.
On the P2 > S2 case, it's less obvious what the pattern is, since there was no cutover.
Insufficient information on the P3 > S3 case.
Each of them is a reasonably high-volume OLTP-style workload. The P1/P2 client has a very high level of writes; the P3 more read-heavy, but still a fair number of writes.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597
But that'd mean nobody noticed it during 9.3's beta...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Great! If there's any further data I can supply to help, let me know.
On Nov 18, 2013, at 2:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,
Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597But that'd mean nobody noticed it during 9.3's beta...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-18 14:25:58 -0800, Christophe Pettus wrote:
Great! If there's any further data I can supply to help, let me know.
Trying to reproduce the issue with and without hot_standby=on would be
very helpful, but I guess that's time consuming?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Nov 18, 2013, at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Trying to reproduce the issue with and without hot_standby=on would be
very helpful, but I guess that's time consuming?
I've been working on it, but I haven't gotten it to fail yet. I'll keep at it.
--
-- Christophe Pettus
xof@thebuild.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On 2013-11-18 23:15:59 +0100, Andres Freund wrote:
Afaics it's likely a combination/interaction of bugs and fixes between:
* the initial HS code
* 5a031a5556ff83b8a9646892715d7fef415b83c3
* f44eedc3f0f347a856eea8590730769125964597
Yes, the combination of those is guilty.
Man, this is (to a good part my) bad.
But that'd mean nobody noticed it during 9.3's beta...
It's fairly hard to reproduce artificially since a) there have to be
enough transactions starting and committing from the start of the
checkpoint the standby is starting from to the point it does
LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save
the game by not accessing clog at all anymore and thus not noticing the
corruption.
I've reproduced the issue by having an INSERT ONLY table that's never
read from. It's helpful to disable autovacuum.
Imo something the attached patch should be done. The description I came
up with is:
Fix Hot-Standby initialization of clog and subtrans.
These bugs can cause data loss on standbys started with hot_standby=on
at the moment they start to accept read only queries by marking
committed transactions as uncommited. The likelihood of such
corruptions is small unless the primary has a high transaction rate.
5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup
logic by maintaining less state until at least
STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both
clog and subtrans are written to before that. This only failed to fail
in common cases because the usage of ExtendCLOG in procarray.c was
superflous since clog extensions are actually WAL logged.
f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the
missing extensions of pg_subtrans due to the former commit's changes -
which are not WAL logged - by performing the extensions when switching
to a state > STANDBY_INITIALIZED and not performing xid assignments
before that - again missing the fact that ExtendCLOG is unneccessary -
but screwed up twice: Once because latestObservedXid wasn't updated
anymore in that state due to the earlier commit and once by having an
off-by-one error in the loop performing extensions.
This means that whenever a CLOG_XACTS_PER_PAGE (32768 with default
settings) boundary was crossed between the start of the checkpoint
recovery started from and the first xl_running_xact record old
transactions commit bits in pg_clog could be overwritten if they
started and committed in that window.
Fix this mess by not performing ExtendCLOG() in HS at all anymore
since it's unneeded and evidently dangerous and by performing subtrans
extensions even before reaching STANDBY_SNAPSHOT_PENDING.
Imo this warrants and expedited point release :(
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-19 15:20:01 +0100, Andres Freund wrote:
Imo something the attached patch should be done. The description I came
up with is:Fix Hot-Standby initialization of clog and subtrans.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
0001-Fix-Hot-Standby-initialization-of-clog-and-subtrans.patchtext/x-patch; charset=us-asciiDownload+39-28
On 11/19/2013 09:20 AM, Andres Freund wrote:
Imo this warrants and expedited point release :(
I presume anyone who is vulnerable to it would need to recreate their
secondary servers to get rid of potential problems?
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote:
On 11/19/2013 09:20 AM, Andres Freund wrote:
Imo this warrants and expedited point release :(
I presume anyone who is vulnerable to it would need to recreate their
secondary servers to get rid of potential problems?
Yes. There's less expensive ways to do it, but those seem to complicated
to suggest.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers