Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid
(I originally posted this to pgsql-admin and was pointed to here instead.)
Folks-
I'm doing a postmortem on a corruption event we had. I have an idea on
what happened, but not sure. I figure I'd share what happened and see if
I'm close to right here.
Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database. Trying to update or delete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.
Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients touching this row, fresh restart of the db,
and saw virtualtransactions against this same table where the values would
be -1/nnnnn were nnnnn was a huge number. Turns out we had about 159
entries from different tables in the database. Checked hot-standby and, of
course, no locks or anything. Switched to hot-standby.
Hot-standby instantly gained these locks, Also noticed that 2 prepared
transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks
still existed. Ended up executing the one thing we knew would work. Take
the site down, pg_dumpall to fresh instance. Everything is fine.
A little more background: We were running 9.1.4 back when 9.1.6 came out.
We saw there was possible corruption issues and did a binary upgrade and
reindexing. Everything seemed to be fine, but I wonder if we really had
problems back then. We rebuilt the hot-standby after the binary upgrade via
normal restore and wal-file replays. I should also note that this row that
had the lock on it that would not go away, was created by an app server
that was killed (via -9) since it was non-responsive, and the row 'create
date' (in db and logs) is the exact time the app server was killed.
I was wondering if a) these virtualtransactions that start with '-1/'
indicate a problem, b) if this could have happened from pre 9.1.6
corruption that was fixed in 9.1.6. Or, could this have occurred when we
killed that app server? Or.... am I looking in the wrong place.
I do still have the old data directories so I can start them up and check
out the dataset. Any advice?
Currently running environment: CentOS 5.x
Used the http://yum.postgresql.org/9.1 repo...
Currently using PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
but was on 9.1.6 at failure time.
The hardware on both primary and secondary servers are in good condition,
raid5 via a hardware card with no issues. Each a 2x4x2 (2 hyperthreaded
Xeon E5620 2.4Ghz, 4 cores each, 16 threads total) 32G Ram.
Data from pg_dumpall checked out fine; no data loss occurred that we could
tell. Just this transaction lock and what seemed like invalid backend ids
listed in the virtualtransaction from the pg_locks table/view.
(-1/verylargenumber)
--
Virtually, Ned Wolpert
"Settle thy studies, Faustus, and begin..." --Marlowe
Ned Wolpert <ned.wolpert@imemories.com> wrote:
I'm doing a postmortem on a corruption event we had. I have an
idea on what happened, but not sure. I figure I'd share what
happened and see if I'm close to right here.Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all.
Those WAL files aren't going to be of much use without an
occasional base backup to apply them to.
50G database. Trying to update or delete a row in a small (21
row, but heavily used table) would lock up completely. Never
finish.
How long is "never" in hours or days?
Removed all clients, restarted the db instance, no joy.
Check pg_stat_activity, and nothing that wasn't idle.... run the
delete, locked up.
Did you have any rows in pg_prepared_xacts that had lingered for
longer than you were waiting for the delete?
Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients touching this row
Of course you need to be connected to the right database when you
run this, and you need to look at relation locks -- row locks don't
show up in that view.
fresh restart of the db, and saw virtualtransactions against this
same table where the values would be -1/nnnnn were nnnnn was a
huge number. Turns out we had about 159 entries from different
tables in the database.
Sounds like prepared transactions. Again, what was in
pg_prepared_xacts?
Checked hot-standby and, of course, no locks or anything.
Switched to hot-standby.
Meaning you promoted it to master?
Hot-standby instantly gained these locks, Also noticed that 2
prepared transactions migrated to the hot-standby. Binary
upgraded to 9.1.8, locks still existed. Ended up executing the
one thing we knew would work. Take the site down, pg_dumpall to
fresh instance. Everything is fine.
That all sounds consistent with a flaky transaction manager.
A little more background: We were running 9.1.4 back when 9.1.6
came out. We saw there was possible corruption issues and did a
binary upgrade and reindexing. Everything seemed to be fine, but
I wonder if we really had problems back then. We rebuilt the
hot-standby after the binary upgrade via normal restore and
wal-file replays. I should also note that this row that had the
lock on it that would not go away, was created by an app server
that was killed (via -9) since it was non-responsive, and the row
'create date' (in db and logs) is the exact time the app server
was killed.
It sounds as though you abruptly terminated the process running
your transaction manager, which left it unaware of one or more
dangling prepared transactions. Further, it sounds like your
transaction manager doesn't go looking for such detritus. If it's
not going to watch for such problems, you had better do so. Any
prepared transaction which is sitting in pg_prepared_xacts for more
than a few seconds, I would consider suspect. After a few minutes
hours I would consider them to be a problem. After a day I would
consider the transaction manager to have fallen on its face, and
would go clean things up by either committing or rolling back the
prepared transaction(s).
I was wondering if a) these virtualtransactions that start with
'-1/' indicate a problem, b) if this could have happened from pre
9.1.6 corruption that was fixed in 9.1.6. Or, could this have
occurred when we killed that app server? Or.... am I looking in
the wrong place.
-1 as the process doesn't indicate a problem per se. It indicates
the transaction has been "prepared" and is no longer associated
with a backend process or connection.
Something which was assuming the role of a transaction manager told
a transaction (or many of them) to prepare for commit as part of a
distributed transaction. A transaction which says it successfully
completed such a PREPARE statement must hold all its locks and keep
all changes pending until it is told to commit or roll back, even
across database restarts. It sounds like things were left in this
state for a very long time, which can lead to all kinds of
problems, notably bloat and blocking.
I do still have the old data directories so I can start them up
and check out the dataset. Any advice?
I would start it up and see what's in pg_prepared_xacts.
--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Ned Wolpert <ned.wolpert@imemories.com> writes:
Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database. Trying to update or delete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.
I'm betting one of those prepared transactions had updated or deleted
this row, and thus held a row lock on it. (Come to think of it, a
SELECT FOR UPDATE/SHARE might have been enough.) Did you try committing
or rolling back those xacts?
regards, tom lane
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Tom and Kevin-
There were two entries in pg_prepared_xacts. In the test-bed, executing
the 'ROLLBACK PREPARED' on both allowed the system to continue processing.
All locks I saw in 'pg_locks' where the virtualtransaction started with the
'-1/' were also gone. That was indeed the issue. More importantly to me,
there was no issue likely leftover during our 9.1.4->9.1.6 upgrade we did;
just a 'flaky transaction manager' as you suspected.
Thanks to both of you for help in tracking this down.
P.S. Kevin, We also do tar archives of the data directories nightly to
accompany the wal files we store, using pg_start_backup/pg_stop_backup. :-)
Full restores are tested monthly.
On Thu, Feb 21, 2013 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Ned Wolpert <ned.wolpert@imemories.com> writes:
Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database. Trying to update ordelete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy.Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.I'm betting one of those prepared transactions had updated or deleted
this row, and thus held a row lock on it. (Come to think of it, a
SELECT FOR UPDATE/SHARE might have been enough.) Did you try committing
or rolling back those xacts?regards, tom lane
--
Virtually, Ned Wolpert
"Settle thy studies, Faustus, and begin..." --Marlowe