spurious wrap-around shutdown

Started by Jeff Janesover 12 years ago3 messages
#1Jeff Janes
jeff.janes@gmail.com

In 9.3 HEAD I am getting what seems to be spurious wrap-around shutdowns.

postgres=# SELECT datname, datfrozenxid, age(datfrozenxid) FROM pg_database;

datname | datfrozenxid | age
-----------+--------------+-----------
template1 | 2621759843 | 0
template0 | 2621759843 | 0
postgres | 2571759843 | 50000000
jjanes | 2437230921 | 184528922

postgres=# select txid_current();
ERROR: database is not accepting commands to avoid wraparound data loss in
database "jjanes"
HINT: Stop the postmaster and use a standalone backend to vacuum that
database.
You might also need to commit or roll back old prepared transactions.

184,528,922 is well short of 2 billion, so what is going on?

I thought maybe the ShmemVariableCache were not getting updated when vacuum
finished, but if I restart the server (forcing shared memory to get rebuilt
from disk) the condition continues.

I tried setting a breakpoint on SetTransactionIdLimit, but that seems to
get executed on startup before the -W flag takes effect, so I can't find it.

Any tips on how to debug this? I figure the next step is running git
bisect, but that is sure to be tedious.

I'm using a variant of the below to reach wraparound quicker, perhaps that
is introducing a bug?

/messages/by-id/20130207203216.GE5172@alvh.no-ip.org

Cheers,

Jeff

#2Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Janes (#1)
Re: spurious wrap-around shutdown

On 2013-06-16 11:54:24 -0700, Jeff Janes wrote:

In 9.3 HEAD I am getting what seems to be spurious wrap-around shutdowns.

postgres=# SELECT datname, datfrozenxid, age(datfrozenxid) FROM pg_database;

datname | datfrozenxid | age
-----------+--------------+-----------
template1 | 2621759843 | 0
template0 | 2621759843 | 0
postgres | 2571759843 | 50000000
jjanes | 2437230921 | 184528922

postgres=# select txid_current();
ERROR: database is not accepting commands to avoid wraparound data loss in
database "jjanes"
HINT: Stop the postmaster and use a standalone backend to vacuum that
database.
You might also need to commit or roll back old prepared transactions.

184,528,922 is well short of 2 billion, so what is going on?

I guess you're sure you don't have any old prepared xacts running
around?

I thought maybe the ShmemVariableCache were not getting updated when vacuum
finished, but if I restart the server (forcing shared memory to get rebuilt
from disk) the condition continues.

I tried setting a breakpoint on SetTransactionIdLimit, but that seems to
get executed on startup before the -W flag takes effect, so I can't find it.

Any tips on how to debug this? I figure the next step is running git
bisect, but that is sure to be tedious.

I'd first add the actual xids limits that are assumed to be dangerous to
the error messages in GetNewTransactionId(). That already might give a
hint.

I'm using a variant of the below to reach wraparound quicker, perhaps that
is introducing a bug?

/messages/by-id/20130207203216.GE5172@alvh.no-ip.org

I don't really trust that patch because it skips loads of checks since
it only repeats part of the work that GetNewTransactionId does. I don't
immediately see what the problem that could cause though.
IIRC I had postes a patch in that thread that looped around
GetNewTransactionId() in that thread. It might be worthwile to test
whether that also reproduces the issue.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#2)

On Sun, Jun 16, 2013 at 11:54 AM, Jeff Janes
<jeff.janes@gmail.com<javascript:_e({}, 'cvml',
'jeff.janes@gmail.com');>

wrote:

In 9.3 HEAD I am getting what seems to be spurious wrap-around shutdowns.

postgres=# SELECT datname, datfrozenxid, age(datfrozenxid) FROM
pg_database;

datname | datfrozenxid | age
-----------+--------------+-----------
template1 | 2621759843 | 0
template0 | 2621759843 | 0
postgres | 2571759843 | 50000000
jjanes | 2437230921 | 184528922

While the behavior is weird, it is not a regression (also present in 9.2
with suitable changes in timing) and the shutdown is not spurious.

If I execute the above query immediately after the shutdown, I see what I
would expect, jjanes has an age of about 2^31.

The one table that is holding everything back is already getting autovac
for wraparound at that point, and eventually that vacuum finishes. When
done, pg_class and pg_database are updated (I don't know how they get
updated without trying to assign another transaction), and then I get the
above query results.

I would think the database would re-allow new transactions at this point,
but it does not. I don't know why.

Since this isn't a regression in 9.3, I probably won't pursue it any more
at this time, unless encouraged to.

Cheers,

Jeff