getting status transaction error
around 6:30 this morning, I started getting the following messages in my log:
Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]
repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.
Anybody know what this is?
merlin
On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:
around 6:30 this morning, I started getting the following messages in my log:
Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.Anybody know what this is?
[x-posting to -hackers]
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory
repeated ad-naseum
merlin
Merlin Moncure wrote:
On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:
around 6:30 this morning, I started getting the following messages in
my log:Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.Anybody know what this is?
[x-posting to -hackers]
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory
hmm I first thought it could have been
http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php
which affects 8.1.1 but that's not the very same error as the one
created by the above bug.
Stefan
Merlin Moncure wrote:
On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:
around 6:30 this morning, I started getting the following messages in my
log:Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.Anybody know what this is?
[x-posting to -hackers]
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory
I guess the problem here is that autovacuum believes that template0
needs a database-wide vacuum due to Xid wraparound getting closer. And
that database seems to have Xid 544441911 somewhere, the clog bit for
which was in the 0207 file which was deleted some time ago.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On 2/6/07, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
Merlin Moncure wrote:
On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:
around 6:30 this morning, I started getting the following messages in
my log:Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.Anybody know what this is?
[x-posting to -hackers]
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directoryhmm I first thought it could have been
http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php
which affects 8.1.1 but that's not the very same error as the one
created by the above bug.
ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.
a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.
merlin
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directoryI guess the problem here is that autovacuum believes that template0
needs a database-wide vacuum due to Xid wraparound getting closer. And
that database seems to have Xid 544441911 somewhere, the clog bit for
which was in the 0207 file which was deleted some time ago.
Latest checkpoint's NextXID: 2162841139
2^31: 2147483648
is this related?
merlin
Merlin Moncure wrote:
ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.
I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.
What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.
Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Merlin Moncure wrote:
ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...
thats a big help, database is actually fairly huge, so I may have to
just go ahead and do it. I'm off to a meeting, but I'll check back
when I'm done and assuming nobody else says 'don't do that', I'll try
the fix and post back with the result.
thanks all,
merlin
Merlin Moncure wrote:
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directoryLatest checkpoint's NextXID: 2162841139
2^31: 2147483648
I think the relevant arithmetic here is
echo "2162841139 544441911 - p" | dc
1618399228
That's a billion and a half transactions. Autovacuum uses the formula
this_whole_db = (tmp->age >
(int32) ((MaxTransactionId >> 3) * 3 - 100000));
to determine whether it needs database-wide vacuum.
(MaxTransactionId >> 3) is 536870911, so the calculation is
536870911 * 3 - 100000
echo "536870911 3 * 100000 - p" | dc
1610512733
which looks awfully close to the number above. About 7 million
transactions must have passed since the first time the error showed up
-- does that sound likely?
Well, scratch that -- what's the _current_ Xid? (not lastest
checkpoint's)
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Merlin Moncure wrote:
n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Merlin Moncure wrote:
ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...thats a big help, database is actually fairly huge, so I may have to
just go ahead and do it. I'm off to a meeting, but I'll check back
when I'm done and assuming nobody else says 'don't do that', I'll try
the fix and post back with the result.
Well, you don't need to copy all databases for the test area, just the
base/<oid> dir for template0 (along with all pg_xlog and pg_clog files,
etc, but these shouldn't be as big as all the other stuff in base/).
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Merlin Moncure wrote:
On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:
actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directoryLatest checkpoint's NextXID: 2162841139
2^31: 2147483648I think the relevant arithmetic here is
echo "2162841139 544441911 - p" | dc
1618399228That's a billion and a half transactions. Autovacuum uses the formula
this_whole_db = (tmp->age >
(int32) ((MaxTransactionId >> 3) * 3 - 100000));to determine whether it needs database-wide vacuum.
(MaxTransactionId >> 3) is 536870911, so the calculation is
536870911 * 3 - 100000echo "536870911 3 * 100000 - p" | dc
1610512733which looks awfully close to the number above. About 7 million
transactions must have passed since the first time the error showed up
-- does that sound likely?Well, scratch that -- what's the _current_ Xid? (not lastest
checkpoint's)
I don't know any better way to get that than this:
postgres=# insert into foo default values;
INSERT 0 1
postgres=# select xmin,xmax from foo;
xmin | xmax
------------+------
2163877346 | 0
(1 row)
merlin