getting status transaction error

Started by Merlin Moncurealmost 19 years ago11 messages
#1Merlin Moncure
mmoncure@gmail.com

around 6:30 this morning, I started getting the following messages in my log:

Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]

repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.

Anybody know what this is?

merlin

#2Merlin Moncure
mmoncure@gmail.com
In reply to: Merlin Moncure (#1)
Re: getting status transaction error

On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:

around 6:30 this morning, I started getting the following messages in my log:

Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]

repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.

Anybody know what this is?

[x-posting to -hackers]

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

repeated ad-naseum

merlin

#3Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Merlin Moncure (#2)
Re: [HACKERS] getting status transaction error

Merlin Moncure wrote:

On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:

around 6:30 this morning, I started getting the following messages in
my log:

Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]

repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.

Anybody know what this is?

[x-posting to -hackers]

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

hmm I first thought it could have been
http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php
which affects 8.1.1 but that's not the very same error as the one
created by the above bug.

Stefan

#4Alvaro Herrera
alvherre@commandprompt.com
In reply to: Merlin Moncure (#2)
Re: getting status transaction error

Merlin Moncure wrote:

On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:

around 6:30 this morning, I started getting the following messages in my
log:

Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]

repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.

Anybody know what this is?

[x-posting to -hackers]

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

I guess the problem here is that autovacuum believes that template0
needs a database-wide vacuum due to Xid wraparound getting closer. And
that database seems to have Xid 544441911 somewhere, the clog bit for
which was in the 0207 file which was deleted some time ago.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#5Merlin Moncure
mmoncure@gmail.com
In reply to: Stefan Kaltenbrunner (#3)
Re: [HACKERS] getting status transaction error

On 2/6/07, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:

Merlin Moncure wrote:

On 2/6/07, Merlin Moncure <mmoncure@gmail.com> wrote:

around 6:30 this morning, I started getting the following messages in
my log:

Feb 6 06:33:34 mojo postgres[1117]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:34:35 mojo postgres[1128]: [2-1] :: ERROR: could not access
status of transaction 544441911
[...]

repeated roughly once a minute. I've never seen this before. this is
on postgres 8.1.1 running on fedora core 4 smp. I don't have any
contextual information yet but I'm getting ready to turn statement
logging on.

Anybody know what this is?

[x-posting to -hackers]

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

hmm I first thought it could have been
http://archives.postgresql.org/pgsql-committers/2006-01/msg00288.php
which affects 8.1.1 but that's not the very same error as the one
created by the above bug.

ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.

a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.

merlin

#6Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#4)
Re: getting status transaction error

On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

I guess the problem here is that autovacuum believes that template0
needs a database-wide vacuum due to Xid wraparound getting closer. And
that database seems to have Xid 544441911 somewhere, the clog bit for
which was in the 0207 file which was deleted some time ago.

Latest checkpoint's NextXID: 2162841139
2^31: 2147483648

is this related?
merlin

#7Alvaro Herrera
alvherre@commandprompt.com
In reply to: Merlin Moncure (#5)
Re: [HACKERS] getting status transaction error

Merlin Moncure wrote:

ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.

a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.

I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.

What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.

Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#7)
Re: [HACKERS] getting status transaction error

n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Merlin Moncure wrote:

ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.

a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.

I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.

What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.

Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...

thats a big help, database is actually fairly huge, so I may have to
just go ahead and do it. I'm off to a meeting, but I'll check back
when I'm done and assuming nobody else says 'don't do that', I'll try
the fix and post back with the result.

thanks all,
merlin

#9Alvaro Herrera
alvherre@commandprompt.com
In reply to: Merlin Moncure (#6)
Re: [HACKERS] getting status transaction error

Merlin Moncure wrote:

On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

Latest checkpoint's NextXID: 2162841139
2^31: 2147483648

I think the relevant arithmetic here is

echo "2162841139 544441911 - p" | dc
1618399228

That's a billion and a half transactions. Autovacuum uses the formula

this_whole_db = (tmp->age >
(int32) ((MaxTransactionId >> 3) * 3 - 100000));

to determine whether it needs database-wide vacuum.
(MaxTransactionId >> 3) is 536870911, so the calculation is
536870911 * 3 - 100000

echo "536870911 3 * 100000 - p" | dc
1610512733

which looks awfully close to the number above. About 7 million
transactions must have passed since the first time the error showed up
-- does that sound likely?

Well, scratch that -- what's the _current_ Xid? (not lastest
checkpoint's)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#10Alvaro Herrera
alvherre@commandprompt.com
In reply to: Merlin Moncure (#8)
Re: [HACKERS] getting status transaction error

Merlin Moncure wrote:

n 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Merlin Moncure wrote:

ya, it doesn't seem to match, as this seems to be repeating quite
regularly. interesting that my 'clog' files start at 06B6 and count
up. 0207 is way off the charts.

a lot of applications are hitting this database, and so far everything
seems to be running ok (i found this log msg by accident), but I am
now officially very nervous.

I don't think there's much cause for concern here. If my theory is
correct, this is an autovacuum bug which was fixed in 8.1.7.

What I'd do is create a 0207 clog file, fill it with 0x55 (which is
"transactions committed" for all transactions in that interval), and do
a VACUUM FREEZE on that database. You'll need to set
pg_database.datallowconn=true beforehand.

Of course, I'd copy the files somewhere else and experiment on a scratch
postmaster, running on a different port, just to be sure ...

thats a big help, database is actually fairly huge, so I may have to
just go ahead and do it. I'm off to a meeting, but I'll check back
when I'm done and assuming nobody else says 'don't do that', I'll try
the fix and post back with the result.

Well, you don't need to copy all databases for the test area, just the
base/<oid> dir for template0 (along with all pg_xlog and pg_clog files,
etc, but these shouldn't be as big as all the other stuff in base/).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#11Merlin Moncure
mmoncure@gmail.com
In reply to: Alvaro Herrera (#9)
Re: [HACKERS] getting status transaction error

On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Merlin Moncure wrote:

On 2/6/07, Alvaro Herrera <alvherre@commandprompt.com> wrote:

actually, here is some more relevant bits from the log.
Feb 6 06:31:33 mojo postgres[1088]: [1-1] :: LOG: autovacuum:
processing database "template0"
Feb 6 06:31:33 mojo postgres[1088]: [2-1] :: ERROR: could not access
status of transaction 544441911
Feb 6 06:31:33 mojo postgres[1088]: [2-2] :: DETAIL: could not open
file "pg_clog/0207": No such file or directory

Latest checkpoint's NextXID: 2162841139
2^31: 2147483648

I think the relevant arithmetic here is

echo "2162841139 544441911 - p" | dc
1618399228

That's a billion and a half transactions. Autovacuum uses the formula

this_whole_db = (tmp->age >
(int32) ((MaxTransactionId >> 3) * 3 - 100000));

to determine whether it needs database-wide vacuum.
(MaxTransactionId >> 3) is 536870911, so the calculation is
536870911 * 3 - 100000

echo "536870911 3 * 100000 - p" | dc
1610512733

which looks awfully close to the number above. About 7 million
transactions must have passed since the first time the error showed up
-- does that sound likely?

Well, scratch that -- what's the _current_ Xid? (not lastest
checkpoint's)

I don't know any better way to get that than this:
postgres=# insert into foo default values;
INSERT 0 1
postgres=# select xmin,xmax from foo;
xmin | xmax
------------+------
2163877346 | 0
(1 row)

merlin