FlushRelationBuffers error

Started by Gaetano Mendolaover 21 years ago5 messages
#1Gaetano Mendola
mendola@bigfoot.com

Hi all,
I'm running postgres 7.4.5 on a linux box, this morning I got this error on my logs:

WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is referenced (private 0, global 1)
ERROR: FlushRelationBuffers returned -2
DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 354676201, it was already committed

after the recovery:

ERROR: could not access status of transaction 352975274
DEBUG: AbortCurrentTransaction

this messages for 5 hours

I had my verbosity equal to terse ( I run the server with debug2 level ) so I didn't see the
exactly reason for this, after putting verbosity to "verbose" I got the entire message:

ERROR: 58P01: could not access status of transaction 352975274
DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/0150": No such file or directory
LOCATION: SlruReportIOError, slru.c:609
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

In the pg_clog directory I had only the file 0152 !

I had to create a 8k file with zeroes and I discover the offset:

ERROR: XX000: could not access status of transaction 352975274
DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0150" at offset 155648: Success
LOCATION: SlruReportIOError, slru.c:630
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

After creating that file till to cover that offset the problem seems be fixed.

Info for hackers: exp_provider is an index and during that message a reindex was in place.

Some questions:
What about the 0151 file?
Don't you think that even with verbosity terse the message about the file missing shall appear ?
Why emit the offset only if the file was found ?

I have to thank Neil Conway that was helping me on IRC about this error.

If you need further infos, please let me know.

Regards
Gaetano Mendola

#2Jan Wieck
JanWieck@Yahoo.com
In reply to: Gaetano Mendola (#1)
Re: FlushRelationBuffers error

Any chance for bad memory?

Jan

On 9/30/2004 6:16 AM, Gaetano Mendola wrote:

Hi all,
I'm running postgres 7.4.5 on a linux box, this morning I got this error on my logs:

WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is referenced (private 0, global 1)
ERROR: FlushRelationBuffers returned -2
DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 354676201, it was already committed

after the recovery:

ERROR: could not access status of transaction 352975274
DEBUG: AbortCurrentTransaction

this messages for 5 hours

I had my verbosity equal to terse ( I run the server with debug2 level ) so I didn't see the
exactly reason for this, after putting verbosity to "verbose" I got the entire message:

ERROR: 58P01: could not access status of transaction 352975274
DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/0150": No such file or directory
LOCATION: SlruReportIOError, slru.c:609
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

In the pg_clog directory I had only the file 0152 !

I had to create a 8k file with zeroes and I discover the offset:

ERROR: XX000: could not access status of transaction 352975274
DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0150" at offset 155648: Success
LOCATION: SlruReportIOError, slru.c:630
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

After creating that file till to cover that offset the problem seems be fixed.

Info for hackers: exp_provider is an index and during that message a reindex was in place.

Some questions:
What about the 0151 file?
Don't you think that even with verbosity terse the message about the file missing shall appear ?
Why emit the offset only if the file was found ?

I have to thank Neil Conway that was helping me on IRC about this error.

If you need further infos, please let me know.

Regards
Gaetano Mendola

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#3Gaetano Mendola
mendola@bigfoot.com
In reply to: Jan Wieck (#2)
Re: FlushRelationBuffers error

Jan Wieck wrote:

Any chance for bad memory?

I'll say near 0. However who never knows ? Now the server is again up and
running without glitches.

I suspect a race condition somewhere for the reindex operation.

I had with the engine 7.3 ( see in the archives ) a duplicate error during
reindexes at least one each month, for instance the server was another one,
and at that time I solved it not reindexing the DB daily ( so I decreased the
chances ).

With the 7.4 is the first time, since November 2003, that I see this error
( and for coincidence during a reindex too ) so I suspect that the race condition
is still there but with less chance to pops up.

Is it so dangerous teach the postmaster to solve this kind of problems without
a direct user intervention ?

Regards
Gaetano Mendola

Show quoted text

On 9/30/2004 6:16 AM, Gaetano Mendola wrote:

Hi all,
I'm running postgres 7.4.5 on a linux box, this morning I got this
error on my logs:

WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is
referenced (private 0, global 1)
ERROR: FlushRelationBuffers returned -2
DEBUG: AbortCurrentTransaction
PANIC: cannot abort transaction 354676201, it was already committed

after the recovery:

ERROR: could not access status of transaction 352975274
DEBUG: AbortCurrentTransaction

this messages for 5 hours

I had my verbosity equal to terse ( I run the server with debug2 level
) so I didn't see the
exactly reason for this, after putting verbosity to "verbose" I got
the entire message:

ERROR: 58P01: could not access status of transaction 352975274
DETAIL: could not open file "/var/lib/pgsql/data/pg_clog/0150": No
such file or directory
LOCATION: SlruReportIOError, slru.c:609
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

In the pg_clog directory I had only the file 0152 !

I had to create a 8k file with zeroes and I discover the offset:

ERROR: XX000: could not access status of transaction 352975274
DETAIL: could not read from file "/var/lib/pgsql/data/pg_clog/0150"
at offset 155648: Success
LOCATION: SlruReportIOError, slru.c:630
DEBUG: 00000: AbortCurrentTransaction
LOCATION: PostgresMain, postgres.c:2721

After creating that file till to cover that offset the problem seems
be fixed.

Info for hackers: exp_provider is an index and during that message a
reindex was in place.

Some questions:
What about the 0151 file?
Don't you think that even with verbosity terse the message about the
file missing shall appear ?
Why emit the offset only if the file was found ?

I have to thank Neil Conway that was helping me on IRC about this error.

If you need further infos, please let me know.

Regards
Gaetano Mendola

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gaetano Mendola (#1)
Re: FlushRelationBuffers error

Gaetano Mendola <mendola@bigfoot.com> writes:

I'm running postgres 7.4.5 on a linux box, this morning I got this error on my logs:

WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is referenced (private 0, global 1)
ERROR: FlushRelationBuffers returned -2

Info for hackers: exp_provider is an index and during that message a reindex was in place.

AFAICS, the only place that FlushRelationBuffers is called on an index
with a nonzero second argument is in btvacuumcleanup, ie, shortening an
index at the conclusion of VACUUM FULL. REINDEX wouldn't have anything
to do with it.

The message appears to indicate that some other backend was holding a
pin on a buffer containing one of the blocks of the index. Which should
definitely not happen, because VACUUM has an exclusive lock at the table
level, and so no other backend should be touching the index at all.

One thing I am wondering about is that VACUUM FULL relies on an
exclusive lock on a table to prevent outside access to the table's
indexes as well as the table itself. It might be better to spend a few
more cycles and obtain locks on the indexes too. I don't think there
are any situations where the main backend could access an index without
having any lock on the parent table, but there are some contrib
functions like pgstattuple that I'm not so sure about. Are you using
any contrib or locally written C functions in your database?

regards, tom lane

#5Gaetano Mendola
mendola@bigfoot.com
In reply to: Tom Lane (#4)
Re: FlushRelationBuffers error

Tom Lane wrote:

Gaetano Mendola <mendola@bigfoot.com> writes:

I'm running postgres 7.4.5 on a linux box, this morning I got this error on my logs:

WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is referenced (private 0, global 1)
ERROR: FlushRelationBuffers returned -2

Info for hackers: exp_provider is an index and during that message a reindex was in place.

AFAICS, the only place that FlushRelationBuffers is called on an index
with a nonzero second argument is in btvacuumcleanup, ie, shortening an
index at the conclusion of VACUUM FULL. REINDEX wouldn't have anything
to do with it.

The message appears to indicate that some other backend was holding a
pin on a buffer containing one of the blocks of the index. Which should
definitely not happen, because VACUUM has an exclusive lock at the table
level, and so no other backend should be touching the index at all.

One thing I am wondering about is that VACUUM FULL relies on an
exclusive lock on a table to prevent outside access to the table's
indexes as well as the table itself. It might be better to spend a few
more cycles and obtain locks on the indexes too. I don't think there
are any situations where the main backend could access an index without
having any lock on the parent table, but there are some contrib
functions like pgstattuple that I'm not so sure about. Are you using
any contrib or locally written C functions in your database?

Unfortunately no contrib and even no C functions at all.

Looking better at what was happening this morning indeed was a vacuum
full, I have in the morning a vacuum followed by some reindex, and the
failure was during the vacuum:

INFO: vacuuming "public.ua_user_data_exp"
INFO: "ua_user_data_exp": found 15232 removable, 33609 nonremovable row
versions in 6073 pages

[SNIP]

CPU 0.07s/0.05u sec elapsed 0.24 sec.
INFO: index "exp_mac_address_normal" now contains 33609 row versions in
644 pages
DETAIL: 3275 index row versions were removed.
1 index pages have been deleted, 1 are currently reusable.
CPU 0.00s/0.04u sec elapsed 0.25 sec.
WARNING: FlushRelationBuffers("exp_provider", 1836): block 1460 is
referenced (private 0, global 1)
vacuumdb: vacuuming of database "empdb" failed: ERROR:
FlushRelationBuffers returned -2
PANIC: cannot abort transaction 354676201, it was already committed
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Regards
Gaetano Mendola