checkpoint write errors

Started by CS DBAover 9 years ago9 messagesgeneral

Jump to latest

CS DBA

cs_dba@consistentstate.com

over 9 years ago

Hi all;

we're seeing the below errors over and over in the logs of one of our
postgres databases. Version 8.4.22

Anyone have any thoughts on correcting/debugging it?

Maybe I need to run a REINDEX on whatever table equates to
"base/1029860192/1029863651"? If so how do I determine the db and table
for "base/1029860192/1029863651"?

LOG: checkpoint starting: time
ERROR: xlog flush request 2571/9C141530 is not satisfied --- flushed
only to 2570/DE61C290
CONTEXT: writing block 4874 of relation base/1029860192/1029863651
WARNING: could not write block 4874 of base/1029860192/1029863651
DETAIL: Multiple failures --- write error might be permanent.

Thanks in advance

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: CS DBA (#1)

Re: checkpoint write errors

CS DBA <cs_dba@consistentstate.com> writes:

we're seeing the below errors over and over in the logs of one of our
postgres databases. Version 8.4.22

[ you really oughta get off 8.4, but you knew that right? ]

Anyone have any thoughts on correcting/debugging it?

ERROR: xlog flush request 2571/9C141530 is not satisfied --- flushed
only to 2570/DE61C290
CONTEXT: writing block 4874 of relation base/1029860192/1029863651
WARNING: could not write block 4874 of base/1029860192/1029863651
DETAIL: Multiple failures --- write error might be permanent.

Evidently the LSN in this block is wrong. If it's an index, your idea of
REINDEX is probably the best solution. If it's a heap block, you could
probably make the problem go away by performing an update that changes any
tuple in this block. It doesn't even need to be a committed update; that
is, you could update or delete any row in that block, then roll back the
transaction, and it'd still be fixed.

Try to avoid shutting down the DB until you've fixed the problem,
else you're looking at replay from whenever the last successful
checkpoint was :-(

Maybe I need to run a REINDEX on whatever table equates to
"base/1029860192/1029863651"? If so how do I determine the db and table
for "base/1029860192/1029863651"?

1029860192 is the OID of the database's pg_database row.
1029863651 is the relfilenode in the relation's pg_class row.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CS DBA

cs_dba@consistentstate.com

over 9 years ago

In reply to: Tom Lane (#2)

Re: checkpoint write errors

Thanks the REINDEX fixed it, it's a client of ours and we're pushing to
get them to move to 9.5

On 10/21/2016 06:33 PM, Tom Lane wrote:

CS DBA <cs_dba@consistentstate.com> writes:

we're seeing the below errors over and over in the logs of one of our
postgres databases. Version 8.4.22

[ you really oughta get off 8.4, but you knew that right? ]

Anyone have any thoughts on correcting/debugging it?
ERROR: xlog flush request 2571/9C141530 is not satisfied --- flushed
only to 2570/DE61C290
CONTEXT: writing block 4874 of relation base/1029860192/1029863651
WARNING: could not write block 4874 of base/1029860192/1029863651
DETAIL: Multiple failures --- write error might be permanent.

Evidently the LSN in this block is wrong. If it's an index, your idea of
REINDEX is probably the best solution. If it's a heap block, you could
probably make the problem go away by performing an update that changes any
tuple in this block. It doesn't even need to be a committed update; that
is, you could update or delete any row in that block, then roll back the
transaction, and it'd still be fixed.

Try to avoid shutting down the DB until you've fixed the problem,
else you're looking at replay from whenever the last successful
checkpoint was :-(

Maybe I need to run a REINDEX on whatever table equates to
"base/1029860192/1029863651"? If so how do I determine the db and table
for "base/1029860192/1029863651"?

1029860192 is the OID of the database's pg_database row.
1029863651 is the relfilenode in the relation's pg_class row.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CS DBA

cs_dba@consistentstate.com

over 9 years ago

In reply to: CS DBA (#3)

Re: checkpoint write errors ( getting worse )

So I ran REINDEX on all the db's and the errors went away for a bit. Now
I'm seeing this:

Log entries like this:FATAL: could not read block 0 of relation
base/1311892067/2687: read only 0 of 8192 bytes

So I checked which db it is:

$ psql -h localhost
psql (8.4.20)
Type "help" for help.

postgres=# select datname from pg_database where oid = 1311892067;
datname
---------
access_one
(1 row)

But when I attempt to connect to the db so I can query for the table in
pg_class I get this:

postgres=# \c access_one
FATAL: could not read block 0 of relation base/1311892067/2687: read
only 0 of 8192 bytes

Thoughts?

Show quoted text

On 10/22/2016 07:52 AM, CS DBA wrote:

Thanks the REINDEX fixed it, it's a client of ours and we're pushing
to get them to move to 9.5

On 10/21/2016 06:33 PM, Tom Lane wrote:

CS DBA <cs_dba@consistentstate.com> writes:

we're seeing the below errors over and over in the logs of one of our
postgres databases. Version 8.4.22

[ you really oughta get off 8.4, but you knew that right? ]

Anyone have any thoughts on correcting/debugging it?
ERROR: xlog flush request 2571/9C141530 is not satisfied --- flushed
only to 2570/DE61C290
CONTEXT: writing block 4874 of relation base/1029860192/1029863651
WARNING: could not write block 4874 of base/1029860192/1029863651
DETAIL: Multiple failures --- write error might be permanent.

Evidently the LSN in this block is wrong. If it's an index, your
idea of
REINDEX is probably the best solution. If it's a heap block, you could
probably make the problem go away by performing an update that
changes any
tuple in this block. It doesn't even need to be a committed update;
that
is, you could update or delete any row in that block, then roll back the
transaction, and it'd still be fixed.

Try to avoid shutting down the DB until you've fixed the problem,
else you're looking at replay from whenever the last successful
checkpoint was :-(

Maybe I need to run a REINDEX on whatever table equates to
"base/1029860192/1029863651"? If so how do I determine the db and
table
for "base/1029860192/1029863651"?

1029860192 is the OID of the database's pg_database row.
1029863651 is the relfilenode in the relation's pg_class row.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 9 years ago

In reply to: CS DBA (#4)

Re: checkpoint write errors ( getting worse )

CS DBA <cs_dba@consistentstate.com> writes:

So I ran REINDEX on all the db's and the errors went away for a bit. Now
I'm seeing this:

Log entries like this:FATAL: could not read block 0 of relation
base/1311892067/2687: read only 0 of 8192 bytes

You have a problem there, because:

regression=# select 2687::regclass;
regclass
----------------------
pg_opclass_oid_index
(1 row)

which is a pretty critical index.

You might be able to fix this by starting a single-user backend with -P
(--ignore-system-indexes) and using it to REINDEX that index.

On the whole, though, it's starting to sound like that system has
got major problems. You'd be well advised to focus all your efforts
on getting a valid dump, not bringing it back into production.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CS DBA

cs_dba@consistentstate.com

over 9 years ago

In reply to: Tom Lane (#5)

Re: checkpoint write errors ( getting worse )

would a dump/restore correct these issues?

On 10/22/2016 05:59 PM, Tom Lane wrote:

CS DBA <cs_dba@consistentstate.com> writes:

So I ran REINDEX on all the db's and the errors went away for a bit. Now
I'm seeing this:
Log entries like this:FATAL: could not read block 0 of relation
base/1311892067/2687: read only 0 of 8192 bytes

You have a problem there, because:

regression=# select 2687::regclass;
regclass
----------------------
pg_opclass_oid_index
(1 row)

which is a pretty critical index.

You might be able to fix this by starting a single-user backend with -P
(--ignore-system-indexes) and using it to REINDEX that index.

On the whole, though, it's starting to sound like that system has
got major problems. You'd be well advised to focus all your efforts
on getting a valid dump, not bringing it back into production.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CS DBA

cs_dba@consistentstate.com

over 9 years ago

In reply to: Tom Lane (#5)

Re: checkpoint write errors ( getting worse )

also, any thoughts on what could be causing these issues?

On 10/22/2016 05:59 PM, Tom Lane wrote:

CS DBA <cs_dba@consistentstate.com> writes:

So I ran REINDEX on all the db's and the errors went away for a bit. Now
I'm seeing this:
Log entries like this:FATAL: could not read block 0 of relation
base/1311892067/2687: read only 0 of 8192 bytes

You have a problem there, because:

regression=# select 2687::regclass;
regclass
----------------------
pg_opclass_oid_index
(1 row)

which is a pretty critical index.

You might be able to fix this by starting a single-user backend with -P
(--ignore-system-indexes) and using it to REINDEX that index.

On the whole, though, it's starting to sound like that system has
got major problems. You'd be well advised to focus all your efforts
on getting a valid dump, not bringing it back into production.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: CS DBA (#6)

Re: checkpoint write errors ( getting worse )

On Sun, Oct 23, 2016 at 12:45 PM, CS DBA <cs_dba@consistentstate.com> wrote:

would a dump/restore correct these issues?

Not directly, but it would give a logical representation of your data,
or a good start image that you could deploy on a server that has less
problems. You seem to be facing advanced issues with your hardware
here.
--
Michael

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

CS DBA

cs_dba@consistentstate.com

over 9 years ago

In reply to: Michael Paquier (#8)

Re: checkpoint write errors ( getting worse )

Understood, thanks. This is a new server fired up for our client by
Rackspace

Not real impressed so far, for the first several days we had major
performance issues even thought new new HW had more memory and
more/faster CPU's and faster IO - turned out rackspace had turned on cpu
throttling limiting the server to no more than 2 cpu's

On 10/23/2016 10:53 PM, Michael Paquier wrote:

On Sun, Oct 23, 2016 at 12:45 PM, CS DBA <cs_dba@consistentstate.com> wrote:

would a dump/restore correct these issues?

Not directly, but it would give a logical representation of your data,
or a good start image that you could deploy on a server that has less
problems. You seem to be facing advanced issues with your hardware
here.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general