BUG #7648: Momentary index corruption while in hot standby

Started by Daniel Farinaover 13 years ago6 messagesbugs

daniel@heroku.com

over 13 years ago

The following bug has been logged on the website:

Bug reference: 7648
Logged by: Daniel Farina
Email address: daniel@heroku.com
PostgreSQL version: 9.0.9
Operating system: Ubuntu 10.04
Description:

At more or less one point in time exactly (the same second, at minimum), a
couple of errors were raised on a hot standby while performing query access
that would nominally suggest corruption were raised:

PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
read only 0 of 8192

Immediately thereafter, no occurrences of the error resurfaced and any
attempts to reproduce the issue with an identical query were met with
failure.

On investigation, this relfile is a fairly common beast: an integer btree
index being used for row identification (e.g. serial).

This is reading from a 9.0.8 Postgres.

Jeff Davis

pgsql@j-davis.com

over 13 years ago

In reply to: Daniel Farina (#1)

Re: BUG #7648: Momentary index corruption while in hot standby

On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote:

PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
read only 0 of 8192

Does that seem wildly off to you, or a little off? Do you think that
block may have existed in the past, but was truncated by a VACUUM or
something?

Just to clarify, that file belongs to the index in question, right? It's
not the heap pointed to by the index?

Immediately thereafter, no occurrences of the error resurfaced and any
attempts to reproduce the issue with an identical query were met with
failure.

Or a lack of failure, I presume.

This is reading from a 9.0.8 Postgres.

Any indication whether it's present on other versions or does it appear
to be isolated to 9.0.X?

Regards,
Jeff Davis

Daniel Farina

daniel@heroku.com

over 13 years ago

In reply to: Jeff Davis (#2)

Re: BUG #7648: Momentary index corruption while in hot standby

On Fri, Nov 9, 2012 at 3:35 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2012-11-09 at 22:49 +0000, daniel@heroku.com wrote:

PGError: ERROR: could not read block 556642 in file "base/16385/2904143.4":
read only 0 of 8192

Does that seem wildly off to you, or a little off? Do you think that
block may have existed in the past, but was truncated by a VACUUM or
something?

A little off. I think that Tom Lane posted a pretty good mechanism in
-hackers: http://archives.postgresql.org/pgsql-hackers/2012-11/msg00409.php

Just to clarify, that file belongs to the index in question, right? It's
not the heap pointed to by the index?

Yes.

Immediately thereafter, no occurrences of the error resurfaced and any
attempts to reproduce the issue with an identical query were met with
failure.

Or a lack of failure, I presume.

Yup.

This is reading from a 9.0.8 Postgres.

Any indication whether it's present on other versions or does it appear
to be isolated to 9.0.X?

This is the only known occurrence to me, ever, but given it's
incredibly ephemeral nature probably glossed over most of the time, I
can't say "it's the only time it's ever happened".

--
fdr

Andres Freund

andres@anarazel.de

over 13 years ago

In reply to: Daniel Farina (#3)

Re: BUG #7648: Momentary index corruption while in hot standby

On 2012-11-09 15:57:06 -0800, Daniel Farina wrote:

This is the only known occurrence to me, ever, but given it's
incredibly ephemeral nature probably glossed over most of the time, I
can't say "it's the only time it's ever happened".

I wish we had some way to easily discern such "something is wrong" error
from other ERRORs that are expected during normal operation..

Greetings,

Andres Freund

Daniel Farina

daniel@heroku.com

over 13 years ago

In reply to: Andres Freund (#4)

Re: BUG #7648: Momentary index corruption while in hot standby

On Fri, Nov 9, 2012 at 4:16 PM, Andres Freund <andres@anarazel.de> wrote:

On 2012-11-09 15:57:06 -0800, Daniel Farina wrote:

This is the only known occurrence to me, ever, but given it's
incredibly ephemeral nature probably glossed over most of the time, I
can't say "it's the only time it's ever happened".

I wish we had some way to easily discern such "something is wrong" error
from other ERRORs that are expected during normal operation..

Me too. Database clients finding these unambiguously platform-level
problems and being relied upon to report them to receive treatment is
a long-standing embarrassment to me. However, I've been way too
swamped to even start thinking of how one would disentangle error
reporting suitable for physical issues from logical issues. Clearly
it is *possible* (most error sites are fairly clear being in one
category or the other), but the carefully considered upheaval to the
logging system is beyond my mind's reach for now.

It's a weakness we've kicked around the office here a few times with
much regret that nobody has the necessary remaining attention to
attack it.

--
fdr

Peter Geoghegan

pg@bowt.ie

over 13 years ago

In reply to: Daniel Farina (#5)

Re: BUG #7648: Momentary index corruption while in hot standby

On 10 November 2012 00:29, Daniel Farina <daniel@heroku.com> wrote:

Me too. Database clients finding these unambiguously platform-level
problems and being relied upon to report them to receive treatment is
a long-standing embarrassment to me. However, I've been way too
swamped to even start thinking of how one would disentangle error
reporting suitable for physical issues from logical issues.

I complained about this a few months ago (and a few months before
that), and the upshot was that we kicked around a few ideas and were
able to outline a useful API [1]http://archives.postgresql.org/message-id/CAEYLb_XdtyJE6WtUy4TGdjUQ6eUtjJp0cTfLaDP9qwp8gOtTdg@mail.gmail.com. The idea here was to derive what I
called magnitude from SQLSTATE. In other words, we'd represent how
routine or non-routine a particular error message was (the "wake me up
in the middle of the night" factor). Severity levels don't and cannot
capture this, since for example a FATAL error occurs in the event of
failed authentication, whereas ERRORs (technically a lesser severity)
may occur in far more serious situations that a Postgres DBA can
reasonably hope to never see, with problems that indicate data
corruption, for example.

[1]: http://archives.postgresql.org/message-id/CAEYLb_XdtyJE6WtUy4TGdjUQ6eUtjJp0cTfLaDP9qwp8gOtTdg@mail.gmail.com

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services