PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

Started by Maxim Bogukover 15 years ago32 messageshackers
Jump to latest
#1Maxim Boguk
maxim.boguk@gmail.com

Hi there,

While I trying create reproducible test case for BUG #5798 I
encountered very strange effect on two of my servers (both servers
have same hardware platform/OS (freebsd 7.2) and PostgreSQL 8.4.4).

Very simple test table created as:
CREATE TABLE test (id integer);
INSERT INTO test select generate_series(0,10000);

And I trying repeateble vacuum of that table with script:
perl -e "foreach (1..100000) {system \"psql -d test -h -c 'vacuum test'\";}"

And once per like an minute (really random intervals can be 5 minutes
without problems can be 3 vacuum in row show same error) I getting
next errors:
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page 1
...
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test"
page 30 for all pages of the relation.

After that errors script continue vacuums and after some time (0-1000
vacuum loops) I getting the same errors again and so on.
During test run no one touched the test table except my script.

That happened on both of production servers (slony master and slony
slave). However, I can not reproduce this error on other databases
under my support. Complete server restart doesn't make problem
dissapear.

Any ideas what happens there?

--
Maxim Boguk
Senior Postgresql DBA.

Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
МойКруг: http://mboguk.moikrug.ru/

Сила солому ломит, но не все в нашей жизни - солома, да и сила далеко не все.

#2Fujii Masao
masao.fujii@gmail.com
In reply to: Maxim Boguk (#1)
Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Sat, Dec 25, 2010 at 2:09 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:

While I trying create reproducible test case for BUG #5798 I
encountered very strange effect on two of my servers (both servers
have same hardware platform/OS (freebsd 7.2) and PostgreSQL 8.4.4).

Very simple test table created as:
CREATE TABLE test (id integer);
INSERT INTO test select generate_series(0,10000);

And I trying repeateble vacuum of that table with script:
 perl -e "foreach (1..100000) {system \"psql -d test -h -c 'vacuum test'\";}"

And once per like an minute (really random intervals can be 5 minutes
without problems can be 3 vacuum in row show same error)  I getting
next errors:
WARNING:  PD_ALL_VISIBLE flag was incorrectly set in relation "test" page 1
...
WARNING:  PD_ALL_VISIBLE flag was incorrectly set in relation "test"
page 30 for all pages of the relation.

The same problem happened on my customer's PostgreSQL 8.4.2
environment. Here are the error messages:

Dec 13 13:11:59 test postgres[28249]: [2-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 38
Dec 13 13:11:59 test postgres[28249]: [2-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:11:59 test postgres[28249]: [3-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 61
Dec 13 13:11:59 test postgres[28249]: [3-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:11:59 test postgres[28249]: [4-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 88
Dec 13 13:11:59 test postgres[28249]: [4-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:11:59 test postgres[28249]: [5-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 96
Dec 13 13:11:59 test postgres[28249]: [5-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:11:59 test postgres[28249]: [6-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 98
Dec 13 13:11:59 test postgres[28249]: [6-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:11:59 test postgres[28249]: [7-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 107
Dec 13 13:11:59 test postgres[28249]: [7-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
...
Dec 13 13:15:59 test postgres[9640]: [2-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 38
Dec 13 13:15:59 test postgres[9640]: [2-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:15:59 test postgres[9640]: [3-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 61
Dec 13 13:15:59 test postgres[9640]: [3-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:15:59 test postgres[9640]: [4-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 88
Dec 13 13:15:59 test postgres[9640]: [4-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:15:59 test postgres[9640]: [5-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 93
Dec 13 13:15:59 test postgres[9640]: [5-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:15:59 test postgres[9640]: [6-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 96
Dec 13 13:15:59 test postgres[9640]: [6-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
Dec 13 13:15:59 test postgres[9640]: [7-1] WARNING: 01000:
PD_ALL_VISIBLE flag was incorrectly set in relation "pg_statistic"
page 107
Dec 13 13:15:59 test postgres[9640]: [7-2] LOCATION: lazy_scan_heap,
vacuumlazy.c:676
...

This problem was reported some times, but has not been resolved yet.
http://archives.postgresql.org/message-id/4C23A3CF.4080506%40frolix.muddywaters.org
http://archives.postgresql.org/message-id/g2o4b46b5f01004010610ib8625426uae6ee90ac1435ba1%40mail.gmail.com

Though I investigated the source code around PD_ALL_VISIBLE flag,
I could not identify the cause.

Does anyone have any ideas what the cause is?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#3Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#2)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On 12.01.2011 06:21, Fujii Masao wrote:

On Sat, Dec 25, 2010 at 2:09 PM, Maxim Boguk<maxim.boguk@gmail.com> wrote:

While I trying create reproducible test case for BUG #5798 I
encountered very strange effect on two of my servers (both servers
have same hardware platform/OS (freebsd 7.2) and PostgreSQL 8.4.4).

Very simple test table created as:
CREATE TABLE test (id integer);
INSERT INTO test select generate_series(0,10000);

And I trying repeateble vacuum of that table with script:
perl -e "foreach (1..100000) {system \"psql -d test -h -c 'vacuum test'\";}"

And once per like an minute (really random intervals can be 5 minutes
without problems can be 3 vacuum in row show same error) I getting
next errors:
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page 1
...
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test"
page 30 for all pages of the relation.

Oh, interesting. This is the first time anyone can reliably reproducible
that. I can't reproduce that on my laptop with that script, though, so
I'm going to need your help to debug this.

Can you compile PostgreSQL with the attached patch, and rerun the test?
It will dump the pages with incorrectly set flags to files in /tmp/, and
adds a bit more detail in the WARNING. Please run the test until you
get those warnings, and tar up the the created "/tmp/pageimage*" files,
and post them along with the warning generated.

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

debug-pd_all_visible-1.patchtext/x-diff; name=debug-pd_all_visible-1.patchDownload+16-2
#4daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#3)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

On 12.01.2011 06:21, Fujii Masao wrote:

On Sat, Dec 25, 2010 at 2:09 PM, Maxim Boguk<maxim.boguk@gmail.com> wrote:

While I trying create reproducible test case for BUG #5798 I
encountered very strange effect on two of my servers (both servers
have same hardware platform/OS (freebsd 7.2) and PostgreSQL 8.4.4).

Very simple test table created as:
CREATE TABLE test (id integer);
INSERT INTO test select generate_series(0,10000);

And I trying repeateble vacuum of that table with script:
perl -e "foreach (1..100000) {system \"psql -d test -h -c 'vacuum
test'\";}"

And once per like an minute (really random intervals can be 5 minutes
without problems can be 3 vacuum in row show same error) I getting
next errors:
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page
1
...
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test"
page 30 for all pages of the relation.

Oh, interesting. This is the first time anyone can reliably reproducible
that. I can't reproduce that on my laptop with that script, though, so
I'm going to need your help to debug this.

Can you compile PostgreSQL with the attached patch, and rerun the test?
It will dump the pages with incorrectly set flags to files in /tmp/, and
adds a bit more detail in the WARNING. Please run the test until you
get those warnings, and tar up the the created "/tmp/pageimage*" files,
and post them along with the warning generated.

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.
Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Thanks

-dg

--
David Gould daveg@sonic.net
If simplicity worked, the world would be overrun with insects.

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: daveg (#4)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On 28.02.2011 23:28, daveg wrote:

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.

I haven't heard anything from the OP since.

Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Sequential scans will honor the flag, so you might see some dead rows
incorrectly returned by a sequential scan. That's the only "damage", but
an incorrectly set flag could be a sign of something more sinister, like
corrupt tuple headers. The flag should never be set incorrectly, so if
you see that message you have hit a bug in PostgreSQL, or you have bad
hardware.

This flag is quite new, so a bug in PostgreSQL is quite possible. If you
still have a backup that contains those incorrectly set flags, I'd like
to see what the page looks like.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#6David Christensen
david@endpoint.com
In reply to: daveg (#4)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Feb 28, 2011, at 3:28 PM, daveg wrote:

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

On 12.01.2011 06:21, Fujii Masao wrote:

On Sat, Dec 25, 2010 at 2:09 PM, Maxim Boguk<maxim.boguk@gmail.com> wrote:

While I trying create reproducible test case for BUG #5798 I
encountered very strange effect on two of my servers (both servers
have same hardware platform/OS (freebsd 7.2) and PostgreSQL 8.4.4).

Very simple test table created as:
CREATE TABLE test (id integer);
INSERT INTO test select generate_series(0,10000);

And I trying repeateble vacuum of that table with script:
perl -e "foreach (1..100000) {system \"psql -d test -h -c 'vacuum
test'\";}"

And once per like an minute (really random intervals can be 5 minutes
without problems can be 3 vacuum in row show same error) I getting
next errors:
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page
1
...
WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test"
page 30 for all pages of the relation.

Oh, interesting. This is the first time anyone can reliably reproducible
that. I can't reproduce that on my laptop with that script, though, so
I'm going to need your help to debug this.

Can you compile PostgreSQL with the attached patch, and rerun the test?
It will dump the pages with incorrectly set flags to files in /tmp/, and
adds a bit more detail in the WARNING. Please run the test until you
get those warnings, and tar up the the created "/tmp/pageimage*" files,
and post them along with the warning generated.

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.
Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Was this cluster upgraded to 8.4.4 from 8.4.0? It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

commit 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Aug 24 02:18:32 2009 +0000

Fix a violation of WAL coding rules in the recent patch to include an
"all tuples visible" flag in heap page headers. The flag update *must*
be applied before calling XLogInsert, but heap_update and the tuple
moving routines in VACUUM FULL were ignoring this rule. A crash and
replay could therefore leave the flag incorrectly set, causing rows
to appear visible in seqscans when they should not be. This might explain
recent reports of data corruption from Jeff Ross and others.

In passing, do a bit of editorialization on comments in visibilitymap.c.

oy:postgresql machack$ git describe --tag 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
REL8_4_0-190-g7fc7a7c

If the flag got twiddled while running as 8.4.0, the incorrect PD_ALL_VISIBLE flag would (obviously) not be fixed by the upgrade to 8.4.4. (Is this a separate issue?)

Regards,

David
--
David Christensen
End Point Corporation
david@endpoint.com

#7Fujii Masao
masao.fujii@gmail.com
In reply to: David Christensen (#6)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Tue, Mar 1, 2011 at 10:43 AM, David Christensen <david@endpoint.com> wrote:

Was this cluster upgraded to 8.4.4 from 8.4.0?  It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

commit 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Mon Aug 24 02:18:32 2009 +0000

   Fix a violation of WAL coding rules in the recent patch to include an
   "all tuples visible" flag in heap page headers.  The flag update *must*
   be applied before calling XLogInsert, but heap_update and the tuple
   moving routines in VACUUM FULL were ignoring this rule.  A crash and
   replay could therefore leave the flag incorrectly set, causing rows
   to appear visible in seqscans when they should not be.  This might explain
   recent reports of data corruption from Jeff Ross and others.

   In passing, do a bit of editorialization on comments in visibilitymap.c.

oy:postgresql machack$ git describe --tag 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
REL8_4_0-190-g7fc7a7c

If the flag got twiddled while running as 8.4.0, the incorrect PD_ALL_VISIBLE flag would (obviously) not be fixed by the upgrade to 8.4.4.  (Is this a separate issue?)

Yes, it's a different issue. I observed it on 8.4.2.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#8Bruce Momjian
bruce@momjian.us
In reply to: David Christensen (#6)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Tue, Mar 1, 2011 at 1:43 AM, David Christensen <david@endpoint.com> wrote:

Was this cluster upgraded to 8.4.4 from 8.4.0?  It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

The reproduction script described was running vacuum repeatedly. A
single vacuum run out to be sufficient to clean up the problem if it
was left-over.

I wonder if it would help to write a regression test that runs 100 or
so vacuums and see if the bulid farm turns up any examples of this
behaviour.

--
greg

#9Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#8)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Mon, Feb 28, 2011 at 10:32 PM, Greg Stark <gsstark@mit.edu> wrote:

On Tue, Mar 1, 2011 at 1:43 AM, David Christensen <david@endpoint.com> wrote:

Was this cluster upgraded to 8.4.4 from 8.4.0?  It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

The reproduction script described was running vacuum repeatedly. A
single vacuum run out to be sufficient to clean up the problem if it
was left-over.

I wonder if it would help to write a regression test that runs 100 or
so vacuums and see if the bulid farm turns up any examples of this
behaviour.

One other thing to keep in mind here is that the warning message we've
chosen can be a bit misleading. The warning is:

WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page 1

...which implies that the state of the tuples is correct, and that the
page-level bit is wrong in comparison. But I recently saw a case
where the infomask got clobbered, resulting in this warning. The page
level bit was correct, at least relative to the intended page
contents; it was the a tuple on the page that was screwed up. It
might have been better to pick a more neutral phrasing, like "page is
marked all-visible but some tuples are not visible".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10daveg
daveg@sonic.net
In reply to: David Christensen (#6)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Mon, Feb 28, 2011 at 07:43:39PM -0600, David Christensen wrote:

On Feb 28, 2011, at 3:28 PM, daveg wrote:

Anything new on this? I'm seeing at on one of my clients production boxes.
Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Was this cluster upgraded to 8.4.4 from 8.4.0? It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

commit 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon Aug 24 02:18:32 2009 +0000

Fix a violation of WAL coding rules in the recent patch to include an
"all tuples visible" flag in heap page headers. The flag update *must*
be applied before calling XLogInsert, but heap_update and the tuple
moving routines in VACUUM FULL were ignoring this rule. A crash and
replay could therefore leave the flag incorrectly set, causing rows
to appear visible in seqscans when they should not be. This might explain
recent reports of data corruption from Jeff Ross and others.

In passing, do a bit of editorialization on comments in visibilitymap.c.

oy:postgresql machack$ git describe --tag 7fc7a7c4d082bfbd579f49e92b046dd51f1faf5f
REL8_4_0-190-g7fc7a7c

If the flag got twiddled while running as 8.4.0, the incorrect PD_ALL_VISIBLE flag would (obviously) not be fixed by the upgrade to 8.4.4. (Is this a separate issue?)

This cluster was installed with 8.4.4. So it is still an existing problem.
Also, to my recollection, this cluster has never crashed.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#11daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#5)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Tue, Mar 01, 2011 at 12:00:54AM +0200, Heikki Linnakangas wrote:

On 28.02.2011 23:28, daveg wrote:

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.

I haven't heard anything from the OP since.

Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Sequential scans will honor the flag, so you might see some dead rows
incorrectly returned by a sequential scan. That's the only "damage", but
an incorrectly set flag could be a sign of something more sinister, like
corrupt tuple headers. The flag should never be set incorrectly, so if
you see that message you have hit a bug in PostgreSQL, or you have bad
hardware.

This flag is quite new, so a bug in PostgreSQL is quite possible. If you
still have a backup that contains those incorrectly set flags, I'd like
to see what the page looks like.

I ran vacuums on all the affected tables last night. I plan to take a downtime
to clear the buffer cache and then to run vacuums on all the dbs in the
cluster.

Most but not all the tables involved are catalogs.

However, I could probably pick up your old patch sometime next week if it
recurrs and send you page images.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#12Maxim Boguk
maxim.boguk@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

Hi

On Tue, Mar 1, 2011 at 11:00 AM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

On 28.02.2011 23:28, daveg wrote:

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.

I haven't heard anything from the OP since.

Also, what is the significance, ie what is the risk or damage potential if

this flag is set incorrectly?

Sequential scans will honor the flag, so you might see some dead rows
incorrectly returned by a sequential scan. That's the only "damage", but an
incorrectly set flag could be a sign of something more sinister, like
corrupt tuple headers. The flag should never be set incorrectly, so if you
see that message you have hit a bug in PostgreSQL, or you have bad hardware.

This flag is quite new, so a bug in PostgreSQL is quite possible. If you
still have a backup that contains those incorrectly set flags, I'd like to
see what the page looks like.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Sorry, I was a bit busy lately.
This effect on my production servers seems have floating nature (some weeks
it happen each day many time and some weeks - not at all... phase of the
moon may be). Last 2 weeks I don't see that error in database logs so I
can't add anything more.

What make situation even harder to debug it is I can't compile and install
patched version of database engine (because it is production server and I
physically don't have access to changing software there). So I trying to
reproduce error on single connect (so I can attach GDB and look inside) but
without luck during last week.

--
Maxim Boguk
Senior Postgresql DBA.

Skype: maxim.boguk
Jabber: maxim.boguk@gmail.com

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
If they can send one man to the moon... why can't they send them all?

МойКруг: http://mboguk.moikrug.ru/
Сила солому ломит, но не все в нашей жизни - солома, да и сила далеко не
все.

#13daveg
daveg@sonic.net
In reply to: daveg (#11)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Tue, Mar 01, 2011 at 01:20:43PM -0800, daveg wrote:

On Tue, Mar 01, 2011 at 12:00:54AM +0200, Heikki Linnakangas wrote:

On 28.02.2011 23:28, daveg wrote:

On Wed, Jan 12, 2011 at 10:46:14AM +0200, Heikki Linnakangas wrote:

We'll likely need to go back and forth a few times with various
debugging patches until we get to the heart of this..

Anything new on this? I'm seeing at on one of my clients production boxes.

I haven't heard anything from the OP since.

Also, what is the significance, ie what is the risk or damage potential if
this flag is set incorrectly?

Sequential scans will honor the flag, so you might see some dead rows
incorrectly returned by a sequential scan. That's the only "damage", but
an incorrectly set flag could be a sign of something more sinister, like
corrupt tuple headers. The flag should never be set incorrectly, so if
you see that message you have hit a bug in PostgreSQL, or you have bad
hardware.

This flag is quite new, so a bug in PostgreSQL is quite possible. If you
still have a backup that contains those incorrectly set flags, I'd like
to see what the page looks like.

I ran vacuums on all the affected tables last night. I plan to take a downtime
to clear the buffer cache and then to run vacuums on all the dbs in the
cluster.

Most but not all the tables involved are catalogs.

However, I could probably pick up your old patch sometime next week if it
recurrs and send you page images.

After a restart and vacuum of all dbs with no other activity things were
quiet for a couple hours and then we started seeing these PD_ALL_VISIBLE
messages again.

Going back through the logs we have been getting these since at least before
mid January. Oddly, this only happens on four systems which are all new Dell
32 core Nehalem 512GB machines using iscsi partitions served off a Netapp.
Our older 8 core 64GB hosts have never logged any of these errors. I'm not
saying it is related to the hw, as these hosts are doing a lot more work than
the old hosts so it may be a concurrency problem that just never came up at
lower levels before.

Postgresql version is 8.4.4.

I'll pick up Heikkis page logging patch and run it for a bit to get some
damaged page images. What else could I be doing to track this down?

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: daveg (#13)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

Excerpts from daveg's message of mié mar 02 18:30:34 -0300 2011:

After a restart and vacuum of all dbs with no other activity things were
quiet for a couple hours and then we started seeing these PD_ALL_VISIBLE
messages again.

Going back through the logs we have been getting these since at least before
mid January. Oddly, this only happens on four systems which are all new Dell
32 core Nehalem 512GB machines using iscsi partitions served off a Netapp.
Our older 8 core 64GB hosts have never logged any of these errors. I'm not
saying it is related to the hw, as these hosts are doing a lot more work than
the old hosts so it may be a concurrency problem that just never came up at
lower levels before.

Postgresql version is 8.4.4.

I don't see how this could be related, but since you're running on NFS,
maybe it is, somehow:
http://archives.postgresql.org/message-id/4D40DDB7.1010000@credativ.com
(for example what if the visibility map fork's last page is overwritten?)

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#15daveg
daveg@sonic.net
In reply to: Alvaro Herrera (#14)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Wed, Mar 02, 2011 at 06:45:13PM -0300, Alvaro Herrera wrote:

Excerpts from daveg's message of mi� mar 02 18:30:34 -0300 2011:

After a restart and vacuum of all dbs with no other activity things were
quiet for a couple hours and then we started seeing these PD_ALL_VISIBLE
messages again.

Going back through the logs we have been getting these since at least before
mid January. Oddly, this only happens on four systems which are all new Dell
32 core Nehalem 512GB machines using iscsi partitions served off a Netapp.
Our older 8 core 64GB hosts have never logged any of these errors. I'm not
saying it is related to the hw, as these hosts are doing a lot more work than
the old hosts so it may be a concurrency problem that just never came up at
lower levels before.

Postgresql version is 8.4.4.

I don't see how this could be related, but since you're running on NFS,
maybe it is, somehow:
http://archives.postgresql.org/message-id/4D40DDB7.1010000@credativ.com
(for example what if the visibility map fork's last page is overwritten?)

Running on ISCSI, not nfs. But it is still a Netapp, so who knows. I'll look.
Also, we are not seeing any of the "unexpected data beyond EOF" errors,
just thousands per day of the PD_ALL_VISIBLE error.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#16bricklen
bricklen@gmail.com
In reply to: daveg (#15)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Wed, Mar 2, 2011 at 3:53 PM, daveg <daveg@sonic.net> wrote:

Postgresql version is 8.4.4.

I don't see how this could be related, but since you're running on NFS,
maybe it is, somehow:
http://archives.postgresql.org/message-id/4D40DDB7.1010000@credativ.com
(for example what if the visibility map fork's last page is overwritten?)

Running on ISCSI, not nfs. But it is still a Netapp, so who knows. I'll look.
Also, we are not seeing any of the "unexpected data beyond EOF" errors,
just thousands per day of the PD_ALL_VISIBLE error.

-dg

FWIW, we had a couple occurrences of that message about a month ago on 9.0.2

http://archives.postgresql.org/pgsql-general/2011-01/msg00887.php

Haven't seen it since we ran a cluster-wide vacuum.

#17daveg
daveg@sonic.net
In reply to: bricklen (#16)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Wed, Mar 02, 2011 at 04:20:24PM -0800, bricklen wrote:

On Wed, Mar 2, 2011 at 3:53 PM, daveg <daveg@sonic.net> wrote:

Postgresql version is 8.4.4.

I don't see how this could be related, but since you're running on NFS,
maybe it is, somehow:
http://archives.postgresql.org/message-id/4D40DDB7.1010000@credativ.com
(for example what if the visibility map fork's last page is overwritten?)

Running on ISCSI, not nfs. But it is still a Netapp, so who knows. I'll look.
Also, we are not seeing any of the "unexpected data beyond EOF" errors,
just thousands per day of the PD_ALL_VISIBLE error.

-dg

FWIW, we had a couple occurrences of that message about a month ago on 9.0.2

http://archives.postgresql.org/pgsql-general/2011-01/msg00887.php

Haven't seen it since we ran a cluster-wide vacuum.

We did a shutdown and restart to clear the buffer cache (but did not reboot
the host) and a vacuum on all dbs in the cluster last night. That cleared it
up for a couple hours, but we are still getting lots of these messages.

Most of them are pg_statistic and we create and drop hundreds of thousands of
temp tables daily, so there is a good chance there is a concurrancy issue.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#18daveg
daveg@sonic.net
In reply to: Robert Haas (#9)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Tue, Mar 01, 2011 at 08:40:37AM -0500, Robert Haas wrote:

On Mon, Feb 28, 2011 at 10:32 PM, Greg Stark <gsstark@mit.edu> wrote:

On Tue, Mar 1, 2011 at 1:43 AM, David Christensen <david@endpoint.com> wrote:

Was this cluster upgraded to 8.4.4 from 8.4.0? �It sounds to me like a known bug in 8.4.0 which was fixed by this commit:

The reproduction script described was running vacuum repeatedly. A
single vacuum run out to be sufficient to clean up the problem if it
was left-over.

I wonder if it would help to write a regression test that runs 100 or
so vacuums and see if the bulid farm turns up any examples of this
behaviour.

One other thing to keep in mind here is that the warning message we've
chosen can be a bit misleading. The warning is:

WARNING: PD_ALL_VISIBLE flag was incorrectly set in relation "test" page 1

...which implies that the state of the tuples is correct, and that the
page-level bit is wrong in comparison. But I recently saw a case
where the infomask got clobbered, resulting in this warning. The page
level bit was correct, at least relative to the intended page
contents; it was the a tuple on the page that was screwed up. It
might have been better to pick a more neutral phrasing, like "page is
marked all-visible but some tuples are not visible".

Yeesh. Yikes. I hope that this is not the case as we are seeing thousands of
these daily on each of 4 large production hosts. Mostly on catalogs,
especially pg_statistic. However it does occur on some high delete/insert
traffic user tables too.

Question: what would be the consequence of simply patching out the setting
of this flag? Assuming that the incorrect PD_ALL_VISIBLE flag is the only
problem (big assumption perhaps) then simply never setting it would at least
avoid the possibility of returning wrong answers, presumably at some
performance cost. We possibly could live with that until we get a handle
on the real cause and fix.

I had a look and don't really see anything except vacuum_lazy that sets it,
so it seems simple to disable.

Or have I understood this incorrectly?

Anything else I can be doing to try to track this down?

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: daveg (#18)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On 03.03.2011 09:12, daveg wrote:

Question: what would be the consequence of simply patching out the setting
of this flag? Assuming that the incorrect PD_ALL_VISIBLE flag is the only
problem (big assumption perhaps) then simply never setting it would at least
avoid the possibility of returning wrong answers, presumably at some
performance cost. We possibly could live with that until we get a handle
on the real cause and fix.

Yes. With that assumption.

If you really want to do that, I would suggest the attached patch
instead. This just disables the optimization in seqscans to trust it, so
an incorrectly set flag won't affect correctness of query results, but
the flag is still set as usual and you still get the warnings so that we
can continue to debug the issue.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

dont-trust-all-visible-flag.patchtext/x-diff; name=dont-trust-all-visible-flag.patchDownload+5-0
#20daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#19)
Re: [HACKERS] Re: PD_ALL_VISIBLE flag was incorrectly set happend during repeatable vacuum

On Thu, Mar 03, 2011 at 10:16:29AM +0200, Heikki Linnakangas wrote:

On 03.03.2011 09:12, daveg wrote:

Question: what would be the consequence of simply patching out the setting
of this flag? Assuming that the incorrect PD_ALL_VISIBLE flag is the only
problem (big assumption perhaps) then simply never setting it would at
least
avoid the possibility of returning wrong answers, presumably at some
performance cost. We possibly could live with that until we get a handle
on the real cause and fix.

Yes. With that assumption.

If you really want to do that, I would suggest the attached patch
instead. This just disables the optimization in seqscans to trust it, so
an incorrectly set flag won't affect correctness of query results, but
the flag is still set as usual and you still get the warnings so that we
can continue to debug the issue.

Thanks. I'll be applying this tomorrow and will send you some page images
to look at assuming it still does it.

I had a look at how this gets set and cleared and did not see anything obvious
so I'm pretty mystified. Also, we are seeing thousands of these daily for at
least a month on 4 large hosts and no-one has noticed any other issues,
which suprises me. Very strange.

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

#21Merlin Moncure
mmoncure@gmail.com
In reply to: Heikki Linnakangas (#19)
#22daveg
daveg@sonic.net
In reply to: Merlin Moncure (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Maxim Boguk (#1)
#24daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#23)
#25Bruce Momjian
bruce@momjian.us
In reply to: daveg (#24)
#26daveg
daveg@sonic.net
In reply to: Bruce Momjian (#25)
#27Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#25)
#28Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#27)
#29daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#27)
#30daveg
daveg@sonic.net
In reply to: Heikki Linnakangas (#28)
#31Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: daveg (#29)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: daveg (#30)