BUG #17245: Index corruption involving deduplicated entries

Started by PG Bug reporting formover 4 years ago103 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17245
Logged by: Andrew Gierth
Email address: andrew@tao11.riddles.org.uk
PostgreSQL version: 14.0
Operating system: FreeBSD 13.0-RELEASE
Description:

From a report from IRC, from a user running a mediawiki instance on PG:

select ctid, page_title from "page" WHERE page_title = 'Isokaze' and
page_namespace = 0;

returned two rows rather than the expected 1:

(181,53) | Ying_Swei
(722,3) | Isokaze

Disabling index scans restored the expected output of 1 row.

Inspecting the index revealed this:

000114a0: 2f47 616c 6c65 7279 0000 1000 0320 2860 /Gallery..... (`
000114b0: 1149 736f 6b61 7a65 0000 7600 1800 0000 .Isokaze..v.....
000114c0: b500 0900 0000 d202 3800 0000 0000 0000 ........8.......
000114d0: 0000 c501 2d00 1840 1749 7365 2f51 756f ....-..@.Ise/Quo

which appears to me to be a deduplicated index entry pointing to heap rows
(181,9) and (722,56). Inspecting the heap showed that both of those ctids
are just pointer entries, to (181,53) and (722,3) respectively, which
explains the 2 row result. What I don't know is how this could have happened
in the first place.

The definition of the affected index is:

CREATE INDEX page_main_title ON mediawiki.page USING btree (page_title
text_pattern_ops) WHERE (page_namespace = 0);

The real entries for 'Ying_Swei' in the index are not in the same page or
even any nearby page, and it seems unlikely that the page_title was
updated.

The original reporter (who I will CC on a followup message) still has a
snapshot of the corrupt data. (A REINDEX naturally fixed the data).

#2Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: PG Bug reporting form (#1)
Re: BUG #17245: Index corruption involving deduplicated entries

The following bug has been logged on the website:
Bug reference: 17245

CC'ing to original reporter.

See report at
/messages/by-id/17245-ddf06aaf85735f36@postgresql.org

--
Andrew (irc:RhodiumToad)

In reply to: Andrew Gierth (#2)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 4:57 AM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

CC'ing to original reporter.

I suggest that pg_amcheck be run against the database, to try to
establish the overall extent of the corruption -- including all
indexes and heap relations from the database. This should start with
the original known-corrupt backup, from before the REINDEX. Then
report back the output of pg_amcheck to the list.

See: https://www.postgresql.org/docs/14/app-pgamcheck.html

Please specify --heapallindexed and --parent-check options when
running pg_amcheck, so that it will perform the most thorough possible
checks against indexes. --heapallindexed is particularly important
here, because the issue that Andrew highlighted is probably not
detectable just by looking at the index itself. It's purely an
inconsistency between the index and the table. This class of problem
has many possible causes.

Thanks
--
Peter Geoghegan

#4Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andrew Gierth (#2)
Re: BUG #17245: Index corruption involving deduplicated entries

During investigation, `SELECT
bt_index_parent_check('mediawiki.page_main_title', true, true);`
displayed no errors.

I have installed the same build of 14.0 in a VM and started it with the
data directory before reindex. pg_amcheck does report errors, but these
are not related to the index in question (page_main_title):

$ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend
azurlane_wiki
btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
    ERROR:  posting list contains misplaced TID in index
"page_redirect_namespace_len"
    DETAIL:  Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
btree index "azurlane_wiki.mediawiki.page_len":
    ERROR:  posting list contains misplaced TID in index "page_len"
    DETAIL:  Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
btree index "azurlane_wiki.mediawiki.transcode_key_idx":
    ERROR:  posting list contains misplaced TID in index
"transcode_key_idx"
    DETAIL:  Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

This instance displays the same behaviour with page_main_title:

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE
page_title = 'Isokaze' and page_namespace = 0;
   ctid   | page_title
----------+------------
 (181,53) | Ying_Swei
 (722,3)  | Isokaze
(2 rows)

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE
page_title = 'Kaga' and page_namespace = 0;
   ctid   | page_title
----------+------------
 (446,32) | Belfast
 (720,53) | Kaga

(2 rows)

After REINDEX'ing the three indices listed by pg_amcheck, there is no
apparent change to the state of the page_main_title index:

azurlane_wiki=# reindex index mediawiki.page_redirect_namespace_len;
REINDEX
azurlane_wiki=# reindex index mediawiki.page_len;
REINDEX
azurlane_wiki=# reindex index mediawiki.transcode_key_idx;
REINDEX

[xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed
--parent-check --rootdescend azurlane_wiki
[xiatian@freebsd pg_amcheck]$

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE
page_title = 'Kaga' and page_namespace = 0;
   ctid   | page_title
----------+------------
 (446,32) | Belfast
 (720,53) | Kaga
(2 rows)

P. S. My previous copy of this e-mail was sent from an invalid address;
I apologise for the possible noise.

--
K. R.

In reply to: PG Bug reporting form (#1)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

PostgreSQL version: 14.0
Operating system: FreeBSD 13.0-RELEASE

PostgreSQL 14 is the first version where the default wal_sync_method
is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
f900a79e. Perhaps that has something to do with the problem seen here.

I have no concrete reason to suspect fdatasync() here -- I'm just
pattern matching. But the inconsistency that Andrew reported looks
like any other scenario where the system's wal_sync_method doesn't
provide the expected guarantees. This looks like generic corruption to
me. Andrew said that the "real entries for 'Ying_Swei' in the index
are not in the same page or even any nearby page". It is reasonable to
suppose that the TID in the index pointing to the wrong tuple in the
heap/table is from an index tuple that has no business being anywhere
near the leaf page that we find it in -- I surmise that it's too far
away in the keyspace for that. (I think that this was Andrew's exact
point, in fact, but I just want to make sure.)

I suspect that there was TID recycling by VACUUM involved here.
Originally, the leaf page image that Andrew examined might well have
actually pointed to the right thing in the heap. Perhaps VACUUM ran,
and recycled the relevant heap TID, while somehow leaving an older
pre-VACUUM leaf page behind (any kind of storage misbehavior could do
this).

There doesn't have to have been a VACUUM, actually. A simple hard
crash (with a misbehaving wal_sync_method or whatever) could easily
lead to the same symptoms. You just need to have heapam "allocate new
heap TIDs" and then forget about having done so. Without the index
structure also forgetting.

--
Peter Geoghegan

In reply to: Kamigishi Rei (#4)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote:

$ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend
azurlane_wiki
btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
ERROR: posting list contains misplaced TID in index
"page_redirect_namespace_len"
DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
btree index "azurlane_wiki.mediawiki.page_len":
ERROR: posting list contains misplaced TID in index "page_len"
DETAIL: Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
btree index "azurlane_wiki.mediawiki.transcode_key_idx":
ERROR: posting list contains misplaced TID in index
"transcode_key_idx"
DETAIL: Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

Thanks for getting back to me with that so quickly.

I suspect that these posting list tuples have duplicate TIDs, which is
never supposed to happen -- nbtree expects that heapam (and the
broader system) will never allow a duplicate TID to be present in the
whole index.

It's possible that I'm wrong, and the corrupt posting list TIDs are
actually in the wrong order (i.e. they're all unique, but are somehow
not in perfect TID-wise order). But I doubt it.

After REINDEX'ing the three indices listed by pg_amcheck, there is no
apparent change to the state of the page_main_title index:

azurlane_wiki=# select ctid, page_title from mediawiki.page WHERE
page_title = 'Kaga' and page_namespace = 0;
ctid | page_title
----------+------------
(446,32) | Belfast
(720,53) | Kaga
(2 rows)

I'm a little confused.

Do you mean that even a REINDEX isn't enough to stop the index from
giving this particular wrong answer, where it thinks that 'Belfast' is
the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or,
are you concerned that pg_amcheck doesn't detect a remaining problem
with one index?

Thanks
--
Peter Geoghegan

In reply to: Peter Geoghegan (#6)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Sun, Oct 24, 2021 at 2:35 PM K. R. <iijima.yun@koumakan.jp> wrote:

$ ./pg_amcheck -U postgres --heapallindexed --parent-check --rootdescend
azurlane_wiki
btree index "azurlane_wiki.mediawiki.page_redirect_namespace_len":
ERROR: posting list contains misplaced TID in index
"page_redirect_namespace_len"
DETAIL: Index tid=(14,9) posting list offset=110 page lsn=2/2C4F7CD8.
btree index "azurlane_wiki.mediawiki.page_len":
ERROR: posting list contains misplaced TID in index "page_len"
DETAIL: Index tid=(1,2) posting list offset=34 page lsn=2/2DDA2378.
btree index "azurlane_wiki.mediawiki.transcode_key_idx":
ERROR: posting list contains misplaced TID in index
"transcode_key_idx"
DETAIL: Index tid=(1,9) posting list offset=5 page lsn=2/2B53AED8.

Does it make any difference if you run "./pg_amcheck -U postgres
--heapallindexed azurlane_wiki" instead?

Typically I would expect this to be less interesting, because you're
asking pg_amcheck to do less checking. But this is not a typical case.

--
Peter Geoghegan

#8Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Peter Geoghegan (#5)
Re: BUG #17245: Index corruption involving deduplicated entries

"Peter" == Peter Geoghegan <pg@bowt.ie> writes:

Peter> PostgreSQL 14 is the first version where the default
Peter> wal_sync_method is fdatasync() on FreeBSD -- though only with
Peter> FreeBSD 13. See commit f900a79e. Perhaps that has something to
Peter> do with the problem seen here.

Nope. The database in question was restored from a pg_dumpall backup
three weeks ago and I'm told there have been no crashes or even unclean
restarts since then.

Peter> I suspect that there was TID recycling by VACUUM involved here.

That also doesn't seem likely from the heap page images I saw, which
suggested a fairly low update rate, HOT updates, and page cleanup rather
than much vacuum activity (in particular there were several LP_REDIRECT
pointers and few free pointers).

--
Andrew (irc:RhodiumToad)

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andrew Gierth (#8)
Re: BUG #17245: Index corruption involving deduplicated entries

Do you happen to have WAL traffic for this table and index?

--
Álvaro Herrera

In reply to: Andrew Gierth (#8)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 5:56 PM Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:

Nope. The database in question was restored from a pg_dumpall backup
three weeks ago and I'm told there have been no crashes or even unclean
restarts since then.

Anything is possible, but this doesn't look like logical index
corruption. If we assume that it was something like a faulty
deduplication pass, then we have to explain how the spurious TID ended
up in that particular leaf page to begin with. We also have to explain
why it is a valid TID though totally distinct TID, which is unlikely
to occur by chance.

I find it slightly suspicious that the "correct" heap blocks (those
blocks that the correct heap tuple is located) are close together in
the heap -- 720 (in the case of 'Kaga'), 722 (in the case of
'Isokaze'). While the "incorrect" heap blocks (those blocks that the
non-matching heap tuple is located, that are nevertheless returned by
the index scan) are earlier in the table -- 181 and 446.

That also doesn't seem likely from the heap page images I saw, which
suggested a fairly low update rate, HOT updates, and page cleanup rather
than much vacuum activity (in particular there were several LP_REDIRECT
pointers and few free pointers).

FWIW Postgres 14 can truncate a heap page's line pointer array during
VACUUM's second pass -- see commit 3c3b8a4b. So a lack of observable
free line pointers is not quite the signal it once was.

What does pg_stat_user_tables say about autovacuum?

--
Peter Geoghegan

#11Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#7)
Re: BUG #17245: Index corruption involving deduplicated entries

On 25.10.2021 1:28, Peter Geoghegan wrote:

Does it make any difference if you run "./pg_amcheck -U postgres
--heapallindexed azurlane_wiki" instead?

Typically I would expect this to be less interesting, because you're
asking pg_amcheck to do less checking. But this is not a typical case.

No, there seem to be no differences between using different sets of
arguments:

[xiatian@freebsd pg_amcheck]$ ./pg_amcheck -U postgres --heapallindexed
azurlane_wiki
[xiatian@freebsd pg_amcheck]$

Do you mean that even a REINDEX isn't enough to stop the index from
giving this particular wrong answer, where it thinks that 'Belfast' is
the same as 'Kaga'? Even a REINDEX won't make it stop doing that? Or,
are you concerned that pg_amcheck doesn't detect a remaining problem
with one index?

The latter. REINDEX fixes the problem with page_main_title. I apologise
for not being clear enough; I have never looked too deep into
PostgreSQL's inner workings before and have been using it purely as a
regular application developer.

What does pg_stat_user_tables say about autovacuum?

For "mediawiki.page" specifically:

last_vacuum: 2021-10-24 12:21:22.256194+09
last_autovacuum: 2021-10-05 18:57:43.850106+09
last_analyze: 2021-10-24 18:48:04.706226+09
last_autoanalyze: 2021-10-09 19:38:36.441795+09

vacuum_count is 20 (the periodic daily script), autovacuum_count is 1.

Considering the constraints for the index in question, I can tell that
changes affecting it were made on the 22nd of October and before, and
then yesterday on the 24th. The first user report regarding the problem
was on the 23rd, unless, of course, it went unnoticed before.

--
K. R.

#12Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Alvaro Herrera (#9)
Re: BUG #17245: Index corruption involving deduplicated entries

"Alvaro" == Alvaro Herrera <alvherre@alvh.no-ip.org> writes:

Alvaro> Do you happen to have WAL traffic for this table and index?

Unfortunately there's no WAL archiving, and no available WAL files from
before the state was known to be corrupt. We'll look into enabling
archiving in case the problem recurs or can be reproduced.

--
Andrew (irc:RhodiumToad)

#13Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#5)
Re: BUG #17245: Index corruption involving deduplicated entries

On 25.10.2021 0:52, Peter Geoghegan wrote:

On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

PostgreSQL version: 14.0
Operating system: FreeBSD 13.0-RELEASE

PostgreSQL 14 is the first version where the default wal_sync_method
is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
f900a79e. Perhaps that has something to do with the problem seen here.

To clarify: I had been running PostgreSQL 13.3 since the 18th of May
(upgraded from 13.2) until the upgrade to 14.0.

The REL_13_3 tag seems to include the change from f900a79e (matching the
commit message: ‘Like commit 576477e73c4, which did the same for Linux,
back-patch to all supported releases.’) and both the 13.3 package from
my package manager's cache and the 13.4 package available from FreeBSD
repositories at the moment uses fdatasync() as well:

[root@lincle-backup ~]# postgres --version
postgres (PostgreSQL) 13.3
[root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method
wal_sync_method sighup  Write-Ahead Log / Settings      ENUM
fdatasync                       Selects the method used for forcing WAL
updates to disk.

[root@lincle-backup ~]# postgres --version
postgres (PostgreSQL) 13.4
[root@lincle-backup ~]# postgres --describe-config| grep wal_sync_method
wal_sync_method sighup  Write-Ahead Log / Settings      ENUM
fdatasync                       Selects the method used for forcing WAL
updates to disk.

According to the logs I have (daily reports), there have been no crashes
or unclean shutdowns of PostgreSQL on the server since the database jail
re-creation time (August of 2014 or so).

Vacuum has been performed daily on all databases since at least around
2017 via the daily periodic script.

--
K. R.

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Kamigishi Rei (#13)
Re: BUG #17245: Index corruption involving deduplicated entries

On Mon, Oct 25, 2021 at 10:50 PM K. R. <iijima.yun@koumakan.jp> wrote:

On 25.10.2021 0:52, Peter Geoghegan wrote:

On Sun, Oct 24, 2021 at 4:40 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

PostgreSQL version: 14.0
Operating system: FreeBSD 13.0-RELEASE

PostgreSQL 14 is the first version where the default wal_sync_method
is fdatasync() on FreeBSD -- though only with FreeBSD 13. See commit
f900a79e. Perhaps that has something to do with the problem seen here.

To clarify: I had been running PostgreSQL 13.3 since the 18th of May
(upgraded from 13.2) until the upgrade to 14.0.

The REL_13_3 tag seems to include the change from f900a79e (matching the
commit message: ‘Like commit 576477e73c4, which did the same for Linux,
back-patch to all supported releases.’) and both the 13.3 package from
my package manager's cache and the 13.4 package available from FreeBSD
repositories at the moment uses fdatasync() as well:

For the record, commit f900a79e was about preserving pre-existing
behaviour dating back a few years (ie avoiding a change). The history
is like this:

On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync.

If built on FreeBSD 11.1+ (July 2017), it would select
wal_sync_method=fdatasync as the default, because the fdatasync()
system call was added and would be detected by the configure script.

If built on FreeBSD 13.0+ (April 2021), it would traditionally have
selected wal_sync_method=open_datasync, because the O_DSYNC flag was
added to <fcntl.h>. But commit f900a79e changed PostgreSQL 13.3+
(April 2021) and sibling releases to prefer wal_sync_method=fdatasync
still, despite the presence of O_DSYNC.

Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of
May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2
around the time of FreeBSD 13.0's release on the 13th of April, it
would have been running with wal_sync_method=open_datasync (assuming
default used) in that 5 week window. I could talk about what exact
effects that would have if I knew which file system we're talking
about here, but it doesn't seems to be relevant to this case, given
that "The database in question was restored from a pg_dumpall backup
_three weeks ago_ and I'm told there have been no crashes or even
unclean restarts since then" (my emphasis).

#15Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Thomas Munro (#14)
Re: BUG #17245: Index corruption involving deduplicated entries

On 25.10.2021 23:49, Thomas Munro wrote:

For the record, commit f900a79e was about preserving pre-existing
behaviour dating back a few years (ie avoiding a change). The history
is like this:

On ancient FreeBSD releases, PostgreSQL would default to wal_sync_method=fsync.

If built on FreeBSD 11.1+ (July 2017), it would select
wal_sync_method=fdatasync as the default, because the fdatasync()
system call was added and would be detected by the configure script.

If built on FreeBSD 13.0+ (April 2021), it would traditionally have
selected wal_sync_method=open_datasync, because the O_DSYNC flag was
added to <fcntl.h>. But commit f900a79e changed PostgreSQL 13.3+
(April 2021) and sibling releases to prefer wal_sync_method=fdatasync
still, despite the presence of O_DSYNC.

Thank you, this confirms the assumptions I had after reading through the
sources of different releases and testing the --describe-config output
from old packages in my package manager's cache.

Given that this system upgraded PostgreSQL 13.2->13.3 on the 18th of
May, and assuming it had installed FreeBSD 13.0 and PostgreSQL 13.2
around the time of FreeBSD 13.0's release on the 13th of April, it
would have been running with wal_sync_method=open_datasync (assuming
default used) in that 5 week window. I could talk about what exact
effects that would have if I knew which file system we're talking
about here, but it doesn't seems to be relevant to this case, given
that "The database in question was restored from a pg_dumpall backup
_three weeks ago_ and I'm told there have been no crashes or even
unclean restarts since then" (my emphasis).

In case this might be important later: everything is on ZFS (upgraded
from FreeBSD's old zfs port to openzfs, as imported in FreeBSD 13.0).

Log entries related to the upgrade:
Oct  5 09:55:27 db pkg[7101]: postgresql13-server-13.3_1 deinstalled
Oct  5 09:55:41 db pkg[7144]: postgresql14-server-14.0 installed

This was preceded by pg_dumpall and followed by initdb with the exact
same parameters (using unchanged rc.conf
postgresql_initdb_flags="--encoding=utf-8 --lc-collate=en_GB.UTF-8") and
the import of the dump using psql. I do not use pg_upgrade since
different versions of PostgreSQL cannot be installed at the same time in
FreeBSD using the ports/packages system, and I do not find dump/restore
much a chore (plus it solves issues with possible locale changes).

There have been no crashes since; there was one reload (pg_hba.conf
edits) and several restarts (to snapshot the file structure with the
corrupted index, plus another enabling WAL archiving today in the morning).

I have postgresql-rum installed, if this matters; it is used by a
Pleroma instance in a separate database.

--
K. R.

In reply to: Kamigishi Rei (#15)
Re: BUG #17245: Index corruption involving deduplicated entries

On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote:

There have been no crashes since; there was one reload (pg_hba.conf
edits) and several restarts (to snapshot the file structure with the
corrupted index, plus another enabling WAL archiving today in the morning).

Thank you for your help.

I wonder if you can show me a page that amcheck reports as having an
incorrect posting list? I am interested in posting list tuples that
are not just pointing to the wrong thing, but actually look wrong
without even looking at the heap.

You must have done this for Andrew already, but note that the
procedure is outlined here:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

For example, the amcheck error with "DETAIL: Index tid=(14,9)
posting list offset=110 page lsn=2/2C4F7CD8 btree index
"azurlane_wiki.mediawiki.page_len" has an "index TID" of 14, meaning
that the block number 14 from the index
"azurlane_wiki.mediawiki.page_len" is interesting to me. It could help
me with debugging.

I can treat this page image as confidential. You could send it to me privately.

Thanks again
--
Peter Geoghegan

#17Thomas Munro
thomas.munro@gmail.com
In reply to: Kamigishi Rei (#15)
Re: BUG #17245: Index corruption involving deduplicated entries

On Tue, Oct 26, 2021 at 10:08 AM K. R. <iijima.yun@koumakan.jp> wrote:

[...] I do not use pg_upgrade since
different versions of PostgreSQL cannot be installed at the same time in
FreeBSD using the ports/packages system, and I do not find dump/restore
much a chore (plus it solves issues with possible locale changes).

It really would be nice to address that. (FWIW A few years ago I had a
mostly-working experiment-grade patch for the ports tree that could
install and run multiple versions in parallel, and the bones of a port
of the Debian wrapper scripts and cluster control tools so you can
list, start, stop, clone clusters etc, but I ran out of steam while
contemplating the squillions of other packages that interact with
PostgreSQL... happy to share the starter bits and pieces I have
off-list if anyone is interested...). But in this particular case it
significantly reduces the search area for the problem, so that's a
useful deficiency!

In reply to: Peter Geoghegan (#10)
Re: BUG #17245: Index corruption involving deduplicated entries

On Sun, Oct 24, 2021 at 6:33 PM Peter Geoghegan <pg@bowt.ie> wrote:

Anything is possible, but this doesn't look like logical index
corruption. If we assume that it was something like a faulty
deduplication pass, then we have to explain how the spurious TID ended
up in that particular leaf page to begin with. We also have to explain
why it is a valid TID though totally distinct TID, which is unlikely
to occur by chance.

I discussed the robustness of some of the posting list split code
privately with Heikki earlier today. Heikki suggested that I more
consistently invalidate the insert state that is used across
nbtinsert.c. Although I don't have any reason to believe that that
makes a difference, I still agree with his conclusions.

Attached patch does this. It also makes the special posting split
related call to _bt_binsrch_insert() at the end of _bt_findinsertloc()
more robust. There is a similar defensive elog() within
_bt_swap_posting(), but that only kicks in when the posting offset is
totally out of bounds for the tuple. _bt_binsrch_insert() could still
give answers for some other reason, at least in theory, and so it
seems like a good idea to be extra careful, and throw an error at the
earliest opportunity.

--
Peter Geoghegan

Attachments:

v1-0001-Harden-nbtinsert.c-posting-split-code.patchapplication/octet-stream; name=v1-0001-Harden-nbtinsert.c-posting-split-code.patchDownload+13-4
In reply to: Peter Geoghegan (#16)
Re: BUG #17245: Index corruption involving deduplicated entries

On Mon, Oct 25, 2021 at 2:29 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Oct 25, 2021 at 2:08 PM K. R. <iijima.yun@koumakan.jp> wrote:

There have been no crashes since; there was one reload (pg_hba.conf
edits) and several restarts (to snapshot the file structure with the
corrupted index, plus another enabling WAL archiving today in the morning).

Thank you for your help.

Thank you for sharing page images with me privately -- that was very
helpful. I can see the same basic issue in all 3 of the pages that you
said amcheck reports as corrupt (the 3 pages that you shared). As I
suspected, there are posting lists that contain duplicate TIDs, but
look normal in every other way. This is very subtle.

A few more questions for you, if you don't mind:

1. Can you tell me which tables the 4 known-corrupt indexes
(page_redirect_namespace_len, page_len, transcode_key_idx, and
page_main_title) are defined against?
2. Can you show me the definition of the table or tables, using \d from psql?
3. Do you notice any consistent pattern here? For example, are foreign
keys involved?
4. How do the table or tables tend to get updated?

Thanks again

Executive summary on my progress:

This now looks like it might be related to code on the heapam side,
perhaps in heapam_index_build_range_scan(), which has a history of
getting confused about duplicate TIDs with HOT chains (e.g., see
Alvaro's commit a811ea5bde from late 2020).

Here is why I now suspect heapam_index_build_range_scan():

It disturbed me that even "heapallindexed" amcheck verification did
not detect the original reported problem, where we see index scans
that give wrong answers (a scan of the page_main_title index shows an
extra non-matching row, per Andrew's original message). We saw that
amcheck does notice a few corrupt posting list tuples in other nearby
indexes, but not this bigger problem. That eventually made me think
about table_index_build_scan()/heapam_index_build_range_scan(), which
is used by amcheck for heapallindexed verification -- most of the
important work is outsourced to that core code. amcheck more or less
performs heapallindexed verification by comparing the tuples that a
REINDEX would see against the actual tuples it fingerprinting from the
index.

I wonder if heapallindexed verification doesn't detect corruption in
the page_main_title index because "the table seems to agree with the
index", even though the actual details are clearly wrong. In other
words, perhaps even the table_index_build_scan() call inside amcheck's
verify_ntree.c file says that there should be duplicate TIDs in the
index, even though that's always wrong, by definition. What if
table_index_build_scan() is actually the problem?

If my hypothesis is true, then we might expect problems to not even go
away following a REINDEX. Or maybe they would go away sometimes, but
not other times.

This idea is supported by remarks from Herman verschooten on Slack
[1]: https://postgresteam.slack.com/archives/C0FS3UTAP/p1635161173202100?thread_ts=1635154585.197300&amp;cid=C0FS3UTAP
quote him directly:

"""
tranman_production=# CREATE INDEX index_freights_on_cmr_received ON
public.freights USING btree (cmr_received) with (deduplicate_items =
off);
CREATE INDEX
tranman_production=# \set VERBOSITY verbose
tranman_production=# update freights set cmr_received=false where id=49632;
ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168
LOCATION: _bt_swap_posting, nbtdedup.c:1037
tranman_production=# drop index index_freights_on_cmr_received ;
DROP INDEX
tranman_production=# update freights set cmr_received=false where id=49632;
UPDATE 1
"""

Notice that the problem seems to be confined to one index, which has
deduplication explicitly disabled. We see an error message about a
posting list tuple all the same. I think that that's just a downstream
consequence of duplicate TIDs appearing in the index. It looks like
the corruption exists and is independent of whether the index is
present or not. When the index is present the problem becomes obvious,
but it's still there either way.

The supposed posting list tuple with 20 items that the error message
goes on about is actually just a plain tuple. The number 20 comes from
its IndexTupleData.t_tid offset number. In other words, it's an offset
number from the index tuple's pointed-to heap TID, which has been
misinterpreted as something else by BTreeTupleGetNPosting().
BTreeTupleGetNPosting() expects to never be called against such a
tuple, but it can happen once we assume duplicate TIDs are possible.
(You'd see an assertion failure if asserts were enabled, but of course
they're not enabled here.)

Note that this independently reported "freights" case from Herman has
the same amcheck issue we see with the mediawiki indexes on this
thread: amcheck won't actually complain about the clearly corrupt
index_freights_on_cmr_received index, but will complain about other
indexes on the same table. This time around amcheck complains about
out-of-order index tuples, but I strongly suspect that that's
practically the same condition as the "posting list contains misplaced
TID" error messages we see here. In other words, I think that both
cases just have duplicate heap TIDs, and that the variation in error
messages is totally unrelated to the true underlying problem.

[1]: https://postgresteam.slack.com/archives/C0FS3UTAP/p1635161173202100?thread_ts=1635154585.197300&amp;cid=C0FS3UTAP
[2]: /messages/by-id/8CDB73C1-E3AF-40A6-BA81-8AFE174C6402@verschooten.net -- Peter Geoghegan
--
Peter Geoghegan

In reply to: Peter Geoghegan (#19)
Re: BUG #17245: Index corruption involving deduplicated entries

On Tue, Oct 26, 2021 at 1:36 PM Peter Geoghegan <pg@bowt.ie> wrote:

A few more questions for you, if you don't mind:

1. Can you tell me which tables the 4 known-corrupt indexes
(page_redirect_namespace_len, page_len, transcode_key_idx, and
page_main_title) are defined against?
2. Can you show me the definition of the table or tables, using \d from psql?
3. Do you notice any consistent pattern here? For example, are foreign
keys involved?
4. How do the table or tables tend to get updated?

Also:

5. Do you use CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY much?

--
Peter Geoghegan

#21Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#19)
In reply to: Kamigishi Rei (#21)
In reply to: Peter Geoghegan (#22)
#24Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#22)
#25Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#23)
#26David Rowley
dgrowleyml@gmail.com
In reply to: Peter Geoghegan (#22)
In reply to: Kamigishi Rei (#24)
In reply to: Kamigishi Rei (#25)
In reply to: David Rowley (#26)
In reply to: Peter Geoghegan (#28)
#31Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#28)
#32Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#30)
#33Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#31)
#34Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Andres Freund (#31)
#35Andres Freund
andres@anarazel.de
In reply to: Andrew Gierth (#12)
#36Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Kamigishi Rei (#36)
In reply to: Thomas Munro (#33)
In reply to: Peter Geoghegan (#38)
In reply to: Andres Freund (#32)
In reply to: Andres Freund (#31)
#42Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#38)
In reply to: Andres Freund (#42)
In reply to: Andres Freund (#32)
In reply to: Pavel Borisov (#34)
#46Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#45)
#47Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#46)
#48Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#43)
In reply to: Andres Freund (#48)
In reply to: Kamigishi Rei (#36)
In reply to: Peter Geoghegan (#49)
#52Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#37)
#53Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Peter Geoghegan (#51)
#54Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#50)
#55Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#42)
#56Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Kamigishi Rei (#52)
In reply to: Kamigishi Rei (#56)
#58Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Kamigishi Rei (#56)
#59Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Peter Geoghegan (#57)
In reply to: Alvaro Herrera (#59)
In reply to: Kamigishi Rei (#54)
In reply to: Kamigishi Rei (#55)
#63Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#57)
In reply to: Peter Geoghegan (#57)
#65Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#64)
In reply to: Kamigishi Rei (#65)
#67Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Kamigishi Rei (#56)
In reply to: Kamigishi Rei (#67)
In reply to: Peter Geoghegan (#68)
#70Andres Freund
andres@anarazel.de
In reply to: Kamigishi Rei (#67)
#71Andres Freund
andres@anarazel.de
In reply to: Kamigishi Rei (#67)
#72Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#71)
#73Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#70)
#74Andres Freund
andres@anarazel.de
In reply to: Kamigishi Rei (#73)
In reply to: Andres Freund (#72)
#76Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#74)
In reply to: Andres Freund (#71)
#78Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#77)
#79Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#76)
#80Andres Freund
andres@anarazel.de
In reply to: Kamigishi Rei (#79)
#81Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#80)
#82Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#80)
#83Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Andres Freund (#81)
In reply to: Andres Freund (#81)
#85Kamigishi Rei
iijima.yun@koumakan.jp
In reply to: Peter Geoghegan (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#84)
In reply to: Andres Freund (#86)
#88Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#87)
In reply to: Andres Freund (#88)
In reply to: Peter Geoghegan (#89)
#91Pavel Borisov
pashkin.elfe@gmail.com
In reply to: Peter Geoghegan (#90)
In reply to: Peter Geoghegan (#90)
#93Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Geoghegan (#92)
In reply to: Masahiko Sawada (#93)
In reply to: Peter Geoghegan (#89)
#96Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#95)
In reply to: Andres Freund (#96)
#98Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#97)
In reply to: Andres Freund (#98)
#100Justin Pryzby
pryzby@telsasoft.com
In reply to: Peter Geoghegan (#99)
In reply to: Justin Pryzby (#100)
In reply to: Alexander Kukushkin (#53)
#103Alexander Kukushkin
cyberdemn@gmail.com
In reply to: Peter Geoghegan (#102)