BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY
The following bug has been logged on the website:
Bug reference: 17485
Logged by: Peter Slavov
Email address: pet.slavov@gmail.com
PostgreSQL version: 14.3
Operating system: Ubuntu 20.04
Description:
I a noticed a problem in our production database when rebuilding the primary
key index. The database has a lot of INSERTS and UPDATES when this is
happening. I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.
Here is how you can reproduce:
- Load table `test2` from this file containing the table structure and small
testing dataset:
https://drive.google.com/file/d/1vCqKOda3IIrHmWDzNNJaDQfCb5eQA_D2/view?usp=sharing
- Start inserting and updating in this table using pgbench:
SHELL: pgbench -r -T 1000 -f test.sql -c 50 -j 50 receiptbank
test.sql file:
https://drive.google.com/file/d/1mdduGqu1XcDg01iwSysF7szrIlCKUJQT/view?usp=sharing
- Reindex the PRIMARY KEY (it is possible to have to repeat this 2-3 times):
SQL: REINDEX INDEX CONCURRENTLY test2_pkey
- Check index with `amcheck` extension for errors:
SQL: select bt_index_check(index => c.oid, heapallindexed => true) from
pg_class c where oid = 'test2_pkey'::regclass;
SQL result for me:
ERROR: heap tuple (13134,18) from table "test2" lacks matching index
tuple within index "test2_pkey"
- Check for missing records in the index:
SQL: WITH missing_check AS (
SELECT t1.*, t1.ctid, EXISTS(SELECT 1 FROM test2 t2 WHERE
t2.id = t1.id) AS flag
FROM test2 t1)
SELECT id, ctid
FROM missing_check
WHERE flag = false;
Result for me was several records found ONLY with the sequential
scan...
This problem is possibly related to BUG #17478
Thanks for the help,
Peter Slavov
18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org> написал(а):
I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.
Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP test to you reproduction [0]https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl. How do you think, what's the key difference between TAP test and your repro?
Thanks! Best regards, Andrey Borodin.
[0]: https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl
On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:
Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar
TAP test to you reproduction [0]. How do you think, what's the key
difference between TAP test and your repro?
Interesting, indeed. Another question I have: is this limited to v14
or are you able to see it in older versions? REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael
На чт, 19.05.2022 г. в 7:32 ч. Michael Paquier <michael@paquier.xyz> написа:
On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:
Uhm, that's very...interesting. I'll look closely next week. Though I
didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar
TAP test to you reproduction [0]. How do you think, what's the key
difference between TAP test and your repro?Interesting, indeed. Another question I have: is this limited to v14
or are you able to see it in older versions? REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael
Hi Michael,
Yes I have made the same test on PostgreSQL 13.7, but the reindex works as
expected there (no issues).
I haven't tested on older versions.
Peter
Hi Andrey,
This test looks similar to me but I cannot say what is the difference. My
tests are done under heavier load maybe.
Also I couldn't reproduce this with a table with low num number of columns
(integrer and text). I am not sure if this is relevant...
Peter
На чт, 19.05.2022 г. в 7:22 ч. Andrey Borodin <x4mmm@yandex-team.ru> написа:
Show quoted text
18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org>
написал(а):
I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.Uhm, that's very...interesting. I'll look closely next week. Though I
didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP
test to you reproduction [0]. How do you think, what's the key difference
between TAP test and your repro?Thanks! Best regards, Andrey Borodin.
[0]
https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl
On Thu, May 19, 2022 at 08:24:27AM +0300, Петър Славов wrote:
Yes I have made the same test on PostgreSQL 13.7, but the reindex works as
expected there (no issues).
I haven't tested on older versions.
Okay, thanks. Something that has changed in this area is the
addition of c98763b, where spurious waits are avoided in some of the
phaese of REINDEX CONCURRENTLY. I am wondering if this is related.
--
Michael
On 2022-May-19, Michael Paquier wrote:
Okay, thanks. Something that has changed in this area is the
addition of c98763b, where spurious waits are avoided in some of the
phaese of REINDEX CONCURRENTLY. I am wondering if this is related.
Hmm, yes, it's definitely possible that it is related.
I'll have a look soon.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL. This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php
On Thu, May 19, 2022 at 02:03:56PM +0200, Alvaro Herrera wrote:
Hmm, yes, it's definitely possible that it is related.
I'll have a look soon.
It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations. I would
not be surprised to see that this is also related to some of the
reports we have seen lately with reindex operations. There was one
with logical replication and missing records from a primary key, I
recall.
For the stable branches of 14 and 15, I would tend to play it safe and
revert d9d0762. I have to admit that f9900df and c98763b stress me a
bit, and that we have not have anticipated all the ramifications of
this set of changes. Thoughts?
--
Michael
Attachments:
0001-Revert-VACUUM-ignore-indexing-operations-with-CONCUR.patchtext/x-diff; charset=us-asciiDownload+9-39
On 2022-May-23, Michael Paquier wrote:
It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations. I would
not be surprised to see that this is also related to some of the
reports we have seen lately with reindex operations. There was one
with logical replication and missing records from a primary key, I
recall.For the stable branches of 14 and 15, I would tend to play it safe and
revert d9d0762. I have to admit that f9900df and c98763b stress me a
bit, and that we have not have anticipated all the ramifications of
this set of changes. Thoughts?
Wow, thanks for researching that over the weekend.
I think if this is a big enough deal (and I think it may be) then IMO we
should revert as you suggest, make an out-of-schedule release, and then
I can take some time to investigate in more depth and see if the feature
can be salvaged.
OTOH if we think an out-of-schedule release is not warranted, then
reverting right now is not useful; we can make a decision about that
closer to the next minor release, once we've had time to see if the bug
can be fixed in some other way that doesn't break the whole feature.
Opinions?
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)
23 мая 2022 г., в 13:07, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а):
Opinions?
I think revert+release is not really a good idea until we understand how this commit breaks things.
Chances are that it only affects frequency of the reproduction.
When we will understand what is root cause of the bug - it won't take much time to fix things.
Best regards, Andrey Borodin.
23 мая 2022 г., в 10:40, Michael Paquier <michael@paquier.xyz> написал(а):
It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations.
I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.
And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);
Best regards, Andrey Borodin.
Attachments:
test.diffapplication/octet-stream; name=test.diff; x-unix-mode=0644Download+67-0
23 мая 2022 г., в 15:49, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.
I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the moment of check fail.
I've added ReadBuffer() in case of bloom_lacks_element() and ItemIdHasStorage() is false.
I understand that this description is a too vague, so I attached a patch for amcheck relaxing bt_index_check() so the test would pass.
Best regards, Andrey Borodin.
Attachments:
skip_vacuumed_tuple.diffapplication/octet-stream; name=skip_vacuumed_tuple.diff; x-unix-mode=0644Download+22-2
On Mon, May 23, 2022 at 2:02 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I think revert+release is not really a good idea until we understand how this commit breaks things.
Chances are that it only affects frequency of the reproduction.
+1 -- it's been in a stable release for months now, and we will
probably know the exact nature of the problem in just a few more days.
There is no reason to decide that the feature needs to be reverted
before anything else. Or if there is I would like to hear it.
--
Peter Geoghegan
On Mon, May 23, 2022 at 6:06 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the moment of check fail.
That could just be a "downstream problem" from HOT chain corruption.
Maybe you'd get a clearer/earlier failure if you also applied this
patch, on an assertion-enabled build:
/messages/by-id/CAH2-Wzk2LeWPwz1wcKNz7Fux4Ogn+PC81H+q7Q7no-5XT0dx3w@mail.gmail.com
--
Peter Geoghegan
On Mon, May 23, 2022 at 10:07:44AM +0200, Alvaro Herrera wrote:
I think if this is a big enough deal (and I think it may be) then IMO we
should revert as you suggest, make an out-of-schedule release, and then
I can take some time to investigate in more depth and see if the feature
can be salvaged.OTOH if we think an out-of-schedule release is not warranted, then
reverting right now is not useful; we can make a decision about that
closer to the next minor release, once we've had time to see if the bug
can be fixed in some other way that doesn't break the whole feature.
The annoying part is that this can cause silent corruptions for
indexes created with REINDEX and CIC, so most users won't know about
the failure until they see that their application is broken. And we
are just talking about a btree index here, other index AMs may be
similarly impacted. So that's rather bad IMHO :/
It seems to me that the problem is around the wait phase after the
validation, where the computation of limitXmin coming from the
snapshot used for the validation ignores now the impact of VACUUM,
hence impacting the timing when the index can be safely used. It also
looks like it is possible to build an isolation test, where we use a
transaction with a snapshot older than the REINDEX to force it to
wait in the first WaitForOlderSnapshots() call.
--
Michael
On Mon, May 23, 2022 at 03:49:02PM +0500, Andrey Borodin wrote:
I've transformed Peter's test into TAP test that runs ~20 seconds
and reliably reproduces problem on my laptop.
Thanks for the TAP test. That's nice. It actually passes here,
reliably.
And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);
Well, by doing so, I think that you are just making the CIC/REINDEX
wait again until the index is safe to use, but we want to skip this
wait as of the optimization done in d9d0762.
--
Michael
On Mon, May 23, 2022 at 6:20 PM Michael Paquier <michael@paquier.xyz> wrote:
And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);Well, by doing so, I think that you are just making the CIC/REINDEX
wait again until the index is safe to use, but we want to skip this
wait as of the optimization done in d9d0762.
Uh...isn't that exactly the point that Andrey made himself, in posting
the snippet?
You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.
--
Peter Geoghegan
On 2022-May-23, Peter Geoghegan wrote:
You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.
Right.
I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things. Didn't reach any conclusion because I ran into thermal problems
with my laptop, which got me angry and couldn't make any further
progress.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
(Linus Torvalds)
On 24 May 2022, at 14:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things
On my machine commenting out set_indexsafe_procflags() before "Phase 3 of concurrent index build" fixes the tests.
On 23 May 2022, at 23:18, Peter Geoghegan <pg@bowt.ie> wrote:
Maybe you'd get a clearer/earlier failure if you also applied this
patch, on an assertion-enabled build
I've tried this approach, but nothing actually seem to change... BTW I used a three-way-merge rebase, there is a slight conflict in comments. But, luckily, comments don't run.
On 24 May 2022, at 06:20, Michael Paquier <michael@paquier.xyz> wrote:
Thanks for the TAP test. That's nice. It actually passes here,
reliably.
IDK. Maybe if you increase --time of pgbench you will observe the problem...
On 24 May 2022, at 07:19, Peter Geoghegan <pg@bowt.ie> wrote:
Andrey was just
making an observation, in case it helped with debugging.
Yes, I'm not proposing to commit anything so far. All my tests, snippets, diffs here are only debug stuff.
Thank you!
Best regards, Andrey Borodin.
Hi,
On 2022-05-24 11:02:12 +0200, Alvaro Herrera wrote:
On 2022-May-23, Peter Geoghegan wrote:
You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.Right.
I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things. Didn't reach any conclusion because I ran into thermal problems
with my laptop, which got me angry and couldn't make any further
progress.
Do we have any idea what really causes the corruption?
One thing that'd be worth excluding is the use of parallel index builds.
Greetings,
Andres Freund