BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

[0]: https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl

amborodin@acm.org

about 4 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org> написал(а):
I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.

Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP test to you reproduction [0]https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl. How do you think, what's the key difference between TAP test and your repro?

Thanks! Best regards, Andrey Borodin.

michael@paquier.xyz

about 4 years ago

In reply to: Andrey Borodin (#2)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:

Uhm, that's very...interesting. I'll look closely next week. Though I didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar
TAP test to you reproduction [0]. How do you think, what's the key
difference between TAP test and your repro?

Interesting, indeed. Another question I have: is this limited to v14
or are you able to see it in older versions? REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael

Петър Славов

pet.slavov@gmail.com

about 4 years ago

In reply to: Michael Paquier (#3)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

На чт, 19.05.2022 г. в 7:32 ч. Michael Paquier <michael@paquier.xyz> написа:

On Thu, May 19, 2022 at 09:22:44AM +0500, Andrey Borodin wrote:

Uhm, that's very...interesting. I'll look closely next week. Though I

didn't have a chance to reproduce yet.

We have fixed similar problem in 14.1. And now we have very similar
TAP test to you reproduction [0]. How do you think, what's the key
difference between TAP test and your repro?

Interesting, indeed. Another question I have: is this limited to v14
or are you able to see it in older versions? REINDEX CONCURRENTLY has
been introduced in v12.
--
Michael

Hi Michael,
Yes I have made the same test on PostgreSQL 13.7, but the reindex works as
expected there (no issues).
I haven't tested on older versions.

Peter

Петър Славов

pet.slavov@gmail.com

about 4 years ago

In reply to: Andrey Borodin (#2)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

Hi Andrey,
This test looks similar to me but I cannot say what is the difference. My
tests are done under heavier load maybe.
Also I couldn't reproduce this with a table with low num number of columns
(integrer and text). I am not sure if this is relevant...

Peter

На чт, 19.05.2022 г. в 7:22 ч. Andrey Borodin <x4mmm@yandex-team.ru> написа:

Show quoted text

18 мая 2022 г., в 15:42, PG Bug reporting form <noreply@postgresql.org>

написал(а):

I was able to reproduce this on PostgreSQL 14.1/2/3 locally on
docker instance and on AWS EC2.

Uhm, that's very...interesting. I'll look closely next week. Though I
didn't have a chance to reproduce yet.
We have fixed similar problem in 14.1. And now we have very similar TAP
test to you reproduction [0]. How do you think, what's the key difference
between TAP test and your repro?

Thanks! Best regards, Andrey Borodin.

[0]
https://github.com/postgres/postgres/blob/REL_14_STABLE/contrib/amcheck/t/002_cic.pl

michael@paquier.xyz

about 4 years ago

In reply to: Петър Славов (#4)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Thu, May 19, 2022 at 08:24:27AM +0300, Петър Славов wrote:

Yes I have made the same test on PostgreSQL 13.7, but the reindex works as
expected there (no issues).
I haven't tested on older versions.

Okay, thanks. Something that has changed in this area is the
addition of c98763b, where spurious waits are avoided in some of the
phaese of REINDEX CONCURRENTLY. I am wondering if this is related.
--
Michael

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Michael Paquier (#6)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On 2022-May-19, Michael Paquier wrote:

Okay, thanks. Something that has changed in this area is the
addition of c98763b, where spurious waits are avoided in some of the
phaese of REINDEX CONCURRENTLY. I am wondering if this is related.

Hmm, yes, it's definitely possible that it is related.

I'll have a look soon.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL. This is by far the most pleasant management experience of
any database I've worked on." (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

michael@paquier.xyz

about 4 years ago

In reply to: Alvaro Herrera (#7)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Thu, May 19, 2022 at 02:03:56PM +0200, Alvaro Herrera wrote:

Hmm, yes, it's definitely possible that it is related.

I'll have a look soon.

It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations. I would
not be surprised to see that this is also related to some of the
reports we have seen lately with reindex operations. There was one
with logical replication and missing records from a primary key, I
recall.

For the stable branches of 14 and 15, I would tend to play it safe and
revert d9d0762. I have to admit that f9900df and c98763b stress me a
bit, and that we have not have anticipated all the ramifications of
this set of changes. Thoughts?
--
Michael

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Michael Paquier (#8)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On 2022-May-23, Michael Paquier wrote:

It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations. I would
not be surprised to see that this is also related to some of the
reports we have seen lately with reindex operations. There was one
with logical replication and missing records from a primary key, I
recall.

For the stable branches of 14 and 15, I would tend to play it safe and
revert d9d0762. I have to admit that f9900df and c98763b stress me a
bit, and that we have not have anticipated all the ramifications of
this set of changes. Thoughts?

Wow, thanks for researching that over the weekend.

I think if this is a big enough deal (and I think it may be) then IMO we
should revert as you suggest, make an out-of-schedule release, and then
I can take some time to investigate in more depth and see if the feature
can be salvaged.

OTOH if we think an out-of-schedule release is not warranted, then
reverting right now is not useful; we can make a decision about that
closer to the next minor release, once we've had time to see if the bug
can be fixed in some other way that doesn't break the whole feature.

Opinions?

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

#10

amborodin@acm.org

about 4 years ago

In reply to: Alvaro Herrera (#9)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

23 мая 2022 г., в 13:07, Alvaro Herrera <alvherre@alvh.no-ip.org> написал(а):

Opinions?

I think revert+release is not really a good idea until we understand how this commit breaks things.
Chances are that it only affects frequency of the reproduction.

When we will understand what is root cause of the bug - it won't take much time to fix things.

Best regards, Andrey Borodin.

#11

amborodin@acm.org

about 4 years ago

In reply to: Michael Paquier (#8)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

23 мая 2022 г., в 10:40, Michael Paquier <michael@paquier.xyz> написал(а):

It took me some time to write a script to bisect that, but I have been
able to establish a correlation with d9d0762 that causes VACUUM to
ignore transactions doing some concurrent reindex operations.

I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.
And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Best regards, Andrey Borodin.

#12

amborodin@acm.org

about 4 years ago

In reply to: Andrey Borodin (#11)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

23 мая 2022 г., в 15:49, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I've transformed Peter's test into TAP test that runs ~20 seconds and reliably reproduces problem on my laptop.

I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the moment of check fail.
I've added ReadBuffer() in case of bloom_lacks_element() and ItemIdHasStorage() is false.
I understand that this description is a too vague, so I attached a patch for amcheck relaxing bt_index_check() so the test would pass.

Best regards, Andrey Borodin.

#13

pg@bowt.ie

about 4 years ago

In reply to: Andrey Borodin (#10)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Mon, May 23, 2022 at 2:02 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I think revert+release is not really a good idea until we understand how this commit breaks things.
Chances are that it only affects frequency of the reproduction.

+1 -- it's been in a stable release for months now, and we will
probably know the exact nature of the problem in just a few more days.
There is no reason to decide that the feature needs to be reverted
before anything else. Or if there is I would like to hear it.

--
Peter Geoghegan

#14

/messages/by-id/CAH2-Wzk2LeWPwz1wcKNz7Fux4Ogn+PC81H+q7Q7no-5XT0dx3w@mail.gmail.com

pg@bowt.ie

about 4 years ago

In reply to: Andrey Borodin (#12)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Mon, May 23, 2022 at 6:06 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I found out one interesting thing: unindexed tuple (that comes from amcheck scan) does not exist in heap page at the moment of check fail.

That could just be a "downstream problem" from HOT chain corruption.

Maybe you'd get a clearer/earlier failure if you also applied this
patch, on an assertion-enabled build:

--
Peter Geoghegan

#15

michael@paquier.xyz

about 4 years ago

In reply to: Alvaro Herrera (#9)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Mon, May 23, 2022 at 10:07:44AM +0200, Alvaro Herrera wrote:

I think if this is a big enough deal (and I think it may be) then IMO we
should revert as you suggest, make an out-of-schedule release, and then
I can take some time to investigate in more depth and see if the feature
can be salvaged.

OTOH if we think an out-of-schedule release is not warranted, then
reverting right now is not useful; we can make a decision about that
closer to the next minor release, once we've had time to see if the bug
can be fixed in some other way that doesn't break the whole feature.

The annoying part is that this can cause silent corruptions for
indexes created with REINDEX and CIC, so most users won't know about
the failure until they see that their application is broken. And we
are just talking about a btree index here, other index AMs may be
similarly impacted. So that's rather bad IMHO :/

It seems to me that the problem is around the wait phase after the
validation, where the computation of limitXmin coming from the
snapshot used for the validation ignores now the impact of VACUUM,
hence impacting the timing when the index can be safely used. It also
looks like it is possible to build an isolation test, where we use a
transaction with a snapshot older than the REINDEX to force it to
wait in the first WaitForOlderSnapshots() call.
--
Michael

#16

michael@paquier.xyz

about 4 years ago

In reply to: Andrey Borodin (#11)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Mon, May 23, 2022 at 03:49:02PM +0500, Andrey Borodin wrote:

I've transformed Peter's test into TAP test that runs ~20 seconds
and reliably reproduces problem on my laptop.

Thanks for the TAP test. That's nice. It actually passes here,
reliably.

And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Well, by doing so, I think that you are just making the CIC/REINDEX
wait again until the index is safe to use, but we want to skip this
wait as of the optimization done in d9d0762.
--
Michael

#17

pg@bowt.ie

about 4 years ago

In reply to: Michael Paquier (#16)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On Mon, May 23, 2022 at 6:20 PM Michael Paquier <michael@paquier.xyz> wrote:

And I observe that commenting out condition in following code fixes the test.
//if (!(statusFlags & PROC_IN_SAFE_IC))
h->data_oldest_nonremovable =
TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Well, by doing so, I think that you are just making the CIC/REINDEX
wait again until the index is safe to use, but we want to skip this
wait as of the optimization done in d9d0762.

Uh...isn't that exactly the point that Andrey made himself, in posting
the snippet?

You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.

--
Peter Geoghegan

#18

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Peter Geoghegan (#17)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On 2022-May-23, Peter Geoghegan wrote:

You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.

Right.

I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things. Didn't reach any conclusion because I ran into thermal problems
with my laptop, which got me angry and couldn't make any further
progress.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"I'm always right, but sometimes I'm more right than other times."
(Linus Torvalds)

#19

amborodin@acm.org

about 4 years ago

In reply to: Alvaro Herrera (#18)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

On 24 May 2022, at 14:02, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things

On my machine commenting out set_indexsafe_procflags() before "Phase 3 of concurrent index build" fixes the tests.

On 23 May 2022, at 23:18, Peter Geoghegan <pg@bowt.ie> wrote:

Maybe you'd get a clearer/earlier failure if you also applied this
patch, on an assertion-enabled build

I've tried this approach, but nothing actually seem to change... BTW I used a three-way-merge rebase, there is a slight conflict in comments. But, luckily, comments don't run.

On 24 May 2022, at 06:20, Michael Paquier <michael@paquier.xyz> wrote:

Thanks for the TAP test. That's nice. It actually passes here,
reliably.

IDK. Maybe if you increase --time of pgbench you will observe the problem...

On 24 May 2022, at 07:19, Peter Geoghegan <pg@bowt.ie> wrote:

Andrey was just
making an observation, in case it helped with debugging.

Yes, I'm not proposing to commit anything so far. All my tests, snippets, diffs here are only debug stuff.

Thank you!

Best regards, Andrey Borodin.

#20

andres@anarazel.de

about 4 years ago

In reply to: Alvaro Herrera (#18)

Re: BUG #17485: Records missing from Primary Key index when doing REINDEX INDEX CONCURRENTLY

Hi,

On 2022-05-24 11:02:12 +0200, Alvaro Herrera wrote:

On 2022-May-23, Peter Geoghegan wrote:

You seem to be addressing this PROC_IN_SAFE_IC snippet as if Andrey
formally proposed it as a bugfix, which seems like an odd
interpretation to me. It seems pretty clear to me that Andrey was just
making an observation, in case it helped with debugging.

Right.

I approached it yesterday by running the test case with each
set_indexsafe_procflags() callsite commented out, see which one breaks
things. Didn't reach any conclusion because I ran into thermal problems
with my laptop, which got me angry and couldn't make any further
progress.

Do we have any idea what really causes the corruption?

One thing that'd be worth excluding is the use of parallel index builds.

Greetings,

Andres Freund

#21

pg@bowt.ie

about 4 years ago

In reply to: Andres Freund (#20)

#22

andres@anarazel.de

about 4 years ago

In reply to: Andres Freund (#20)

#23

andres@anarazel.de

about 4 years ago

In reply to: Andres Freund (#22)

#24

amborodin@acm.org

about 4 years ago

In reply to: Andres Freund (#22)

#25

andres@anarazel.de

about 4 years ago

In reply to: Peter Geoghegan (#21)

#26

andres@anarazel.de

about 4 years ago

In reply to: Andrey Borodin (#24)

#27

bruce@momjian.us

about 4 years ago

In reply to: Andres Freund (#26)

#28

andres@anarazel.de

about 4 years ago

In reply to: Bruce Momjian (#27)

#29

michael@paquier.xyz

about 4 years ago

In reply to: Andres Freund (#28)

#30

amborodin@acm.org

about 4 years ago

In reply to: Andres Freund (#26)

#31

andres@anarazel.de

about 4 years ago

In reply to: Andrey Borodin (#30)

#32

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Andres Freund (#26)

#33

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#32)

#34

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Robert Haas (#33)

#35

andres@anarazel.de

about 4 years ago

In reply to: Alvaro Herrera (#34)

#36

Robert Haas

robertmhaas@gmail.com

about 4 years ago

In reply to: Alvaro Herrera (#34)

#37

pg@bowt.ie

about 4 years ago

In reply to: Robert Haas (#36)

#38

bruce@momjian.us

about 4 years ago

In reply to: Peter Geoghegan (#37)

#39

michael@paquier.xyz

about 4 years ago

In reply to: Bruce Momjian (#38)

#40

amborodin@acm.org

about 4 years ago

In reply to: Michael Paquier (#39)

#41

Петър Славов

pet.slavov@gmail.com

about 4 years ago

In reply to: Bruce Momjian (#38)

#42

bruce@momjian.us

about 4 years ago

In reply to: Michael Paquier (#39)

#43

Christophe Pettus

xof@thebuild.com

about 4 years ago

In reply to: Bruce Momjian (#42)

#44

bruce@momjian.us

about 4 years ago

In reply to: Christophe Pettus (#43)

#45

michael@paquier.xyz

about 4 years ago

In reply to: Bruce Momjian (#44)

#46

andres@anarazel.de

about 4 years ago

In reply to: Alvaro Herrera (#34)

#47

amborodin@acm.org

about 4 years ago

In reply to: Andres Freund (#46)

#48

andres@anarazel.de

about 4 years ago

In reply to: Andrey Borodin (#47)

#49

amborodin@acm.org

about 4 years ago

In reply to: Andres Freund (#48)

#50

andres@anarazel.de

about 4 years ago

In reply to: Andrey Borodin (#49)

#51

amborodin@acm.org

about 4 years ago

In reply to: Andres Freund (#50)

#52

michael@paquier.xyz

about 4 years ago

In reply to: Andres Freund (#50)

#53

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Michael Paquier (#52)

#54

andres@anarazel.de

about 4 years ago

In reply to: Andrey Borodin (#51)

#55

andres@anarazel.de

about 4 years ago

In reply to: Michael Paquier (#52)

#56

pg@bowt.ie

about 4 years ago

In reply to: Andres Freund (#55)

#57

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Andres Freund (#55)

#58

alvherre@2ndquadrant.com

about 4 years ago

In reply to: Alvaro Herrera (#57)

#59

andres@anarazel.de

about 4 years ago

In reply to: Alvaro Herrera (#57)

#60

michael@paquier.xyz

almost 4 years ago

In reply to: Andres Freund (#55)

#61

michael@paquier.xyz

almost 4 years ago

In reply to: Alvaro Herrera (#58)

#62

amborodin@acm.org

almost 4 years ago

In reply to: Andres Freund (#54)

#63

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Michael Paquier (#61)

#64

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Michael Paquier (#60)

#65

michael@paquier.xyz

almost 4 years ago

In reply to: Alvaro Herrera (#64)

#66

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Christophe Pettus (#43)

#67

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#66)

#68

pg@bowt.ie

almost 4 years ago

In reply to: Alvaro Herrera (#66)

#69

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Andrey Borodin (#49)

#70

amborodin@acm.org

almost 4 years ago

In reply to: Alvaro Herrera (#69)

#71

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Andres Freund (#59)

#72

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Andrey Borodin (#70)

#73

pg@bowt.ie

almost 4 years ago

In reply to: Alvaro Herrera (#72)

#74

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Peter Geoghegan (#73)

#75

pg@bowt.ie

almost 4 years ago

In reply to: Alvaro Herrera (#74)

#76

andres@anarazel.de

almost 4 years ago

In reply to: Peter Geoghegan (#75)

#77

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Peter Geoghegan (#75)

#78

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#76)

#79

alvherre@2ndquadrant.com

almost 4 years ago

In reply to: Alvaro Herrera (#72)

#80