BUG #17386: btree index corruption after reindex concurrently on write heavy table

Started by PG Bug reporting formover 4 years ago9 messagesbugs

noreply@postgresql.org

over 4 years ago

The following bug has been logged on the website:

Bug reference: 17386
Logged by: Maxim Boguk
Email address: maxim.boguk@gmail.com
PostgreSQL version: 14.1
Operating system: Ubuntu Linux
Description:

Hi,

I found multiple cases of Btree index corruption after REINDEX someindex
CONCURRENTLY;.

What I found so far:
1)corruption exists in master, on all replicas and test server recovered
from base backup + wal archive (so it isn't some local hardware error)
2)it doesn't happen every time but at least two cases of corrupted indexes
found after database verification with amcheck.
3)it happen on huge and write heavy table

Now details:
bt_index_parent_check results:
set client_min_messages=DEBUG1;
SET
select bt_index_parent_check('jobs_pkey', true, true);
psql:check.sql:2: DEBUG: verifying consistency of tree structure for index
"jobs_pkey" with cross-level checks
psql:check.sql:2: DEBUG: verifying level 3 (true root level)
psql:check.sql:2: DEBUG: verifying level 2
psql:check.sql:2: DEBUG: verifying level 1
psql:check.sql:2: DEBUG: verifying level 0 (leaf level)
psql:check.sql:2: DEBUG: verifying that tuples from index "jobs_pkey" are
present in "jobs"
psql:check.sql:2: ERROR: heap tuple (33634183,14) from table "jobs" lacks
matching index tuple within index "jobs_pkey"

select ctid,id,updated_at,created_at from jobs where job_reference =
'some_ref';
ctid | id | updated_at |
created_at
---------------+-------------+----------------------------+----------------------------
(33634183,26) | 26228724405 | 2022-01-25 00:50:32.544507 | 2021-11-18
09:39:30.594114

select id from jobs where id = 26228724405;
id
----
(0 rows)

What I know that the reindex concurrently run during on updated_at of this
tuple:
2022-01-25 01:17:34.262 UTC 28073 postgres@**** from [local] [vxid:84/0
txid:0] [REINDEX] LOG: duration: 4649025.692 ms statement: REINDEX INDEX
CONCURRENTLY public.jobs_pkey

Problem tuple data:
SELECT * FROM heap_page_item_attrs(get_raw_page('jobs', 33634183),
'jobs'::regclass) where lp=26;
-[ RECORD 1
]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
lp | 26
lp_off | 1840
lp_flags | 1
lp_len | 522
t_xmin | 1829097938
t_xmax | 0
t_field3 | 38
t_ctid | (33634183,26)
t_infomask2 | 32846
t_infomask | 11011
t_hoff | 40
t_bits |
01011100000000011110001000000000001001000000000011010100000111111111101110011000
t_oid |
t_attrs |
{NULL,"\\xab31353135315f37383730322d4a4f425245512d3135393732392d656e2d75732d4f54484c4f432d434125354642656e696369612d3239663963653035633163646564323435313532376230633232363830663465",NULL,"\\x114f616b6c616e64","\\x074341","\\x0d3934363631",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"\\x2f3b0000","\\x42293ab40b740200","\\xfb74b33d5c790200","\\x1c260300",NULL,NULL,NULL,"\\x01000000",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"\\x2b000000",NULL,NULL,"\\x00",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"\\xc0020000040000800400008040000000470000004900000052000000540000005e000000880000006c6f676f68747470733a2f2f7777772e687964726f6368656d7073632e636f6d2f77702d636f6e74656e742f75706c6f6164732f616d702d6c6f676f2e706e67636f756e747279555373706f6e736f7265646e6f736f757263656e616d65436c65616e20486172626f727320456e7669726f6e6d656e74616c2053657276696365732c20496e632e","\\x5d37383730322d4a4f425245512d3135393732392d656e2d75732d4f54484c4f432d434125354642656e69636961",NULL,"\\x873d0100",NULL,"\\x804b3e77e0780200",NULL,NULL,NULL,NULL,NULL,"\\x1300832500911aa41f","\\x1300a37900fe234826","\\xce2b5b00","\\xb5925a1b06000000","\\x4b905a1b06000000","\\x00000000","\\xe6120000","\\x0f020000","\\x20502e5508750200","\\x0400",NULL,"\\x67c30000","\\x4e6c1f13","\\x652105fbe2f3fab45335f04bfccc900c",NULL,NULL,"\\x04001c00","\\x0000",NULL}

SELECT t_ctid, raw_flags, combined_flags FROM
heap_page_item_attrs(get_raw_page('jobs', 33634183), 'jobs'::regclass),
LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) where lp=26;
-[ RECORD 1
]--+---------------------------------------------------------------------------------------------------------------------
t_ctid | (33634183,26)
raw_flags |
{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}
combined_flags | {HEAP_XMIN_FROZEN}

This is all what I have so far, but I have base backup taken before problem
arise and wal archive and standalone server for experiments (database 10+TB
size with huge load so it isn't fast but definitely possible extract all
requested information).

Kind Regards,
Maxim

Maxim Boguk

maxim.boguk@gmail.com

over 4 years ago

In reply to: PG Bug reporting form (#1)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Fri, Jan 28, 2022 at 3:42 PM PG Bug reporting form <
noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 17386
Logged by: Maxim Boguk
Email address: maxim.boguk@gmail.com
PostgreSQL version: 14.1
Operating system: Ubuntu Linux
Description:

Hi,

I found multiple cases of Btree index corruption after REINDEX someindex
CONCURRENTLY;.

What I found so far:
1)corruption exists in master, on all replicas and test server recovered
from base backup + wal archive (so it isn't some local hardware error)
2)it doesn't happen every time but at least two cases of corrupted indexes
found after database verification with amcheck.
3)it happen on huge and write heavy table

Attempt fix situation with REINDEX INDEX jobs_pkey CONCURRENTLY;
lead to corrupted index again in different tuple:
ERROR: heap tuple (69306318,15) from table "jobs" lacks matching index
tuple within index "jobs_pkey"

SELECT * FROM heap_page_item_attrs(get_raw_page('jobs', 69306318),
'jobs'::regclass) where lp=15;
-[ RECORD 1 ]---
lp | 15
lp_off | 24
lp_flags | 2
lp_len | 0

select ctid,id,updated_at,created_at from jobs where
ctid='(69306318,24)'::tid;
-[ RECORD 1 ]--------------------------
ctid | (69306318,24)
id | 26192320674
updated_at | 2022-01-28 13:08:27.224275
created_at | 2021-11-17 14:35:59.441979

And again updated_at happens during the REINDEX run.

select ctid,id,updated_at,created_at from jobs where id=26192320674;
(0 rows)

SELECT t_ctid, raw_flags, combined_flags FROM
heap_page_item_attrs(get_raw_page('jobs', 69306318), 'jobs'::regclass),
LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2) where lp=24;
-[ RECORD 1
]--+---------------------------------------------------------------------------------------------------------------------------------------------
t_ctid | (69306318,24)
raw_flags |
{HEAP_HASNULL,HEAP_HASVARWIDTH,HEAP_XMAX_KEYSHR_LOCK,HEAP_XMAX_LOCK_ONLY,HEAP_XMIN_COMMITTED,HEAP_XMAX_INVALID,HEAP_UPDATED,HEAP_ONLY_TUPLE}
combined_flags | {}

Seems something broken in combination of HOT update and REINDEX
CONCURRENTLY.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Maxim Boguk (#2)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Fri, Jan 28, 2022 at 04:17:28PM +0200, Maxim Boguk wrote:

On Fri, Jan 28, 2022 at 3:42 PM PG Bug reporting form <noreply@postgresql.org> wrote:

PostgreSQL version: 14.1

Seems something broken in combination of HOT update and REINDEX
CONCURRENTLY.

Agreed. Can you create a self-contained test case, perhaps by adapting
contrib/amcheck/t/002_cic.pl? If this bug is like the ones fixed between 14.0
and 14.1, the base backup and WAL won't help us, unfortunately. The transient
states are what matter. I did try modifying the test to have elements of your
scenario, including HOT updates, KEY SHARE locks, and frozen tuples. That
didn't reproduce the bug. I'm attaching what I tried.

Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Noah Misch (#3)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Fri, Jan 28, 2022 at 6:34 PM Noah Misch <noah@leadboat.com> wrote:

Agreed. Can you create a self-contained test case, perhaps by adapting
contrib/amcheck/t/002_cic.pl? If this bug is like the ones fixed between 14.0
and 14.1, the base backup and WAL won't help us, unfortunately. The transient
states are what matter. I did try modifying the test to have elements of your
scenario, including HOT updates, KEY SHARE locks, and frozen tuples. That
didn't reproduce the bug. I'm attaching what I tried.

If I had to guess, then I'd guess that this has something to do with
orphaned HOT chains, like those we saw in the bug report that led to
bugfix commit 18b87b20 (which is in 14.2 but not 14.1). I could easily
be wrong about that, so take it with a grain of salt. I find it a
little suspicious that we're hearing about a REINDEX CONCURRENTLY
problem in Postgres 14, which is much less mature than Postgres 12
(where REINDEX CONCURRENTLY first appeared).

--
Peter Geoghegan

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Peter Geoghegan (#4)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Fri, Jan 28, 2022 at 07:00:31PM -0800, Peter Geoghegan wrote:

If I had to guess, then I'd guess that this has something to do with
orphaned HOT chains, like those we saw in the bug report that led to
bugfix commit 18b87b20 (which is in 14.2 but not 14.1). I could easily
be wrong about that, so take it with a grain of salt. I find it a
little suspicious that we're hearing about a REINDEX CONCURRENTLY
problem in Postgres 14, which is much less mature than Postgres 12
(where REINDEX CONCURRENTLY first appeared).

Possible, but hard to say without an actual proof. Maxim, if the
problem is reproducible easily on your end, could you give a try to v12
and v13 and see if it happens as well there?
--
Michael

Maxim Boguk

maxim.boguk@gmail.com

over 4 years ago

In reply to: Michael Paquier (#5)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Sat, Jan 29, 2022 at 7:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jan 28, 2022 at 07:00:31PM -0800, Peter Geoghegan wrote:

If I had to guess, then I'd guess that this has something to do with
orphaned HOT chains, like those we saw in the bug report that led to
bugfix commit 18b87b20 (which is in 14.2 but not 14.1). I could easily
be wrong about that, so take it with a grain of salt. I find it a
little suspicious that we're hearing about a REINDEX CONCURRENTLY
problem in Postgres 14, which is much less mature than Postgres 12
(where REINDEX CONCURRENTLY first appeared).

Possible, but hard to say without an actual proof. Maxim, if the
problem is reproducible easily on your end, could you give a try to v12
and v13 and see if it happens as well there?
--
Michael

I don't remember such problems during the last year on v13 with the same
workload and the same periodic reindex.
It isn't easily reproduced, table question is 800GB size (almost 2TB with
indexes) and around 1000 rows/s updated (5k in peaks), under such load
chance to hit the error seems around 60% (e.g. more than half reindex
attempts end with broken index).
I have a suitable powerful server for tests, but there is no good way to
simulate production workload (especially reproducible).
I'll see what I can do next week.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Phone RU: +7 985 433 0000
Phone UA: +380 99 143 0000
Phone AU: +61 45 218 5678

LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk

"Доктор, вы мне советовали так не делать, но почему мне по-прежнему больно
когда я так делаю ещё раз?"

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Maxim Boguk (#6)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Sat, Jan 29, 2022 at 05:23:49PM +0200, Maxim Boguk wrote:

On Sat, Jan 29, 2022 at 7:43 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Jan 28, 2022 at 07:00:31PM -0800, Peter Geoghegan wrote:

If I had to guess, then I'd guess that this has something to do with
orphaned HOT chains, like those we saw in the bug report that led to
bugfix commit 18b87b20 (which is in 14.2 but not 14.1). I could easily
be wrong about that, so take it with a grain of salt. I find it a
little suspicious that we're hearing about a REINDEX CONCURRENTLY
problem in Postgres 14, which is much less mature than Postgres 12
(where REINDEX CONCURRENTLY first appeared).

Possible, but hard to say without an actual proof. Maxim, if the
problem is reproducible easily on your end, could you give a try to v12
and v13 and see if it happens as well there?

I don't remember such problems during the last year on v13 with the same
workload and the same periodic reindex.

Got it. A low-effort strategy would be to wait for v14.2 to release on
2022-02-10. If the problem started with a v13.X -> v14.1 upgrade and ceases
in v14.2, then commit 18b87b20 likely was the fix.

Maxim Boguk

maxim.boguk@gmail.com

over 4 years ago

In reply to: Noah Misch (#7)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

On Sun, Jan 30, 2022 at 5:43 AM Noah Misch <noah@leadboat.com> wrote:

On Sat, Jan 29, 2022 at 05:23:49PM +0200, Maxim Boguk wrote:

On Sat, Jan 29, 2022 at 7:43 AM Michael Paquier <michael@paquier.xyz>

wrote:

On Fri, Jan 28, 2022 at 07:00:31PM -0800, Peter Geoghegan wrote:

If I had to guess, then I'd guess that this has something to do with
orphaned HOT chains, like those we saw in the bug report that led to
bugfix commit 18b87b20 (which is in 14.2 but not 14.1). I could

easily

be wrong about that, so take it with a grain of salt. I find it a
little suspicious that we're hearing about a REINDEX CONCURRENTLY
problem in Postgres 14, which is much less mature than Postgres 12
(where REINDEX CONCURRENTLY first appeared).

Possible, but hard to say without an actual proof. Maxim, if the
problem is reproducible easily on your end, could you give a try to v12
and v13 and see if it happens as well there?

I don't remember such problems during the last year on v13 with the same
workload and the same periodic reindex.

Got it. A low-effort strategy would be to wait for v14.2 to release on
2022-02-10. If the problem started with a v13.X -> v14.1 upgrade and
ceases
in v14.2, then commit 18b87b20 likely was the fix.

Ok, I'll wait for the 14.2 release and provide an update about this issue.

--
Maxim Boguk
Senior Postgresql DBA
https://dataegret.com/

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Noah Misch (#3)

Re: BUG #17386: btree index corruption after reindex concurrently on write heavy table

Hi,

On 2022-01-28 18:34:18 -0800, Noah Misch wrote:

If this bug is like the ones fixed between 14.0 and 14.1, the base backup
and WAL won't help us, unfortunately.

FWIW: I've found that once one has identified a specific page / tuple that is
damaged, searching the WAL for the record causing the problem is really
helpful. Once that record is identified, the records from the transaction
logging the problematic WAL record(s) can often identify the concrete activity
at the time. And looking at the WAL records of concurrently running
transactions can often identify the concurrency problem.

It might not be worth it in this case (given we have a suspicion it's
18b87b20), but when you have a base backup from before the problem, and all
incremental WAL, just replaying forward incrementally and testing for the
problem can help narrow down problems a lot.

Greetings,

Andres Freund

BUG #17386: btree index corruption after reindex concurrently on write heavy table

Attachments: