BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
The following bug has been logged on the website:
Bug reference: 17255
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 14.0
Operating system: Ubuntu 20.04
Description:
The following scenario:
###
createdb regression
export PGDATABASE=regression
echo "
-- excerpt from inherit.sql
CREATE TABLE errtst_parent (
partid int not null,
shdata int not null,
data int NOT NULL DEFAULT 0,
CONSTRAINT shdata_small CHECK(shdata < 3)
) PARTITION BY RANGE (partid);
CREATE TABLE errtst_child_plaindef (
partid int not null,
shdata int not null,
data int NOT NULL DEFAULT 0,
CONSTRAINT shdata_small CHECK(shdata < 3),
CHECK(data < 10)
);
CREATE TABLE errtst_child_reorder (
data int NOT NULL DEFAULT 0,
shdata int not null,
partid int not null,
CONSTRAINT shdata_small CHECK(shdata < 3),
CHECK(data < 10)
);
ALTER TABLE errtst_parent ATTACH PARTITION errtst_child_plaindef FOR VALUES
FROM (10) TO (20);
ALTER TABLE errtst_parent ATTACH PARTITION errtst_child_reorder FOR VALUES
FROM (20) TO (30);
DROP TABLE errtst_parent;
" >/tmp/mini-inherit.sql
echo "
-- excerpt from vacuum.sql
CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);
SELECT pg_sleep(random()/500);
CREATE INDEX tmp_idx1 ON tmp (a);
" >/tmp/mini-vacuum.sql
echo "
VACUUM (skip_locked, index_cleanup off) pg_catalog.pg_class;
SELECT pg_sleep(random()/500);
" >/tmp/pseudo-autovacuum.sql
pgbench -n -f /tmp/mini-vacuum.sql -f /tmp/pseudo-autovacuum.sql -C -c 40 -T
600 >/dev/null 2>&1 &
pgbench -n -f /tmp/mini-inherit.sql -C -c 1 -T 600 >/dev/null 2>&1 &
wait
###
with the settings:
autovacuum=off
fsync=off
in postgresql.conf
causes the server crash:
TIME PID UID GID SIG COREFILE EXE
Fri 2021-10-29 09:37:09 MSK 383805 1000 1000 6 present
.../usr/local/pgsql/bin/postgres
real 3m20,335s
user 0m7,245s
sys 0m8,306s
with the following stack:
Core was generated by `postgres: law regression [local] CREATE INDEX
'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f8a7f97a859 in __GI_abort () at abort.c:79
#2 0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
pointer>, deltid1=<optimized out>) at heapam.c:7582
#3 index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
heapam.c:7623
#4 heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
at heapam.c:7296
#5 0x0000562dabc5519a in table_index_delete_tuples
(delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
at ../../../../src/include/access/tableam.h:1327
#6 _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08,
delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
#7 0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
minoff=<optimized out>, newitem=<optimized out>,
ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
buffer=191, rel=0x7f8a7652cc80)
at nbtinsert.c:2899
#8 _bt_delete_or_dedup_one_page (rel=0x7f8a7652cc80,
heapRel=0x7f8a76523e08, insertstate=0x7fff6f60a340,
simpleonly=<optimized out>, checkingunique=<optimized out>,
uniquedup=<optimized out>, indexUnchanged=false)
at nbtinsert.c:2712
#9 0x0000562dabc523f3 in _bt_findinsertloc (heapRel=0x7f8a76523e08,
stack=0x562dad8c1320, indexUnchanged=false,
checkingunique=true, insertstate=0x7fff6f60a340, rel=0x7f8a7652cc80) at
nbtinsert.c:904
#10 _bt_doinsert (rel=rel@entry=0x7f8a7652cc80,
itup=itup@entry=0x562dad8b9f20,
checkUnique=checkUnique@entry=UNIQUE_CHECK_YES,
indexUnchanged=indexUnchanged@entry=false,
heapRel=heapRel@entry=0x7f8a76523e08) at nbtinsert.c:255
#11 0x0000562dabc58451 in btinsert (rel=0x7f8a7652cc80, values=<optimized
out>, isnull=<optimized out>,
ht_ctid=0x562dad81f7d4, heapRel=0x7f8a76523e08,
checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false,
indexInfo=0x562dad8b9ad8) at nbtree.c:199
#12 0x0000562dabcc0f14 in CatalogIndexInsert
(indstate=indstate@entry=0x562dad8c0fb0,
heapTuple=heapTuple@entry=0x562dad81f7d0) at indexing.c:158
#13 0x0000562dabcc119f in CatalogTupleInsert (heapRel=0x7f8a76523e08,
tup=0x562dad81f7d0) at indexing.c:231
#14 0x0000562dabcb92a6 in InsertPgClassTuple
(pg_class_desc=pg_class_desc@entry=0x7f8a76523e08,
new_rel_desc=new_rel_desc@entry=0x7f8a7654cd28, new_rel_oid=<optimized
out>, relacl=relacl@entry=0,
reloptions=reloptions@entry=0) at heap.c:986
#15 0x0000562dabcbec8d in index_create
(heapRelation=heapRelation@entry=0x7f8a7654c6a0,
indexRelationName=indexRelationName@entry=0x562dad7f7ff8 "tmp_idx1",
indexRelationId=571743,
indexRelationId@entry=0, parentIndexRelid=parentIndexRelid@entry=0,
parentConstraintId=parentConstraintId@entry=0,
relFileNode=<optimized out>, indexInfo=<optimized out>,
indexColNames=<optimized out>,
accessMethodObjectId=<optimized out>, tableSpaceId=<optimized out>,
collationObjectId=<optimized out>,
classObjectId=<optimized out>, coloptions=<optimized out>,
reloptions=<optimized out>, flags=<optimized out>,
constr_flags=<optimized out>, allow_system_table_mods=<optimized out>,
is_internal=<optimized out>,
constraintId=<optimized out>) at index.c:968
#16 0x0000562dabd533f8 in DefineIndex (relationId=relationId@entry=571674,
stmt=stmt@entry=0x562dad7f8168,
indexRelationId=indexRelationId@entry=0,
parentIndexId=parentIndexId@entry=0,
parentConstraintId=parentConstraintId@entry=0,
is_alter_table=is_alter_table@entry=false, check_rights=true,
check_not_in_use=true, skip_build=false, quiet=false) at
indexcmds.c:1137
#17 0x0000562dabf41217 in ProcessUtilitySlow (pstate=0x562dad8c0c80,
pstmt=0x562dad7f8518,
queryString=0x562dad7f75e0 "CREATE INDEX tmp_idx1 ON tmp (a);",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, qc=0x7fff6f60b2a0, dest=<optimized out>) at
utility.c:1534
#18 0x0000562dabf3fd23 in standard_ProcessUtility (pstmt=0x562dad7f8518,
queryString=0x562dad7f75e0 "CREATE INDEX tmp_idx1 ON tmp (a);",
readOnlyTree=<optimized out>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x562dad7f8608, qc=0x7fff6f60b2a0)
at utility.c:1066
#19 0x0000562dabf3e3f1 in PortalRunUtility
(portal=portal@entry=0x562dad85b040, pstmt=pstmt@entry=0x562dad7f8518,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x562dad7f8608,
qc=qc@entry=0x7fff6f60b2a0) at pquery.c:1155
#20 0x0000562dabf3e52d in PortalRunMulti
(portal=portal@entry=0x562dad85b040, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x562dad7f8608, altdest=altdest@entry=0x562dad7f8608,
qc=qc@entry=0x7fff6f60b2a0) at pquery.c:1312
#21 0x0000562dabf3ebc9 in PortalRun (portal=portal@entry=0x562dad85b040,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x562dad7f8608,
altdest=altdest@entry=0x562dad7f8608, qc=0x7fff6f60b2a0) at
pquery.c:788
#22 0x0000562dabf3a93b in exec_simple_query (query_string=0x562dad7f75e0
"CREATE INDEX tmp_idx1 ON tmp (a);")
at postgres.c:1214
#23 0x0000562dabf3c541 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7fff6f60b710, dbname=<optimized out>,
username=<optimized out>) at postgres.c:4486
#24 0x0000562dabea84bd in BackendRun (port=0x562dad81be40,
port=0x562dad81be40) at postmaster.c:4506
#25 BackendStartup (port=0x562dad81be40) at postmaster.c:4228
#26 ServerLoop () at postmaster.c:1745
#27 0x0000562dabea9461 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1417
#28 0x0000562dabbd72e2 in main (argc=3, argv=0x562dad7f18a0) at main.c:209
Discovered while hunting to another bug related to autovacuum (unfortunately
I still can't produce the reliable reproducing script for that).
Best regards,
Alexander
On Fri, Oct 29, 2021 at 07:00:01AM +0000, PG Bug reporting form wrote:
The following bug has been logged on the website:Bug reference: 17255
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 14.0
Operating system: Ubuntu 20.04
Description:with the following stack:
Core was generated by `postgres: law regression [local] CREATE INDEX
'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f8a7f97a859 in __GI_abort () at abort.c:79
#2 0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
pointer>, deltid1=<optimized out>) at heapam.c:7582
#3 index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
heapam.c:7623
#4 heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
at heapam.c:7296
#5 0x0000562dabc5519a in table_index_delete_tuples
(delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
at ../../../../src/include/access/tableam.h:1327
#6 _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08,
delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
#7 0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
minoff=<optimized out>, newitem=<optimized out>,
ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
buffer=191, rel=0x7f8a7652cc80)
at nbtinsert.c:2899
...Discovered while hunting to another bug related to autovacuum (unfortunately
I still can't produce the reliable reproducing script for that).
Thanks for reporting (in fact I'm impressed how many issues you've
discovered, hopefully there are at least some t-shirts "I've found X
bugs in PostgreSQL" available as a reward) and putting efforts into the
reproducing steps. I believe I've managed to reproduce at least a
similar crash with the same trace.
In my case it crashed on pg_unreachable (which is an abort, when asserts
are enabled) inside index_delete_sort_cmp. It seems like item pointers
to compare both have the same block and offset number. In the view of
the recent discussions I was thinking it could be somehow related to the
issues with duplicated TIDs, but delstate->deltids doesn't in fact have
any duplicated entries -- so not sure about that, still investigating
the core dump.
Hi,
On 2021-10-29 16:55:32 +0200, Dmitry Dolgov wrote:
On Fri, Oct 29, 2021 at 07:00:01AM +0000, PG Bug reporting form wrote:
The following bug has been logged on the website:Bug reference: 17255
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 14.0
Operating system: Ubuntu 20.04
Description:with the following stack:
Core was generated by `postgres: law regression [local] CREATE INDEX
'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f8a7f97a859 in __GI_abort () at abort.c:79
#2 0x0000562dabb49700 in index_delete_sort_cmp (deltid2=<synthetic
pointer>, deltid1=<optimized out>) at heapam.c:7582
#3 index_delete_sort (delstate=0x7fff6f609f10, delstate=0x7fff6f609f10) at
heapam.c:7623
#4 heap_index_delete_tuples (rel=0x7f8a76523e08, delstate=0x7fff6f609f10)
at heapam.c:7296
#5 0x0000562dabc5519a in table_index_delete_tuples
(delstate=0x7fff6f609f10, rel=0x562dac23d6c2)
at ../../../../src/include/access/tableam.h:1327
#6 _bt_delitems_delete_check (rel=rel@entry=0x7f8a7652cc80,
buf=buf@entry=191, heapRel=heapRel@entry=0x7f8a76523e08,
delstate=delstate@entry=0x7fff6f609f10) at nbtpage.c:1541
#7 0x0000562dabc4dbe1 in _bt_simpledel_pass (maxoff=<optimized out>,
minoff=<optimized out>, newitem=<optimized out>,
ndeletable=55, deletable=0x7fff6f609f30, heapRel=0x7f8a76523e08,
buffer=191, rel=0x7f8a7652cc80)
at nbtinsert.c:2899
...Discovered while hunting to another bug related to autovacuum (unfortunately
I still can't produce the reliable reproducing script for that).Thanks for reporting (in fact I'm impressed how many issues you've
discovered, hopefully there are at least some t-shirts "I've found X
bugs in PostgreSQL" available as a reward) and putting efforts into the
reproducing steps. I believe I've managed to reproduce at least a
similar crash with the same trace.In my case it crashed on pg_unreachable (which is an abort, when asserts
are enabled) inside index_delete_sort_cmp. It seems like item pointers
to compare both have the same block and offset number. In the view of
the recent discussions I was thinking it could be somehow related to the
issues with duplicated TIDs, but delstate->deltids doesn't in fact have
any duplicated entries -- so not sure about that, still investigating
the core dump.
I suspect this is the same bug as #17245. Could you check if it's fixed by
/messages/by-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ@mail.gmail.com
The crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.
Greetings,
Andres Freund
On Sat, Oct 30, 2021 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:
The crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.
Quite possible, but I can't rule out the possibility that it's
actually this other bug, since as you say this is pg_class, a system
catalog index:
/messages/by-id/CAH2-WzkjjCoq5Y4LeeHJcjYJVxGm3M3SAWZ0=6J8K1FPSC9K0w@mail.gmail.com
I am in too much of a hurry right now to spend the 5 minutes it would
take to confirm either way. (Even if it's not this other bug, I needed
to remind you of its existence anyway.)
--
Peter Geoghegan
On Sat, Oct 30, 2021 at 02:39:48PM -0700, Andres Freund wrote:
In my case it crashed on pg_unreachable (which is an abort, when asserts
are enabled) inside index_delete_sort_cmp. It seems like item pointers
to compare both have the same block and offset number. In the view of
the recent discussions I was thinking it could be somehow related to the
issues with duplicated TIDs, but delstate->deltids doesn't in fact have
any duplicated entries -- so not sure about that, still investigating
the core dump.I suspect this is the same bug as #17245. Could you check if it's fixed by
/messages/by-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ@mail.gmail.comThe crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.
I've applied both patches from Peter, the fix itself and
index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
pg_unreachable, but hits those extra assertions in the second patch:
#0 0x00007f251875f2fb in raise () from /lib64/libc.so.6
#1 0x00007f2518748ef6 in abort () from /lib64/libc.so.6
#2 0x000056387b62a4c7 in ExceptionalCondition (conditionName=0x56387b6be622 "ItemIdIsUsed(iid)", errorType=0x56387b6bc849 "FailedAssertion", fileName=0x56387b6bc928 "heapam.c", lineNumber=7467) at assert.c:69
#3 0x000056387afb4ba9 in heap_index_delete_tuples (rel=0x7f25195f8e20, delstate=0x7ffe817bdf00) at heapam.c:7467
#4 0x000056387afe4a38 in table_index_delete_tuples (rel=0x7f25195f8e20, delstate=0x7ffe817bdf00) at ../../../../src/include/access/tableam.h:1327
#5 0x000056387afe83b7 in _bt_delitems_delete_check (rel=0x7f2519601880, buf=182, heapRel=0x7f25195f8e20, delstate=0x7ffe817bdf00) at nbtpage.c:1541
#6 0x000056387afe4452 in _bt_simpledel_pass (rel=0x7f2519601880, buffer=182, heapRel=0x7f25195f8e20, deletable=0x7ffe817bdfb0, ndeletable=55, newitem=0x56387c05cfe0, minoff=1, maxoff=271) at nbtinsert.c:2896
#7 0x000056387afe3cb1 in _bt_delete_or_dedup_one_page (rel=0x7f2519601880, heapRel=0x7f25195f8e20, insertstate=0x7ffe817be3b0, simpleonly=false, checkingunique=true, uniquedup=false, indexUnchanged=false) at nbtinsert.c:2709
#8 0x000056387afdea4b in _bt_findinsertloc (rel=0x7f2519601880, insertstate=0x7ffe817be3b0, checkingunique=true, indexUnchanged=false, stack=0x56387c05d008, heapRel=0x7f25195f8e20) at nbtinsert.c:901
#9 0x000056387afdd3c7 in _bt_doinsert (rel=0x7f2519601880, itup=0x56387c05cfe0, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, heapRel=0x7f25195f8e20) at nbtinsert.c:255
#10 0x000056387afecfee in btinsert (rel=0x7f2519601880, values=0x7ffe817be510, isnull=0x7ffe817be4f0, ht_ctid=0x56387c05b994, heapRel=0x7f25195f8e20, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x56387c05cec8) at nbtree.c:199
#11 0x000056387afd7f05 in index_insert (indexRelation=0x7f2519601880, values=0x7ffe817be510, isnull=0x7ffe817be4f0, heap_t_ctid=0x56387c05b994, heapRelation=0x7f25195f8e20, checkUnique=UNIQUE_CHECK_YES, indexUnchanged=false, indexInfo=0x56387c05cec8) at indexam.c:193
#12 0x000056387b08c396 in CatalogIndexInsert (indstate=0x56387bfc0388, heapTuple=0x56387c05b990) at indexing.c:158
#13 0x000056387b08c51a in CatalogTupleInsert (heapRel=0x7f25195f8e20, tup=0x56387c05b990) at indexing.c:231
#14 0x000056387b07ed40 in InsertPgClassTuple (pg_class_desc=0x7f25195f8e20, new_rel_desc=0x7f251960fa18, new_rel_oid=957915, relacl=0, reloptions=0) at heap.c:984
#15 0x000056387b07eec6 in AddNewRelationTuple (pg_class_desc=0x7f25195f8e20, new_rel_desc=0x7f251960fa18, new_rel_oid=957915, new_type_oid=957917, reloftype=0, relowner=10, relkind=114 'r', relfrozenxid=412531, relminmxid=1, relacl=0, reloptions=0) at heap.c:1056
#16 0x000056387b07f60d in heap_create_with_catalog (relname=0x7ffe817bec60 "tmp", relnamespace=16686, reltablespace=0, relid=957915, reltypeid=0, reloftypeid=0, ownerid=10, accessmtd=2, tupdesc=0x56387bfbb6b0, cooked_constraints=0x0, relkind=114 'r', relpersistence=116 't', shared_relation=false, mapped_relation=false, oncommit=ONCOMMIT_NOOP, reloptions=0, use_user_acl=true, allow_system_table_mods=false, is_internal=false, relrewrite=0, typaddress=0x0) at heap.c:1409
#17 0x000056387b1a9bb7 in DefineRelation (stmt=0x56387bf98620, relkind=114 'r', ownerId=10, typaddress=0x0, queryString=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);") at tablecmds.c:933
#18 0x000056387b47fde1 in ProcessUtilitySlow (pstate=0x56387bfb9890, pstmt=0x56387bf989d0, queryString=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at utility.c:1163
#19 0x000056387b47fb3d in standard_ProcessUtility (pstmt=0x56387bf989d0, queryString=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at utility.c:1066
#20 0x000056387b47eb01 in ProcessUtility (pstmt=0x56387bf989d0, queryString=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at utility.c:527
#21 0x000056387b47d5e8 in PortalRunUtility (portal=0x56387bffb860, pstmt=0x56387bf989d0, isTopLevel=true, setHoldSnapshot=false, dest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:1155
#22 0x000056387b47d858 in PortalRunMulti (portal=0x56387bffb860, isTopLevel=true, setHoldSnapshot=false, dest=0x56387bf98ac0, altdest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:1312
#23 0x000056387b47ccdb in PortalRun (portal=0x56387bffb860, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x56387bf98ac0, altdest=0x56387bf98ac0, qc=0x7ffe817bf4a0) at pquery.c:788
#24 0x000056387b475fad in exec_simple_query (query_string=0x56387bf97810 "CREATE TEMPORARY TABLE tmp (a int PRIMARY KEY);") at postgres.c:1214
#25 0x000056387b47ab2b in PostgresMain (dbname=0x56387bfc3748 "regression", username=0x56387bfc3728 "erthalion") at postgres.c:4497
#26 0x000056387b3a636f in BackendRun (port=0x56387bfbb460) at postmaster.c:4560
#27 0x000056387b3a5c60 in BackendStartup (port=0x56387bfbb460) at postmaster.c:4288
#28 0x000056387b3a1e81 in ServerLoop () at postmaster.c:1801
#29 0x000056387b3a1630 in PostmasterMain (argc=3, argv=0x56387bf91ca0) at postmaster.c:1473
#30 0x000056387b297f62 in main (argc=3, argv=0x56387bf91ca0) at main.c:198
The ItemId in question:
p *iid
$2 = {
lp_off = 0,
lp_flags = 0,
lp_len = 0
}
31.10.2021 22:20, Dmitry Dolgov wrote:
I suspect this is the same bug as #17245. Could you check if it's fixed by
/messages/by-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ@mail.gmail.comThe crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.I've applied both patches from Peter, the fix itself and
index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
pg_unreachable, but hits those extra assertions in the second patch:
Yes, the committed fix for the bug #17245 doesn't help here.
I've also noticed that the server crash is not the only possible
outcome. You can also get unexpected errors like:
ERROR: relation "errtst_parent" already exists
ERROR: relation "tmp_idx1" already exists
ERROR: relation "errtst_child_plaindef" already exists
or
ERROR: could not open relation with OID 1033921
STATEMENT: DROP TABLE errtst_parent;
in the server.log (and no crash).
These strange errors and the crash inside index_delete_sort_cmp() can be
seen starting from the commit dc7420c2.
On the previous commit (b8443eae) the reproducing script completes
without a crash or errors (triple-checked).
Probably, the bug #17257 has the same root cause, but the patch [1]/messages/by-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH=71_JEmgJV+6eBgjL7g@mail.gmail.com
applied to REL_14_STABLE (b0f6bd48) doesn't prevent the crash.
Initially I've thought that the infinite loop in vacuum is a problem
itself, so I decided to separate that one, but maybe both bugs are too
related to be discussed apart.
Best regards,
Alexander
[1]: /messages/by-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH=71_JEmgJV+6eBgjL7g@mail.gmail.com
/messages/by-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH=71_JEmgJV+6eBgjL7g@mail.gmail.com
On Sun, Nov 07, 2021 at 09:00:00PM +0300, Alexander Lakhin wrote:
31.10.2021 22:20, Dmitry Dolgov wrote:I suspect this is the same bug as #17245. Could you check if it's fixed by
/messages/by-id/CAH2-WzkN5aESSLfK7-yrYgsXxYUi__VzG4XpZFwXm98LUtoWuQ@mail.gmail.comThe crash is somewhere in pg_class, which is also manually VACUUMed by the
test, which could trigger the issue we found in the other thread. The likely
reason the loop in the repro is needed is that that'll push one of the indexes
on pg_class over the 512kb/min_parallel_index_scan_size boundary to start
using paralell vacuum.I've applied both patches from Peter, the fix itself and
index-points-to-LP_UNUSED-item assertions. Now it doesn't crash on
pg_unreachable, but hits those extra assertions in the second patch:Yes, the committed fix for the bug #17245 doesn't help here.
I've also noticed that the server crash is not the only possible
outcome. You can also get unexpected errors like:
ERROR:� relation "errtst_parent" already exists
ERROR:� relation "tmp_idx1" already exists
ERROR:� relation "errtst_child_plaindef" already exists
or
ERROR:� could not open relation with OID 1033921
STATEMENT:� DROP TABLE errtst_parent;
in the server.log (and no crash).
Interesting, I don't think I've observed those errors. In fact after the
recent changes (I've compiled here from 39a31056) around assertion logic
and index_delete_check_htid now I'm getting another type of crashes
using your scripts. This time heap_page_prune_execute stumbles upon a
non heap-only tuple trying to update unused line pointers:
#0 0x00007f0dfce072fb in raise () from /lib64/libc.so.6
#1 0x00007f0dfcdf0ef6 in abort () from /lib64/libc.so.6
#2 0x000055a66a87db05 in ExceptionalCondition (conditionName=0x55a66a914610 "HeapTupleHeaderIsHeapOnly(htup)", errorType=0x55a66a91419c "FailedAssertion", fileName=0x55a66a914190 "pruneheap.c", lineNumber=961) at assert.c:69
#3 0x000055a66a21b83a in heap_page_prune_execute (buffer=138, redirected=0x7fffac638994, nredirected=0, nowdead=0x7fffac638e20, ndead=12, nowunused=0x7fffac639066, nunused=1) at pruneheap.c:961
#4 0x000055a66a219d82 in heap_page_prune (relation=0x7f0dfdca0e20, buffer=138, vistest=0x55a66addf140 <GlobalVisCatalogRels>, old_snap_xmin=0, old_snap_ts=0, report_stats=true, off_loc=0x0) at pruneheap.c:295
The relation in question is pg_class, and htup apparently has
e.g. HEAP_KEYS_UPDATED, but no HEAP_ONLY_TUPLE flags set:
p *htup
$1 = {
t_choice = {
t_heap = {
t_xmin = 661929,
t_xmax = 662015,
t_field3 = {
t_cid = 0,
t_xvac = 0
}
},
t_datum = {
datum_len_ = 661929,
datum_typmod = 662015,
datum_typeid = 0
}
},
t_ctid = {
ip_blkid = {
bi_hi = 0,
bi_lo = 2004
},
ip_posid = 128
},
t_infomask2 = 8225,
t_infomask = 1281,
t_hoff = 32 ' ',
t_bits = 0x7f0de7080ee7 "\377\377\377?"
}
On Mon, Nov 8, 2021 at 9:28 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Interesting, I don't think I've observed those errors. In fact after the
recent changes (I've compiled here from 39a31056) around assertion logic
and index_delete_check_htid now I'm getting another type of crashes
using your scripts. This time heap_page_prune_execute stumbles upon a
non heap-only tuple trying to update unused line pointers:
It looks like the new heap_page_prune_execute() assertions catch the
same problem earlier. It's hard to not suspect the code in pruneheap.c
itself. Whatever else may have happened, the code in pruneheap.c ought
to not even try to set a non-heap-only tuple item to LP_UNUSED. ISTM
that it should explicitly look out for and avoid doing that.
Offhand, I wonder if we just need to have an additional check in
heap_prune_chain(), which is like the existing "If we find a redirect
somewhere else, stop --- it must not be same chain" handling used for
LP_REDIRECT items that aren't at the start of a HOT chain:
diff --git a/src/backend/access/heap/pruneheap.c
b/src/backend/access/heap/pruneheap.c
index c7331d810..3c72cdf67 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -646,6 +646,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber
rootoffnum, PruneState *prstate)
!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
break;
+ if (nchain > 0 && !HeapTupleHeaderIsHeapOnly(htup))
+ break; /* not at start of chain */
+
/*
* OK, this tuple is indeed a member of the chain.
*/
I would expect the "Check the tuple XMIN against prior XMAX" test to
do what we need in the vast vast majority of cases, but why should it
be 100% effective?
--
Peter Geoghegan
On Mon, Nov 08, 2021 at 10:32:39AM -0800, Peter Geoghegan wrote:
On Mon, Nov 8, 2021 at 9:28 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:Interesting, I don't think I've observed those errors. In fact after the
recent changes (I've compiled here from 39a31056) around assertion logic
and index_delete_check_htid now I'm getting another type of crashes
using your scripts. This time heap_page_prune_execute stumbles upon a
non heap-only tuple trying to update unused line pointers:It looks like the new heap_page_prune_execute() assertions catch the
same problem earlier. It's hard to not suspect the code in pruneheap.c
itself. Whatever else may have happened, the code in pruneheap.c ought
to not even try to set a non-heap-only tuple item to LP_UNUSED. ISTM
that it should explicitly look out for and avoid doing that.Offhand, I wonder if we just need to have an additional check in
heap_prune_chain(), which is like the existing "If we find a redirect
somewhere else, stop --- it must not be same chain" handling used for
LP_REDIRECT items that aren't at the start of a HOT chain:
Yes, adding such condition works in this case, no non-heap-only tuples
were recorded as unused in heap_prune_chain, and nothing else popped up
afterwards. But now after a couple of runs I could also reproduce (at
least partially) what Alexander was talking about:
ERROR: could not open relation with OID 1056321
Not sure yet where is it coming from.
On Tue, Nov 9, 2021 at 7:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
Yes, adding such condition works in this case, no non-heap-only tuples
were recorded as unused in heap_prune_chain, and nothing else popped up
afterwards. But now after a couple of runs I could also reproduce (at
least partially) what Alexander was talking about:ERROR: could not open relation with OID 1056321
I've seen that too.
Not sure yet where is it coming from.
I think that the additional check that I sketched (in
heap_prune_chain()) is protective in that it prevents a bad situation
from becoming even worse. At the same time it doesn't actually fix
anything.
I've discussed this privately with Andres -- expect more from him
soon. I came up with more sophisticated instrumentation (better
assertions, really) that shows that the problem begins in VACUUM, not
opportunistic pruning (certainly with the test case we have).
The real problem (identified by Andres) seems to be that pruning feels
entitled to corrupt HOT chains by making an existing LP_REDIRECT
continue to point to a DEAD item that it marks LP_UNUSED. That's never
supposed to happen. This seems to occur in the aborted heap-only tuple
"If the tuple is DEAD and doesn't chain to anything else, mark it
unused immediately" code path at the top of heap_prune_chain(). That
code path seems wonky, despite not having changed much in many years.
This wonky heap_prune_chain() code cares about whether the
to-be-set-LP_UNUSED heap-only tuple item chains to other items -- it's
only really needed for aborted tuples, but doesn't discriminate
between aborted tuples and other kinds of DEAD tuples. It doesn't seem
to get all the details right. In particular, it doesn't account for
the fact that it's not okay to break a HOT chain between the root
LP_REDIRECT item and the first heap-only tuple. It's only okay to do
that between two heap-only tuples.
HOT chain traversal code (in places like heap_hot_search_buffer())
knows how to deal with broken HOT chains when the breakage occurs
between two heap-only tuples, but not in this other LP_REDIRECT case.
It's just not possible for HOT chain traversal code to deal with that,
because there is not enough information in the LP_REDIRECT (just a
link to another item, no xmin or xmax) to validate anything on the
fly. It's pretty clear that we need to specifically make sure that
LP_REDIRECT items always point to something sensible. Anything less
risks causing confusion about which HOT chain lives at which TID.
Obviously we were quite right to suspect that there wasn't enough
rigor around the HOT chain invariants. Wasn't specifically expecting
it to help with this bug.
--
Peter Geoghegan
On Tue, Nov 9, 2021 at 9:51 AM Peter Geoghegan <pg@bowt.ie> wrote:
I've discussed this privately with Andres -- expect more from him
soon. I came up with more sophisticated instrumentation (better
assertions, really) that shows that the problem begins in VACUUM, not
opportunistic pruning (certainly with the test case we have).
Attached is a WIP fix for the bug. The idea here is to follow all HOT
chains in an initial pass over the page, while even following LIVE
heap-only tuples. Any heap-only tuples that we don't determine are
part of some valid HOT chain (following an initial pass over the whole
heap page) will now be processed in a second pass over the page. We
expect (and assert) that these "disconnected" heap-only tuples will
all be either DEAD or RECENTLY_DEAD. We treat them as DEAD either way,
on the grounds that they must be from an aborted xact in any case.
Note that we sometimes do something very similar already -- we can
sometimes consider some tuples from a HOT chain DEAD, even though
they're RECENTLY_DEAD (provided a later tuple from the chain really is
DEAD).
The patch also has more detailed assertions inside heap_page_prune().
These should catch any HOT chain invariant violations at just about
the earliest opportunity, at least when assertions are enabled.
Especially because we're now following every HOT chain from beginning
to end now, even when we already know that there are no more
DEAD/RECENTLY_DEAD tuples in the chain to be found.
I'm not sure why this seems to have become more of a problem following
the snapshot scalability work from Andres -- Alexander mentioned that
commit dc7420c2 looked like it was the source of the problem here, but
I can't see any reason why that might be true (even though I accept
that it might well *appear* to be true). I believe Andres has some
theory on that, but I don't know the details myself. AFAICT, this is a
live bug on all supported versions. We simply weren't being careful
enough about breaking the invariant that an LP_REDIRECT can only point
to a valid heap-only tuple. The really surprising thing here is that
it took this long for it to visibly break.
--
Peter Geoghegan
Attachments:
v1-0001-Fix-aborted-HOT-update-bug-in-heap-pruning.patchapplication/octet-stream; name=v1-0001-Fix-aborted-HOT-update-bug-in-heap-pruning.patchDownload+135-67
On Tue, Nov 9, 2021 at 3:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached is a WIP fix for the bug. The idea here is to follow all HOT
chains in an initial pass over the page, while even following LIVE
heap-only tuples. Any heap-only tuples that we don't determine are
part of some valid HOT chain (following an initial pass over the whole
heap page) will now be processed in a second pass over the page.
I realized that I could easily go further than in v1, and totally get
rid of the "marked" array (which tracks whether we have decided to
mark an item as LP_DEAD/LP_UNUSED/a new LP_REDIRECT/newly pointed to
by another LP_REDIRECT). In my v1 from earlier today we already had an
array that records whether or not each item is part of any known valid
chain, which is strictly better than knowing whether or not they were
"marked" earlier. So why bother with the "marked" array at all, even
for assertions? It is less robust (not to mention less efficient) than
just using the new "fromvalidchain" array.
Attached is v2, which gets rid of the "marked" array as described. It
also has better worked out comments and assertions. The patch has
stood up to a fair amount of stress-testing. I repeated Alexander's
original test case for over an hour with this. Getting the test case
to cause an assertion failure would usually take about 5 minutes
without any fix.
I have yet to do any work on validating the performance of this patch,
though that definitely needs to happen.
Anybody have any thoughts on how far this should be backpatched? We'll
probably need to do that for Postgres 14. Less sure about other
branches, which haven't been directly demonstrated to be affected by
the bug so far. Haven't tried to break earlier branches with
Alexander's test case, though I will note again that Alexander
couldn't do that when he tried.
--
Peter Geoghegan
Attachments:
v2-0001-Fix-aborted-HOT-update-bug-in-heap-pruning.patchapplication/octet-stream; name=v2-0001-Fix-aborted-HOT-update-bug-in-heap-pruning.patchDownload+187-104
Hi,
On 2021-11-09 15:31:37 -0800, Peter Geoghegan wrote:
I'm not sure why this seems to have become more of a problem following
the snapshot scalability work from Andres -- Alexander mentioned that
commit dc7420c2 looked like it was the source of the problem here, but
I can't see any reason why that might be true (even though I accept
that it might well *appear* to be true). I believe Andres has some
theory on that, but I don't know the details myself. AFAICT, this is a
live bug on all supported versions. We simply weren't being careful
enough about breaking the invariant that an LP_REDIRECT can only point
to a valid heap-only tuple. The really surprising thing here is that
it took this long for it to visibly break.
The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
Consider a page like this:
lp 1: redirect to lp2
lp 2: deleted by xid x, not yet committed
and a sequence of events like this:
1) heap_prune_chain(rootlp = 1)
2) commit x
3) heap_prune_chain(rootlp = 2)
1) heap_prune_chain(rootlp = 1) will go to lp2, and see a
HEAPTUPLE_DELETE_IN_PROGRESS and thus not do anything.
3) then could, with the snapshot scalability changes, get DEAD back from
HTSV. Due to the "fuzzy" nature of the post-snapshot-scalability xid horizons,
that is possible, because we can end up rechecking the boundary condition and
seeing that now the horizon allows us to prune x / lp2.
At that point we have a redirect tuple pointing into an unused slot. Which is
"illegal", because something independent can be inserted into that slot.
What made this hard to understand (and likely hard to hit) is that we don't
recompute the xid horizons more than once per hot pruning ([1]it's a bit more complicated than that, we only recompute the horizon when a) we've not done it before in the current xact, b) RecentXmin changed during a snapshot computation. Recomputing the horizon is expensive-ish, so we don't want to do it constantly.). At first I
concluded that a change from RECENTLY_DEAD to DEAD could thus not happen - and
it doesn't: We go from HEAPTUPLE_DELETE_IN_PROGRESS to DEAD, which is possible
because there was no horizon test for HEAPTUPLE_DELETE_IN_PROGRESS.
Note that there are several paths < 14, that cause HTSV()'s answer to change
for the same xid. E.g. when the transaction inserting a tuple version aborts,
we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD. But I haven't quite found a
path to trigger problems with that, because there won't be redirects to a
tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).
I hit a crash once in 13 with a slightly evolved version of the test (many
connections creating / dropping the partitions as in the original scenario,
using :client_id to target different tables). It's possible that my
instrumentation was the cause of that. Unfortunately it took quite a few hours
to hit the problem in 13...
Greetings,
Andres Freund
[1]: it's a bit more complicated than that, we only recompute the horizon when a) we've not done it before in the current xact, b) RecentXmin changed during a snapshot computation. Recomputing the horizon is expensive-ish, so we don't want to do it constantly.
a) we've not done it before in the current xact, b) RecentXmin changed during
a snapshot computation. Recomputing the horizon is expensive-ish, so we don't
want to do it constantly.
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.
I had no idea that that was now possible. I really think that this
ought to be documented centrally.
As you know, I don't like the way that vacuumlazy.c doesn't explain
anything about the relationship between OldestXmin (which still
exists, but isn't used for pruning), and the similar GlobalVisState
state (used only during pruning). Surely this deserves to be
addressed, because we expect these two things to agree in certain
specific ways. But not necessarily in others.
Note that there are several paths < 14, that cause HTSV()'s answer to change
for the same xid. E.g. when the transaction inserting a tuple version aborts,
we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD.
Right -- that one I certainly knew about. After all, the
tupgone-ectomy work from my commit 8523492d specifically targeted this
case.
But I haven't quite found a
path to trigger problems with that, because there won't be redirects to a
tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).
That explains why the snapshot scalability either made these problems
possible for the first time, or at the very least made them far far
more likely in practice.
The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.
We could go back further than that, but ISTM that the risk of causing
new problems far outweighs the benefits. Whereas I feel pretty
confident that we need to do something on 14.
--
Peter Geoghegan
On Wed, Nov 10, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.I had no idea that that was now possible. I really think that this
ought to be documented centrally.
BTW, is it just a coincidence that we have only seen these problems
with pg_class? Or does it have something to do with the specifics of
VISHORIZON_CATALOG/horizons.catalog_oldest_nonremovable, or something
else along those lines?
--
Peter Geoghegan
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
I hit a crash once in 13 with a slightly evolved version of the test (many
connections creating / dropping the partitions as in the original scenario,
using :client_id to target different tables). It's possible that my
instrumentation was the cause of that. Unfortunately it took quite a few hours
to hit the problem in 13...
Have you thought about the case where a transaction does a HOT update
of the same row twice, and then aborts?
I'm asking because I notice that the fragile "We need this primarily
to handle aborted HOT updates" precheck for
HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
is DEAD before deciding to mark it LP_UNUSED. It also checks
HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
another condition of the tuple being marked unused. Of course, whether
or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
change from true to false when an updater concurrently aborts. Could
that have race conditions?
In other words: what if the aforementioned "aborted HOT updates"
precheck code doesn't deal with a DEAD tuple, imagining that it's not
a relevant tuple, while at the same time the later HOT-chain-chasing
code *also* doesn't get to the tuple? What if they each assume that
the other will/has taken care of it, due to a race?
So far we've been worried about cases where these two code paths
clobber each other -- that's what we've actually seen. We should also
at least consider the possibility that we have the opposite problem,
which is what this really is.
--
Peter Geoghegan
Hi,
On 2021-11-10 13:04:43 -0800, Peter Geoghegan wrote:
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.I had no idea that that was now possible. I really think that this
ought to be documented centrally.
Where would you suggest?
The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.
Yea. The fact that I also saw *one* error in 13 worries me a bit, but perhaps
that was something else. Even if we eventually need to backpatch something
further, having it in 14/master first is good.
The fact that 13 didn't trigger the problem reliably doesn't necessarily much
- it's a pretty limited workload. There e.g. are no aborts.
I think we might be able to do something a bit more limited than what you
propose. But I'm not sure it's worth going for that.
Greetings,
Andres Freund
Hi,
On 2021-11-10 14:18:01 -0800, Peter Geoghegan wrote:
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
I hit a crash once in 13 with a slightly evolved version of the test (many
connections creating / dropping the partitions as in the original scenario,
using :client_id to target different tables). It's possible that my
instrumentation was the cause of that. Unfortunately it took quite a few hours
to hit the problem in 13...Have you thought about the case where a transaction does a HOT update
of the same row twice, and then aborts?
Yes. I don't think it's problematic right now, because the redirect would, I
think, in all cases have to point to the chain element before those tuples,
because the preceding value would just have to be DELETE_IN_PROGRESS, which we
we don't follow in heap_prune_chain().
I'm asking because I notice that the fragile "We need this primarily
to handle aborted HOT updates" precheck for
HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
is DEAD before deciding to mark it LP_UNUSED. It also checks
HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
another condition of the tuple being marked unused. Of course, whether
or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
change from true to false when an updater concurrently aborts. Could
that have race conditions?
I wondered about that too, but I couldn't *quite* come up with a problematic
scenario, because I don't think any of the cases that can change
HeapTupleHeaderIsHotUpdated() would have allowed to set the redirect to a
subsequent chain element.
In other words: what if the aforementioned "aborted HOT updates"
precheck code doesn't deal with a DEAD tuple, imagining that it's not
a relevant tuple, while at the same time the later HOT-chain-chasing
code *also* doesn't get to the tuple? What if they each assume that
the other will/has taken care of it, due to a race?
Then we'd just end up not pruning the tuple, I think. Which should be fine, as
it could only happen for fairly new tuples.
Greetings,
Andres Freund
Hi,
On 2021-11-10 13:40:25 -0800, Peter Geoghegan wrote:
On Wed, Nov 10, 2021 at 1:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
The way this definitely breaks - I have been able to reproduce this in
isolation - is when one tuple is processed twice by heap_prune_chain(), and
the result of HeapTupleSatisfiesVacuum() changes from
HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.I had no idea that that was now possible. I really think that this
ought to be documented centrally.BTW, is it just a coincidence that we have only seen these problems
with pg_class? Or does it have something to do with the specifics of
VISHORIZON_CATALOG/horizons.catalog_oldest_nonremovable, or something
else along those lines?
It's the testcase. The test instructions specify autovacuum=off and only
pg_class is vacuumed. There are a other hot updates than to pg_class, but
there's much less HOT pruning on them because there's no loop of VACUUMs.
Greetings,
Andres Freund
On Wed, Nov 10, 2021 at 4:57 PM Andres Freund <andres@anarazel.de> wrote:
Yes. I don't think it's problematic right now, because the redirect would, I
think, in all cases have to point to the chain element before those tuples,
because the preceding value would just have to be DELETE_IN_PROGRESS, which we
we don't follow in heap_prune_chain().
Actually, I was more worried about versions before Postgres 14 here --
versions that don't have that new DELETE_IN_PROGRESS pruning behavior
you mentioned.
I'm asking because I notice that the fragile "We need this primarily
to handle aborted HOT updates" precheck for
HeapTupleHeaderIsHeapOnly() doesn't just check if the heap-only tuple
is DEAD before deciding to mark it LP_UNUSED. It also checks
HeapTupleHeaderIsHotUpdated() against the target tuple -- that's
another condition of the tuple being marked unused. Of course, whether
or not a given tuple is considered HeapTupleHeaderIsHotUpdated() can
change from true to false when an updater concurrently aborts. Could
that have race conditions?I wondered about that too, but I couldn't *quite* come up with a problematic
scenario, because I don't think any of the cases that can change
HeapTupleHeaderIsHotUpdated() would have allowed to set the redirect to a
subsequent chain element.
It's pretty complicated.
In other words: what if the aforementioned "aborted HOT updates"
precheck code doesn't deal with a DEAD tuple, imagining that it's not
a relevant tuple, while at the same time the later HOT-chain-chasing
code *also* doesn't get to the tuple? What if they each assume that
the other will/has taken care of it, due to a race?Then we'd just end up not pruning the tuple, I think. Which should be fine, as
it could only happen for fairly new tuples.
It's pretty far from ideal if the situation cannot correct itself in a
timely manner. Offhand I wonder if this might have something to do
with remaining suspected cases where we restart pruning during VACUUM,
based on the belief that an inserter concurrently aborted. If the DEAD
tuple is unreachable by pruning *forever*, then we're in trouble.
The good news is that the fix for missing DEAD tuples (a hypothetical
problem) is the same as the fix for the known, concrete problem:
process whole HOT chains first, and only later (in a "second pass"
over the page) process remaining heap-only tuples -- those heap-only
tuples that cannot be located any other way (because we tried and
failed).
You probably noticed that my patch does *not* have the same
"!HeapTupleHeaderIsHotUpdated()" check for these
unreachable-via-HOT-chain heap-only tuples. The mere fact that they're
unreachable is a good enough reason to consider them DEAD (and so mark
them LP_UNUSED). We do check heap_prune_satisfies_vacuum() for these
tuples too, but that is theoretically unnecessary. This feels pretty
water tight to me -- contradictory interpretations of what heap-only
tuple is and is in what HOT chain (if any) now seem impossible.
I benchmarked the patch, and it looks like there is a consistent
reduction in latency -- which you probably won't find too surprising.
That isn't the goal, but it is nice to see that new performance
regressions are unlikely to be a problem for the bug fix.
--
Peter Geoghegan