BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
The following bug has been logged on the website:
Bug reference: 17257
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 14.0
Operating system: Ubuntu 20.04
Description:
When running concurrent installchecks (x100) I've observed the autovacuum
process hanging:
tester 328713 328706 0 12:19 ? 00:02:03 postgres: logical
replication launcher
tester 3301788 328706 0 16:33 ? 00:00:02 postgres: postgres
regress030 127.0.0.1(44534) CREATE TABLE
tester 3302076 328706 99 16:33 ? 05:17:10 postgres: autovacuum
worker regress030
tester 3311424 328706 0 16:34 ? 00:03:09 postgres: postgres
regress090 127.0.0.1(48744) ALTER TABLE
tester 3311453 328706 1 16:34 ? 00:03:12 postgres: postgres
regress048 127.0.0.1(48768) ALTER TABLE
tester 3311461 328706 1 16:34 ? 00:03:11 postgres: postgres
regress017 127.0.0.1(48774) DROP TABLE
tester 3311490 328706 0 16:34 ? 00:03:09 postgres: postgres
regress006 127.0.0.1(48804) ALTER TABLE
tester 3311493 328706 0 16:34 ? 00:03:08 postgres: postgres
regress008 127.0.0.1(48806) ALTER TABLE
gdb showed the following stacktrace:
0x0000564d6c93cd1f in GetPrivateRefCountEntry (buffer=<optimized out>,
do_move=<optimized out>) at bufmgr.c:312
312 Assert(BufferIsValid(buffer));
(gdb) bt
#0 0x0000564d6c93cd1f in GetPrivateRefCountEntry (buffer=<optimized out>,
do_move=<optimized out>) at bufmgr.c:312
#1 0x0000564d6c93fb79 in GetPrivateRefCount (buffer=111422) at
bufmgr.c:398
#2 BufferGetBlockNumber (buffer=buffer@entry=111422) at bufmgr.c:2771
#3 0x0000564d6b83519d in heap_prune_chain (prstate=0x7fff07f03e10,
rootoffnum=7, buffer=111422) at pruneheap.c:625
#4 heap_page_prune (relation=relation@entry=0x7fa60a9cff80,
buffer=buffer@entry=111422,
vistest=vistest@entry=0x564d710816a0 <GlobalVisCatalogRels>,
old_snap_xmin=old_snap_xmin@entry=0,
old_snap_ts=old_snap_ts@entry=0, report_stats=report_stats@entry=false,
off_loc=<optimized out>) at pruneheap.c:278
#5 0x0000564d6b84a7d4 in lazy_scan_prune (vacrel=0x625000044190,
buf=<optimized out>, blkno=<optimized out>,
page=0x7fa642bdd580 <incomplete sequence \334>, vistest=<optimized out>,
prunestate=<optimized out>)
at vacuumlazy.c:1741
#6 0x0000564d6b8597cd in lazy_scan_heap (aggressive=<optimized out>,
params=<optimized out>, vacrel=<optimized out>)
at vacuumlazy.c:1384
#7 heap_vacuum_rel (rel=<optimized out>, params=<optimized out>,
bstrategy=<optimized out>) at vacuumlazy.c:638
#8 0x0000564d6bf9b6b6 in table_relation_vacuum (bstrategy=<optimized out>,
params=0x6250000272c4, rel=0x7fa60a9cff80)
at ../../../src/include/access/tableam.h:1678
#9 vacuum_rel (relid=<optimized out>, relation=<optimized out>,
params=0x6250000272c4) at vacuum.c:2034
#10 0x0000564d6bfa11f0 in vacuum (relations=0x625000041278,
relations@entry=0x6250000552b0,
params=params@entry=0x6250000272c4, bstrategy=<optimized out>,
bstrategy@entry=0x625000053048,
isTopLevel=isTopLevel@entry=true) at vacuum.c:475
#11 0x0000564d6c6e3a37 in autovacuum_do_vac_analyze
(bstrategy=0x625000053048, tab=0x6250000272c0) at autovacuum.c:3242
#12 do_autovacuum () at autovacuum.c:2490
#13 0x0000564d6c6e6681 in AutoVacWorkerMain (argv=0x0, argc=0) at
autovacuum.c:1714
#14 0x0000564d6c6e6a95 in StartAutoVacWorker () at autovacuum.c:1499
#15 0x0000564d6c7272d4 in StartAutovacuumWorker () at postmaster.c:5561
#16 sigusr1_handler (postgres_signal_arg=<optimized out>) at
postmaster.c:5266
#17 <signal handler called>
#18 __GI___sigprocmask (how=2, set=<optimized out>, oset=0x0) at
../sysdeps/unix/sysv/linux/sigprocmask.c:39
#19 0x0000564d6b52efc4 in __interceptor_sigprocmask.part.0 ()
#20 0x0000564d6c727d9e in ServerLoop () at postmaster.c:1707
#21 0x0000564d6c72c309 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x603000000250) at postmaster.c:1417
#22 0x0000564d6b4f6995 in main (argc=3, argv=0x603000000250) at main.c:209
(gdb) b 1820
Breakpoint 1 at 0x564d6b84b378: file vacuumlazy.c, line 1820.
(gdb) cont
Continuing.
Breakpoint 1, lazy_scan_prune (vacrel=0x625000044190, buf=<optimized out>,
blkno=<optimized out>,
page=0x7fa642bdd580 <incomplete sequence \334>, vistest=<optimized out>,
prunestate=<optimized out>)
at vacuumlazy.c:1820
1820 goto retry;
(gdb) print *vacrel
$15 = {rel = 0x7fa60a9cff80, indrels = 0x625000043b70, nindexes = 5,
failsafe_active = false,
consider_bypass_optimization = true, do_index_vacuuming = true,
do_index_cleanup = true, do_rel_truncate = true,
bstrategy = 0x625000053048, lps = 0x0, old_rel_pages = 6, old_live_tuples
= 133, relfrozenxid = 726, relminmxid = 1,
OldestXmin = 48155004, FreezeLimit = 4293122300, MultiXactCutoff =
4289973311,
relnamespace = 0x625000044140 "pg_catalog", relname = 0x625000044168
"pg_constraint", indname = 0x0, blkno = 11,
offnum = 32, phase = VACUUM_ERRCB_PHASE_SCAN_HEAP, dead_tuples =
0x62a0000a8240, rel_pages = 13, scanned_pages = 12,
pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 12,
pages_removed = 0, lpdead_item_pages = 9,
nonempty_pages = 11, new_rel_tuples = 0, new_live_tuples = 0, indstats =
0x6250000444d8, num_index_scans = 0,
tuples_deleted = 9, lpdead_items = 226, new_dead_tuples = 0, num_tuples =
237, live_tuples = 237}
I've also caught the similar hang on (auto)vacuuming pg_attrdef,
pg_shdepend, pg_class.
It seems that this eternal retry is caused by the HEAPTUPLE_RECENTLY_DEAD
state of a tuple processed in the lazy_scan_prune loop.
29.10.2021 16:00, PG Bug reporting form wrote:
The following bug has been logged on the website:
Bug reference: 17257
Logged by: Alexander Lakhin
Email address: exclusion@gmail.com
PostgreSQL version: 14.0
Operating system: Ubuntu 20.04
Description:When running concurrent installchecks (x100) I've observed the autovacuum
process hanging:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)
The reproducing script is:
###
export PGDATABASE=regression
createdb regression
echo "
vacuum (verbose, skip_locked, index_cleanup off) pg_catalog.pg_class;
select pg_sleep(random()/50);
" >/tmp/pseudo-autovacuum.sql
( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/inherit.sql >>psql1.log 2>&1; coredumpctl
--no-pager >/dev/null 2>&1 && break; done" ) &
( timeout 10m bash -c "while true; do psql -f
src/test/regress/sql/vacuum.sql >>psql2.log 2>&1; coredumpctl --no-pager
/dev/null 2>&1 && break; done" ) &
pgbench -n -f /tmp/pseudo-autovacuum.sql -C -c 40 -T 600 >pgbench.log 2>&1 &
wait
###
(I've set:
autovacuum=off
fsync=off
in postgresql.conf)
It leads to:
TIME PID UID GID SIG COREFILE EXE
Fri 2021-10-29 16:15:59 MSK 2337121 1000 1000 6 present
.../usr/local/pgsql/bin/postgres
real 2m19,425s
user 4m26,078s
sys 0m31,658s
Core was generated by `postgres: law regression [local]
VACUUM '.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f3d17e88859 in __GI_abort () at abort.c:79
#2 0x0000557358fbecb8 in ExceptionalCondition
(conditionName=conditionName@entry=0x557359027544 "numretries < 100",
errorType=errorType@entry=0x55735901800b "FailedAssertion",
fileName=0x7ffef5bbfa70 "\231\354\373XsU",
fileName@entry=0x557359027520 "vacuumlazy.c",
lineNumber=lineNumber@entry=1726) at assert.c:69
#3 0x0000557358ba0bdb in lazy_scan_prune (vacrel=0x55735a508d60,
buf=16184, blkno=270, page=0x7f3d16ef6f80 "",
vistest=0x5573592dd540 <GlobalVisCatalogRels>,
prunestate=0x7ffef5bc0fb0) at vacuumlazy.c:1823
#4 0x0000557358ba38c5 in lazy_scan_heap (aggressive=false,
params=0x7ffef5bc13a0, vacrel=<optimized out>)
at vacuumlazy.c:1384
#5 heap_vacuum_rel (rel=0x7f3d183ebec8, params=0x7ffef5bc13a0,
bstrategy=<optimized out>) at vacuumlazy.c:638
#6 0x0000557358cec785 in table_relation_vacuum (bstrategy=<optimized
out>, params=0x7ffef5bc13a0, rel=0x7f3d183ebec8)
at ../../../src/include/access/tableam.h:1678
#7 vacuum_rel (relid=1259, relation=<optimized out>,
params=params@entry=0x7ffef5bc13a0) at vacuum.c:2034
#8 0x0000557358cedf9e in vacuum (relations=0x55735a57bea8,
params=0x7ffef5bc13a0, bstrategy=<optimized out>,
isTopLevel=<optimized out>) at vacuum.c:475
#9 0x0000557358cee4f5 in ExecVacuum
(pstate=pstate@entry=0x55735a4ab730, vacstmt=vacstmt@entry=0x55735a48a408,
isTopLevel=isTopLevel@entry=true) at vacuum.c:268
#10 0x0000557358e9de4e in standard_ProcessUtility (pstmt=0x55735a48a758,
queryString=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;",
readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x55735a48a848,
qc=0x7ffef5bc1700) at utility.c:858
#11 0x0000557358e9c411 in PortalRunUtility
(portal=portal@entry=0x55735a4ed040, pstmt=pstmt@entry=0x55735a48a758,
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848,
qc=qc@entry=0x7ffef5bc1700) at pquery.c:1155
#12 0x0000557358e9c54d in PortalRunMulti
(portal=portal@entry=0x55735a4ed040, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55735a48a848, altdest=altdest@entry=0x55735a48a848,
qc=qc@entry=0x7ffef5bc1700) at pquery.c:1312
#13 0x0000557358e9cbe9 in PortalRun (portal=portal@entry=0x55735a4ed040,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55735a48a848,
altdest=altdest@entry=0x55735a48a848, qc=0x7ffef5bc1700) at pquery.c:788
#14 0x0000557358e9895b in exec_simple_query (
query_string=0x55735a4895e0 "vacuum (verbose, skip_locked,
index_cleanup off) pg_catalog.pg_class;")
at postgres.c:1214
#15 0x0000557358e9a561 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffef5bc1b70, dbname=<optimized out>,
username=<optimized out>) at postgres.c:4486
#16 0x0000557358e064dd in BackendRun (port=0x55735a4aeb00,
port=0x55735a4aeb00) at postmaster.c:4506
#17 BackendStartup (port=0x55735a4aeb00) at postmaster.c:4228
#18 ServerLoop () at postmaster.c:1745
#19 0x0000557358e07481 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at postmaster.c:1417
#20 0x0000557358b352e2 in main (argc=3, argv=0x55735a4838a0) at main.c:209
Sometimes the server crashes other way as described in bug #17255.
And I've also seen the following crash (maybe it's yet another bug, but
I can't explore it now):
Core was generated by `postgres: law regression [local]
EXPLAIN '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
230 datum = heap_getattr(tuple,
Anum_pg_class_relpartbound,
(gdb) bt
#0 RelationBuildPartitionDesc (omit_detached=true, rel=0x7f09b87a27e8)
at partdesc.c:230
#1 RelationGetPartitionDesc (rel=0x7f09b87a27e8,
omit_detached=<optimized out>) at partdesc.c:110
#2 0x000055ec12169032 in PartitionDirectoryLookup (pdir=0x55ec12ee7660,
rel=rel@entry=0x7f09b87a27e8)
at partdesc.c:425
#3 0x000055ec12158b35 in set_relation_partition_info
(relation=0x7f09b87a27e8, rel=0x55ec12f337c0,
root=0x55ec12effae0) at plancat.c:2197
#4 get_relation_info (root=<optimized out>, relationObjectId=<optimized
out>, inhparent=<optimized out>,
rel=<optimized out>) at plancat.c:472
#5 0x000055ec1215d26f in build_simple_rel
(root=root@entry=0x55ec12effae0, relid=5,
parent=parent@entry=0x55ec12ee7330) at relnode.c:307
#6 0x000055ec1214f495 in expand_partitioned_rtentry
(root=0x55ec12effae0, relinfo=0x55ec12ee7330,
parentrte=0x55ec12e006d0, parentRTindex=1, parentrel=0x7f09b87a0630,
top_parentrc=0x0, lockmode=1) at inherit.c:398
#7 0x000055ec1214f67e in expand_inherited_rtentry
(root=root@entry=0x55ec12effae0, rel=0x55ec12ee7330,
rte=0x55ec12e006d0, rti=rti@entry=1) at inherit.c:143
#8 0x000055ec1212eb8d in add_other_rels_to_query
(root=root@entry=0x55ec12effae0) at initsplan.c:163
#9 0x000055ec12131e3f in query_planner (root=root@entry=0x55ec12effae0,
qp_callback=qp_callback@entry=0x55ec12132b20 <standard_qp_callback>,
qp_extra=qp_extra@entry=0x7ffcffe4e890)
at planmain.c:264
#10 0x000055ec121371b7 in grouping_planner (root=<optimized out>,
tuple_fraction=<optimized out>) at planner.c:1442
#11 0x000055ec12139d0e in subquery_planner
(glob=glob@entry=0x55ec12ef3eb8, parse=parse@entry=0x55ec12e005b8,
parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=false,
tuple_fraction=tuple_fraction@entry=0)
at planner.c:1019
#12 0x000055ec1213a383 in standard_planner (parse=0x55ec12e005b8,
query_string=<optimized out>, cursorOptions=2048,
boundParams=<optimized out>) at planner.c:400
#13 0x000055ec1220d7f8 in pg_plan_query (querytree=0x55ec12e005b8,
querytree@entry=0x7ffcffe4ead0,
query_string=query_string@entry=0x55ec12dff520 "explain (costs off)
select * from range_list_parted;",
cursorOptions=cursorOptions@entry=2048,
boundParams=boundParams@entry=0x0) at postgres.c:847
#14 0x000055ec12025fef in ExplainOneQuery (query=0x7ffcffe4ead0,
cursorOptions=2048, into=0x0, es=0x55ec12ef3bd0,
queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", params=0x0, queryEnv=0x0)
at explain.c:397
#15 0x000055ec1202679f in ExplainQuery (pstate=0x55ec12ef3a70,
stmt=0x55ec12e003d8, params=0x0, dest=0x55ec12ef39d8)
at ../../../src/include/nodes/nodes.h:604
#16 0x000055ec122132c9 in standard_ProcessUtility (pstmt=0x55ec12e01058,
queryString=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;", readOnlyTree=<optimized out>,
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x55ec12ef39d8, qc=0x7ffcffe4ed60)
at utility.c:862
#17 0x000055ec122116cf in PortalRunUtility
(portal=portal@entry=0x55ec12e63060, pstmt=0x55ec12e01058,
isTopLevel=<optimized out>,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x55ec12ef39d8,
qc=qc@entry=0x7ffcffe4ed60) at pquery.c:1155
#18 0x000055ec12211bb0 in FillPortalStore (portal=0x55ec12e63060,
isTopLevel=<optimized out>)
at ../../../src/include/nodes/nodes.h:604
#19 0x000055ec12211ebd in PortalRun (portal=portal@entry=0x55ec12e63060,
count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55ec12e01148,
altdest=altdest@entry=0x55ec12e01148, qc=0x7ffcffe4ef60) at pquery.c:760
#20 0x000055ec1220dcf5 in exec_simple_query (
query_string=0x55ec12dff520 "explain (costs off) select * from
range_list_parted;") at postgres.c:1214
#21 0x000055ec1220f886 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffcffe4f4b0, dbname=<optimized out>,
username=<optimized out>) at postgres.c:4486
#22 0x000055ec1218071a in BackendRun (port=<optimized out>,
port=<optimized out>) at postmaster.c:4506
#23 BackendStartup (port=<optimized out>) at postmaster.c:4228
#24 ServerLoop () at postmaster.c:1745
#25 0x000055ec12181682 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55ec12df9ce0) at postmaster.c:1417
#26 0x000055ec11eca3a0 in main (argc=3, argv=0x55ec12df9ce0) at main.c:209
Best regards,
Alexander
Attachments:
vacuum-hang-demo.patchtext/x-patch; charset=UTF-8; name=vacuum-hang-demo.patchDownload+5-601
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)
This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.
--
Peter Geoghegan
On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.
Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.
I think that I also have found the culprit, which is something we
talked about in [0]/messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.
It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0]/messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.
I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.
Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).
Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/20210609184506.rqm5rikoikm47csf@alap3.anarazel.de
Attachments:
0001-Fix-stuck-vacuum-due-to-retreating-GlobalVisState-ma.patchapplication/octet-stream; name=0001-Fix-stuck-vacuum-due-to-retreating-GlobalVisState-ma.patchDownload+16-9
On Mon, 1 Nov 2021 at 16:15, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Fri, 29 Oct 2021 at 20:17, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)This bug is similar to the one fixed in commit d9d8aa9b. And so I
wonder if code like GlobalVisTestFor() is missing something that it
needs for partitioned tables.Without `autovacuum = off; fsync = off` I could not replicate the
issue in the configured 10m time window; with those options I did get
the reported trace in minutes.I think that I also have found the culprit, which is something we
talked about in [0]: GlobalVisState->maybe_needed was not guaranteed
to never move backwards when recalculated, and because vacuum can
update its snapshot bounds (heap_prune_satisfies_vacuum ->
GlobalVisTestIsRemovableFullXid -> GlobalVisUpdate) this maybe_needed
could move backwards, resulting in the observed behaviour.It was my understanding based on the mail conversation that Andres
would fix this observed issue too while fixing [0] (whose fix was
included with beta 2), but apparently I was wrong; I can't find the
code for 'maybe_needed'-won't-move-backwards-in-a-backend.I (again) propose the attached patch, which ensures that this
maybe_needed field will not move backwards for a backend. It is
based on 14, but should be applied on head as well, because it's
lacking there as well.Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).Note: After fixing the issue with retreating maybe_needed I also hit
your segfault, and I'm still trying to find out what the source of
that issue might be. I do think it is an issue seperate from stuck
vacuum, though.
After further debugging, I think these both might be caused by the
same issue, due to xmin horizon confusion as a result from restored
snapshots:
I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant). Ergo, this issue is not really solved by my previous patch,
because apparently at this point we have snapshots wih an xmin that is
only registered in the backend's procarray entry when the xmin is
already out of scope, which makes it generally impossible to determine
what tuples may or may not yet be vacuumed.
I noticed that when this happens, generally a parallel vacuum worker
is involved. I also think that this is intimately related to [0]/messages/by-id/202110191807.5svc3kmm32tl@alvherre.pgsql, and
how snapshots are restored in parallel workers: A vacuum worker is
generally ignored, but if its snapshot has the oldest xmin available,
then a parallel worker launched from that vacuum worker will move the
visible xmin backwards. Same for concurrent index creation jobs.
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/202110191807.5svc3kmm32tl@alvherre.pgsql
On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant).
I noticed that when this happens, generally a parallel vacuum worker
is involved.
Hmm. That is plausible. The way that VACUUM (and concurrent index
builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
broke lazy_scan_prune in the way that we see here.
What about testing? Can we find a simple way of reducing this
complicated repro to a less complicated repro with a failing
assertion? Maybe an assertion that we get to keep after the bug is
fixed?
--
Peter Geoghegan
On Wed, 3 Nov 2021 at 17:21, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Nov 3, 2021 at 8:46 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I seem to repeatedly get backends of which the xmin is set from
InvalidTransactionId to some value < min(ProcGlobal->xids), which then
result in shared_oldest_nonremovable (and others) being less than the
value of their previous iteration. This leads to the infinite loop in
lazy_scan_prune (it stores and uses one value of
*_oldest_nonremovable, whereas heap_page_prune uses a more up-to-date
variant).I noticed that when this happens, generally a parallel vacuum worker
is involved.Hmm. That is plausible. The way that VACUUM (and concurrent index
builds) avoid being seen via the PROC_IN_VACUUM thing is pretty
delicate. Wouldn't surprise me if the parallel VACUUM issue subtly
broke lazy_scan_prune in the way that we see here.What about testing? Can we find a simple way of reducing this
complicated repro to a less complicated repro with a failing
assertion? Maybe an assertion that we get to keep after the bug is
fixed?
I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:
- It is always safe to set ->xmin to InvalidTransactionId: This
removes any claim that we have a snapshot anyone should worry about.
- If we have a valid ->xmin set, it is always safe to increase its value.
- Otherwise, the xmin must not lower the overall xmin of the database
it is connected to, plus some potential conditions for status flags.
It also may not be set without first taking the ProcArrayLock:
without synchronised access to the proc array, you cannot guarantee
you can set your xmin to a globally correct value.
It worked well with the bgworker flags patch [0]/messages/by-id/CAD21AoDkERUJkGEuQRiyGKmVRt2duU378UgnwBpqXQjA+EY3Lg@mail.gmail.com, until I added this
instrumentation to SnapshotResetXmin and ran the regression tests: I
stumbled upon the following issue with aborting transactions, and I
don't know what the correct solution is to solve it:
AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
reset MyProc->xmin to InvalidTransactionId (both directly and through
ProcArrayEndTransactionInternal). So far, this is safe.
However, later in AbortTransaction we call ResourceOwnerRelease(...,
RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
stored in its owner->snapshotarr array using UnregisterSnapshot.
Then, if UnregisterSnapshot determines that a snapshot is now not
referenced anymore, and that snapshot has no active count, then it
will call SnapshotResetXmin().
Finally, when SnapshotResetXmin() is called, the oldest still
registered snapshot in RegisteredSnapshots will be pulled and
MyProc->xmin will be set to that snapshot's xmin.
Similarly, in AbortTransaction we call AtEOXact_Inval, which calls
ProcessInvalidationMessages -> LocalExecuteInvalidationMessage ->
InvalidateCatalogSnapshot -> SnapshotResetXmin, also setting
MyProc->xmin back to a non-InvalidXid value.
Note that from a third-party observer's standpoint we've just moved
our horizons backwards, and the regression tests (correctly) fail when
assertions are enabled.
I don't know what the expected behaviour is, but I do know that this
is a violation of the expected invariant of xmin never goes backwards
(for any of the cluster, database or data level).
Kind regards,
Matthias van de Meent
[0]: /messages/by-id/CAD21AoDkERUJkGEuQRiyGKmVRt2duU378UgnwBpqXQjA+EY3Lg@mail.gmail.com
Attachments:
v2-0001-Add-instrumentation-for-xmin-horizon-validation.patchapplication/x-patch; name=v2-0001-Add-instrumentation-for-xmin-horizon-validation.patchDownload+171-20
On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:
This test case involves partitioning, but also pruning, which is very
particular about heap tuple headers being a certain way following
updates. I wonder if we're missing a
HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
heapam/VACUUM/pruning code, or could be somewhere else.
Take a look at commit f16241bef7 to get some idea of what I mean.
--
Peter Geoghegan
On Fri, 5 Nov 2021 at 22:25, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Nov 5, 2021 at 4:43 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:This test case involves partitioning, but also pruning, which is very
particular about heap tuple headers being a certain way following
updates. I wonder if we're missing a
HeapTupleHeaderIndicatesMovedPartitions() test somewhere. Could be in
heapam/VACUUM/pruning code, or could be somewhere else.
If you watch closely, the second backtrace in [0]/messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com (the segfault)
originates from the code that builds the partition bounds based on
relcaches / catalog tables, which are never partitioned. Although it
is indeed in the partition infrastructure, if we'd have a tuple with
HeapTupleHeaderIndicatesMovedPartitions() at that point, then that'd
be a bug (we do not partition catalogs).
But I hit this same segfault earlier while testing, and I deduced that
problem that I hit at that point was that there was that an index
entry could not resolve to a heap tuple (or the scan at partdesc.c:227
otherwise returned NULL where one result was expected); so that tuple
is NULL at partdesc.c:230, and heap_getattr subsequently segfaults
when it dereferences the null tuple pointer to access it's fields.
Due to the blatant visibility horizon confusion, the failing scan
being on the pg_class table, and the test case including aggressive
manual vacuuming of the pg_class table, I assume that the error was
caused by vacuum having removed tuples from pg_class, while other
backends still required / expected access to these tuples.
Kind regards,
Matthias
[0]: /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
On 2021-11-05 12:43:00 +0100, Matthias van de Meent wrote:
I added the attached instrumentation for checking xmin validity, which
asserts what I believe are correct claims about the proc
infrastructure:- It is always safe to set ->xmin to InvalidTransactionId: This
removes any claim that we have a snapshot anyone should worry about.
- If we have a valid ->xmin set, it is always safe to increase its value.
I think I know what you mean, but of course you cannot just increase xmin if
there are existing snapshots requiring that xmin.
- Otherwise, the xmin must not lower the overall xmin of the database
it is connected to, plus some potential conditions for status flags.
walsenders can end up doing this IIRC.
It also may not be set without first taking the ProcArrayLock:
without synchronised access to the proc array, you cannot guarantee
you can set your xmin to a globally correct value.
There's possibly one exception around this, which is snapshot import. But
that's rare enough that an unnecessary acquisition is fine.
It worked well with the bgworker flags patch [0], until I added this
instrumentation to SnapshotResetXmin and ran the regression tests: I
stumbled upon the following issue with aborting transactions, and I
don't know what the correct solution is to solve it:AbortTransaction (see xact.c) calls ProcArrayEndTransaction, which can
reset MyProc->xmin to InvalidTransactionId (both directly and through
ProcArrayEndTransactionInternal). So far, this is safe.However, later in AbortTransaction we call ResourceOwnerRelease(...,
RESOURCE_RELEASE_AFTER_LOCKS...), which will clean up the snapshots
stored in its owner->snapshotarr array using UnregisterSnapshot.
Then, if UnregisterSnapshot determines that a snapshot is now not
referenced anymore, and that snapshot has no active count, then it
will call SnapshotResetXmin().
Finally, when SnapshotResetXmin() is called, the oldest still
registered snapshot in RegisteredSnapshots will be pulled and
MyProc->xmin will be set to that snapshot's xmin.
Yea, that's not great. This is a pretty old behaviour, IIRC?
We have an unwritten rule that a backend's xmin may not go back, but
this is not enforced.
I don't think we can make any of this hard assertions. There's e.g. the
following comment:
* Note: despite the above, it's possible for the calculated values to move
* backwards on repeated calls. The calculated values are conservative, so
* that anything older is definitely not considered as running by anyone
* anymore, but the exact values calculated depend on a number of things. For
* example, if there are no transactions running in the current database, the
* horizon for normal tables will be latestCompletedXid. If a transaction
* begins after that, its xmin will include in-progress transactions in other
* databases that started earlier, so another call will return a lower value.
* Nonetheless it is safe to vacuum a table in the current database with the
* first result. There are also replication-related effects: a walsender
* process can set its xmin based on transactions that are no longer running
* on the primary but are still being replayed on the standby, thus possibly
* making the values go backwards. In this case there is a possibility that
* we lose data that the standby would like to have, but unless the standby
* uses a replication slot to make its xmin persistent there is little we can
* do about that --- data is only protected if the walsender runs continuously
* while queries are executed on the standby. (The Hot Standby code deals
* with such cases by failing standby queries that needed to access
* already-removed data, so there's no integrity bug.) The computed values
* are also adjusted with vacuum_defer_cleanup_age, so increasing that setting
* on the fly is another easy way to make horizons move backwards, with no
* consequences for data integrity.
Greetings,
Andres Freund
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)The reproducing script is:
I cannot reproduce this bug by following your steps, even when the
assertion is made to fail after only 5 retries (5 is still ludicrously
excessive, 100 might be overkill). And even when I don't use a debug
build (and make the assertion into an equivalent PANIC). I wonder why
that is. I didn't have much trouble following your similar repro for
bug #17255.
My immediate goal in trying to follow your reproducer was to determine
what effect (if any) the pending bugfix for #17255 [1]/messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com -- Peter Geoghegan has on this
bug. It seems more than possible that this bug is in fact a different
manifestation of the same underlying problem we see in #17255. And so
that should be the next thing we check here.
[1]: /messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believe
v17 commit 1ccc1e05ae essentially did that.
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).
Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.
On Wed, Nov 10, 2021 at 12:28:43PM -0800, Peter Geoghegan wrote:
On Fri, Oct 29, 2021 at 6:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I can propose the debugging patch to reproduce the issue that replaces
the hang with the assert and modifies a pair of crash-causing test
scripts to simplify the reproducing. (Sorry, I have no time now to prune
down the scripts further as I have to leave for a week.)The reproducing script is:
I cannot reproduce this bug by following your steps, even when the
assertion is made to fail after only 5 retries (5 is still ludicrously
excessive, 100 might be overkill). And even when I don't use a debug
build (and make the assertion into an equivalent PANIC). I wonder why
that is. I didn't have much trouble following your similar repro for
bug #17255.
For what it's worth, I needed "-X" on the script's psql command lines to keep
my ~/.psqlrc from harming things. I also wondered if the regression database
needed to be populated with a "make installcheck" run. The script had a
"createdb regression" without a "make installcheck", so I assumed an empty
regression database was intended.
My immediate goal in trying to follow your reproducer was to determine
what effect (if any) the pending bugfix for #17255 [1] has on this
bug. It seems more than possible that this bug is in fact a different
manifestation of the same underlying problem we see in #17255. And so
that should be the next thing we check here.[1] /messages/by-id/CAH2-WzkpG9KLQF5sYHaOO_dSVdOjM+dv=nTEn85oNfMUTk836Q@mail.gmail.com
Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:
- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.
- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.
- At 18b87b201f, a 1440m script run got no failures.
I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.
Thanks,
nm
On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:
On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believev17 commit 1ccc1e05ae essentially did that.
Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.
My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here. Seems to be
specific to partitioning with cross-partition updates.
Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.- At 18b87b201f, a 1440m script run got no failures.
I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.
I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?
I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.
--
Peter Geoghegan
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
On Sun, Dec 24, 2023 at 6:44 PM Noah Misch <noah@leadboat.com> wrote:
On Mon, Nov 01, 2021 at 04:15:27PM +0100, Matthias van de Meent wrote:
Another alternative would be to replace the use of vacrel->OldestXmin
with `vacrel->vistest->maybe_needed` in lazy_scan_prune, but I believev17 commit 1ccc1e05ae essentially did that.
Obviously, 1ccc1e05ae would fix the immediate problem of infinite
retries, since it just rips out the loop.
Yep.
that is not legal in how vacuum works (we cannot unilaterally decide
that we want to retain tuples < OldestXmin).Do you think commit 1ccc1e05ae creates problems in that respect? It does have
the effect of retaining tuples for which GlobalVisState rules "retain" but
HeapTupleSatisfiesVacuum() would have ruled "delete". If that doesn't create
problems, then back-patching commit 1ccc1e05ae could be a fix for remaining
infinite-retries scenarios, if any.My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.
Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.
Seems to be
specific to partitioning with cross-partition updates.Using the /messages/by-id/d5d5af5d-ba46-aff3-9f91-776c70246cc3@gmail.com
procedure, I see these results:- A commit from the day of that email, 2021-10-29, (5ccceb2946) reproduced the
"numretries" assertion failure in each of five 10m runs.- At the 2022-01-13 commit (18b87b201f^) just before the fix for #17255, the
script instead gets: FailedAssertion("HeapTupleHeaderIsHeapOnly(htup)",
File: "pruneheap.c", Line: 964. That happened once in two 10m runs, so it
was harder to reach than the numretries failure.- At 18b87b201f, a 1440m script run got no failures.
I've seen symptoms that suggest the infinite-numretries bug remains reachable,
but I don't know how to reproduce them. (Given the upthread notes about xmin
going backward during end-of-xact, I'd first try some pauses there.) If it
does remain reachable, likely some other code change between 2021-10 and
2022-01 made this particular test script no longer reach it.I am aware of a production database that appears to run into the same
problem. Inserts and concurrent cross-partition updates are used
heavily on this instance (the table in question uses partitioning).
Perhaps you happened upon a similar problematic production database,
and found this thread when researching the issue? Maybe we've both
seen the same problem in the wild?
I did find this thread while researching the symptoms I was seeing. No
partitioning where I saw them.
I have seen VACUUM get stuck like this on multiple versions, all
associated with the same application code/partitioning
scheme/workload. This includes a 15.4 instance, and various 14.* point
release instances. It seems likely that there is a bug here, and that
it affects Postgres 14, 15, and 16.
Agreed.
On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.
Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)
We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?
I also worry about the possibility that GlobalVisTestShouldUpdate()
masks problems in this area (as opposed to causing the problems). It
seems very hard to test.
I did find this thread while researching the symptoms I was seeing. No
partitioning where I saw them.
If this was an isolated incident then it could perhaps have been a
symptom of corruption. Though corruption seems highly unlikely to be
involved with the cases that I've seen, since they appear
intermittently, across a variety of different contexts/versions.
--
Peter Geoghegan
On Sat, Jan 6, 2024 at 1:30 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least).
What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.
--
Peter Geoghegan
On Sat, Jan 06, 2024 at 01:30:40PM -0800, Peter Geoghegan wrote:
On Sat, Jan 6, 2024 at 12:24 PM Noah Misch <noah@leadboat.com> wrote:
On Sun, Dec 31, 2023 at 03:53:34PM -0800, Peter Geoghegan wrote:
My guess is that there is a decent chance that backpatching 1ccc1e05ae
would be okay, but that isn't much use. I really don't know either way
right now. And I wouldn't like to speculate too much further before
gaining a proper understanding of what's going on here.Fair enough. While I agree there's a decent chance back-patching would be
okay, I think there's also a decent chance that 1ccc1e05ae creates the problem
Matthias theorized. Something like: we update relfrozenxid based on
OldestXmin, even though GlobalVisState caused us to retain a tuple older than
OldestXmin. Then relfrozenxid disagrees with table contents.Either every relevant code path has the same OldestXmin to work off
of, or the whole NewRelfrozenXid/relfrozenxid-tracking thing can't be
expected to work as designed. I find it a bit odd that
pruneheap.c/GlobalVisState has no direct understanding of this
dependency (none that I can discern, at least). Wouldn't it at least
be more natural if pruneheap.c could access OldestXmin when run inside
VACUUM? (Could just be used by defensive hardening code.)
Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:
1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)
2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)
Which way is better?
We're also relying on vacuumlazy.c's call to vacuum_get_cutoffs()
(which itself calls GetOldestNonRemovableTransactionId) taking place
before vacuumlazy.c goes on to call GlobalVisTestFor() a few lines
further down (I think). It seems like even the code in procarray.c
might have something to say about the vacuumlazy.c dependency, too.
But offhand it doesn't look like it does, either. Why shouldn't we
expect random implementation details in code like ComputeXidHorizons()
to break the assumption/dependency within vacuumlazy.c?
Makes sense.
On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:
What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.
So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening? That, or limiting the retry count, could help.
On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:
Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)Which way is better?
I suppose that a hybrid of these two approaches makes the most sense.
A design that's a lot closer to your #1 than to your #2.
Under this scheme, pruneheap.c would be explicitly aware of
OldestXmin, and would promise to respect the exact invariant that we
need to avoid getting stuck in lazy_scan_prune's loop (or avoid
confused NewRelfrozenXid tracking on HEAD, which no longer has this
loop). But it'd be limited to that exact invariant; we'd still avoid
unduly imposing any requirements on pruning-away deleted tuples whose
xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
we prune away any "extra" heap tuples, just because we can (or just
because it's convenient to the implementation). Andres has in the past
placed a lot of emphasis on being able to update the
GlobalVisState-wise bounds on the fly. Not sure that it's really that
important that VACUUM does that, but there is no reason to not allow
it. So we can keep that property (as well as the aforementioned
high-level OldestXmin immutability property).
More importantly (at least to me), this scheme allows vacuumlazy.c to
continue to treat OldestXmin as an immutable cutoff for both pruning
and freezing -- the high level design doesn't need any revisions. We
already "freeze away" multixact member XIDs >= OldestXmin in certain
rare cases (i.e. we remove lockers that are determined to no longer be
running in FreezeMultiXactId's "second pass" slow path), so there is
nothing fundamentally novel about the idea of removing some extra XIDs
= OldestXmin in passing, just because it happens to be convenient to
some low-level piece of code that's external to vacuumlazy.c.
What do you think of that general approach? I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.
On Sat, Jan 06, 2024 at 01:41:23PM -0800, Peter Geoghegan wrote:
What do you think of the idea of adding a defensive "can't happen"
error to lazy_scan_prune() that will catch DEAD or RECENTLY_DEAD
tuples with storage that still contain an xmax < OldestXmin? This
probably won't catch every possible problem, but I suspect it'll work
well enough.So before the "goto retry", ERROR if the tuple header suggests this mismatch
is happening? That, or limiting the retry count, could help.
When I wrote this code, my understanding was that the sole reason for
needing to loop back was a concurrently-aborted xact. In principle we
ought to be able to test the tuple to detect if it's that exact case
(the only truly valid case), and then throw an error if we somehow got
it wrong. That kind of hardening would at least be correct according
to my original understanding of things.
There is an obvious practical concern with adding such hardening now:
what if the current loop is accidentally protective, in whatever way?
That seems quite possible. I seem to recall that Andres supposed at
some point that the loop's purpose wasn't limited to the
concurrently-aborted-inserter case that I believed was the only
relevant case back when I worked on what became commit 8523492d4e
("Remove tupgone special case from vacuumlazy.c"). I don't have a
reference for that, but I'm pretty sure it was said at some point
around the release of 14.
--
Peter Geoghegan
On Mon, Jan 08, 2024 at 12:02:01PM -0500, Peter Geoghegan wrote:
On Sat, Jan 6, 2024 at 5:44 PM Noah Misch <noah@leadboat.com> wrote:
Tied to that decision is the choice of semantics when the xmin horizon moves
backward during one VACUUM, e.g. when a new walsender xmin does so. Options:1. Continue to remove tuples based on the OldestXmin from VACUUM's start. We
could have already removed some of those tuples, so the walsender xmin
won't achieve a guarantee anyway. (VACUUM would want ratchet-like behavior
in GlobalVisState, possibly by sharing OldestXmin with pruneheap like you
say.)2. Move OldestXmin backward, to reflect the latest xmin horizon. (Perhaps
VACUUM would just pass GlobalVisState to a function that returns the
compatible OldestXmin.)Which way is better?
I suppose that a hybrid of these two approaches makes the most sense.
A design that's a lot closer to your #1 than to your #2.Under this scheme, pruneheap.c would be explicitly aware of
OldestXmin, and would promise to respect the exact invariant that we
need to avoid getting stuck in lazy_scan_prune's loop (or avoid
confused NewRelfrozenXid tracking on HEAD, which no longer has this
loop). But it'd be limited to that exact invariant; we'd still avoid
unduly imposing any requirements on pruning-away deleted tuples whose
xmax was >= OldestXmin. lazy_scan_prune/vacuumlazy.c shouldn't care if
we prune away any "extra" heap tuples, just because we can (or just
because it's convenient to the implementation). Andres has in the past
placed a lot of emphasis on being able to update the
GlobalVisState-wise bounds on the fly. Not sure that it's really that
important that VACUUM does that, but there is no reason to not allow
it. So we can keep that property (as well as the aforementioned
high-level OldestXmin immutability property).More importantly (at least to me), this scheme allows vacuumlazy.c to
continue to treat OldestXmin as an immutable cutoff for both pruning
and freezing -- the high level design doesn't need any revisions. We
already "freeze away" multixact member XIDs >= OldestXmin in certain
rare cases (i.e. we remove lockers that are determined to no longer be
running in FreezeMultiXactId's "second pass" slow path), so there is
nothing fundamentally novel about the idea of removing some extra XIDs= OldestXmin in passing, just because it happens to be convenient to
some low-level piece of code that's external to vacuumlazy.c.
What do you think of that general approach?
That all sounds good to me.
I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.
That may be fully okay, or we may want to clamp OldestXmin to be no older than
relfrozenxid. I don't feel great about the system moving relfrozenxid
backward unless it observed an older XID, and observing an older XID would be
a corruption signal. I don't have a specific way non-monotonic relfrozenxid
breaks things, though.
On Mon, Jan 8, 2024 at 1:21 PM Noah Misch <noah@leadboat.com> wrote:
I see no reason why it
matters if OldestXmin goes backwards across two VACUUM operations, so
I haven't tried to avoid that.That may be fully okay, or we may want to clamp OldestXmin to be no older than
relfrozenxid. I don't feel great about the system moving relfrozenxid
backward unless it observed an older XID, and observing an older XID would be
a corruption signal. I don't have a specific way non-monotonic relfrozenxid
breaks things, though.
We're already prepared for this -- relfrozenxid simply cannot go
backwards, regardless of what vacuumlazy.c thinks. That is,
vac_update_relstats() won't accept a new relfrozenxid that is < its
existing value (unless it's a value "from the future", which is a way
of recovering after historical pg_upgrade-related corruption bugs).
If memory serves it doesn't take much effort to exercise the relevant
code within vac_update_relstats(). I'm pretty sure that the regression
tests will fail if you run them after removing its defensive
no-older-relfrozenxid test (though I haven't checked recently).
--
Peter Geoghegan