Use heap scan routines directly in vac_update_datfrozenxid()

Started by Soumyadeep Chakrabortyover 1 year ago4 messages

soumyadeep2007@gmail.com

over 1 year ago

2 attachment(s)

Hi hackers,

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure. The speedup can be noticeable in databases
containing a large number of relations (perhaps due to heavy partition
table usage). This was proposed in [1]/messages/by-id/20221229030329.fbpiitatmowzza6c@awork3.anarazel.de.

Experiment setup:

* Use -O3 optimized build without asserts, with fsync and autovacuum off,
on my laptop. Other gucs are all at defaults.

* Create tables using pgbench to inflate pg_class's to a decent size.

$ cat << EOF > bench.sql

select txid_current() AS txid \gset
CREATE TABLE t:txid(a int);
EOF

$ pgbench -f ./bench.sql -t 200000 -c 100 -n bench

select pg_size_pretty(pg_relation_size('pg_class'));
pg_size_pretty
----------------
3508 MB
(1 row)

* Use instr_time to record the scan time. See attached instr_vac.diff.

* Run vacuum on any of the created empty tables in the database bench:

Results:

* main as of 68dfecbef2:

bench=# vacuum t1624;
NOTICE: scan took 796.862142 ms
bench=# vacuum t1624;
NOTICE: scan took 793.730688 ms
bench=# vacuum t1624;
NOTICE: scan took 793.963655 ms

* patch:

bench=# vacuum t1624;
NOTICE: scan took 682.283366 ms
bench=# vacuum t1624;
NOTICE: scan took 670.816975 ms
bench=# vacuum t1624;
NOTICE: scan took 683.821717 ms

Regards,
Soumyadeep (Broadcom)

[1]: /messages/by-id/20221229030329.fbpiitatmowzza6c@awork3.anarazel.de

Attachments:

v1-0001-Use-heap_getnext-in-vac_update_datfrozenxid.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Use-heap_getnext-in-vac_update_datfrozenxid.patchDownload

From 320e54894a1ad45e2c25a4ee88a6409a9dc1a527 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty <soumyadeep2007@gmail.com>
Date: Sat, 5 Oct 2024 16:22:56 -0700
Subject: [PATCH v1 1/1] Use heap_getnext() in vac_update_datfrozenxid()

Since we are going to do a full sequential scan without a filter, we can
avoid overhead from the extra layers of sysscan.
---
 src/backend/commands/vacuum.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9c25..717e310054 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1588,7 +1588,7 @@ vac_update_datfrozenxid(void)
 	HeapTuple	tuple;
 	Form_pg_database dbform;
 	Relation	relation;
-	SysScanDesc scan;
+	TableScanDesc scan;
 	HeapTuple	classTup;
 	TransactionId newFrozenXid;
 	MultiXactId newMinMulti;
@@ -1638,10 +1638,9 @@ vac_update_datfrozenxid(void)
 	 */
 	relation = table_open(RelationRelationId, AccessShareLock);
 
-	scan = systable_beginscan(relation, InvalidOid, false,
-							  NULL, 0, NULL);
+	scan = table_beginscan_catalog(relation, 0, NULL);
 
-	while ((classTup = systable_getnext(scan)) != NULL)
+	while ((classTup = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
 		volatile FormData_pg_class *classForm = (Form_pg_class) GETSTRUCT(classTup);
 		TransactionId relfrozenxid = classForm->relfrozenxid;
@@ -1707,7 +1706,7 @@ vac_update_datfrozenxid(void)
 	}
 
 	/* we're done with pg_class */
-	systable_endscan(scan);
+	table_endscan(scan);
 	table_close(relation, AccessShareLock);
 
 	/* chicken out if bogus data found */
-- 
2.43.0

instr_vac.difftext/x-patch; charset=US-ASCII; name=instr_vac.diffDownload

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 717e310054..db3e8a4baf 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1599,6 +1599,9 @@ vac_update_datfrozenxid(void)
 	ScanKeyData key[1];
 	void	   *inplace_state;
 
+	instr_time	before;
+	instr_time	after;
+
 	/*
 	 * Restrict this task to one backend per database.  This avoids race
 	 * conditions that would move datfrozenxid or datminmxid backward.  It
@@ -1636,6 +1639,7 @@ vac_update_datfrozenxid(void)
 	 *
 	 * See vac_truncate_clog() for the race condition to prevent.
 	 */
+	INSTR_TIME_SET_CURRENT(before);
 	relation = table_open(RelationRelationId, AccessShareLock);
 
 	scan = table_beginscan_catalog(relation, 0, NULL);
@@ -1708,7 +1712,9 @@ vac_update_datfrozenxid(void)
 	/* we're done with pg_class */
 	table_endscan(scan);
 	table_close(relation, AccessShareLock);
-
+	INSTR_TIME_SET_CURRENT(after);
+	INSTR_TIME_SUBTRACT(after, before);
+	elog(NOTICE, "scan took %lf", INSTR_TIME_GET_MILLISEC(after));
 	/* chicken out if bogus data found */
 	if (bogus)
 		return;

Tom Lane

tgl@sss.pgh.pa.us

over 1 year ago

In reply to: Soumyadeep Chakraborty (#1)

Re: Use heap scan routines directly in vac_update_datfrozenxid()

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

regards, tom lane

Alvaro Herrera

alvherre@alvh.no-ip.org

over 1 year ago

In reply to: Tom Lane (#2)

Re: Use heap scan routines directly in vac_update_datfrozenxid()

On 2024-Oct-06, Tom Lane wrote:

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

Though if there's anybody with a Postgres fork using catalog tables that
aren't heapam, then they aren't going to be happy with this change. (I
remember Tom commenting that Salesforce used to do that, I wonder if
they still do.)

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

I suspect the problem is not the systable layer per se, but the fact
that it has to go through the table AM interface. So by replacing
systable with the heap routines, it's actually _two_ layers of
indirection that are being skipped over. systable seems indeed quite
lean, or at least it was up to postgres 11 ... but it's not clear to me
that it continues to be.

The table AM API is heavily centered on slots, and I think having to
build heap tuples from slots might be slow. I wonder if it would be
better to add "systable_getnext_slot" which returns a slot and omit the
conversion to heaptuple. Callers for which it's significant could skip
that conversion by dealing with a slot instead. Converting just one or
two critical spots (such as vac_update_datfrozenxid, maybe
pg_publication.c) should be easy enough.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

Soumyadeep Chakraborty

soumyadeep2007@gmail.com

about 1 year ago

In reply to: Alvaro Herrera (#3)

4 attachment(s)

Re: Use heap scan routines directly in vac_update_datfrozenxid()

On Mon, Oct 7, 2024 at 1:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Oct-06, Tom Lane wrote:

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

Though if there's anybody with a Postgres fork using catalog tables that
aren't heapam, then they aren't going to be happy with this change. (I
remember Tom commenting that Salesforce used to do that, I wonder if
they still do.)

I see. However, I do see many uses of table_beginscan_catalog() followed
by a heap_getnext(). So, we are assuming that the catalog = heap in those
callsites. I guess that they aren't ideal and are meant to be changed
eventually?

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

I suspect the problem is not the systable layer per se, but the fact
that it has to go through the table AM interface. So by replacing
systable with the heap routines, it's actually _two_ layers of
indirection that are being skipped over. systable seems indeed quite
lean, or at least it was up to postgres 11 ... but it's not clear to me
that it continues to be.

The table AM API is heavily centered on slots, and I think having to
build heap tuples from slots might be slow. I wonder if it would be
better to add "systable_getnext_slot" which returns a slot and omit the
conversion to heaptuple. Callers for which it's significant could skip
that conversion by dealing with a slot instead. Converting just one or
two critical spots (such as vac_update_datfrozenxid, maybe
pg_publication.c) should be easy enough.

We aren't building heap tuples AFAICS. Attaching a debugger, I saw
that we are calling:

tts_buffer_heap_store_tuple execTuples.c:981
ExecStoreBufferHeapTuple execTuples.c:1493
heap_getnextslot heapam.c:1316
table_scan_getnextslot tableam.h:1071
systable_getnext genam.c:538
vac_update_datfrozenxid vacuum.c:1644

where we store the HeapTuple obtained from heapgettup() into the slot:
bslot->base.tuple = tuple;

And then subsequently, we obtain the tuple directly via:

tts_buffer_heap_get_heap_tuple execTuples.c:909
ExecFetchSlotHeapTuple genam.c:542
systable_getnext genam.c:542
vac_update_datfrozenxid vacuum.c:1644

We don't go through a slot->tts_ops->copy_heap_tuple() call, so we don't
copy the heap tuple. Here we just return: bslot->base.tuple.

At any rate, there is some (albeit small) overhead due to slots being in the
picture. This is an excerpt from perf on the main branch with my workload:

main (with CFLAGS='-O0 -fno-omit-frame-pointer -ggdb', without asserts):

  Children      Self  Command   Symbol
...
-   81.59%     2.65%  postgres  [.] systable_getnext
   - 78.93% systable_getnext
         - 76.61% table_scan_getnextslot
         - 73.16% heap_getnextslot
            + 57.95% heapgettup_pagemode
            - 9.51% ExecStoreBufferHeapTuple
               - 7.65% tts_buffer_heap_store_tuple
                  + 2.47% IncrBufferRefCount
                  + 2.04% ReleaseBuffer
      - 1.83% ExecFetchSlotHeapTuple
           0.83% tts_buffer_heap_get_heap_tuple
   + 2.48% _start

We can see some overhead in ExecStoreBufferHeapTuple() and
ExecFetchSlotHeapTuple(), both of which are avoided in my branch.

I am attaching perf diff (with --dsos=postgres) between -O3 branch and -O3 main
(with branch as the baseline), as well as the individual profiles
(--report with --stdio
flag and --dsos=postgres, profiles captured with --call-graph=dwarf). I ran the
same workload as in my previous email.

To your point about having a slots based API, where the caller will
manipulate the slot directly: that unfortunately won't help us avoid
the overhead
seen in storing the tuple in the slot and then in extracting it out.

To Tom's point about fixing systable_getnext() to remove the overhead, there is
the option of substituting the call inside to table_scan_getnextslot()
with heap_getnext().
However, that isn't fair to non-heap catalogs I suppose.

Regards,
Soumyadeep (Broadcom)

Attachments:

branchO3_vs_mainO3.difftext/x-patch; charset=US-ASCII; name=branchO3_vs_mainO3.diffDownload

# Event 'cpu_atom/cycles/P'
#
# Baseline  Delta Abs  Symbol
# ........  .........  ......
#

# Event 'cpu_core/cycles/P'
#
# Baseline  Delta Abs  Symbol                                 
# ........  .........  .......................................
#
               +8.89%  [.] systable_getnext
               +4.25%  [.] ExecStoreBufferHeapTuple
               +3.60%  [.] heap_getnextslot
     8.28%     -3.21%  [.] heapgettup_pagemode
               +1.83%  [.] ExecFetchSlotHeapTuple
     7.02%     -1.29%  [.] HeapTupleSatisfiesVisibility
     1.37%     +0.99%  [.] LWLockAcquire
     7.57%     +0.83%  [.] vac_update_datfrozenxid
     4.74%     +0.81%  [.] heap_prepare_pagescan
               +0.75%  [.] tts_buffer_heap_get_heap_tuple
     1.82%     -0.69%  [.] LWLockRelease
     2.02%     +0.62%  [.] MultiXactIdPrecedes
     2.60%     -0.61%  [.] hash_search_with_hash_value
     0.17%     +0.52%  [.] ResourceOwnerForget
               +0.48%  [.] uint32_hash
     0.60%     -0.44%  [.] read_stream_look_ahead
     1.14%     -0.44%  [.] UnpinBufferNoOwner
     0.86%     -0.38%  [.] StartBufferIO.constprop.0
     0.43%     -0.38%  [.] GetVictimBuffer
     3.66%     +0.38%  [.] TransactionIdPrecedes
               +0.38%  [.] ReleaseBuffer
     0.09%     +0.34%  [.] ResourceOwnerEnlarge
               +0.32%  [.] GetPrivateRefCountEntry
     0.95%     -0.31%  [.] WaitReadBuffers
               +0.22%  [.] IncrBufferRefCount
     0.70%     -0.21%  [.] InvalidateVictimBuffer
     0.17%     +0.21%  [.] GetPrivateRefCountEntry.constprop.0
     0.26%     -0.16%  [.] ConditionVariableBroadcast
     0.35%     -0.13%  [.] PageIsVerifiedExtended
               +0.11%  [.] ResourceOwnerRemember
     0.27%     +0.11%  [.] hash_search
               +0.11%  [.] BufferGetBlockNumber
               +0.11%  [.] DataChecksumsEnabled
               +0.11%  [.] BufTableInsert
     2.01%     +0.09%  [.] XidInMVCCSnapshot
     0.09%     +0.07%  [.] finish_spin_delay
     0.61%     -0.07%  [.] hash_bytes
               +0.05%  [.] ReservePrivateRefCountEntry
               +0.05%  [.] read_stream_start_pending_read
               +0.05%  [.] get_hash_value
               +0.05%  [.] FileAccess
               +0.05%  [.] heap_scan_stream_read_next_serial
               +0.05%  [.] smgrreadv
               +0.05%  [.] heap_page_prune_opt
     0.09%     -0.04%  [.] mdreadv
     0.61%     +0.03%  [.] StartReadBuffers
     0.08%     +0.02%  [.] BufTableDelete
     0.25%     +0.01%  [.] LockBufHdr
     0.18%     -0.01%  [.] StrategyGetBuffer
     0.43%     -0.00%  [.] read_stream_next_buffer
     5.32%             [.] heap_getnext
     1.13%             [.] GetHeapamTableAmRoutine
     0.18%             [.] BufTableHashCode
     0.09%             [.] LockBuffer
     0.09%             [.] BufTableLookup
     0.09%             [.] IOContextForStrategy
     0.08%             [.] RecoveryInProgress

# Event 'dummy:u'
#
# Baseline  Delta Abs  Symbol
# ........  .........  ......
#

perf.main.O3.data.outapplication/octet-stream; name=perf.main.O3.data.outDownload

perf.branch.O3.data.outapplication/octet-stream; name=perf.branch.O3.data.outDownload

perf.main.O0.data.outapplication/octet-stream; name=perf.main.O0.data.outDownload