Use heap scan routines directly in vac_update_datfrozenxid()

Started by Soumyadeep Chakrabortyover 1 year ago4 messageshackers
Jump to latest
#1Soumyadeep Chakraborty
soumyadeep2007@gmail.com

Hi hackers,

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure. The speedup can be noticeable in databases
containing a large number of relations (perhaps due to heavy partition
table usage). This was proposed in [1]/messages/by-id/20221229030329.fbpiitatmowzza6c@awork3.anarazel.de.

Experiment setup:

* Use -O3 optimized build without asserts, with fsync and autovacuum off,
on my laptop. Other gucs are all at defaults.

* Create tables using pgbench to inflate pg_class's to a decent size.

$ cat << EOF > bench.sql

select txid_current() AS txid \gset
CREATE TABLE t:txid(a int);
EOF

$ pgbench -f ./bench.sql -t 200000 -c 100 -n bench

select pg_size_pretty(pg_relation_size('pg_class'));
pg_size_pretty
----------------
3508 MB
(1 row)

* Use instr_time to record the scan time. See attached instr_vac.diff.

* Run vacuum on any of the created empty tables in the database bench:

Results:

* main as of 68dfecbef2:

bench=# vacuum t1624;
NOTICE: scan took 796.862142 ms
bench=# vacuum t1624;
NOTICE: scan took 793.730688 ms
bench=# vacuum t1624;
NOTICE: scan took 793.963655 ms

* patch:

bench=# vacuum t1624;
NOTICE: scan took 682.283366 ms
bench=# vacuum t1624;
NOTICE: scan took 670.816975 ms
bench=# vacuum t1624;
NOTICE: scan took 683.821717 ms

Regards,
Soumyadeep (Broadcom)

[1]: /messages/by-id/20221229030329.fbpiitatmowzza6c@awork3.anarazel.de

Attachments:

v1-0001-Use-heap_getnext-in-vac_update_datfrozenxid.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Use-heap_getnext-in-vac_update_datfrozenxid.patchDownload+4-6
instr_vac.difftext/x-patch; charset=US-ASCII; name=instr_vac.diffDownload+7-1
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Soumyadeep Chakraborty (#1)
Re: Use heap scan routines directly in vac_update_datfrozenxid()

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

regards, tom lane

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Use heap scan routines directly in vac_update_datfrozenxid()

On 2024-Oct-06, Tom Lane wrote:

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

Though if there's anybody with a Postgres fork using catalog tables that
aren't heapam, then they aren't going to be happy with this change. (I
remember Tom commenting that Salesforce used to do that, I wonder if
they still do.)

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

I suspect the problem is not the systable layer per se, but the fact
that it has to go through the table AM interface. So by replacing
systable with the heap routines, it's actually _two_ layers of
indirection that are being skipped over. systable seems indeed quite
lean, or at least it was up to postgres 11 ... but it's not clear to me
that it continues to be.

The table AM API is heavily centered on slots, and I think having to
build heap tuples from slots might be slow. I wonder if it would be
better to add "systable_getnext_slot" which returns a slot and omit the
conversion to heaptuple. Callers for which it's significant could skip
that conversion by dealing with a slot instead. Converting just one or
two critical spots (such as vac_update_datfrozenxid, maybe
pg_publication.c) should be easy enough.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

#4Soumyadeep Chakraborty
soumyadeep2007@gmail.com
In reply to: Alvaro Herrera (#3)
Re: Use heap scan routines directly in vac_update_datfrozenxid()

On Mon, Oct 7, 2024 at 1:58 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2024-Oct-06, Tom Lane wrote:

Soumyadeep Chakraborty <soumyadeep2007@gmail.com> writes:

Attached is a simple patch to directly use heap scan routines in
vac_update_datfrozenxid(), avoiding the multilayer overhead from the
sysscan infrastructure.

Though if there's anybody with a Postgres fork using catalog tables that
aren't heapam, then they aren't going to be happy with this change. (I
remember Tom commenting that Salesforce used to do that, I wonder if
they still do.)

I see. However, I do see many uses of table_beginscan_catalog() followed
by a heap_getnext(). So, we are assuming that the catalog = heap in those
callsites. I guess that they aren't ideal and are meant to be changed
eventually?

I would think the overhead of that is minuscule. If it isn't,
we should try to make it so, not randomly hack up some of the callers
to avoid it. The intention certainly was that it wouldn't cost
anything compared to what happens within the actual table access.

I suspect the problem is not the systable layer per se, but the fact
that it has to go through the table AM interface. So by replacing
systable with the heap routines, it's actually _two_ layers of
indirection that are being skipped over. systable seems indeed quite
lean, or at least it was up to postgres 11 ... but it's not clear to me
that it continues to be.

The table AM API is heavily centered on slots, and I think having to
build heap tuples from slots might be slow. I wonder if it would be
better to add "systable_getnext_slot" which returns a slot and omit the
conversion to heaptuple. Callers for which it's significant could skip
that conversion by dealing with a slot instead. Converting just one or
two critical spots (such as vac_update_datfrozenxid, maybe
pg_publication.c) should be easy enough.

We aren't building heap tuples AFAICS. Attaching a debugger, I saw
that we are calling:

tts_buffer_heap_store_tuple execTuples.c:981
ExecStoreBufferHeapTuple execTuples.c:1493
heap_getnextslot heapam.c:1316
table_scan_getnextslot tableam.h:1071
systable_getnext genam.c:538
vac_update_datfrozenxid vacuum.c:1644

where we store the HeapTuple obtained from heapgettup() into the slot:
bslot->base.tuple = tuple;

And then subsequently, we obtain the tuple directly via:

tts_buffer_heap_get_heap_tuple execTuples.c:909
ExecFetchSlotHeapTuple genam.c:542
systable_getnext genam.c:542
vac_update_datfrozenxid vacuum.c:1644

We don't go through a slot->tts_ops->copy_heap_tuple() call, so we don't
copy the heap tuple. Here we just return: bslot->base.tuple.

At any rate, there is some (albeit small) overhead due to slots being in the
picture. This is an excerpt from perf on the main branch with my workload:

main (with CFLAGS='-O0 -fno-omit-frame-pointer -ggdb', without asserts):

  Children      Self  Command   Symbol
...
-   81.59%     2.65%  postgres  [.] systable_getnext
   - 78.93% systable_getnext
         - 76.61% table_scan_getnextslot
         - 73.16% heap_getnextslot
            + 57.95% heapgettup_pagemode
            - 9.51% ExecStoreBufferHeapTuple
               - 7.65% tts_buffer_heap_store_tuple
                  + 2.47% IncrBufferRefCount
                  + 2.04% ReleaseBuffer
      - 1.83% ExecFetchSlotHeapTuple
           0.83% tts_buffer_heap_get_heap_tuple
   + 2.48% _start

We can see some overhead in ExecStoreBufferHeapTuple() and
ExecFetchSlotHeapTuple(), both of which are avoided in my branch.

I am attaching perf diff (with --dsos=postgres) between -O3 branch and -O3 main
(with branch as the baseline), as well as the individual profiles
(--report with --stdio
flag and --dsos=postgres, profiles captured with --call-graph=dwarf). I ran the
same workload as in my previous email.

To your point about having a slots based API, where the caller will
manipulate the slot directly: that unfortunately won't help us avoid
the overhead
seen in storing the tuple in the slot and then in extracting it out.

To Tom's point about fixing systable_getnext() to remove the overhead, there is
the option of substituting the call inside to table_scan_getnextslot()
with heap_getnext().
However, that isn't fair to non-heap catalogs I suppose.

Regards,
Soumyadeep (Broadcom)

Attachments:

branchO3_vs_mainO3.difftext/x-patch; charset=US-ASCII; name=branchO3_vs_mainO3.diffDownload
perf.main.O3.data.outapplication/octet-stream; name=perf.main.O3.data.outDownload
perf.branch.O3.data.outapplication/octet-stream; name=perf.branch.O3.data.outDownload
perf.main.O0.data.outapplication/octet-stream; name=perf.main.O0.data.outDownload