subtransaction performance regression [kind of] due to snapshot caching
Hi,
In a recent thread ([1]/messages/by-id/20210317055718.v6qs3ltzrformqoa@alap3.anarazel.de) I found a performance regression of the
following statement
DO $do$
BEGIN FOR i IN 1 .. 10000 LOOP
BEGIN
EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$;
EXCEPTION WHEN others THEN
END;
END LOOP;
END;$do$;
13: 1617.798
14-dev: 34088.505
The time in 14 is spent mostly below:
- 94.58% 0.01% postgres postgres [.] CreateFunction
- 94.57% CreateFunction
- 94.49% ProcedureCreate
- 90.95% record_object_address_dependencies
- 90.93% recordMultipleDependencies
- 89.65% isObjectPinned
- 89.12% systable_getnext
- 89.06% index_getnext_slot
- 56.13% index_fetch_heap
- 54.82% table_index_fetch_tuple
+ 53.79% heapam_index_fetch_tuple
0.07% heap_hot_search_buffer
0.01% ReleaseAndReadBuffer
0.01% LockBuffer
0.08% heapam_index_fetch_tuple
After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).
What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.
I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?
Snapshot
GetSnapshotData(Snapshot snapshot)
...
/*
* We don't include our own XIDs (if any) in the snapshot. It
* needs to be includeded in the xmin computation, but we did so
* outside the loop.
*/
if (pgxactoff == mypgxactoff)
continue;
The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.
Which then causes HeapTupleSatisfiesMVCC to go down
if (!HeapTupleHeaderXminCommitted(tuple))
...
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.
bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
* its parent below, because a subxact with XID < xmin has surely also got
* a parent with XID < xmin, while one with XID >= xmax must belong to a
* parent that was not yet committed at the time of this snapshot.
*/
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).
It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.
See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.
I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...
Greetings,
Andres Freund
[1]: /messages/by-id/20210317055718.v6qs3ltzrformqoa@alap3.anarazel.de
Attachments:
0001-snapshot-caching-vs-subtransactions.patchtext/x-diff; charset=us-asciiDownload+96-2
Andres Freund <andres@anarazel.de> writes:
The time in 14 is spent mostly below: - 94.58% 0.01% postgres postgres [.] CreateFunction - 94.57% CreateFunction - 94.49% ProcedureCreate - 90.95% record_object_address_dependencies - 90.93% recordMultipleDependencies - 89.65% isObjectPinned - 89.12% systable_getnext - 89.06% index_getnext_slot - 56.13% index_fetch_heap - 54.82% table_index_fetch_tuple + 53.79% heapam_index_fetch_tuple 0.07% heap_hot_search_buffer 0.01% ReleaseAndReadBuffer 0.01% LockBuffer 0.08% heapam_index_fetch_tuple
Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.
regards, tom lane
Hi,
On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
The time in 14 is spent mostly below: - 94.58% 0.01% postgres postgres [.] CreateFunction - 94.57% CreateFunction - 94.49% ProcedureCreate - 90.95% record_object_address_dependencies - 90.93% recordMultipleDependencies - 89.65% isObjectPinned - 89.12% systable_getnext - 89.06% index_getnext_slot - 56.13% index_fetch_heap - 54.82% table_index_fetch_tuple + 53.79% heapam_index_fetch_tuple 0.07% heap_hot_search_buffer 0.01% ReleaseAndReadBuffer 0.01% LockBuffer 0.08% heapam_index_fetch_tupleNot wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.
I had the same thought, and yes, I do think we should do that. I've seen
it show up in non-buggy workloads too (not to that degree though).
The <= 9999 pg_depend entries area also just a substantial proportion of
the size of an empty database, making it worth to remove <= 9999 entries:
freshly initdb'd empty cluster:
VACUUM FULL pg_depend;
dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT 10;
┌─────────────────────────────────┬──────────────────┐
│ oid │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_depend │ 532480 │
│ pg_toast.pg_toast_2618 │ 516096 │
│ pg_collation │ 360448 │
│ pg_description │ 352256 │
│ pg_depend_depender_index │ 294912 │
│ pg_depend_reference_index │ 294912 │
│ pg_description_o_c_o_index │ 221184 │
│ pg_statistic │ 155648 │
│ pg_operator │ 114688 │
│ pg_collation_name_enc_nsp_index │ 106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)
DELETE FROM pg_depend WHERE deptype = 'p' AND refobjid <> 0 AND refobjid < 10000;
VACUUM FULL pg_depend;
dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT 10;
┌─────────────────────────────────┬──────────────────┐
│ oid │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_toast.pg_toast_2618 │ 516096 │
│ pg_collation │ 360448 │
│ pg_description │ 352256 │
│ pg_depend │ 253952 │
│ pg_description_o_c_o_index │ 221184 │
│ pg_statistic │ 155648 │
│ pg_depend_depender_index │ 147456 │
│ pg_depend_reference_index │ 147456 │
│ pg_operator │ 114688 │
│ pg_collation_name_enc_nsp_index │ 106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)
A reduction from 8407kB to 7863kB of the size of the "dropme" database
just by deleting the "implicitly pinned" entries seems like a good deal.
To save people the time to look it up: pg_toast.pg_toast_2618 is
pg_description...
Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.
Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...
Yeah, it'd require some close study of exactly what we want to pin
or not pin. Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs. On the other hand,
it's intentional that nothing in information_schema is pinned.
We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.
Anyway, just idle late-night speculation for now ...
regards, tom lane
On 2021-04-06 01:34:02 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2021-04-06 00:47:13 -0400, Tom Lane wrote:
Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...Yeah, it'd require some close study of exactly what we want to pin
or not pin.
One interesting bit is:
postgres[947554][1]=# SELECT classid::regclass, objid, refclassid::regclass, refobjid, deptype, refobjversion FROM pg_depend WHERE refobjid < 13000 AND deptype <> 'p';
┌───────────────┬───────┬──────────────┬──────────┬─────────┬───────────────┐
│ classid │ objid │ refclassid │ refobjid │ deptype │ refobjversion │
├───────────────┼───────┼──────────────┼──────────┼─────────┼───────────────┤
│ pg_constraint │ 15062 │ pg_collation │ 100 │ n │ 2.31 │
└───────────────┴───────┴──────────────┴──────────┴─────────┴───────────────┘
(1 row)
Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs. On the other hand,
it's intentional that nothing in information_schema is pinned.
Isn't that pretty much the difference between FirstGenbkiObjectId and
FirstBootstrapObjectId? Genbki will have assigned things like opclasses,
but not things like information_schema?
We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.
The only non-pinned pg_depend entry below FirstBootstrapObjectId is the
collation versioning one above. The only pinned entries above
FirstBootstrapObjectId are the ones created via
system_constraints.sql. So it seems we "just" would need to resolve the
constraint versioning stuff? And that could probably just be handled as
a hardcoded special case for now...
Greetings,
Andres Freund
Hi,
On 2021-04-05 21:35:21 -0700, Andres Freund wrote:
See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.
Pushed that now, after trying and failing to make the test spuriously
fail due to concurrent activity.
Greetings,
Andres Freund