subtransaction performance regression [kind of] due to snapshot caching

Started by Andres Freundalmost 5 years ago6 messages
#1Andres Freund
andres@anarazel.de
1 attachment(s)

Hi,

In a recent thread ([1]/messages/by-id/20210317055718.v6qs3ltzrformqoa@alap3.anarazel.de) I found a performance regression of the
following statement
DO $do$
BEGIN FOR i IN 1 .. 10000 LOOP
BEGIN
EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$;
EXCEPTION WHEN others THEN
END;
END LOOP;
END;$do$;

13: 1617.798
14-dev: 34088.505

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
   - 94.57% CreateFunction
      - 94.49% ProcedureCreate
         - 90.95% record_object_address_dependencies
            - 90.93% recordMultipleDependencies
               - 89.65% isObjectPinned
                  - 89.12% systable_getnext
                     - 89.06% index_getnext_slot
                        - 56.13% index_fetch_heap
                           - 54.82% table_index_fetch_tuple
                              + 53.79% heapam_index_fetch_tuple
                                0.07% heap_hot_search_buffer
                                0.01% ReleaseAndReadBuffer
                                0.01% LockBuffer
                             0.08% heapam_index_fetch_tuple

After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).

What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.

I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?

Snapshot
GetSnapshotData(Snapshot snapshot)
...
/*
* We don't include our own XIDs (if any) in the snapshot. It
* needs to be includeded in the xmin computation, but we did so
* outside the loop.
*/
if (pgxactoff == mypgxactoff)
continue;

The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.

Which then causes HeapTupleSatisfiesMVCC to go down
if (!HeapTupleHeaderXminCommitted(tuple))
...
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}

the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;

/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
* its parent below, because a subxact with XID < xmin has surely also got
* a parent with XID < xmin, while one with XID >= xmax must belong to a
* parent that was not yet committed at the time of this snapshot.
*/

/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;

I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).

It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.

See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.

I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...

Greetings,

Andres Freund

[1]: /messages/by-id/20210317055718.v6qs3ltzrformqoa@alap3.anarazel.de

Attachments:

0001-snapshot-caching-vs-subtransactions.patchtext/x-diff; charset=us-asciiDownload
From 1e1259df48373934be43049d8aa4dad328e96b96 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 5 Apr 2021 21:21:14 -0700
Subject: [PATCH] snapshot caching vs subtransactions.

---
 src/backend/storage/ipc/procarray.c |  8 ++++++
 src/test/regress/expected/mvcc.out  | 42 +++++++++++++++++++++++++++
 src/test/regress/parallel_schedule  |  2 +-
 src/test/regress/serial_schedule    |  1 +
 src/test/regress/sql/mvcc.sql       | 44 +++++++++++++++++++++++++++++
 5 files changed, 96 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/mvcc.out
 create mode 100644 src/test/regress/sql/mvcc.sql

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e113a85aed4..bf776286de0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1210,6 +1210,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 */
 	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
 
+	/*
+	 * NB: No need to increment ShmemVariableCache->xactCompletionCount here,
+	 * nobody can see it yet.
+	 */
+
 	LWLockRelease(ProcArrayLock);
 
 	/* ShmemVariableCache->nextXid must be beyond any observed xid. */
@@ -3915,6 +3920,9 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/regress/expected/mvcc.out b/src/test/regress/expected/mvcc.out
new file mode 100644
index 00000000000..16ed4ddf2d8
--- /dev/null
+++ b/src/test/regress/expected/mvcc.out
@@ -0,0 +1,42 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+DO $$
+BEGIN
+    -- iterate often enough to see index growth even on larger-than-default page sizes
+    FOR i IN 1..100 LOOP
+        BEGIN
+	    -- perform index scan over all the inserted keys to get them to be seen as dead
+            IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+	        RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+            END IF;
+            INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+	    -- just some error that's not normally thrown
+	    RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+	EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+    END LOOP;
+END;$$;
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+ size_before | size_after 
+-------------+------------
+(0 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 2e898390892..a0913008577 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -29,7 +29,7 @@ test: strings numerology point lseg line box path polygon circle date time timet
 # geometry depends on point, lseg, box, path, polygon and circle
 # horology depends on interval, timetz, timestamp, timestamptz
 # ----------
-test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid
+test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc
 
 # ----------
 # These four each depend on the previous one
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a46f3d01789..56448476015 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -11,6 +11,7 @@ test: int4
 test: int8
 test: oid
 test: xid
+test: mvcc
 test: float4
 test: float8
 test: bit
diff --git a/src/test/regress/sql/mvcc.sql b/src/test/regress/sql/mvcc.sql
new file mode 100644
index 00000000000..b22a86dc5e5
--- /dev/null
+++ b/src/test/regress/sql/mvcc.sql
@@ -0,0 +1,44 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+
+DO $$
+BEGIN
+    -- iterate often enough to see index growth even on larger-than-default page sizes
+    FOR i IN 1..100 LOOP
+        BEGIN
+	    -- perform index scan over all the inserted keys to get them to be seen as dead
+            IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+	        RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+            END IF;
+            INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+	    -- just some error that's not normally thrown
+	    RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+	EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+    END LOOP;
+END;$$;
+
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+
+ROLLBACK;
-- 
2.31.0.121.g9198c13e34

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: subtransaction performance regression [kind of] due to snapshot caching

Andres Freund <andres@anarazel.de> writes:

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
- 94.57% CreateFunction
- 94.49% ProcedureCreate
- 90.95% record_object_address_dependencies
- 90.93% recordMultipleDependencies
- 89.65% isObjectPinned
- 89.12% systable_getnext
- 89.06% index_getnext_slot
- 56.13% index_fetch_heap
- 54.82% table_index_fetch_tuple
+ 53.79% heapam_index_fetch_tuple
0.07% heap_hot_search_buffer
0.01% ReleaseAndReadBuffer
0.01% LockBuffer
0.08% heapam_index_fetch_tuple

Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.

regards, tom lane

#3Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: subtransaction performance regression [kind of] due to snapshot caching

Hi,

On 2021-04-06 00:47:13 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
- 94.57% CreateFunction
- 94.49% ProcedureCreate
- 90.95% record_object_address_dependencies
- 90.93% recordMultipleDependencies
- 89.65% isObjectPinned
- 89.12% systable_getnext
- 89.06% index_getnext_slot
- 56.13% index_fetch_heap
- 54.82% table_index_fetch_tuple
+ 53.79% heapam_index_fetch_tuple
0.07% heap_hot_search_buffer
0.01% ReleaseAndReadBuffer
0.01% LockBuffer
0.08% heapam_index_fetch_tuple

Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.

I had the same thought, and yes, I do think we should do that. I've seen
it show up in non-buggy workloads too (not to that degree though).

The <= 9999 pg_depend entries area also just a substantial proportion of
the size of an empty database, making it worth to remove <= 9999 entries:

freshly initdb'd empty cluster:

VACUUM FULL pg_depend;
dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT 10;
┌─────────────────────────────────┬──────────────────┐
│ oid │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_depend │ 532480 │
│ pg_toast.pg_toast_2618 │ 516096 │
│ pg_collation │ 360448 │
│ pg_description │ 352256 │
│ pg_depend_depender_index │ 294912 │
│ pg_depend_reference_index │ 294912 │
│ pg_description_o_c_o_index │ 221184 │
│ pg_statistic │ 155648 │
│ pg_operator │ 114688 │
│ pg_collation_name_enc_nsp_index │ 106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)

DELETE FROM pg_depend WHERE deptype = 'p' AND refobjid <> 0 AND refobjid < 10000;
VACUUM FULL pg_depend;

dropme[926131][1]=# SELECT oid::regclass, pg_relation_size(oid) FROM pg_class WHERE relfilenode <> 0 ORDER BY 2 DESC LIMIT 10;
┌─────────────────────────────────┬──────────────────┐
│ oid │ pg_relation_size │
├─────────────────────────────────┼──────────────────┤
│ pg_toast.pg_toast_2618 │ 516096 │
│ pg_collation │ 360448 │
│ pg_description │ 352256 │
│ pg_depend │ 253952 │
│ pg_description_o_c_o_index │ 221184 │
│ pg_statistic │ 155648 │
│ pg_depend_depender_index │ 147456 │
│ pg_depend_reference_index │ 147456 │
│ pg_operator │ 114688 │
│ pg_collation_name_enc_nsp_index │ 106496 │
└─────────────────────────────────┴──────────────────┘
(10 rows)

A reduction from 8407kB to 7863kB of the size of the "dropme" database
just by deleting the "implicitly pinned" entries seems like a good deal.

To save people the time to look it up: pg_toast.pg_toast_2618 is
pg_description...

Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...

Greetings,

Andres Freund

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#3)
Re: subtransaction performance regression [kind of] due to snapshot caching

Andres Freund <andres@anarazel.de> writes:

On 2021-04-06 00:47:13 -0400, Tom Lane wrote:

Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.

Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...

Yeah, it'd require some close study of exactly what we want to pin
or not pin. Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs. On the other hand,
it's intentional that nothing in information_schema is pinned.

We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.

Anyway, just idle late-night speculation for now ...

regards, tom lane

#5Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#4)
Re: subtransaction performance regression [kind of] due to snapshot caching

On 2021-04-06 01:34:02 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2021-04-06 00:47:13 -0400, Tom Lane wrote:

Not wanting to distract from your point about xactCompletionCount,
but ... I wonder if we could get away with defining "isObjectPinned"
as "is the OID <= 9999" (and, in consequence, dropping explicit pin
entries from pg_depend). I had not previously seen a case where the
cost of looking into pg_depend for this info was this much of the
total query runtime.

Couldn't we also treat FirstGenbkiObjectId to FirstBootstrapObjectId as
pinned? That'd be another 400kB of database size...

Yeah, it'd require some close study of exactly what we want to pin
or not pin.

One interesting bit is:

postgres[947554][1]=# SELECT classid::regclass, objid, refclassid::regclass, refobjid, deptype, refobjversion FROM pg_depend WHERE refobjid < 13000 AND deptype <> 'p';
┌───────────────┬───────┬──────────────┬──────────┬─────────┬───────────────┐
│ classid │ objid │ refclassid │ refobjid │ deptype │ refobjversion │
├───────────────┼───────┼──────────────┼──────────┼─────────┼───────────────┤
│ pg_constraint │ 15062 │ pg_collation │ 100 │ n │ 2.31 │
└───────────────┴───────┴──────────────┴──────────┴─────────┴───────────────┘
(1 row)

Certainly everything with hand-assigned OIDs should
be pinned, but I think there's a lot of critical stuff like index
opclasses that don't get hand-assigned OIDs. On the other hand,
it's intentional that nothing in information_schema is pinned.

Isn't that pretty much the difference between FirstGenbkiObjectId and
FirstBootstrapObjectId? Genbki will have assigned things like opclasses,
but not things like information_schema?

We might have to rejigger initdb so that there's a clearer
distinction between the OID ranges we want to pin or not.
Maybe we'd even get initdb to record the cutoff OID in
pg_control or someplace.

The only non-pinned pg_depend entry below FirstBootstrapObjectId is the
collation versioning one above. The only pinned entries above
FirstBootstrapObjectId are the ones created via
system_constraints.sql. So it seems we "just" would need to resolve the
constraint versioning stuff? And that could probably just be handled as
a hardcoded special case for now...

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: subtransaction performance regression [kind of] due to snapshot caching

Hi,

On 2021-04-05 21:35:21 -0700, Andres Freund wrote:

See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.

Pushed that now, after trying and failing to make the test spuriously
fail due to concurrent activity.

Greetings,

Andres Freund