[PATCH] Speed up of vac_update_datfrozenxid.

Started by Rustam Khamidullin6 months ago3 messages

rstm.khamidullin@gmail.com

6 months ago

2 attachment(s)

Hi hackers,

While testing behavior of 'vacuum' functionality under high load
we've noticed that the exclusive lock with 'LOCKTAG_DATABASE_FROZEN_ID'
tag could be held for a prolonged time, thus causing contention with
other sessions.
This lock is obtained by the 'vac_update_datfrozenxid' function at the
end of vacuum
process to update value of 'datfrozenxid' in the 'pg_database' relation.
This function currently performs two operations under the
mentioned lock: first scanning the 'pg_class' to find minimal
'relfrozenxid' and 'relminmxid' and then update corresponding values
in the 'pg_database' relation. Update of the 'pg_database' table
includes locking the required tuple and then deciding whether row
need to be updated or not. While looking at the
'vac_update_datfrozenxid' content, we've noticed following comment:

/*
* Fetch a copy of the tuple to scribble on. We could check the syscache
* tuple first. If that concluded !dirty, we'd avoid waiting on
* concurrent heap_update() and would avoid exclusive-locking the buffer.
* For now, don't optimize that.
*/

The optimization described in this comment (i.e. scanning and
locking tuple only if we really intend to update it) seems to be a
valid one, so I've created a patch to test it (attached to this
message). It should be applicable on top of current 'master' branch.
The code now tries to obtain data from the SysCache first and checks
whether it's necessary to call 'systable_inplace_update_begin()' to
actually update xid values.

To test the effect of the patch, a synthetic test was written using
bpftrace.
Bpftrace script measures the execution time of 'vac_update_datfrozenxid'.
The script is attached to the email.

The test scenario:

1. Create an empty database cluster
2. Set autovacuum_naptime = '1s' (so that 'vac_update_datfrozenxid' is
called more often)
3. Run the bpftrace script (it outputs the function execution time in
ns to stdout)

It is not necessary to apply the load. The autovacuum worker will call
'vac_update_datfrozenxid' regardless.
As a result, execution time of 'vac_update_datfrozenxid()' with the
patch is 30% faster
No patch Patch Speed up
Mean: 73058 ns 51444 ns 30%
Median: 71846 ns 48650 ns 33%

It is worth noting that due to the fact thatbefore updating the
'datfrozenxid',a full 'pg_class' scan is performed, the performance
improvementbecomes less significant as the number of
relationsincreases(more time will be spent on sequentialscan of
'pg_class').The previous measurement was performedon an empty
clusterwith nouser relations.To test the patch in a more realistic case,
I generated 1000 tables in the database and repeated the measurement. No
patch Patch Speed up
Mean:97113 ns 73839 ns 24% Median:98328 ns 63541 ns 29% As we can see,
the speed gain has decreased, but it is still significant.

What do you think about the patch?

Best regards,
Rustam Khamidullin

Attachments:

0001-Check-the-syscache-before-updating-datfrozenxid.patchtext/x-patch; charset=UTF-8; name=0001-Check-the-syscache-before-updating-datfrozenxid.patchDownload

From ea2382b5594fc41d040641c23e41867e160fcc3d Mon Sep 17 00:00:00 2001
From: Rustam Khamidullin <r.khamidullin@postgrespro.ru>
Date: Thu, 10 Jul 2025 19:01:05 +0700
Subject: [PATCH] Check the syscache before updating datfrozenxid.

Before updating the pg_database.datfrozenxid, check if it needs to be updated using the syscache to avoid exclusive locking.
---
 src/backend/commands/vacuum.c | 118 +++++++++++++++++++++-------------
 1 file changed, 73 insertions(+), 45 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 733ef40ae7c..2e304a12db7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1588,6 +1588,13 @@ vac_update_relstats(Relation relation,
 								 RelationGetRelationName(relation))));
 }
 
+static inline bool
+need_update_datxid(TransactionId datxid, TransactionId newXid, TransactionId lastSaneXid)
+{
+	return datxid != newXid &&
+		(TransactionIdPrecedes(datxid, newXid) ||
+		 TransactionIdPrecedes(lastSaneXid, datxid));
+}
 
 /*
  *	vac_update_datfrozenxid() -- update pg_database.datfrozenxid for our DB
@@ -1619,6 +1626,8 @@ vac_update_datfrozenxid(void)
 	MultiXactId newMinMulti;
 	TransactionId lastSaneFrozenXid;
 	MultiXactId lastSaneMinMulti;
+	TransactionId prevDatfrozenxid;
+	TransactionId prevDatminmxid;
 	bool		bogus = false;
 	bool		dirty = false;
 	ScanKeyData key[1];
@@ -1742,61 +1751,80 @@ vac_update_datfrozenxid(void)
 	Assert(TransactionIdIsNormal(newFrozenXid));
 	Assert(MultiXactIdIsValid(newMinMulti));
 
-	/* Now fetch the pg_database tuple we need to update. */
-	relation = table_open(DatabaseRelationId, RowExclusiveLock);
-
 	/*
-	 * Fetch a copy of the tuple to scribble on.  We could check the syscache
-	 * tuple first.  If that concluded !dirty, we'd avoid waiting on
-	 * concurrent heap_update() and would avoid exclusive-locking the buffer.
-	 * For now, don't optimize that.
+	 * Update datfrozenxid and datminmxid. But we check the syscache tuple
+	 * first. If that conclude !dirty, we avoid waiting on concurrent
+	 * heap_update() and avoid exclusive-locking the buffer.
 	 */
-	ScanKeyInit(&key[0],
-				Anum_pg_database_oid,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(MyDatabaseId));
+	prevDatfrozenxid = InvalidTransactionId;
+	prevDatminmxid = InvalidTransactionId;
 
-	systable_inplace_update_begin(relation, DatabaseOidIndexId, true,
-								  NULL, 1, key, &tuple, &inplace_state);
-
-	if (!HeapTupleIsValid(tuple))
-		elog(ERROR, "could not find tuple for database %u", MyDatabaseId);
+	tuple = SearchSysCache1(DATABASEOID, ObjectIdGetDatum(MyDatabaseId));
+	if (HeapTupleIsValid(tuple))
+	{
+		dbform = (Form_pg_database) GETSTRUCT(tuple);
 
-	dbform = (Form_pg_database) GETSTRUCT(tuple);
+		prevDatfrozenxid = dbform->datfrozenxid;
+		prevDatminmxid = dbform->datminmxid;
+		ReleaseSysCache(tuple);
+	}
 
-	/*
-	 * As in vac_update_relstats(), we ordinarily don't want to let
-	 * datfrozenxid go backward; but if it's "in the future" then it must be
-	 * corrupt and it seems best to overwrite it.
-	 */
-	if (dbform->datfrozenxid != newFrozenXid &&
-		(TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) ||
-		 TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid)))
+	if (!TransactionIdIsValid(prevDatfrozenxid) ||
+		!TransactionIdIsValid(prevDatminmxid) ||
+		need_update_datxid(prevDatfrozenxid, newFrozenXid, lastSaneFrozenXid) ||
+		need_update_datxid(prevDatminmxid, newMinMulti, lastSaneMinMulti))
 	{
-		dbform->datfrozenxid = newFrozenXid;
-		dirty = true;
+		/* Now fetch the pg_database tuple we need to update. */
+		relation = table_open(DatabaseRelationId, RowExclusiveLock);
+
+		ScanKeyInit(&key[0],
+					Anum_pg_database_oid,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(MyDatabaseId));
+
+		systable_inplace_update_begin(relation, DatabaseOidIndexId, true,
+									  NULL, 1, key, &tuple, &inplace_state);
+
+		if (!HeapTupleIsValid(tuple))
+			elog(ERROR, "could not find tuple for database %u", MyDatabaseId);
+
+		dbform = (Form_pg_database) GETSTRUCT(tuple);
+
+		/*
+		 * As in vac_update_relstats(), we ordinarily don't want to let
+		 * datfrozenxid go backward; but if it's "in the future" then it must
+		 * be corrupt and it seems best to overwrite it.
+		 */
+		if (need_update_datxid(dbform->datfrozenxid, newFrozenXid, lastSaneFrozenXid))
+		{
+			dbform->datfrozenxid = newFrozenXid;
+			dirty = true;
+		}
+		else
+			newFrozenXid = dbform->datfrozenxid;
+
+		/* Ditto for datminmxid */
+		if (need_update_datxid(dbform->datminmxid, newMinMulti, lastSaneMinMulti))
+		{
+			dbform->datminmxid = newMinMulti;
+			dirty = true;
+		}
+		else
+			newMinMulti = dbform->datminmxid;
+
+		if (dirty)
+			systable_inplace_update_finish(inplace_state, tuple);
+		else
+			systable_inplace_update_cancel(inplace_state);
+
+		heap_freetuple(tuple);
+		table_close(relation, RowExclusiveLock);
 	}
 	else
-		newFrozenXid = dbform->datfrozenxid;
-
-	/* Ditto for datminmxid */
-	if (dbform->datminmxid != newMinMulti &&
-		(MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) ||
-		 MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid)))
 	{
-		dbform->datminmxid = newMinMulti;
-		dirty = true;
+		newFrozenXid = prevDatfrozenxid;
+		newMinMulti = prevDatminmxid;
 	}
-	else
-		newMinMulti = dbform->datminmxid;
-
-	if (dirty)
-		systable_inplace_update_finish(inplace_state, tuple);
-	else
-		systable_inplace_update_cancel(inplace_state);
-
-	heap_freetuple(tuple);
-	table_close(relation, RowExclusiveLock);
 
 	/*
 	 * If we were able to advance datfrozenxid or datminmxid, see if we can
-- 
2.50.1

bpf_vac_datfrozenxid.shapplication/x-shellscript; name=bpf_vac_datfrozenxid.shDownload

Heikki Linnakangas

hlinnaka@iki.fi

2 months ago

In reply to: Rustam Khamidullin (#1)

Re: [PATCH] Speed up of vac_update_datfrozenxid.

On 15/07/2025 14:02, Rustam Khamidullin wrote:

Hi hackers,

While testing behavior of 'vacuum' functionality under high load
we've noticed that the exclusive lock with 'LOCKTAG_DATABASE_FROZEN_ID'
tag could be held for a prolonged time, thus causing contention with
other sessions.
This lock is obtained by the 'vac_update_datfrozenxid' function at the
end of vacuum
process to update value of 'datfrozenxid' in the 'pg_database' relation.
This function currently performs two operations under the
mentioned lock: first scanning the 'pg_class' to find minimal
'relfrozenxid' and 'relminmxid' and then update corresponding values
in the 'pg_database' relation. Update of the 'pg_database' table
includes locking the required tuple and then deciding whether row
need to be updated or not. While looking at the
'vac_update_datfrozenxid' content, we've noticed following comment:

/*
* Fetch a copy of the tuple to scribble on. We could check the syscache
* tuple first. If that concluded !dirty, we'd avoid waiting on
* concurrent heap_update() and would avoid exclusive-locking the buffer.
* For now, don't optimize that.
*/

The optimization described in this comment (i.e. scanning and
locking tuple only if we really intend to update it) seems to be a
valid one, so I've created a patch to test it (attached to this
message). It should be applicable on top of current 'master' branch.
The code now tries to obtain data from the SysCache first and checks
whether it's necessary to call 'systable_inplace_update_begin()' to
actually update xid values.

To test the effect of the patch, a synthetic test was written using
bpftrace.
Bpftrace script measures the execution time of 'vac_update_datfrozenxid'.
The script is attached to the email.

The test scenario:

1. Create an empty database cluster
2. Set autovacuum_naptime = '1s' (so that 'vac_update_datfrozenxid' is
called more often)
3. Run the bpftrace script (it outputs the function execution time in
ns to stdout)

It is not necessary to apply the load. The autovacuum worker will call
'vac_update_datfrozenxid' regardless.
As a result, execution time of 'vac_update_datfrozenxid()' with the
patch is 30% faster
No patch Patch Speed up
Mean: 73058 ns 51444 ns 30%
Median: 71846 ns 48650 ns 33%

It is worth noting that due to the fact thatbefore updating the
'datfrozenxid',a full 'pg_class' scan is performed, the performance
improvementbecomes less significant as the number of
relationsincreases(more time will be spent on sequentialscan of
'pg_class').The previous measurement was performedon an empty
clusterwith nouser relations.To test the patch in a more realistic case,
I generated 1000 tables in the database and repeated the measurement. No
patch Patch Speed up
Mean:97113 ns 73839 ns 24% Median:98328 ns 63541 ns 29% As we can see,
the speed gain has decreased, but it is still significant.

How significant is that speedup in the grand scheme of things? What
fraction is vac_update_datfrozenxid of the overall work that autovacuum
does? Did this patch help with the lock contention that you mentioned at
the top?

- Heikki

Rustam Khamidullin

rstm.khamidullin@gmail.com

about 2 months ago

In reply to: Heikki Linnakangas (#2)

Re: [PATCH] Speed up of vac_update_datfrozenxid.

How significant is that speedup in the grand scheme of things? What
fraction is vac_update_datfrozenxid of the overall work that autovacuum
does?

To answer this question, I conducted additional measurements. To find
out how long vacuum takes, I decided to measure the execution time of
the `do_autovacuum` function and compare it with the execution time of
`vac_update_datfrozenxid`. I measured the execution time under and
without load.

The load was generated using the bgbench tool. In all load tests, the
results were approximately 900 TPS. (I dropped the `postgres` database
so that the vacuum only worked in 1 database)

pgbench -i -s 10 -d template1 -U postgres
pgbench -c 1 -j 1 -T 330 -d -S -d template1 -U postgres

The execution time was measured using bpftrace scripts.

Without load:
+-------------------------+----------+------------+------------+--------------+
|                         | Mean, ns | Mean Ratio | Median, ns | Median 
Ratio |
+-------------------------+----------+------------+------------+--------------+
|      do_autovacuum      |  256665  |    3.68    |   247760   | 3.80     |
| vac_update_datfrozenxid |  69694   |    0.27    |   65232    | 0.26     |
+-------------------------+----------+------------+------------+--------------+

With load:
+------------------------------+----------+------------+------------+--------------+
|                              | Mean, ns | Mean Ratio | Median, ns | 
Median Ratio |
+------------------------------+----------+------------+------------+--------------+
|      do_autovacuum_load      | 12673421 |   199.44   |  3109108   | 
50.86     |
| vac_update_datfrozenxid_load |  63544   |    0.01    |   61128    | 
0.02     |
+------------------------------+----------+------------+------------+--------------+

Thus, if there is no load on the database, then vac_update_datfrozenxid
occupies 27% of do_autovacuum. If there is a load, then 1-2%.

We can also evaluate how much faster do_autovacuum has become after
applying the patch.

Without load:
+---------------------+----------+------------+------------+--------------+
|                     | Mean, ns | Mean Ratio | Median, ns | Median Ratio |
+---------------------+----------+------------+------------+--------------+
|    do_autovacuum    |  256665  |    0.97    |   247760   |     0.97     |
| do_autovacuum_patch |  263913  |    1.03    |   255346   |     1.03     |
+---------------------+----------+------------+------------+--------------+

With load:
+--------------------------+----------+------------+------------+--------------+
|                          | Mean, ns | Mean Ratio | Median, ns | Median 
Ratio |
+--------------------------+----------+------------+------------+--------------+
|    do_autovacuum_load    | 12673421 |    1.04    |  3109108   | 1.01     |
| do_autovacuum_load_patch | 12152590 |    0.96    |  3086644   | 0.99     |
+--------------------------+----------+------------+------------+--------------+

Acceleration by 3% without load and by 1-4% (depending on the metrics
being compared) under load.

Did this patch help with the lock contention that you mentioned at
the top?

Unfortunately, no, in my case, the acceleration was not enough.
Reconfiguring the autovacuum system helped.

However, I think this patch makes sense. What do you think about it?

Best regards,
Rustam Khamidullin