Reducing planning time on tables with many indexes

Started by David Geierover 3 years ago8 messages

geidav.pg@gmail.com

over 3 years ago

3 attachment(s)

Hi hackers,

We came across a slowdown in planning, where queries use tables with many
indexes. In setups with wide tables it is not uncommon to have easily
10-100 indexes on a single table. The slowdown is already visible in serial
workloads with just a handful of indexes, but gets drastically amplified
when running queries with more indexes in parallel at high throughput.

We measured the TPS and planning time of running parallel streams of simple
point look-up queries on a single empty table with 60 columns and 60
indexes. The query used is 'SELECT * FROM synth_table WHERE col5 = 42'. No
rows are returned because the table is empty. We used a machine with 64
physical CPU cores. The schema and sysbench script to reproduce these
numbers are attached. We used the TPS as reported by sysbench and obtained
planning time by running 'EXPLAIN ANALYZE' on the same query in a
separately opened connection. We averaged the planning time of 3 successive
'EXPLAIN ANALYZE' runs. sysbench ran on the same machine with varying
numbers of threads using the following command line:

sysbench repro.lua --db-driver=pgsql --pgsql-host=localhost
--pgsql-db=postgres --pgsql-port=? --pgsql-user=? --pgsql-password=?
--report-interval=1 --threads=64 run

The following table shows the results. It is clearly visible that the TPS
flatten out already at 8 parallel streams, while the planning time is
increasing drastically.

Parallel streams | TPS (before) | Planning time (before)
-----------------|--------------|-----------------------
1 | 5,486 | 0.13 ms
2 | 8,098 | 0.22 ms
4 | 15,013 | 0.19 ms
8 | 27,107 | 0.29 ms
16 | 30,938 | 0.43 ms
32 | 26,330 | 1.68 ms
64 | 24,314 | 2.48 ms

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'. Then, before
opening (= locking) the indexes in 'get_relation_info()', we check if the
index can actually contribute to the query and if not it is discarded right
away. Caching the index info saves considerable work for every query run
subsequently, because less indexes must be inspected and thereby locked.
This way we also save cycles in any code that later on goes over all
relation indexes.

The work-in-progress version of the patch is attached. It is still fairly
rough (e.g. uses a global variable, selects the best index in scans without
restrictions by column count instead of physical column size, is missing
some renaming, etc.), but shows the principle.

The following table shows the TPS, planning time and speed-ups after
applying the patch and rerunning above described benchmark. Now, the
planning time remains roughly constant and TPS roughly doubles each time
the number of parallel streams is doubled. The higher the stream count the
more severe the lock contention is and the more pronounced the gained
speed-up gets. Interestingly, even for a single query stream the speed-up
in planning time is already very significant. This applies also for lower
index counts. For example just with 10 indexes the TPS for a single query
stream goes from 9,159 to 12,558. We can do more measurements if there is
interest in details for a lower number of indexes.

Parallel streams | TPS (after) | Planning time (after) | Speed-up TPS |
Speed-up planning
-----------------|-------------|-----------------------|--------------|------------------
1 | 10,344 | 0.046 | 1.9x |
2.8x
2 | 20,140 | 0.045 ms | 2.5x |
4.9x
4 | 40,349 | 0.047 ms | 2.7x |
4.0x
8 | 80,121 | 0.046 ms | 3.0x |
6.3x
16 | 152,632 | 0.051 ms | 4.9x |
8.4x
32 | 301,359 | 0.052 ms | 11.4x |
32.3x
64 | 525,115 | 0.062 ms | 21.6x |
40.0x

We are happy to receive your feedback and polish up the patch.

--
David Geier
(ServiceNow)

Attachments:

repro.luatext/x-lua; charset=US-ASCII; name=repro.luaDownload

schema.sqlapplication/sql; name=schema.sqlDownload

0001-Index-filtering.patchtext/x-patch; charset=US-ASCII; name=0001-Index-filtering.patchDownload

From 60e300b5cac8f527e61483296df81ab783670ac9 Mon Sep 17 00:00:00 2001
From: David Geier <david.geier@servicenow.com>
Date: Fri, 15 Jul 2022 11:53:27 +0200
Subject: [PATCH] Index filtering

---
 src/backend/optimizer/path/Makefile           |   3 +-
 src/backend/optimizer/path/index_filtering.c  | 220 ++++++++++++++++++
 src/backend/optimizer/plan/planmain.c         |   7 +
 src/backend/optimizer/util/plancat.c          | 124 ++++++----
 src/backend/utils/cache/relcache.c            |  17 +-
 src/include/optimizer/planner_index_locking.h |  40 ++++
 6 files changed, 361 insertions(+), 50 deletions(-)
 create mode 100644 src/backend/optimizer/path/index_filtering.c
 create mode 100644 src/include/optimizer/planner_index_locking.h

diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..933c4df8f5 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	index_filtering.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/index_filtering.c b/src/backend/optimizer/path/index_filtering.c
new file mode 100644
index 0000000000..b9d05f85f2
--- /dev/null
+++ b/src/backend/optimizer/path/index_filtering.c
@@ -0,0 +1,220 @@
+#include <postgres.h> // NOTE: this include MUST come first
+
+#include <access/htup_details.h>
+#include <access/sysattr.h>
+#include <catalog/pg_am.h>
+#include <catalog/pg_class.h>
+#include <nodes/pathnodes.h>
+#include <optimizer/optimizer.h>
+#include <optimizer/planner_index_locking.h>
+#include <utils/builtins.h>
+#include <utils/catcache.h>
+#include <utils/rel.h>
+#include <utils/syscache.h>
+
+bool FilterIndexes = false;
+
+// similar to e.g. RelationGetPrimaryKeyIndex
+List *s64_RelationGetIndexBitmapList(Relation relation)
+{
+    if (!relation->rd_indexvalid)
+        list_free(RelationGetIndexList(relation));
+
+    return list_copy(relation->rd_indexprs);
+}
+
+IndexBitmapset *s64_RelationBuildIndexBitmapset(HeapTuple htup)
+{
+    Form_pg_index   index         = (Form_pg_index)GETSTRUCT(htup);
+    Bitmapset      *keys          = NULL;
+    IndexBitmapset *bitmap        = NULL;
+    MemoryContext   oldcxt;
+    int             i;
+
+    for (i = 0; i < index->indnatts; ++i)
+        if (index->indkey.values[i] != 0)
+            keys = bms_add_member(keys,
+                                  index->indkey.values[i] - FirstLowInvalidHeapAttributeNumber);
+
+    // Find also all columns referenced by any expression
+    // From RelationGetIndexExpressions and inlined so that we don't need extra locks
+    if (!heap_attisnull(htup, Anum_pg_index_indexprs, NULL))
+    {
+        bool  isNull;
+        Datum exprsDatum =
+                heap_getattr(htup, Anum_pg_index_indexprs, GetPgIndexDescriptor(), &isNull);
+        char *exprsString = TextDatumGetCString(exprsDatum);
+        Node *exprs       = (Node *)stringToNode(exprsString);
+        pull_varattnos(exprs, 1, &keys);
+        pfree(exprsString);
+        pfree(exprs);
+    }
+
+    // Also find any column referenced by any predicate
+    if (!heap_attisnull(htup, Anum_pg_index_indpred, NULL))
+    {
+        bool  isNull;
+        Datum predDatum =
+                heap_getattr(htup, Anum_pg_index_indpred, GetPgIndexDescriptor(), &isNull);
+        char *predString = TextDatumGetCString(predDatum);
+        Node *pred       = (Node *)stringToNode(predString);
+        pull_varattnos(pred, 1, &keys);
+        pfree(predString);
+        pfree(pred);
+    }
+
+    oldcxt                = MemoryContextSwitchTo(CacheMemoryContext);
+    bitmap                = palloc0(sizeof(IndexBitmapset));
+    bitmap->Index         = index->indexrelid;
+    bitmap->Keys          = bms_copy(keys);
+    MemoryContextSwitchTo(oldcxt);
+    bms_free(keys);
+    return bitmap;
+}
+
+// The best index for an index-only scan is the index that:
+// - is the smallest
+// - has all the fields so that it can actually be an index-only scan.
+Oid s64_RelationGetBestIndexForIndexOnly(Relation relation, Bitmapset *required)
+{
+    int       smallest  = INDEX_MAX_KEYS + 1;
+    Oid       bestIndex = InvalidOid;
+    ListCell *lc;
+
+    if (!relation->rd_indexvalid)
+        list_free(RelationGetIndexList(relation));
+
+    foreach (lc, relation->rd_indexprs)
+    {
+        IndexBitmapset *bitmap = (IndexBitmapset *)lfirst(lc);
+
+        if (required == NULL || bms_is_subset(required, bitmap->Keys))
+        {
+            if (bms_num_members(bitmap->Keys) < smallest)
+            {
+                bestIndex = bitmap->Index;
+                smallest  = bms_num_members(bitmap->Keys);
+            }
+        }
+    }
+
+    return bestIndex;
+}
+
+int list_bitmapset_qcmp(const void *p1, const void *p2)
+{
+    return list_bitmapset_cmp(*(ListCell **)p1, *(ListCell **)p2);
+}
+
+int list_bitmapset_cmp(const ListCell *p1, const ListCell *p2)
+{
+    const IndexBitmapset *v1 = (IndexBitmapset *)(lfirst(p1));
+    const IndexBitmapset *v2 = (IndexBitmapset *)(lfirst(p2));
+    if (v1->Index < v2->Index)
+        return -1;
+    if (v1->Index > v2->Index)
+        return 1;
+    return 0;
+}
+
+Bitmapset *s64_RelationUsedClauses(PlannerInfo *root, RelOptInfo *rel)
+{
+    ListCell  *lc1  = NULL;
+    ListCell  *lc2  = NULL;
+    Bitmapset *used = NULL;
+
+    // All columns we filter on
+    foreach (lc1, rel->baserestrictinfo)
+    {
+        RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc1);
+        pull_varattnos((Node *)(rinfo->clause), rel->relid, &used);
+    }
+
+    // All columns explicitly used in a join
+    foreach (lc1, rel->joininfo)
+    {
+        RestrictInfo *rinfo = (RestrictInfo *)lfirst(lc1);
+        pull_varattnos((Node *)(rinfo->clause), rel->relid, &used);
+    }
+
+    // All columns used in any ordering.
+    // Query pathkeys contains any useful ordering for query_planner, so we don't have to figure
+    // out which ordering would be most useful (grouping, window, distinct, sort, etc).
+    foreach (lc1, root->query_pathkeys)
+    {
+        PathKey *pathKey = (PathKey *)lfirst(lc1);
+        foreach (lc2, pathKey->pk_eclass->ec_members)
+        {
+            EquivalenceMember *ecMember = (EquivalenceMember *)lfirst(lc2);
+            if (bms_is_member(rel->relid, ecMember->em_relids))
+                pull_varattnos((Node *)(ecMember->em_expr), rel->relid, &used);
+        }
+    }
+
+    // All useful equivalences there might be, can be used e.g. when you have an index
+    // which can return ordered data for a join indirectly (because the data is somehow
+    // ordered the same way, e.g. because of a PK)
+    foreach (lc1, root->eq_classes)
+    {
+        EquivalenceClass *ec = (EquivalenceClass *)lfirst(lc1);
+        foreach (lc2, ec->ec_members)
+        {
+            EquivalenceMember *ecMember = (EquivalenceMember *)lfirst(lc2);
+            if (bms_is_member(rel->relid, ecMember->em_relids))
+                pull_varattnos((Node *)(ecMember->em_expr), rel->relid, &used);
+        }
+    }
+
+    return used;
+}
+
+Bitmapset *s64_RelationUsedTList(PlannerInfo *root, RelOptInfo *rel)
+{
+    Bitmapset *used = NULL;
+
+    pull_varattnos((Node *)(rel->reltarget->exprs), rel->relid, &used);
+    return used;
+}
+
+// Any index that cannot provide a reduction of the data read is unnecessary.
+// An index can filter out (enough) data when:
+// 1. the columns used in the clauses that the index covers can filter
+//    out enough rows to not hit all pages of the table
+// 2. the index has all fields required to allow an index-only scan,
+//    the data is all-visible, and we want actual fields.
+// 3. we don't want any fields and this is the smallest index.
+//    this is used in a `SELECT COUNT(*) FROM`
+// 4. the index can provide the data ordered on a column that is useful
+bool s64_IsUnnecessaryIndex(PlannerInfo    *root,
+                            IndexBitmapset *bitmap,
+                            Bitmapset      *clauses,
+                            Bitmapset      *all,
+                            Oid             smallestIndex)
+{
+    if (!FilterIndexes)
+        return false;
+
+    // for safety only do this on SELECT statements
+    if (root->parse->commandType != CMD_SELECT)
+        return false;
+
+    // approximation of case 1 and 4. doesn't (yet) take into account visibility fraction
+    if (bms_overlap(clauses, bitmap->Keys))
+        return false;
+
+    if (!bms_is_empty(all))
+    {
+        // case 2.
+        if (bms_is_subset(all, bitmap->Keys))
+            return false;
+    }
+    else
+    {
+        // case 3
+        if (bitmap->Index == smallestIndex)
+            return false;
+    }
+
+    // TestCountIndexesFiltered++;
+    return true;
+}
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 273ac0acf7..370711ca50 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -29,6 +29,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
 #include "optimizer/planmain.h"
+#include "optimizer/planner_index_locking.h"
 
 
 /*
@@ -163,6 +164,7 @@ query_planner(PlannerInfo *root,
 	 * for rels not actively part of the query, for example views.  We don't
 	 * want to make RelOptInfos for them.
 	 */
+	FilterIndexes = true;
 	add_base_rels_to_query(root, (Node *) parse->jointree);
 
 	/*
@@ -213,6 +215,11 @@ query_planner(PlannerInfo *root,
 	 */
 	fix_placeholder_input_needed_levels(root);
 
+	/* add all required indexes for the baserels, so that the following functions have the required indexes:
+	 * reduce_unique_semijoins, remove_useless_joins. This is therefore the latest we can do this. */
+	s64_add_indexes(root);
+	FilterIndexes = false;
+
 	/*
 	 * Remove any useless outer joins.  Ideally this would be done during
 	 * jointree preprocessing, but the necessary information isn't available
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fdbbf..225b9f9a8e 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -53,6 +53,8 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 
+#include "optimizer/planner_index_locking.h"
+
 /* GUC parameter */
 int			constraint_exclusion = CONSTRAINT_EXCLUSION_PARTITION;
 
@@ -113,10 +115,7 @@ void
 get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 				  RelOptInfo *rel)
 {
-	Index		varno = rel->relid;
 	Relation	relation;
-	bool		hasindex;
-	List	   *indexinfos = NIL;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -150,9 +149,60 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 		estimate_rel_size(relation, rel->attr_widths - rel->min_attr,
 						  &rel->pages, &rel->tuples, &rel->allvisfrac);
 
+	if (!FilterIndexes)
+		s64_add_indexes_for_rel(root, relation->rd_id, inhparent, rel);
+
 	/* Retrieve the parallel_workers reloption, or -1 if not set. */
 	rel->rel_parallel_workers = RelationGetParallelWorkers(relation, -1);
 
+	rel->statlist = get_relation_statistics(rel, relation);
+
+	/* Grab foreign-table info using the relcache, while we have it */
+	if (relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+	{
+		rel->serverid = GetForeignServerIdByRelId(RelationGetRelid(relation));
+		rel->fdwroutine = GetFdwRoutineForRelation(relation, true);
+	}
+	else
+	{
+		rel->serverid = InvalidOid;
+		rel->fdwroutine = NULL;
+	}
+
+	/* Collect info about relation's foreign keys, if relevant */
+	get_relation_foreign_keys(root, rel, relation, inhparent);
+
+	/* Collect info about functions implemented by the rel's table AM. */
+	if (relation->rd_tableam &&
+		relation->rd_tableam->scan_set_tidrange != NULL &&
+		relation->rd_tableam->scan_getnextslot_tidrange != NULL)
+		rel->amflags |= AMFLAG_HAS_TID_RANGE;
+
+	/*
+	 * Collect info about relation's partitioning scheme, if any. Only
+	 * inheritance parents may be partitioned.
+	 */
+	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		set_relation_partition_info(root, rel, relation);
+
+	table_close(relation, NoLock);
+
+	/*
+	 * Allow a plugin to editorialize on the info we obtained from the
+	 * catalogs.  Actions might include altering the assumed relation size,
+	 * removing an index, or adding a hypothetical index to the indexlist.
+	 */
+	if (get_relation_info_hook)
+		(*get_relation_info_hook) (root, relationObjectId, inhparent, rel);
+}
+
+void s64_add_indexes_for_rel(PlannerInfo* root, Oid relationObjectId, bool inhparent, RelOptInfo *rel)
+{
+	Relation relation = table_open(relationObjectId, NoLock);
+	List	   *indexinfos = NIL;
+	Index		varno = rel->relid;
+	bool		hasindex;
+
 	/*
 	 * Make list of indexes.  Ignore indexes on system catalogs if told to.
 	 * Don't bother with indexes for an inheritance parent, either.
@@ -165,11 +215,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 
 	if (hasindex)
 	{
-		List	   *indexoidlist;
+		List	   *index_bitmaps;
 		LOCKMODE	lmode;
 		ListCell   *l;
+		Bitmapset* requiredTList = s64_RelationUsedTList(root, rel);
+		Bitmapset* requiredClauses = s64_RelationUsedClauses(root, rel);
+		Bitmapset* required = bms_union(requiredTList, requiredClauses);
+		Oid smallestIndex = s64_RelationGetBestIndexForIndexOnly(relation, required);
 
-		indexoidlist = RelationGetIndexList(relation);
+		index_bitmaps = s64_RelationGetIndexBitmapList(relation);
 
 		/*
 		 * For each index, we get the same type of lock that the executor will
@@ -181,9 +235,10 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 		 */
 		lmode = root->simple_rte_array[varno]->rellockmode;
 
-		foreach(l, indexoidlist)
+		foreach(l, index_bitmaps)
 		{
-			Oid			indexoid = lfirst_oid(l);
+			IndexBitmapset* bitmap = (IndexBitmapset*) lfirst(l);
+			Oid		indexoid = bitmap->Index;
 			Relation	indexRelation;
 			Form_pg_index index;
 			IndexAmRoutine *amroutine;
@@ -192,6 +247,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 						nkeycolumns;
 			int			i;
 
+			if (s64_IsUnnecessaryIndex(root, bitmap, requiredClauses, required, smallestIndex))
+				continue;
+
 			/*
 			 * Extract info from the relation descriptor for the index.
 			 */
@@ -436,51 +494,27 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			indexinfos = lcons(info, indexinfos);
 		}
 
-		list_free(indexoidlist);
+		list_free(index_bitmaps);
+		bms_free(requiredTList);
+		bms_free(requiredClauses);
+		bms_free(required);
 	}
 
 	rel->indexlist = indexinfos;
 
-	rel->statlist = get_relation_statistics(rel, relation);
+	table_close(relation, NoLock);
+}
 
-	/* Grab foreign-table info using the relcache, while we have it */
-	if (relation->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
-	{
-		rel->serverid = GetForeignServerIdByRelId(RelationGetRelid(relation));
-		rel->fdwroutine = GetFdwRoutineForRelation(relation, true);
-	}
-	else
+void s64_add_indexes(PlannerInfo *root)
+{
+	for (int i = 0; i < root->simple_rel_array_size; ++i)
 	{
-		rel->serverid = InvalidOid;
-		rel->fdwroutine = NULL;
+		RangeTblEntry* rte = root->simple_rte_array[i];
+		RelOptInfo* rel = root->simple_rel_array[i];
+		if (rel != NULL && rte->rtekind == RTE_RELATION && rel->indexlist == NULL)
+			s64_add_indexes_for_rel(root, rte->relid, rte->inh, rel);
 	}
-
-	/* Collect info about relation's foreign keys, if relevant */
-	get_relation_foreign_keys(root, rel, relation, inhparent);
-
-	/* Collect info about functions implemented by the rel's table AM. */
-	if (relation->rd_tableam &&
-		relation->rd_tableam->scan_set_tidrange != NULL &&
-		relation->rd_tableam->scan_getnextslot_tidrange != NULL)
-		rel->amflags |= AMFLAG_HAS_TID_RANGE;
-
-	/*
-	 * Collect info about relation's partitioning scheme, if any. Only
-	 * inheritance parents may be partitioned.
-	 */
-	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		set_relation_partition_info(root, rel, relation);
-
-	table_close(relation, NoLock);
-
-	/*
-	 * Allow a plugin to editorialize on the info we obtained from the
-	 * catalogs.  Actions might include altering the assumed relation size,
-	 * removing an index, or adding a hypothetical index to the indexlist.
-	 */
-	if (get_relation_info_hook)
-		(*get_relation_info_hook) (root, relationObjectId, inhparent, rel);
-}
+ }
 
 /*
  * get_relation_foreign_keys -
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index dcf56d4790..cc8b1c3cab 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -86,6 +86,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "optimizer/planner_index_locking.h"
 
 #define RELCACHE_INIT_FILEMAGIC		0x573266	/* version ID value */
 
@@ -290,7 +291,6 @@ static void write_item(const void *data, Size len, FILE *fp);
 static void formrdesc(const char *relationName, Oid relationReltype,
 					  bool isshared, int natts, const FormData_pg_attribute *attrs);
 
-static HeapTuple ScanPgRelation(Oid targetRelId, bool indexOK, bool force_non_historic);
 static Relation AllocateRelationDesc(Form_pg_class relp);
 static void RelationParseRelOptions(Relation relation, HeapTuple tuple);
 static void RelationBuildTupleDesc(Relation relation);
@@ -298,7 +298,6 @@ static Relation RelationBuildDesc(Oid targetRelId, bool insertIt);
 static void RelationInitPhysicalAddr(Relation relation);
 static void load_critical_index(Oid indexoid, Oid heapoid);
 static TupleDesc GetPgClassDescriptor(void);
-static TupleDesc GetPgIndexDescriptor(void);
 static void AttrDefaultFetch(Relation relation, int ndef);
 static int	AttrDefaultCmp(const void *a, const void *b);
 static void CheckConstraintFetch(Relation relation);
@@ -330,7 +329,7 @@ static void unlink_initfile(const char *initfilename, int elevel);
  *		NB: the returned tuple has been copied into palloc'd storage
  *		and must eventually be freed with heap_freetuple.
  */
-static HeapTuple
+HeapTuple
 ScanPgRelation(Oid targetRelId, bool indexOK, bool force_non_historic)
 {
 	HeapTuple	pg_class_tuple;
@@ -4334,7 +4333,7 @@ GetPgClassDescriptor(void)
 	return pgclassdesc;
 }
 
-static TupleDesc
+TupleDesc
 GetPgIndexDescriptor(void)
 {
 	static TupleDesc pgindexdesc = NULL;
@@ -4680,6 +4679,7 @@ RelationGetIndexList(Relation relation)
 	ScanKeyData skey;
 	HeapTuple	htup;
 	List	   *result;
+	List	   *bitmaps;
 	List	   *oldlist;
 	char		replident = relation->rd_rel->relreplident;
 	Oid			pkeyIndex = InvalidOid;
@@ -4697,6 +4697,7 @@ RelationGetIndexList(Relation relation)
 	 * if we get some sort of error partway through.
 	 */
 	result = NIL;
+	bitmaps = NIL;
 
 	/* Prepare to scan pg_index for entries having indrelid = this rel. */
 	ScanKeyInit(&skey,
@@ -4724,6 +4725,10 @@ RelationGetIndexList(Relation relation)
 		/* add index's OID to result list */
 		result = lappend_oid(result, index->indexrelid);
 
+		/* build bitmap and add to bitmaps list, but only for valid indexes */
+		if (index->indisvalid)
+			bitmaps = lappend(bitmaps, s64_RelationBuildIndexBitmapset(htup));
+
 		/*
 		 * Invalid, non-unique, non-immediate or predicate indexes aren't
 		 * interesting for either oid indexes or replication identity indexes,
@@ -4750,10 +4755,14 @@ RelationGetIndexList(Relation relation)
 	/* Sort the result list into OID order, per API spec. */
 	list_sort(result, list_oid_cmp);
 
+	/* sort the bitmaps into OID order, per API spec. */
+	list_sort(bitmaps, list_bitmapset_cmp);
+
 	/* Now save a copy of the completed list in the relcache entry. */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	oldlist = relation->rd_indexlist;
 	relation->rd_indexlist = list_copy(result);
+	relation->rd_indexprs = list_copy(bitmaps);
 	relation->rd_pkindex = pkeyIndex;
 	if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
 		relation->rd_replidindex = pkeyIndex;
diff --git a/src/include/optimizer/planner_index_locking.h b/src/include/optimizer/planner_index_locking.h
new file mode 100644
index 0000000000..27adec56d7
--- /dev/null
+++ b/src/include/optimizer/planner_index_locking.h
@@ -0,0 +1,40 @@
+#pragma once
+
+#include <postgres.h>
+
+#include <nodes/plannodes.h>
+#include <optimizer/paths.h>
+#include <utils/rel.h>
+
+extern bool FilterIndexes;
+
+// introduces a bitmap that caches which index covers which columns, so that we can filter the
+// possible indexes we should add to any RelOptInfo* to only be the ones that have any overlap with
+// the requested columns for a relation.
+
+extern TupleDesc GetPgIndexDescriptor(void);
+extern HeapTuple ScanPgRelation(Oid targetRelId, bool indexOK, bool force_non_historic);
+extern List     *s64_RelationGetIndexBitmapList(Relation relation);
+extern Oid       s64_RelationGetBestIndexForIndexOnly(Relation relation, Bitmapset *required);
+extern void      s64_add_indexes(PlannerInfo *root);
+extern void      s64_add_indexes_for_rel(PlannerInfo *root,
+                                         Oid          relationObjectId,
+                                         bool         inhparent,
+                                         RelOptInfo  *rel);
+
+typedef struct IndexBitmapset
+{
+    Oid        Index;
+    Bitmapset *Keys;
+} IndexBitmapset;
+extern IndexBitmapset *s64_RelationBuildIndexBitmapset(HeapTuple htup);
+extern bool            s64_IsUnnecessaryIndex(PlannerInfo    *root,
+                                              IndexBitmapset *bitmap,
+                                              Bitmapset      *clauses,
+                                              Bitmapset      *all,
+                                              Oid             smallestIndex);
+
+extern int        list_bitmapset_qcmp(const void *p1, const void *p2);
+extern int        list_bitmapset_cmp(const ListCell *p1, const ListCell *p2);
+extern Bitmapset *s64_RelationUsedTList(PlannerInfo *root, RelOptInfo *rel);
+extern Bitmapset *s64_RelationUsedClauses(PlannerInfo *root, RelOptInfo *rel);
-- 
2.34.1

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: David Geier (#1)

Re: Reducing planning time on tables with many indexes

David Geier <geidav.pg@gmail.com> writes:

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'.

I wonder how much thought you gave to the costs imposed by that extra
cache space. We have a lot of users who moan about relcache bloat
already. But more to the point, I do not buy the assumption that
an index's set of columns is a good filter for which indexes are of
interest. A trivial counterexample from the regression database is

regression=# explain select count(*) from tenk1;
QUERY PLAN

--------------------------------------------------------------------------------
------------
Aggregate (cost=219.28..219.29 rows=1 width=8)
-> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28 rows=100
00 width=0)
(2 rows)

It looks to me like the patch also makes unwarranted assumptions about
being able to discard all but the smallest index having a given set
of columns. This would, for example, possibly lead to dropping the
index that has the most useful sort order, or that has the operator
class needed to support a specific WHERE clause.

In short, I'm not sure I buy this concept at all. I think it might
be more useful to attack the locking overhead more directly. I kind
of wonder why we need per-index locks at all during planning ---
I think that we already interpret AccessShareLock on the parent table
as being sufficient to block schema changes on existing indexes.

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are some
comparable behaviors in other AMs). I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way. Maybe we could persuade VACUUM or ANALYZE to store that
info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Tom Lane (#2)

Re: Reducing planning time on tables with many indexes

I wrote:

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are some
comparable behaviors in other AMs). I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way. Maybe we could persuade VACUUM or ANALYZE to store that
info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

A first step here could just be to postpone fetching _bt_getrootheight()
until we actually need it during cost estimation. That would avoid the
need to do it at all for indexes that indxpath.c discards as irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.

Having done that, you could look into revising plancat.c to fill the
IndexOptInfo structs from catcache entries instead of opening the
index per se. (You'd have to also make sure that the appropriate
index locks are acquired eventually, for indexes the query does use
at runtime. I think that's the case, but I'm not sure if anything
downstream has been optimized on the assumption the planner did it.)

This'd probably get us a large part of the way there. Further
optimization of acquisition of tree height etc could be an
optional follow-up.

regards, tom lane

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Tom Lane (#3)

Re: Reducing planning time on tables with many indexes

Hi Tom,

On Wed, Jul 27, 2022 at 7:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are

some

comparable behaviors in other AMs). I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way. Maybe we could persuade VACUUM or ANALYZE to store

that

info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

It seems like _bt_getrootheight() first checks if the height is cached and
only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight()
accessing index meta pages in case they are not cached, maybe the index
locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even
better.

A first step here could just be to postpone fetching _bt_getrootheight()
until we actually need it during cost estimation. That would avoid the
need to do it at all for indexes that indxpath.c discards as irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.

Having done that, you could look into revising plancat.c to fill the
IndexOptInfo structs from catcache entries instead of opening the
index per se. (You'd have to also make sure that the appropriate
index locks are acquired eventually, for indexes the query does use
at runtime. I think that's the case, but I'm not sure if anything
downstream has been optimized on the assumption the planner did it.)

I can give this a try.

That way we would get rid of the scalability issues.
However, what about the runtime savings observed with a single query stream?
In that case there is no contention, so it seems like having less indexes
to look at further down the road, also yields substantial savings.
Any clue where exactly these savings might come from? Or is it actually
the calls to _bt_getrootheight()? I can also do a few perf runs to track
that down.

Show quoted text

This'd probably get us a large part of the way there. Further
optimization of acquisition of tree height etc could be an
optional follow-up.

regards, tom lane

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: Tom Lane (#2)

Re: Reducing planning time on tables with many indexes

Hi Tom,

On Wed, Jul 27, 2022 at 6:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Geier <geidav.pg@gmail.com> writes:

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every

single

table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in

every

index, inside 'struct Relation', similarly to 'rd_indexlist'.

I wonder how much thought you gave to the costs imposed by that extra
cache space. We have a lot of users who moan about relcache bloat
already.

The current representation could be compacted (e.g. by storing the column
indexes as 16-bit integers, instead of using a Bitmapset). That should make
the additional space needed neglectable compared to the current size of
RelationData. On top of that we could maybe reorder the members of
RelationData to save padding bytes. Currently, there is lots of
interleaving of members of different sizes. Even when taking cache locality
into consideration it looks like a fair amount could be saved, probably
accounting for the additional space needed to store the index column data.

But more to the point, I do not buy the assumption that

an index's set of columns is a good filter for which indexes are of
interest. A trivial counterexample from the regression database is

regression=# explain select count(*) from tenk1;
QUERY PLAN

--------------------------------------------------------------------------------
------------
Aggregate (cost=219.28..219.29 rows=1 width=8)
-> Index Only Scan using tenk1_hundred on tenk1 (cost=0.29..194.28
rows=100
00 width=0)
(2 rows)

Only for queries without index conditions, the current version of the

patch simply discards all indexes but the one with the least columns. This
is case (3) in s64_IsUnnecessaryIndex(). This is an over-simplification
with the goal of picking the index which yields least I/O. For single
column indexes that works, but it can fall short for multi-column indexes
(e.g. [INT, TEXT] index vs [INT, INT]t index have both 2 columns but the
latter would be better suited when there's no other index and we want to
read the first integer column). What we should do here instead is to
discard indexes based on storage size.

It looks to me like the patch also makes unwarranted assumptions about
being able to discard all but the smallest index having a given set
of columns. This would, for example, possibly lead to dropping the
index that has the most useful sort order, or that has the operator
class needed to support a specific WHERE clause.t

Why would that be? If we keep all indexes that contain columns that are
used by the query, we also keep the indexes which provide a certain sort
order or operator class.

In short, I'm not sure I buy this concept at all. I think it might
be more useful to attack the locking overhead more directly. I kind
of wonder why we need per-index locks at all during planning ---
I think that we already interpret AccessShareLock on the parent table
as being sufficient to block schema changes on existing indexes.

As I said in the reply to your other mail, there's huge savings also in

the serial case where lock contention is not an issue. It seems like
pre-filtering saves work down the road. I'll do some perf runs to track
down where exactly the savings come from. One source I can think of is only
having to consider a subset of all indexes during path creation.

Unfortunately, as things stand today, the planner needs more than the
right to look at the indexes' schemas, because it makes physical accesses
to btree indexes to find out their tree height (and I think there are some
comparable behaviors in other AMs). I've never particularly cared for
that implementation, and would be glad to rip out that behavior if we can
find another way. Maybe we could persuade VACUUM or ANALYZE to store that
info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

That's another interesting approach, but I would love to save the planner

cycles for the sequential case.

--
David Geier
(ServiceNow)

Luc Vlaming Hummel

luc.vlaming@servicenow.com

over 3 years ago

In reply to: Tom Lane (#2)

Re: Reducing planning time on tables with many indexes

On 27.07.22, 18:39, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

[External Email]

David Geier <geidav.pg@gmail.com> writes:

We tracked down the root cause of this slowdown to lock contention in
'get_relation_info()'. The index lock of every single index of every single
table used in that query is acquired. We attempted a fix by pre-filtering
out all indexes that anyways cannot be used with a certain query, without
taking the index locks (credits to Luc Vlaming for idea and
implementation). The patch does so by caching the columns present in every
index, inside 'struct Relation', similarly to 'rd_indexlist'.

regression=# explain select count(*) from tenk1;
QUERY PLAN

Thanks for checking out the patch!

Just to make sure we're on the same page: we're only making this assumption if you select no fields at all.
If you select any fields at all it will check for column overlap, and if there's any overlap with any referenced field,
then the index will not be filtered out.

For producing a row count with no referenced fields it is true that it should select the truly cheapest
index to produce the row count and there should be some Index-am callback introduced for that.
For now it was just a quick-and-dirty solution.
Wouldn't a callback that would estimate the amount of data read be good enough though?

For sort orders the field to sort by should be listed and hence the index should not be filtered out,
or what am I missing? Likely I've missed some fields that are referenced somehow (potentially indirectly),
but that shouldn't disqualify the approach completely.

Could you elaborate as to why this approach is not good enough? To me it seems that avoiding work
ahead of time is generally useful. Or are you worried that we remove too much?

regards, tom lane

The thing you're touching on is specific for a btree. Not sure this generalizes to all index types that
are out there though? I could see there being some property that allows you to be "no-lock",
and then a callback that allows you to cache some generic data that can be transformed
when the indexopt info structs are filled. Is that roughly what you have in mind?

Best,
Luc

David Geier

geidav.pg@gmail.com

over 3 years ago

In reply to: David Geier (#4)

1 attachment(s)

Re: Reducing planning time on tables with many indexes

On 8/1/22 15:33, David Geier wrote:

Hi Tom,

On Wed, Jul 27, 2022 at 7:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Unfortunately, as things stand today, the planner needs more

than the

right to look at the indexes' schemas, because it makes physical

accesses

to btree indexes to find out their tree height (and I think

there are some

comparable behaviors in other AMs). I've never particularly

cared for

that implementation, and would be glad to rip out that behavior

if we can

find another way. Maybe we could persuade VACUUM or ANALYZE to

store that

info in the index's pg_index row, or some such, and then the planner
could use it with no lock?

It seems like _bt_getrootheight() first checks if the height is cached
and only if it isn't it accesses index meta pages.
If the index locks are only taken for the sake of _bt_getrootheight()
accessing index meta pages in case they are not cached, maybe the
index locks could be taken conditionally.
However, postponing the call to where it is really needed sounds even
better.

A first step here could just be to postpone fetching
_bt_getrootheight()
until we actually need it during cost estimation. That would
avoid the
need to do it at all for indexes that indxpath.c discards as
irrelevant,
which is a decision made on considerably more information than the
proposed patch uses.

Hi Tom,

I gave the idea of moving _bt_getrootheight() into costsize.c and
filling IndexOptInfo in get_relation_info() via syscache instead of
relcache a try, but didn't get very far.
Moving out _bt_getrootheight() was straightforward, and we should do
nevertheless. However, it seems like get_relation_info() strongly
depends on the index's Relation for quite some stuff. A fair amount of
fields I could actually fill from syscache, but there are some that
either need data not stored in syscache (e.g. estimate_rel_size(),
Relation::rd_smgr needed by RelationGetNumberOfBlocksInFork()) or need
fields that are cached in the index's Relation and would have to be
recomputed otherwise (e.g. Relation::rd_indexprs filled by
RelationGetIndexExpressions(), Relation::rd_indpred filled by
RelationGetIndexPredicate()). Even if we could somehow obtain the
missing info from somewhere, recomputing the otherwise cached fields
from Relation would likely cause a significant slowdown in the serial case.

Beyond that I did some off-CPU profiling to precisely track down which
lock serializes execution. It turned out to be the MyProc::fpInfoLock
lightweight lock. This lock is used in the fast path of the heavyweight
lock. In the contenting case, fpInfoLock is acquired in LW_EXCLUSIVE
mode to (1) check if there is no other process holding a stronger lock,
and if not, to reserve a process local fast path lock slot and (2) to
return the fast path lock slots all in one go. To do so, the current
implementation always linearly iterates over all lock slots. The
corresponding call stacks are:

get_relation_info()               CommitTransaction()
index_open()                      ResourceOwnerRelease()
    relation_open() ResourceOwnerReleaseInternal()
      LockRelationOid()                 ProcReleaseLocks()
        LockAcquireExtended()             LockReleaseAll() <-- called
twice from ProcReleaseLocks()
          LWLockAcquire()

On top of that there are only 16 fast path lock slots. One slot is
always taken up by the parent relation, leaving only 15 slots for the
indexes. As soon as a session process runs out of slots, it falls back
to the normal lock path which has to mess around with the lock table. To
do so it also acquires a lightweight lock in LW_EXCLUSIVE mode. This
lightweight lock however is partitioned and therefore does not content.
Hence, normal lock acquisition is slower but contents less.

To prove above findings I increased the number of fast path lock slots
per connection and optimized FastPathGrantRelationLock() and
FastPathUnGrantRelationLock(). With these changes the lock contention
disappeared and the workload scales linearly (the code I tested with
also included moving out _bt_getrootheight()):

| Parallel streams | TPS      | TPS / stream |
|------------------|----------|---------------|
| 1                |   5,253 | 5,253         |
| 10               | 51,406 | 5,140         |
| 20               | 101,401 | 5,070         |
| 30               | 152,023 | 5,067         |
| 40               | 200,607 | 5,015         |
| 50               | 245,359 | 4,907         |
| 60               | 302,994 | 5,049         |

However, with the very same setup, the index filtering approach yields
486k TPS with 60 streams and 9,827 TPS with a single stream. The single
stream number shows that this is not because it scales even better, but
just because less work is spent during planning. A quick perf session
showed that a fair amount of time is spent to get the relation sizes in
blocks (RelationGetNumberOfBlocksInFork() -> lseek64()) and creating
index paths (pull_varattnos() -> bms_add_member(), surprisingly).

-   32.20%     1.58% postgres postgres            [.]
get_relation_info
   - 30.62% get_relation_info
      - 16.56% RelationGetNumberOfBlocksInFork
         - 16.42% smgrnblocks
            - 16.25% mdnblocks
               - 16.10% _mdnblocks
                  + 15.55% __libc_lseek64
      + 5.83% index_open
      + 2.71% estimate_rel_size
        1.56% build_index_tlist
      + 1.22% palloc
   + 1.57% __libc_start_main
-   23.02%     0.03% postgres postgres            [.]
make_one_rel
   - 22.99% make_one_rel
      - 22.01% set_base_rel_pathlists
         - 21.99% set_rel_pathlist
            - 21.89% set_plain_rel_pathlist
               - 21.53% create_index_paths
                  - 18.76% get_index_paths
                     - 18.33% build_index_paths
                        - 15.77% check_index_only
                           - 14.75% pull_varattnos
                              - 14.58% pull_varattnos_walker
                                 - 13.05% expression_tree_walker
                                    - 9.50% pull_varattnos_walker
                                         5.77% bms_add_member
                                      0.93% bms_add_member
                                      0.52% expression_tree_walker
                                   1.44% pull_varattnos_walker
                        + 1.79% create_index_path
                  + 0.90% match_restriction_clauses_to_index
      + 0.95% set_base_rel_sizes

Given the findings above, the two patches are actually complementary.
Optimizing the lock fast path not only helps when many indexes exist and
only a small subset is used, but whenever there are many locks used by a
query. The index filtering is another way to reduce lock contention, but
beyond that also greatly reduces the time spent on planning in the
serial case.

I have attached the patch to improve the heavyweight lock fast path. It
also for now contains moving out _bt_getrootheight(). For workloads
where the same set of locks is used over and over again, it only needs
on average a single loop iteration to find the relation (instead of a
linear scan before). This allows to increase the number of fast path
locks by a lot. In this patch I increased them from 16 to 64. The code
can be further improved for cases where to be locked relations change
frequently and therefore the chance of not finding a relation and
because of that having to linearly search the whole array is higher.

I would really appreciate your feedback Tom, also on the questions
around the approach of filtering out indexes, discussed in the last mails.

--
David Geier
(ServiceNow)

Attachments:

0001-Improve-heavyweight-lock-fast-path.patchtext/x-patch; charset=UTF-8; name=0001-Improve-heavyweight-lock-fast-path.patchDownload

From 0b03fc09570635cd00248a5f4e3ab60e39b917fb Mon Sep 17 00:00:00 2001
From: David Geier <david.geier@servicenow.com>
Date: Sun, 14 Aug 2022 20:17:54 +0000
Subject: [PATCH] Improve heavyweight lock fast path

---
 src/backend/optimizer/util/plancat.c |  9 ----
 src/backend/storage/lmgr/lock.c      | 71 +++++++++++++++-------------
 src/backend/utils/adt/selfuncs.c     |  8 ++++
 src/include/storage/proc.h           |  4 +-
 4 files changed, 48 insertions(+), 44 deletions(-)

diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5012bfe142..dfe99ba93b 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -414,16 +414,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 					info->tuples = rel->tuples;
 			}
 
-			if (info->relam == BTREE_AM_OID)
-			{
-				/* For btrees, get tree height while we have the index open */
-				info->tree_height = _bt_getrootheight(indexRelation);
-			}
-			else
-			{
-				/* For other index types, just set it to "unknown" for now */
 				info->tree_height = -1;
-			}
 
 			index_close(indexRelation, NoLock);
 
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..b7139f51b5 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -163,14 +163,6 @@ typedef struct TwoPhaseLockRecord
 } TwoPhaseLockRecord;
 
 
-/*
- * Count of the number of fast path lock slots we believe to be used.  This
- * might be higher than the real number if another backend has transferred
- * our locks to the primary lock table, but it can never be lower than the
- * real value, since only we can acquire locks on our own behalf.
- */
-static int	FastPathLocalUseCount = 0;
-
 /*
  * Flag to indicate if the relation extension lock is held by this backend.
  * This flag is used to ensure that while holding the relation extension lock
@@ -203,18 +195,18 @@ static bool IsPageLockHeld PG_USED_FOR_ASSERTS_ONLY = false;
 #define FAST_PATH_LOCKNUMBER_OFFSET		1
 #define FAST_PATH_MASK					((1 << FAST_PATH_BITS_PER_SLOT) - 1)
 #define FAST_PATH_GET_BITS(proc, n) \
-	(((proc)->fpLockBits >> (FAST_PATH_BITS_PER_SLOT * n)) & FAST_PATH_MASK)
+        proc->fpLockBits[n]
 #define FAST_PATH_BIT_POSITION(n, l) \
 	(AssertMacro((l) >= FAST_PATH_LOCKNUMBER_OFFSET), \
 	 AssertMacro((l) < FAST_PATH_BITS_PER_SLOT+FAST_PATH_LOCKNUMBER_OFFSET), \
 	 AssertMacro((n) < FP_LOCK_SLOTS_PER_BACKEND), \
 	 ((l) - FAST_PATH_LOCKNUMBER_OFFSET + FAST_PATH_BITS_PER_SLOT * (n)))
 #define FAST_PATH_SET_LOCKMODE(proc, n, l) \
-	 (proc)->fpLockBits |= UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)
+	 (proc)->fpLockBits[n] |= 1<<((l)-FAST_PATH_LOCKNUMBER_OFFSET)
 #define FAST_PATH_CLEAR_LOCKMODE(proc, n, l) \
-	 (proc)->fpLockBits &= ~(UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l))
+	 (proc)->fpLockBits[n]  &= ~(1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET))
 #define FAST_PATH_CHECK_LOCKMODE(proc, n, l) \
-	 ((proc)->fpLockBits & (UINT64CONST(1) << FAST_PATH_BIT_POSITION(n, l)))
+	 ((proc)->fpLockBits[n] & (1 << ((l)-FAST_PATH_LOCKNUMBER_OFFSET)))
 
 /*
  * The fast-path lock mechanism is concerned only with relation locks on
@@ -924,8 +916,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	 * lock type on a relation we have already locked using the fast-path, but
 	 * for now we don't worry about that case either.
 	 */
-	if (EligibleForRelationFastPath(locktag, lockmode) &&
-		FastPathLocalUseCount < FP_LOCK_SLOTS_PER_BACKEND)
+	if (EligibleForRelationFastPath(locktag, lockmode))
 	{
 		uint32		fasthashcode = FastPathStrongLockHashPartition(hashcode);
 		bool		acquired;
@@ -940,8 +931,7 @@ LockAcquireExtended(const LOCKTAG *locktag,
 		if (FastPathStrongRelationLocks->count[fasthashcode] != 0)
 			acquired = false;
 		else
-			acquired = FastPathGrantRelationLock(locktag->locktag_field2,
-												 lockmode);
+			acquired = FastPathGrantRelationLock(locktag->locktag_field2, lockmode);
 		LWLockRelease(&MyProc->fpInfoLock);
 		if (acquired)
 		{
@@ -2076,8 +2066,7 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
 	locallock->lockCleared = false;
 
 	/* Attempt fast release of any lock eligible for the fast path. */
-	if (EligibleForRelationFastPath(locktag, lockmode) &&
-		FastPathLocalUseCount > 0)
+	if (EligibleForRelationFastPath(locktag, lockmode))
 	{
 		bool		released;
 
@@ -2647,6 +2636,11 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
 	ResourceOwnerForgetLock(CurrentResourceOwner, locallock);
 }
 
+//static int ItersGrant = 0;
+//static int CallsGrant = 0;
+//static int ItersUngrant = 0;
+//static int CallsUngrant = 0;
+
 /*
  * FastPathGrantRelationLock
  *		Grant lock using per-backend fast-path array, if there is space.
@@ -2654,20 +2648,31 @@ LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent)
 static bool
 FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 {
-	uint32		f;
+	uint32		f, i;
 	uint32		unused_slot = FP_LOCK_SLOTS_PER_BACKEND;
 
+        // if (CallsGrant % 100000 == 0 || CallsUngrant % 100000 == 0)
+        //    elog(WARNING, "counts: %d, %d, %d, %d, %f, %f", CallsGrant, CallsUngrant, ItersGrant, ItersUngrant,
+        //    (float)ItersGrant/(float)CallsGrant, (float)ItersUngrant/(float)CallsUngrant);
+
+        // CallsGrant++;
+
 	/* Scan for existing entry for this relid, remembering empty slot. */
-	for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+	for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
 	{
-		if (FAST_PATH_GET_BITS(MyProc, f) == 0)
-			unused_slot = f;
-		else if (MyProc->fpRelId[f] == relid)
+                f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+		Assert(f < FP_LOCK_SLOTS_PER_BACKEND);
+		// ItersGrant++;
+
+		if (MyProc->fpRelId[f] == relid)
 		{
 			Assert(!FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode));
 			FAST_PATH_SET_LOCKMODE(MyProc, f, lockmode);
 			return true;
 		}
+                else if (FAST_PATH_GET_BITS(MyProc, f) == 0)
+			if (unused_slot == FP_LOCK_SLOTS_PER_BACKEND)
+				unused_slot = f;
 	}
 
 	/* If no existing entry, use any empty slot. */
@@ -2675,7 +2680,6 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 	{
 		MyProc->fpRelId[unused_slot] = relid;
 		FAST_PATH_SET_LOCKMODE(MyProc, unused_slot, lockmode);
-		++FastPathLocalUseCount;
 		return true;
 	}
 
@@ -2691,24 +2695,25 @@ FastPathGrantRelationLock(Oid relid, LOCKMODE lockmode)
 static bool
 FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode)
 {
-	uint32		f;
-	bool		result = false;
+	uint32		f, i;
 
-	FastPathLocalUseCount = 0;
-	for (f = 0; f < FP_LOCK_SLOTS_PER_BACKEND; f++)
+        // CallsUngrant++;
+
+	for (i = 0; i < FP_LOCK_SLOTS_PER_BACKEND; i++)
 	{
+                f = ((relid % FP_LOCK_SLOTS_PER_BACKEND) + i) % FP_LOCK_SLOTS_PER_BACKEND;
+		// ItersUngrant++;
+
 		if (MyProc->fpRelId[f] == relid
 			&& FAST_PATH_CHECK_LOCKMODE(MyProc, f, lockmode))
 		{
 			Assert(!result);
 			FAST_PATH_CLEAR_LOCKMODE(MyProc, f, lockmode);
-			result = true;
-			/* we continue iterating so as to update FastPathLocalUseCount */
+			return true;
 		}
-		if (FAST_PATH_GET_BITS(MyProc, f) != 0)
-			++FastPathLocalUseCount;
 	}
-	return result;
+
+	return false;
 }
 
 /*
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d35e5605de..dab1aaf195 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -97,6 +97,7 @@
 #include <ctype.h>
 #include <math.h>
 
+#include "access/nbtree.h"
 #include "access/brin.h"
 #include "access/brin_page.h"
 #include "access/gin.h"
@@ -6831,6 +6832,13 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * touched.  The number of such pages is btree tree height plus one (ie,
 	 * we charge for the leaf page too).  As above, charge once per SA scan.
 	 */
+        if (index->tree_height < 0)
+	{
+	    Relation indexRel = index_open(index->indexoid, AccessShareLock);
+	    index->tree_height = _bt_getrootheight(indexRel);
+	    index_close(indexRel, NoLock);
+	}
+
 	descentCost = (index->tree_height + 1) * 50.0 * cpu_operator_cost;
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 2579e619eb..736b03d4f8 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -80,7 +80,7 @@ struct XidCache
  * rather than the main lock table.  This eases contention on the lock
  * manager LWLocks.  See storage/lmgr/README for additional details.
  */
-#define		FP_LOCK_SLOTS_PER_BACKEND 16
+#define		FP_LOCK_SLOTS_PER_BACKEND 64
 
 /*
  * An invalid pgprocno.  Must be larger than the maximum number of PGPROC
@@ -282,7 +282,7 @@ struct PGPROC
 
 	/* Lock manager data, recording fast-path locks taken by this backend. */
 	LWLock		fpInfoLock;		/* protects per-backend fast-path state */
-	uint64		fpLockBits;		/* lock modes held for each fast-path slot */
+	uint8		fpLockBits[FP_LOCK_SLOTS_PER_BACKEND];		/* lock modes held for each fast-path slot */
 	Oid			fpRelId[FP_LOCK_SLOTS_PER_BACKEND]; /* slots for rel oids */
 	bool		fpVXIDLock;		/* are we holding a fast-path VXID lock? */
 	LocalTransactionId fpLocalTransactionId;	/* lxid for fast-path VXID
-- 
2.25.1

Alvaro Herrera

alvherre@alvh.no-ip.org

about 3 years ago

In reply to: David Geier (#7)

Re: Reducing planning time on tables with many indexes

On 2022-Aug-19, David Geier wrote:

Beyond that I did some off-CPU profiling to precisely track down which lock
serializes execution. It turned out to be the MyProc::fpInfoLock lightweight
lock. This lock is used in the fast path of the heavyweight lock. In the
contenting case, fpInfoLock is acquired in LW_EXCLUSIVE mode to (1) check if
there is no other process holding a stronger lock, and if not, to reserve a
process local fast path lock slot and (2) to return the fast path lock slots
all in one go. To do so, the current implementation always linearly iterates
over all lock slots.

Ah, so this is the aspect that you mentioned to me today. I definitely
think that this analysis deserves its own thread, and the fix is its own
separate patch.

I have attached the patch to improve the heavyweight lock fast path. It also
for now contains moving out _bt_getrootheight(). For workloads where the
same set of locks is used over and over again, it only needs on average a
single loop iteration to find the relation (instead of a linear scan
before). This allows to increase the number of fast path locks by a lot. In
this patch I increased them from 16 to 64. The code can be further improved
for cases where to be locked relations change frequently and therefore the
chance of not finding a relation and because of that having to linearly
search the whole array is higher.

I suggest to put each change in a separate patch:

1. improve fast-path lock algorithm to find the element, perhaps
together with increasing the number of elements in the array
2. change _bt_getrootheight

However, since patch (1) may have nontrivial performance implications,
you would also need to justify the change: not only that improves the
case where many locks are acquired, but also that it does not make the
case with few locks worse.

I strongly suggest to not include C++ comments or any other dirtiness in
the patch, as that might deter some potential reviewers.

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there. Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html