PATCH: Using BRIN indexes for sorted output
Hi,
There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.
The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.
The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.
This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).
A trivial example, demonstrating the benefits:
create table t (a int) with (fillfactor = 10);
insert into t select i from generate_series(1,10000000) s(i);
First, a simple LIMIT query:
explain (analyze, costs off) select * from t order by a limit 10;
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1879.768..1879.770 rows=10 loops=1)
-> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t
(actual time=0.007..1353.110 rows=10000000 loops=1)
Planning Time: 0.083 ms
Execution Time: 1879.786 ms
(7 rows)
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1.217..1.219 rows=10 loops=1)
-> BRIN Sort using t_a_idx on t
(actual time=1.216..1.217 rows=10 loops=1)
Sort Key: a
Planning Time: 0.084 ms
Execution Time: 1.234 ms
(5 rows)
That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.
However, not only the LIMIT queries can improve - consider a sort of the
whole table:
test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
-------------------------------------------------------------------------
Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 117528kB
-> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 3766.825 ms
(6 rows)
test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
----------------------------------------------------------------------------------
BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
Sort Key: a
Planning Time: 0.073 ms
Execution Time: 2939.324 ms
(4 rows)
Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)
1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.
2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.
3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.
4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.
5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).
6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.
7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Allow-BRIN-indexes-to-produce-sorted-output-20221015.patchtext/x-patch; charset=UTF-8; name=0001-Allow-BRIN-indexes-to-produce-sorted-output-20221015.patchDownload
From 6d75cd243c107bc309958ecee98b085dfb7962ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 9 Oct 2022 11:33:37 +0200
Subject: [PATCH] Allow BRIN indexes to produce sorted output
Some BRIN indexes can be used to produce sorted output, by using the
range information to sort tuples incrementally. This is particularly
interesting for LIMIT queries, which only need to scan the first few
rows, and alternative plans (e.g. Seq Scan + Sort) have a very high
startup cost.
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.
This is disabled by default, use enable_brinsort GUC to enable it.
---
src/backend/access/brin/brin_minmax.c | 149 +++
src/backend/commands/explain.c | 44 +
src/backend/executor/Makefile | 1 +
src/backend/executor/execProcnode.c | 10 +
src/backend/executor/nodeBrinSort.c | 1538 +++++++++++++++++++++++
src/backend/optimizer/path/costsize.c | 254 ++++
src/backend/optimizer/path/indxpath.c | 197 +++
src/backend/optimizer/path/pathkeys.c | 50 +
src/backend/optimizer/plan/createplan.c | 189 +++
src/backend/optimizer/plan/setrefs.c | 19 +
src/backend/optimizer/util/pathnode.c | 59 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/access/brin.h | 20 +
src/include/access/brin_internal.h | 1 +
src/include/catalog/pg_amproc.dat | 64 +
src/include/catalog/pg_opclass.dat | 2 +-
src/include/catalog/pg_proc.dat | 3 +
src/include/executor/nodeBrinSort.h | 47 +
src/include/nodes/execnodes.h | 69 +
src/include/nodes/pathnodes.h | 11 +
src/include/nodes/plannodes.h | 26 +
src/include/optimizer/cost.h | 3 +
src/include/optimizer/pathnode.h | 11 +
src/include/optimizer/paths.h | 3 +
24 files changed, 2779 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/nodeBrinSort.c
create mode 100644 src/include/executor/nodeBrinSort.h
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index 9e8a8e056cc..0e6ba0893df 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -10,12 +10,20 @@
*/
#include "postgres.h"
+#include "access/brin.h"
#include "access/brin_internal.h"
+#include "access/brin_revmap.h"
#include "access/brin_tuple.h"
#include "access/genam.h"
#include "access/stratnum.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
#include "utils/builtins.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
@@ -253,6 +261,147 @@ brin_minmax_union(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+typedef struct BrinOpaque
+{
+ BlockNumber bo_pagesPerRange;
+ BrinRevmap *bo_rmAccess;
+ BrinDesc *bo_bdesc;
+} BrinOpaque;
+
+Datum
+brin_minmax_ranges(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ AttrNumber attnum = PG_GETARG_INT16(1);
+ BrinOpaque *opaque;
+ Relation indexRel;
+ Relation heapRel;
+ BlockNumber nblocks;
+ BlockNumber nranges;
+ BlockNumber heapBlk;
+ Oid heapOid;
+ BrinMemTuple *dtup;
+ BrinTuple *btup = NULL;
+ Size btupsz = 0;
+ Buffer buf = InvalidBuffer;
+ BrinRanges *ranges;
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc;
+
+ /*
+ * Determine how many BRIN ranges could there be, allocate space and read
+ * all the min/max values.
+ */
+ opaque = (BrinOpaque *) scan->opaque;
+ bdesc = opaque->bo_bdesc;
+ pagesPerRange = opaque->bo_pagesPerRange;
+
+ indexRel = bdesc->bd_index;
+
+ /* make sure the provided attnum is valid */
+ Assert((attnum > 0) && (attnum <= bdesc->bd_tupdesc->natts));
+
+ /*
+ * We need to know the size of the table so that we know how long to iterate
+ * on the revmap (and to pre-allocate the arrays).
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(indexRel), false);
+ heapRel = table_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ table_close(heapRel, AccessShareLock);
+
+ /*
+ * How many ranges can there be? We simply look at the number of pages,
+ * divide it by the pages_per_range.
+ *
+ * XXX We need to be careful not to overflow nranges, so we just divide
+ * and then maybe add 1 for partial ranges.
+ */
+ nranges = (nblocks / pagesPerRange);
+ if (nblocks % pagesPerRange != 0)
+ nranges += 1;
+
+ /* allocate for space, and also for the alternative ordering */
+ ranges = palloc0(offsetof(BrinRanges, ranges) + nranges * sizeof(BrinRange));
+ ranges->nranges = 0;
+
+ /* allocate an initial in-memory tuple, out of the per-range memcxt */
+ dtup = brin_new_memtuple(bdesc);
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ bool gottuple = false;
+ BrinTuple *tup;
+ OffsetNumber off;
+ Size size;
+ BrinRange *range = &ranges->ranges[ranges->nranges];
+
+ ranges->nranges++;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE,
+ scan->xs_snapshot);
+ if (tup)
+ {
+ gottuple = true;
+ btup = brin_copy_tuple(tup, size, btup, &btupsz);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ range->blkno_start = heapBlk;
+ range->blkno_end = heapBlk + (pagesPerRange - 1);
+
+ /*
+ * Ranges with no indexed tuple may contain anything.
+ */
+ if (!gottuple)
+ {
+ range->not_summarized = true;
+ }
+ else
+ {
+ dtup = brin_deform_tuple(bdesc, btup, dtup);
+ if (dtup->bt_placeholder)
+ {
+ /*
+ * Placeholder tuples are treated as if not populated.
+ *
+ * XXX Is this correct?
+ */
+ range->not_summarized = true;
+ }
+ else
+ {
+ BrinValues *bval;
+
+ bval = &dtup->bt_columns[attnum - 1];
+
+ range->has_nulls = bval->bv_hasnulls;
+ range->all_nulls = bval->bv_allnulls;
+
+ if (!bval->bv_allnulls)
+ {
+ /* FIXME copy the values, if needed (e.g. varlena) */
+ range->min_value = bval->bv_values[0];
+ range->max_value = bval->bv_values[1];
+ }
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ PG_RETURN_POINTER(ranges);
+}
+
/*
* Cache and return the procedure for the given strategy.
*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f86983c6601..e15b29246b1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -85,6 +85,8 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
+static void show_brinsort_keys(BrinSortState *sortstate, List *ancestors,
+ ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -1100,6 +1102,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
case T_IndexScan:
case T_IndexOnlyScan:
case T_BitmapHeapScan:
+ case T_BrinSort:
case T_TidScan:
case T_TidRangeScan:
case T_SubqueryScan:
@@ -1262,6 +1265,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_IndexOnlyScan:
pname = sname = "Index Only Scan";
break;
+ case T_BrinSort:
+ pname = sname = "BRIN Sort";
+ break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
break;
@@ -1508,6 +1514,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainScanTarget((Scan *) indexonlyscan, es);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *brinsort = (BrinSort *) plan;
+
+ ExplainIndexScanDetails(brinsort->indexid,
+ brinsort->indexorderdir,
+ es);
+ ExplainScanTarget((Scan *) brinsort, es);
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *bitmapindexscan = (BitmapIndexScan *) plan;
@@ -1790,6 +1806,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
break;
+ case T_BrinSort:
+ show_scan_qual(((BrinSort *) plan)->indexqualorig,
+ "Index Cond", planstate, ancestors, es);
+ if (((BrinSort *) plan)->indexqualorig)
+ show_instrumentation_count("Rows Removed by Index Recheck", 2,
+ planstate, es);
+ show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_brinsort_keys(castNode(BrinSortState, planstate), ancestors, es);
+ if (plan->qual)
+ show_instrumentation_count("Rows Removed by Filter", 1,
+ planstate, es);
+ break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
@@ -2389,6 +2417,21 @@ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
ancestors, es);
}
+/*
+ * Show the sort keys for a BRIN Sort node.
+ */
+static void
+show_brinsort_keys(BrinSortState *sortstate, List *ancestors, ExplainState *es)
+{
+ BrinSort *plan = (BrinSort *) sortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) sortstate, "Sort Key",
+ plan->numCols, 0, plan->sortColIdx,
+ plan->sortOperators, plan->collations,
+ plan->nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -3812,6 +3855,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
case T_ForeignScan:
case T_CustomScan:
case T_ModifyTable:
+ case T_BrinSort:
/* Assert it's on a real relation */
Assert(rte->rtekind == RTE_RELATION);
objectname = get_rel_name(rte->relid);
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..bcaa2ce8e21 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -38,6 +38,7 @@ OBJS = \
nodeBitmapHeapscan.o \
nodeBitmapIndexscan.o \
nodeBitmapOr.o \
+ nodeBrinSort.o \
nodeCtescan.o \
nodeCustom.o \
nodeForeignscan.o \
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 36406c3af57..4a6dc3f263c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -79,6 +79,7 @@
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeBitmapIndexscan.h"
#include "executor/nodeBitmapOr.h"
+#include "executor/nodeBrinSort.h"
#include "executor/nodeCtescan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
@@ -226,6 +227,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_BrinSort:
+ result = (PlanState *) ExecInitBrinSort((BrinSort *) node,
+ estate, eflags);
+ break;
+
case T_BitmapIndexScan:
result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
estate, eflags);
@@ -639,6 +645,10 @@ ExecEndNode(PlanState *node)
ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
break;
+ case T_BrinSortState:
+ ExecEndBrinSort((BrinSortState *) node);
+ break;
+
case T_BitmapIndexScanState:
ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
break;
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
new file mode 100644
index 00000000000..ad46169aee3
--- /dev/null
+++ b/src/backend/executor/nodeBrinSort.c
@@ -0,0 +1,1538 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.c
+ * Routines to support sorted scan of relations using a BRIN index
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * FIXME handling of other brin opclasses (minmax-multi)
+ *
+ * FIXME improve costing
+ *
+ *
+ * Improvement ideas:
+ *
+ * 1) multiple tuplestores for overlapping ranges
+ *
+ * When there are many overlapping ranges (so that maxval > current.maxval),
+ * we're loading all the "future" tuples into a new tuplestore. However, if
+ * there are multiple such ranges (imagine ranges "shifting" by 10%, which
+ * gives us 9 more ranges), we know in the next round we'll only need rows
+ * until the next maxval. We'll not sort these rows, but we'll still shuffle
+ * them around until we get to the proper range (so about 10x each row).
+ * Maybe we should pre-allocate the tuplestores (or maybe even tuplesorts)
+ * for future ranges, and route the tuples to the correct one? Maybe we
+ * could be a bit smarter and discard tuples once we have enough rows for
+ * the preceding ranges (say, with LIMIT queries). We'd also need to worry
+ * about work_mem, though - we can't just use many tuplestores, each with
+ * whole work_mem. So we'd probably use e.g. work_mem/2 for the next one,
+ * and then /4, /8 etc. for the following ones. That's work_mem in total.
+ * And there'd need to be some limit on number of tuplestores, I guess.
+ *
+ * 2) handling NULL values
+ *
+ * We need to handle NULLS FIRST / NULLS LAST cases. The question is how
+ * to do that - the easiest way is to simply do a separate scan of ranges
+ * that might contain NULL values, processing just rows with NULLs, and
+ * discarding other rows. And then process non-NULL values as currently.
+ * The NULL scan would happen before/after this regular phase.
+ *
+ * Byt maybe we could be smarter, and not do separate scans. When reading
+ * a page, we might stash the tuple in a tuplestore, so that we can read
+ * it the next round. Obviously, this might be expensive if we need to
+ * keep too many rows, so the tuplestore would grow too large - in that
+ * case it might be better to just do the two scans.
+ *
+ * 3) parallelism
+ *
+ * Presumably we could do a parallel version of this. The leader or first
+ * worker would prepare the range information, and the workers would then
+ * grab ranges (in a kinda round robin manner), sort them independently,
+ * and then the results would be merged by Gather Merge.
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeBrinSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ * ExecBrinSort scans a relation using an index
+ * IndexNext retrieve next tuple using index
+ * ExecInitBrinSort creates and initializes state info.
+ * ExecReScanBrinSort rescans the indexed relation.
+ * ExecEndBrinSort releases all storage.
+ * ExecBrinSortMarkPos marks scan position.
+ * ExecBrinSortRestrPos restores scan position.
+ * ExecBrinSortEstimate estimates DSM space needed for parallel index scan
+ * ExecBrinSortInitializeDSM initialize DSM for parallel BrinSort
+ * ExecBrinSortReInitializeDSM reinitialize DSM for fresh scan
+ * ExecBrinSortInitializeWorker attach to DSM info in parallel worker
+ */
+#include "postgres.h"
+
+#include "access/brin.h"
+#include "access/brin_internal.h"
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "executor/execdebug.h"
+#include "executor/nodeBrinSort.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
+#include "utils/array.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/*
+ * When an ordering operator is used, tuples fetched from the index that
+ * need to be reordered are queued in a pairing heap, as ReorderTuples.
+ */
+typedef struct
+{
+ pairingheap_node ph_node;
+ HeapTuple htup;
+ Datum *orderbyvals;
+ bool *orderbynulls;
+} ReorderTuple;
+
+static TupleTableSlot *IndexNext(BrinSortState *node);
+static bool IndexRecheck(BrinSortState *node, TupleTableSlot *slot);
+static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
+
+/* do various consistency checks */
+static void
+AssertCheckRanges(BrinSortState *node)
+{
+#ifdef USE_ASSERT_CHECKING
+
+ /* the primary range index has to be valid */
+ Assert((0 <= node->bs_next_range) &&
+ (node->bs_next_range <= node->bs_nranges));
+
+ /* the intersect range index has to be valid*/
+ Assert((0 <= node->bs_next_range_intersect) &&
+ (node->bs_next_range_intersect <= node->bs_nranges));
+
+ /* all the ranges up to bs_next_range should be marked as processed */
+ for (int i = 0; i < node->bs_next_range; i++)
+ {
+ BrinSortRange *range = &node->bs_ranges[i];
+ Assert(range->processed);
+ }
+
+ /* same for bs_next_range_intersect */
+ for (int i = 0; i < node->bs_next_range_intersect; i++)
+ {
+ BrinSortRange *range = node->bs_ranges_minval[i];
+ Assert(range->processed);
+ }
+#endif
+}
+
+/*
+ * brinsort_start_tidscan
+ * Start scanning tuples from a given page range.
+ *
+ * We open a TID range scan for the given range, and initialize the tuplesort.
+ * Optionally, we update the watermark (with either high/low value). We only
+ * need to do this for the main page range, not for the intersecting ranges.
+ *
+ * XXX Maybe we should initialize the tidscan only once, and then do rescan
+ * for the following ranges? And similarly for the tuplesort?
+ */
+static void
+brinsort_start_tidscan(BrinSortState *node, BrinSortRange *range,
+ bool update_watermark, bool mark_processed)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate = node->ss.ps.state;
+
+ /*
+ * When scanning the range during NULL processing, in which case the range
+ * might be already marked as processed (for NULLS LAST). So we only check
+ * the page is not alreayd marked as processed when we're supposed to mark
+ * it as processed.
+ */
+ Assert(!(mark_processed && range->processed));
+
+ /* There must not be any TID scan in progress yet. */
+ Assert(node->ss.ss_currentScanDesc == NULL);
+
+ /* Initialize the TID range scan, for the provided block range. */
+ if (node->ss.ss_currentScanDesc == NULL)
+ {
+ TableScanDesc tscandesc;
+ ItemPointerData mintid,
+ maxtid;
+
+ ItemPointerSetBlockNumber(&mintid, range->blkno_start);
+ ItemPointerSetOffsetNumber(&mintid, 0);
+
+ ItemPointerSetBlockNumber(&maxtid, range->blkno_end);
+ ItemPointerSetOffsetNumber(&maxtid, MaxHeapTuplesPerPage);
+
+ elog(DEBUG1, "loading range blocks [%u, %u]",
+ range->blkno_start, range->blkno_end);
+
+ tscandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ &mintid, &maxtid);
+ node->ss.ss_currentScanDesc = tscandesc;
+ }
+
+ if (node->bs_tuplesortstate == NULL)
+ {
+ TupleDesc tupDesc = RelationGetDescr(node->ss.ss_currentRelation);
+
+ node->bs_tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plan->numCols,
+ plan->sortColIdx,
+ plan->sortOperators,
+ plan->collations,
+ plan->nullsFirst,
+ work_mem,
+ NULL,
+ TUPLESORT_NONE);
+ }
+
+ if (node->bs_tuplestore == NULL)
+ {
+ node->bs_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+ }
+
+ /*
+ * Remember maximum value for the current range (but not when
+ * processing overlapping ranges). We only do this during the
+ * regular tuple processing, not when scanning NULL values.
+ *
+ * We use the larger value, according to the sort operator, so that this
+ * gets the right value even for DESC ordering (in which case the lower
+ * boundary will be evaluated as "greater").
+ *
+ * XXX Could also use the scan direction, like in other places.
+ */
+ if (update_watermark)
+ {
+ int cmp = ApplySortComparator(range->min_value, false,
+ range->max_value, false,
+ &node->bs_sortsupport);
+
+ if (cmp < 0)
+ node->bs_watermark = range->max_value;
+ else
+ node->bs_watermark = range->min_value;
+ }
+
+ /* Maybe mark the range as processed. */
+ range->processed |= mark_processed;
+}
+
+/*
+ * brinsort_end_tidscan
+ * Finish the TID range scan.
+ */
+static void
+brinsort_end_tidscan(BrinSortState *node)
+{
+ /* get the first range, read all tuples using a tid range scan */
+ if (node->ss.ss_currentScanDesc != NULL)
+ {
+ table_endscan(node->ss.ss_currentScanDesc);
+ node->ss.ss_currentScanDesc = NULL;
+ }
+}
+
+/*
+ * brinsort_load_tuples
+ * Load tuples from the TID range scan, add them to tuplesort/store.
+ *
+ * When called for the "current" range, we don't need to check the watermark,
+ * we know the tuple goes into the tuplesort. So with check_watermark we
+ * skip the comparator call to save CPU cost.
+ */
+static void
+brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_processing)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ TableScanDesc scan = node->ss.ss_currentScanDesc;
+ EState *estate;
+ ScanDirection direction;
+ TupleTableSlot *slot;
+
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ slot = node->ss.ss_ScanTupleSlot;
+
+ /*
+ * Read tuples, evaluate the filer (so that we don't keep tuples only to
+ * discard them later), and decide if it goes into the current range
+ * (tuplesort) or overflow (tuplestore).
+ */
+ while (table_scan_getnextslot_tidrange(scan, direction, slot))
+ {
+ ExprContext *econtext;
+ ExprState *qual;
+
+ /*
+ * Fetch data from node
+ */
+ qual = node->bs_qual;
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /*
+ * place the current tuple into the expr context
+ */
+ econtext->ecxt_scantuple = slot;
+
+ /*
+ * check that the current tuple satisfies the qual-clause
+ *
+ * check for non-null qual here to avoid a function call to ExecQual()
+ * when the qual is null ... saves only a few cycles, but they add up
+ * ...
+ *
+ * XXX Done here, because in ExecScan we'll get different slot type
+ * (minimal tuple vs. buffered tuple). Scan expects slot while reading
+ * from the table (like here), but we're stashing it into a tuplesort.
+ *
+ * XXX Maybe we could eliminate many tuples by leveraging the BRIN
+ * range, by executing the consistent function. But we don't have
+ * the qual in appropriate format at the moment, so we'd preprocess
+ * the keys similarly to bringetbitmap(). In which case we should
+ * probably evaluate the stuff while building the ranges? Although,
+ * if the "consistent" function is expensive, it might be cheaper
+ * to do that incrementally, as we need the ranges. Would be a win
+ * for LIMIT queries, for example.
+ *
+ * XXX However, maybe we could also leverage other bitmap indexes,
+ * particularly for BRIN indexes because that makes it simpler to
+ * eliminage the ranges incrementally - we know which ranges to
+ * load from the index, while for other indexes (e.g. btree) we
+ * have to read the whole index and build a bitmap in order to have
+ * a bitmap for any range. Although, if the condition is very
+ * selective, we may need to read only a small fraction of the
+ * index, so maybe that's OK.
+ */
+ if (qual == NULL || ExecQual(qual, econtext))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ Datum value;
+ bool isnull;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /*
+ * FIXME Not handling NULLS for now, we need to stash them into
+ * a separate tuplestore (so that we can output them first or
+ * last), and then skip them in the regular processing?
+ */
+ if (null_processing)
+ {
+ /* Stash it to the tuplestore (when NULL, or ignore
+ * it (when not-NULL). */
+ if (isnull)
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+
+ /* NULL or not, we're done */
+ continue;
+ }
+
+ /* we're not processing NULL values, so ignore NULLs */
+ if (isnull)
+ continue;
+
+ /*
+ * Otherwise compare to watermark, and stash it either to the
+ * tuplesort or tuplestore.
+ */
+ if (check_watermark)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+ }
+
+ ExecClearTuple(slot);
+ }
+
+ ExecClearTuple(slot);
+}
+
+/*
+ * brinsort_load_spill_tuples
+ * Load tuples from the spill tuplestore, and either stash them into
+ * a tuplesort or a new tuplestore.
+ *
+ * After processing the last range, we want to process all remaining ranges,
+ * so with check_watermark=false we skip the check.
+ */
+static void
+brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ Tuplestorestate *tupstore;
+ TupleTableSlot *slot;
+
+ if (node->bs_tuplestore == NULL)
+ return;
+
+ /* start scanning the existing tuplestore (XXX needed?) */
+ tuplestore_rescan(node->bs_tuplestore);
+
+ /*
+ * Create a new tuplestore, for tuples that exceed the watermark and so
+ * should not be included in the current sort.
+ */
+ tupstore = tuplestore_begin_heap(false, false, work_mem);
+
+ /*
+ * We need a slot for minimal tuples. The scan slot uses buffered tuples,
+ * so it'd trigger an error in the loop.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(node->ss.ss_currentRelation),
+ &TTSOpsMinimalTuple);
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ bool isnull;
+ Datum value;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /* We shouldn't have NULL values in the spill, at least not now. */
+ Assert(!isnull);
+
+ if (check_watermark)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(tupstore, slot);
+ }
+
+ /*
+ * Discard the existing tuplestore (that we just processed), use the new
+ * one instead.
+ */
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = tupstore;
+
+ ExecDropSingleTupleTableSlot(slot);
+}
+
+/*
+ * brinsort_load_intersecting_ranges
+ * Load ranges intersecting with the current watermark.
+ *
+ * This does not increment bs_next_range, but bs_next_range_intersect.
+ */
+static void
+brinsort_load_intersecting_ranges(BrinSort *plan, BrinSortState *node)
+{
+ /* load intersecting ranges */
+ for (int i = node->bs_next_range_intersect; i < node->bs_nranges; i++)
+ {
+ int cmp;
+ BrinSortRange *range = node->bs_ranges_minval[i];
+
+ /* skip already processed ranges */
+ if (range->processed)
+ continue;
+
+ /*
+ * Abort on the first all-null or not-summarized range. These are
+ * intentionally kept at the end, but don't intersect with anything.
+ */
+ if (range->all_nulls || range->not_summarized)
+ break;
+
+ if (ScanDirectionIsForward(plan->indexorderdir))
+ cmp = ApplySortComparator(range->min_value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+ else
+ cmp = ApplySortComparator(range->max_value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ /*
+ * No possible overlap, so break, we know all following ranges have
+ * a higher minval and thus can't intersect either.
+ */
+ if (cmp > 0)
+ break;
+
+ node->bs_next_range_intersect++;
+
+ elog(DEBUG1, "loading intersecting range %d (%u,%u) [%ld,%ld] %ld", i,
+ range->blkno_start, range->blkno_end,
+ range->min_value, range->max_value,
+ node->bs_watermark);
+
+ /* load tuples from the rage, check the watermark */
+ brinsort_start_tidscan(node, range, false, true);
+ brinsort_load_tuples(node, true, false);
+ brinsort_end_tidscan(node);
+ }
+}
+
+/*
+ * brinsort_load_unsummarized_ranges
+ * Load ranges that don't have a proper summary, so we don't know
+ * what values are in them (might be even NULL values).
+ *
+ * We simply load them into the spill tuplestore, because that's the
+ * best thing we can do. We ignore NULL values though - those are handled
+ * in a separate step.
+ */
+static void
+brinsort_load_unsummarized_ranges(BrinSort *plan, BrinSortState *node)
+{
+ /* Should be called only once, right after the first range. */
+ Assert(node->bs_next_range == 1);
+
+ /* load unsummarized ranges */
+ for (int i = 0; i < node->bs_nranges; i++)
+ {
+ BrinSortRange *range = node->bs_ranges_minval[i];
+
+ /* skip already processed ranges (there should be just one) */
+ if (range->processed)
+ continue;
+
+ /* we're interested only in not-summarized ranges */
+ if (!range->not_summarized)
+ continue;
+
+ elog(DEBUG1, "loading not-summarized range %d (%u,%u) [%ld,%ld] %ld", i,
+ range->blkno_start, range->blkno_end,
+ range->min_value, range->max_value,
+ node->bs_watermark);
+
+ /*
+ * Load tuples from the rage, check the watermark and mark the
+ * ranges as processed.
+ */
+ brinsort_start_tidscan(node, range, false, true);
+ brinsort_load_tuples(node, true, false);
+ brinsort_end_tidscan(node);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * IndexNext
+ *
+ * Retrieve a tuple from the BrinSort node's currentRelation
+ * using the index specified in the BrinSortState information.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+IndexNext(BrinSortState *node)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate;
+ ScanDirection direction;
+ IndexScanDesc scandesc;
+ TupleTableSlot *slot;
+ bool nullsFirst;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ /* flip direction if this is an overall backward scan */
+ /* XXX For BRIN indexes this is always forward direction */
+ // if (ScanDirectionIsBackward(((BrinSort *) node->ss.ps.plan)->indexorderdir))
+ if (false)
+ {
+ if (ScanDirectionIsForward(direction))
+ direction = BackwardScanDirection;
+ else if (ScanDirectionIsBackward(direction))
+ direction = ForwardScanDirection;
+ }
+ scandesc = node->iss_ScanDesc;
+ slot = node->ss.ss_ScanTupleSlot;
+
+ nullsFirst = plan->nullsFirst[0];
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the index scan is not parallel, or if we're
+ * serially executing an index scan that was planned to be parallel.
+ */
+ scandesc = index_beginscan(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys);
+
+ node->iss_ScanDesc = scandesc;
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and
+ * pass the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(scandesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+ /*
+ * Load info about BRIN ranges, sort them to match the desired ordering.
+ */
+ ExecInitBrinSortRanges(plan, node);
+ node->bs_next_range = 0;
+ node->bs_next_range_intersect = 0;
+ node->bs_next_range_nulls = 0;
+ node->bs_phase = BRINSORT_START;
+
+
+ /* dump ranges for debugging */
+ for (int i = 0; i < node->bs_nranges; i++)
+ {
+ elog(DEBUG1, "%d => (%u,%u) [%ld,%ld]", i,
+ node->bs_ranges[i].blkno_start,
+ node->bs_ranges[i].blkno_end,
+ node->bs_ranges[i].min_value,
+ node->bs_ranges[i].max_value);
+ }
+
+ for (int i = 0; i < node->bs_nranges; i++)
+ {
+ elog(DEBUG1, "minval %d => (%u,%u) [%ld,%ld]", i,
+ node->bs_ranges_minval[i]->blkno_start,
+ node->bs_ranges_minval[i]->blkno_end,
+ node->bs_ranges_minval[i]->min_value,
+ node->bs_ranges_minval[i]->max_value);
+ }
+ }
+
+ /*
+ * ok, now that we have what we need, fetch the next tuple.
+ */
+ while (node->bs_phase != BRINSORT_FINISHED)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ elog(DEBUG1, "phase = %d", node->bs_phase);
+
+ AssertCheckRanges(node);
+
+ switch (node->bs_phase)
+ {
+ case BRINSORT_START:
+ /*
+ * If we have NULLS FIRST, move to that stage. Otherwise
+ * start scanning regular ranges.
+ */
+ node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;
+
+ break;
+
+ case BRINSORT_LOAD_RANGE:
+ {
+ BrinSortRange *range;
+
+ elog(DEBUG1, "phase = LOAD_RANGE %d of %d", node->bs_next_range, node->bs_nranges);
+
+ /*
+ * Some of the ranges might intersect with already processed
+ * range and thus have already been processed, so skip them.
+ *
+ * FIXME Should this care about all-null / not_summarized?
+ */
+ while ((node->bs_next_range < node->bs_nranges) &&
+ (node->bs_ranges[node->bs_next_range].processed))
+ node->bs_next_range++;
+
+ Assert(node->bs_next_range <= node->bs_nranges);
+
+ /* might point just after the last range */
+ range = &node->bs_ranges[node->bs_next_range];
+
+ /*
+ * Is this the last regular range? We might have either run
+ * out of ranges in general, or maybe we just hit the first
+ * all-null or unprocessed range.
+ *
+ * In this case there might still be a bunch of tuples in
+ * the tuplestore, so we need to process them properly. We
+ * load them into the tuplesort and process them.
+ */
+ if ((node->bs_next_range == node->bs_nranges) ||
+ (range->all_nulls || range->not_summarized))
+ {
+ /* still some tuples to process */
+ if (node->bs_tuplestore != NULL)
+ {
+ brinsort_load_spill_tuples(node, false);
+ node->bs_tuplestore = NULL;
+ tuplesort_performsort(node->bs_tuplesortstate);
+
+ node->bs_phase = BRINSORT_PROCESS_RANGE;
+ break;
+ }
+
+ /*
+ * We've reached the end, and there are no more rows in the
+ * tuplestore, so we're done.
+ */
+ if (node->bs_next_range == node->bs_nranges)
+ {
+ elog(DEBUG1, "phase => FINISHED / last range processed");
+ node->bs_phase = (nullsFirst) ? BRINSORT_FINISHED : BRINSORT_LOAD_NULLS;
+ break;
+ }
+ }
+
+ /* Fine, we can process this range, so move the index too. */
+ node->bs_next_range++;
+
+ /*
+ * Load the next unprocessed range. We update the watermark,
+ * so that we don't need to check it when loading tuples.
+ */
+ brinsort_start_tidscan(node, range, true, true);
+ brinsort_load_tuples(node, false, false);
+ brinsort_end_tidscan(node);
+
+ Assert(range->processed);
+
+ /* Load matching tuples from the current spill tuplestore. */
+ brinsort_load_spill_tuples(node, true);
+
+ /*
+ * Load tuples from intersecting ranges.
+ *
+ * XXX We do this after processing the spill tuplestore,
+ * because we will add rows to it - but we know those rows
+ * should be there, and brinsort_load_spill would recheck
+ * them again unnecessarily.
+ */
+ elog(DEBUG1, "loading intersecting ranges");
+ brinsort_load_intersecting_ranges(plan, node);
+
+ /*
+ * If this is the first range, process unsummarized ranges
+ * too. Similarly to the intersecting ranges, we do this
+ * after loading tuples from the spill tuplestore, because
+ * we might write some (many) tuples into that.
+ */
+ if (node->bs_next_range == 1)
+ brinsort_load_unsummarized_ranges(plan, node);
+
+ elog(DEBUG1, "performing sort");
+ tuplesort_performsort(node->bs_tuplesortstate);
+
+ node->bs_phase = BRINSORT_PROCESS_RANGE;
+ break;
+ }
+
+ case BRINSORT_PROCESS_RANGE:
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplesortstate != NULL)
+ {
+ if (tuplesort_gettupleslot(node->bs_tuplesortstate,
+ ScanDirectionIsForward(direction),
+ false, slot, NULL))
+ return slot;
+
+ /* once we're done with the tuplesort, reset it */
+ tuplesort_reset(node->bs_tuplesortstate);
+ node->bs_phase = BRINSORT_LOAD_RANGE; /* load next range */
+ }
+
+ break;
+
+ case BRINSORT_LOAD_NULLS:
+ {
+ BrinSortRange *range;
+
+ elog(DEBUG1, "phase = LOAD_NULLS");
+
+ /*
+ * Ignore ranges that can't possibly have NULL values. We do
+ * not care about whether the range was already processed.
+ */
+ while (node->bs_next_range_nulls < node->bs_nranges)
+ {
+ /* these ranges may have NULL values */
+ if (node->bs_ranges[node->bs_next_range_nulls].has_nulls ||
+ node->bs_ranges[node->bs_next_range_nulls].all_nulls ||
+ node->bs_ranges[node->bs_next_range_nulls].not_summarized)
+ break;
+
+ node->bs_next_range_nulls++;
+ }
+
+ Assert(node->bs_next_range_nulls <= node->bs_nranges);
+
+ /*
+ * Did we process the last range? There should be nothing left
+ * in the tuplestore, because we flush that at the end of
+ * processing regular tuples.
+ */
+ if (node->bs_next_range_nulls == node->bs_nranges)
+ {
+ elog(DEBUG1, "phase => FINISHED / last range processed");
+ Assert(node->bs_tuplestore == NULL);
+ node->bs_phase = BRINSORT_FINISHED;
+ node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_RANGE : BRINSORT_FINISHED;
+ break;
+ }
+
+ range = &node->bs_ranges[node->bs_next_range_nulls];
+ node->bs_next_range_nulls++;
+
+ /*
+ * Load the next unprocessed range. We update the watermark,
+ * so that we don't need to check it when loading tuples.
+ */
+ brinsort_start_tidscan(node, range, false, false);
+ brinsort_load_tuples(node, true, true);
+ brinsort_end_tidscan(node);
+
+ node->bs_phase = BRINSORT_PROCESS_NULLS;
+ break;
+ }
+
+ break;
+
+ case BRINSORT_PROCESS_NULLS:
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ Assert(node->bs_tuplestore != NULL);
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplestore != NULL)
+ {
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ return slot;
+
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ node->bs_phase = BRINSORT_LOAD_NULLS; /* load next range */
+ }
+
+ break;
+
+ case BRINSORT_FINISHED:
+ elog(ERROR, "unexpected BrinSort phase: FINISHED");
+ break;
+ }
+ }
+
+ /*
+ * if we get here it means the index scan failed so we are at the end of
+ * the scan..
+ */
+ node->iss_ReachedEnd = true;
+ return ExecClearTuple(slot);
+}
+
+/*
+ * IndexRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+IndexRecheck(BrinSortState *node, TupleTableSlot *slot)
+{
+ ExprContext *econtext;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /* Does the tuple meet the indexqual condition? */
+ econtext->ecxt_scantuple = slot;
+ return ExecQualAndReset(node->indexqualorig, econtext);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecBrinSort(node)
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecBrinSort(PlanState *pstate)
+{
+ BrinSortState *node = castNode(BrinSortState, pstate);
+
+ /*
+ * If we have runtime keys and they've not already been set up, do it now.
+ */
+ if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+ ExecReScan((PlanState *) node);
+
+ return ExecScan(&node->ss,
+ (ExecScanAccessMtd) IndexNext,
+ (ExecScanRecheckMtd) IndexRecheck);
+}
+
+/* ----------------------------------------------------------------
+ * ExecReScanBrinSort(node)
+ *
+ * Recalculates the values of any scan keys whose value depends on
+ * information known at runtime, then rescans the indexed relation.
+ *
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanBrinSort(BrinSortState *node)
+{
+ /*
+ * If we are doing runtime key calculations (ie, any of the index key
+ * values weren't simple Consts), compute the new key values. But first,
+ * reset the context so we don't leak memory as each outer tuple is
+ * scanned. Note this assumes that we will recalculate *all* runtime keys
+ * on each call.
+ */
+ if (node->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *econtext = node->iss_RuntimeContext;
+
+ ResetExprContext(econtext);
+ ExecIndexEvalRuntimeKeys(econtext,
+ node->iss_RuntimeKeys,
+ node->iss_NumRuntimeKeys);
+ }
+ node->iss_RuntimeKeysReady = true;
+
+ /* reset index scan */
+ if (node->iss_ScanDesc)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+ node->iss_ReachedEnd = false;
+
+ ExecScanReScan(&node->ss);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecEndBrinSort
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndBrinSort(BrinSortState *node)
+{
+ Relation indexRelationDesc;
+ IndexScanDesc IndexScanDesc;
+
+ /*
+ * extract information from the node
+ */
+ indexRelationDesc = node->iss_RelationDesc;
+ IndexScanDesc = node->iss_ScanDesc;
+
+ /*
+ * clear out tuple table slots
+ */
+ if (node->ss.ps.ps_ResultTupleSlot)
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+ /*
+ * close the index relation (no-op if we didn't open it)
+ */
+ if (IndexScanDesc)
+ index_endscan(IndexScanDesc);
+ if (indexRelationDesc)
+ index_close(indexRelationDesc, NoLock);
+
+ if (node->ss.ss_currentScanDesc != NULL)
+ table_endscan(node->ss.ss_currentScanDesc);
+
+ if (node->bs_tuplestore != NULL)
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ if (node->bs_tuplesortstate != NULL)
+ tuplesort_end(node->bs_tuplesortstate);
+ node->bs_tuplesortstate = NULL;
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortMarkPos
+ *
+ * Note: we assume that no caller attempts to set a mark before having read
+ * at least one tuple. Otherwise, iss_ScanDesc might still be NULL.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortMarkPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (epqstate != NULL)
+ {
+ /*
+ * We are inside an EvalPlanQual recheck. If a test tuple exists for
+ * this relation, then we shouldn't access the index at all. We would
+ * instead need to save, and later restore, the state of the
+ * relsubs_done flag, so that re-fetching the test tuple is possible.
+ * However, given the assumption that no caller sets a mark at the
+ * start of the scan, we can only get here with relsubs_done[i]
+ * already set, and so no state need be saved.
+ */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortMarkPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_markpos(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecIndexRestrPos
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortRestrPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (estate->es_epq_active != NULL)
+ {
+ /* See comments in ExecIndexMarkPos */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortRestrPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_restrpos(node->iss_ScanDesc);
+}
+
+
+/*
+ * We always sort the ranges so that we have them in this general order
+ *
+ * 1) ranges sorted by min/max value, as dictated by ASC/DESC
+ * 2) all-null ranges
+ * 3) not-summarized ranges
+ *
+ */
+static int
+brin_sort_range_asc_cmp(const void *a, const void *b, void *arg)
+{
+ int r;
+ BrinSortRange *ra = (BrinSortRange *) a;
+ BrinSortRange *rb = (BrinSortRange *) b;
+ SortSupport ssup = (SortSupport) arg;
+
+ /* unsummarized ranges are sorted last */
+ if (ra->not_summarized && rb->not_summarized)
+ return 0;
+ else if (ra->not_summarized)
+ return -1;
+ else if (rb->not_summarized)
+ return 1;
+
+ Assert(!(ra->not_summarized || rb->not_summarized));
+
+ /* then we sort all-null ranges */
+ if (ra->all_nulls && rb->all_nulls)
+ return 0;
+ else if (ra->all_nulls)
+ return -1;
+ else if (rb->all_nulls)
+ return 1;
+
+ Assert(!(ra->all_nulls || rb->all_nulls));
+
+ r = ApplySortComparator(ra->max_value, false, rb->max_value, false, ssup);
+ if (r != 0)
+ return r;
+
+ return ApplySortComparator(ra->min_value, false, rb->min_value, false, ssup);
+}
+
+static int
+brin_sort_range_desc_cmp(const void *a, const void *b, void *arg)
+{
+ int r;
+ BrinSortRange *ra = (BrinSortRange *) a;
+ BrinSortRange *rb = (BrinSortRange *) b;
+ SortSupport ssup = (SortSupport) arg;
+
+ /* unsummarized ranges are sorted last */
+ if (ra->not_summarized && rb->not_summarized)
+ return 0;
+ else if (ra->not_summarized)
+ return -1;
+ else if (rb->not_summarized)
+ return 1;
+
+ Assert(!(ra->not_summarized || rb->not_summarized));
+
+ /* then we sort all-null ranges */
+ if (ra->all_nulls && rb->all_nulls)
+ return 0;
+ else if (ra->all_nulls)
+ return -1;
+ else if (rb->all_nulls)
+ return 1;
+
+ Assert(!(ra->all_nulls || rb->all_nulls));
+
+ r = ApplySortComparator(ra->min_value, false, rb->min_value, false, ssup);
+ if (r != 0)
+ return r;
+
+ return ApplySortComparator(ra->max_value, false, rb->max_value, false, ssup);
+}
+
+static int
+brin_sort_rangeptr_asc_cmp(const void *a, const void *b, void *arg)
+{
+ BrinSortRange *ra = *(BrinSortRange **) a;
+ BrinSortRange *rb = *(BrinSortRange **) b;
+ SortSupport ssup = (SortSupport) arg;
+
+ /* unsummarized ranges are sorted last */
+ if (ra->not_summarized && rb->not_summarized)
+ return 0;
+ else if (ra->not_summarized)
+ return -1;
+ else if (rb->not_summarized)
+ return 1;
+
+ Assert(!(ra->not_summarized || rb->not_summarized));
+
+ /* then we sort all-null ranges */
+ if (ra->all_nulls && rb->all_nulls)
+ return 0;
+ else if (ra->all_nulls)
+ return -1;
+ else if (rb->all_nulls)
+ return 1;
+
+ Assert(!(ra->all_nulls || rb->all_nulls));
+
+ return ApplySortComparator(ra->min_value, false, rb->min_value, false, ssup);
+}
+
+static int
+brin_sort_rangeptr_desc_cmp(const void *a, const void *b, void *arg)
+{
+ BrinSortRange *ra = *(BrinSortRange **) a;
+ BrinSortRange *rb = *(BrinSortRange **) b;
+ SortSupport ssup = (SortSupport) arg;
+
+ /* unsummarized ranges are sorted last */
+ if (ra->not_summarized && rb->not_summarized)
+ return 0;
+ else if (ra->not_summarized)
+ return -1;
+ else if (rb->not_summarized)
+ return 1;
+
+ Assert(!(ra->not_summarized || rb->not_summarized));
+
+ /* then we sort all-null ranges */
+ if (ra->all_nulls && rb->all_nulls)
+ return 0;
+ else if (ra->all_nulls)
+ return -1;
+ else if (rb->all_nulls)
+ return 1;
+
+ Assert(!(ra->all_nulls || rb->all_nulls));
+
+ return ApplySortComparator(ra->max_value, false, rb->max_value, false, ssup);
+}
+
+/*
+ * somewhat crippled verson of bringetbitmap
+ *
+ * XXX We don't call consistent function (or any other function), so unlike
+ * bringetbitmap we don't set a separate memory context. If we end up filtering
+ * the ranges somehow (e.g. by WHERE conditions), this might be necessary.
+ *
+ * XXX Should be part of opclass, to somewhere in brin_minmax.c etc.
+ */
+static void
+ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate)
+{
+ IndexScanDesc scan = planstate->iss_ScanDesc;
+ Relation indexRel = planstate->iss_RelationDesc;
+ int attno;
+ FmgrInfo *rangeproc;
+ BrinRanges *ranges;
+
+ /* BRIN Sort only allows ORDER BY using a single column */
+ Assert(node->numCols == 1);
+
+ /*
+ * Determine index attnum we're interested in. The sortColIdx has attnums
+ * from the table, but we need index attnum so that we can fetch the right
+ * range summary.
+ *
+ * XXX Maybe we could/should arrange the tlists differently, so that this
+ * is not necessary?
+ */
+ attno = 0;
+ for (int i = 0; i < indexRel->rd_index->indnatts; i++)
+ {
+ if (indexRel->rd_index->indkey.values[i] == node->sortColIdx[0])
+ {
+ attno = (i + 1);
+ break;
+ }
+ }
+
+ /* get procedure to generate sort ranges */
+ rangeproc = index_getprocinfo(indexRel, attno, BRIN_PROCNUM_RANGES);
+
+ /*
+ * Should not get here without a proc, thanks to the check before
+ * building the BrinSort path.
+ */
+ Assert(OidIsValid(rangeproc));
+
+ /* XXX maybe call this in a separate memory context? */
+ ranges = (BrinRanges *) DatumGetPointer(FunctionCall2Coll(rangeproc,
+ InvalidOid, /* FIXME use proper collation*/
+ PointerGetDatum(scan),
+ Int16GetDatum(attno)));
+
+ /* allocate for space, and also for the alternative ordering */
+ planstate->bs_nranges = 0;
+ planstate->bs_ranges = (BrinSortRange *) palloc0(ranges->nranges * sizeof(BrinSortRange));
+ planstate->bs_ranges_minval = (BrinSortRange **) palloc0(ranges->nranges * sizeof(BrinSortRange *));
+
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ planstate->bs_ranges[i].blkno_start = ranges->ranges[i].blkno_start;
+ planstate->bs_ranges[i].blkno_end = ranges->ranges[i].blkno_end;
+ planstate->bs_ranges[i].min_value = ranges->ranges[i].min_value;
+ planstate->bs_ranges[i].max_value = ranges->ranges[i].max_value;
+ planstate->bs_ranges[i].has_nulls = ranges->ranges[i].has_nulls;
+ planstate->bs_ranges[i].all_nulls = ranges->ranges[i].all_nulls;
+ planstate->bs_ranges[i].not_summarized = ranges->ranges[i].not_summarized;
+
+ planstate->bs_ranges_minval[i] = &planstate->bs_ranges[i];
+ }
+
+ planstate->bs_nranges = ranges->nranges;
+
+ /*
+ * Sort ranges by maximum value, as determined by the sort operator.
+ *
+ * This automatically considers the ASC/DESC, because for DESC we use
+ * an operator that deems the "min_value" value greater.
+ *
+ * XXX Not sure what to do about NULLS FIRST / LAST.
+ */
+ memset(&planstate->bs_sortsupport, 0, sizeof(SortSupportData));
+ PrepareSortSupportFromOrderingOp(node->sortOperators[0], &planstate->bs_sortsupport);
+
+ /*
+ * We need to sort by max_value in the first step, so that we can add
+ * ranges incrementally, as they add "minimum" number of rows.
+ *
+ * But then in the second step we need to add all intersecting ranges X
+ * until X.min_value > A.max_value (where A is the range added in first
+ * step). And for that we probably need a separate sort by min_value,
+ * perhaps of just a pointer array, pointing back to bs_ranges.
+ *
+ * For DESC sort this works the opposite way, i.e. first step sort by
+ * min_value, then max_value.
+ */
+ if (ScanDirectionIsForward(node->indexorderdir))
+ {
+ qsort_arg(planstate->bs_ranges, planstate->bs_nranges, sizeof(BrinSortRange),
+ brin_sort_range_asc_cmp, &planstate->bs_sortsupport);
+
+ qsort_arg(planstate->bs_ranges_minval, planstate->bs_nranges, sizeof(BrinSortRange *),
+ brin_sort_rangeptr_asc_cmp, &planstate->bs_sortsupport);
+ }
+ else
+ {
+ qsort_arg(planstate->bs_ranges, planstate->bs_nranges, sizeof(BrinSortRange),
+ brin_sort_range_desc_cmp, &planstate->bs_sortsupport);
+
+ qsort_arg(planstate->bs_ranges_minval, planstate->bs_nranges, sizeof(BrinSortRange *),
+ brin_sort_rangeptr_desc_cmp, &planstate->bs_sortsupport);
+ }
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitBrinSort
+ *
+ * Initializes the index scan's state information, creates
+ * scan keys, and opens the base and index relations.
+ *
+ * Note: index scans have 2 sets of state information because
+ * we have to keep track of the base relation and the
+ * index relation.
+ * ----------------------------------------------------------------
+ */
+BrinSortState *
+ExecInitBrinSort(BrinSort *node, EState *estate, int eflags)
+{
+ BrinSortState *indexstate;
+ Relation currentRelation;
+ LOCKMODE lockmode;
+
+ /*
+ * create state structure
+ */
+ indexstate = makeNode(BrinSortState);
+ indexstate->ss.ps.plan = (Plan *) node;
+ indexstate->ss.ps.state = estate;
+ indexstate->ss.ps.ExecProcNode = ExecBrinSort;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * create expression context for node
+ */
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+
+ /*
+ * open the scan relation
+ */
+ currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
+
+ indexstate->ss.ss_currentRelation = currentRelation;
+ indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
+
+ /*
+ * get the scan type from the relation descriptor.
+ */
+ ExecInitScanTupleSlot(estate, &indexstate->ss,
+ RelationGetDescr(currentRelation),
+ table_slot_callbacks(currentRelation));
+
+ /*
+ * Initialize result type and projection.
+ */
+ ExecInitResultTypeTL(&indexstate->ss.ps);
+ ExecAssignScanProjectionInfo(&indexstate->ss);
+
+ /*
+ * initialize child expressions
+ *
+ * Note: we don't initialize all of the indexqual expression, only the
+ * sub-parts corresponding to runtime keys (see below). Likewise for
+ * indexorderby, if any. But the indexqualorig expression is always
+ * initialized even though it will only be used in some uncommon cases ---
+ * would be nice to improve that. (Problem is that any SubPlans present
+ * in the expression must be found now...)
+ */
+ indexstate->ss.ps.qual =
+ ExecInitQual(node->scan.plan.qual, (PlanState *) indexstate);
+ indexstate->indexqualorig =
+ ExecInitQual(node->indexqualorig, (PlanState *) indexstate);
+
+ /*
+ * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+ * here. This allows an index-advisor plugin to EXPLAIN a plan containing
+ * references to nonexistent indexes.
+ */
+ if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+ return indexstate;
+
+ /* Open the index relation. */
+ lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
+ indexstate->iss_RelationDesc = index_open(node->indexid, lockmode);
+
+ /*
+ * Initialize index-specific scan state
+ */
+ indexstate->iss_RuntimeKeysReady = false;
+ indexstate->iss_RuntimeKeys = NULL;
+ indexstate->iss_NumRuntimeKeys = 0;
+
+ /*
+ * build the index scan keys from the index qualification
+ */
+ ExecIndexBuildScanKeys((PlanState *) indexstate,
+ indexstate->iss_RelationDesc,
+ node->indexqual,
+ false,
+ &indexstate->iss_ScanKeys,
+ &indexstate->iss_NumScanKeys,
+ &indexstate->iss_RuntimeKeys,
+ &indexstate->iss_NumRuntimeKeys,
+ NULL, /* no ArrayKeys */
+ NULL);
+
+ /*
+ * If we have runtime keys, we need an ExprContext to evaluate them. The
+ * node's standard context won't do because we want to reset that context
+ * for every tuple. So, build another context just like the other one...
+ * -tgl 7/11/00
+ */
+ if (indexstate->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *stdecontext = indexstate->ss.ps.ps_ExprContext;
+
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+ indexstate->iss_RuntimeContext = indexstate->ss.ps.ps_ExprContext;
+ indexstate->ss.ps.ps_ExprContext = stdecontext;
+ }
+ else
+ {
+ indexstate->iss_RuntimeContext = NULL;
+ }
+
+ indexstate->bs_tuplesortstate = NULL;
+ indexstate->bs_qual = indexstate->ss.ps.qual;
+ indexstate->ss.ps.qual = NULL;
+ ExecInitResultTupleSlotTL(&indexstate->ss.ps, &TTSOpsMinimalTuple);
+
+ /*
+ * all done.
+ */
+ return indexstate;
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortEstimate(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeDSM
+ *
+ * Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+ index_parallelscan_initialize(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ piscan);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortReInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ index_parallelrescan(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4c6b1d1f55b..64d103b19e9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -790,6 +790,260 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.total_cost = startup_cost + run_cost;
}
+void
+cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
+ bool partial_path)
+{
+ IndexOptInfo *index = path->ipath.indexinfo;
+ RelOptInfo *baserel = index->rel;
+ amcostestimate_function amcostestimate;
+ List *qpquals;
+ Cost startup_cost = 0;
+ Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost indexStartupCost;
+ Cost indexTotalCost;
+ Selectivity indexSelectivity;
+ double indexCorrelation,
+ csquared;
+ double spc_seq_page_cost,
+ spc_random_page_cost;
+ Cost min_IO_cost,
+ max_IO_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ double tuples_fetched;
+ double pages_fetched;
+ double rand_heap_pages;
+ double index_pages;
+
+ /* Should only be applied to base relations */
+ Assert(IsA(baserel, RelOptInfo) &&
+ IsA(index, IndexOptInfo));
+ Assert(baserel->relid > 0);
+ Assert(baserel->rtekind == RTE_RELATION);
+
+ /*
+ * Mark the path with the correct row estimate, and identify which quals
+ * will need to be enforced as qpquals. We need not check any quals that
+ * are implied by the index's predicate, so we can use indrestrictinfo not
+ * baserestrictinfo as the list of relevant restriction clauses for the
+ * rel.
+ */
+ if (path->ipath.path.param_info)
+ {
+ path->ipath.path.rows = path->ipath.path.param_info->ppi_rows;
+ /* qpquals come from the rel's restriction clauses and ppi_clauses */
+ qpquals = list_concat(extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses),
+ extract_nonindex_conditions(path->ipath.path.param_info->ppi_clauses,
+ path->ipath.indexclauses));
+ }
+ else
+ {
+ path->ipath.path.rows = baserel->rows;
+ /* qpquals come from just the rel's restriction clauses */
+ qpquals = extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses);
+ }
+
+ if (!enable_indexscan)
+ startup_cost += disable_cost;
+ /* we don't need to check enable_indexonlyscan; indxpath.c does that */
+
+ /*
+ * Call index-access-method-specific code to estimate the processing cost
+ * for scanning the index, as well as the selectivity of the index (ie,
+ * the fraction of main-table tuples we will have to retrieve) and its
+ * correlation to the main-table tuple order. We need a cast here because
+ * pathnodes.h uses a weak function type to avoid including amapi.h.
+ */
+ amcostestimate = (amcostestimate_function) index->amcostestimate;
+ amcostestimate(root, &path->ipath, loop_count,
+ &indexStartupCost, &indexTotalCost,
+ &indexSelectivity, &indexCorrelation,
+ &index_pages);
+
+ /*
+ * Save amcostestimate's results for possible use in bitmap scan planning.
+ * We don't bother to save indexStartupCost or indexCorrelation, because a
+ * bitmap scan doesn't care about either.
+ */
+ path->ipath.indextotalcost = indexTotalCost;
+ path->ipath.indexselectivity = indexSelectivity;
+
+ /* all costs for touching index itself included here */
+ startup_cost += indexStartupCost;
+ run_cost += indexTotalCost - indexStartupCost;
+
+ /* estimate number of main-table tuples fetched */
+ tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
+
+ /* fetch estimated page costs for tablespace containing table */
+ get_tablespace_page_costs(baserel->reltablespace,
+ &spc_random_page_cost,
+ &spc_seq_page_cost);
+
+ /*----------
+ * Estimate number of main-table pages fetched, and compute I/O cost.
+ *
+ * When the index ordering is uncorrelated with the table ordering,
+ * we use an approximation proposed by Mackert and Lohman (see
+ * index_pages_fetched() for details) to compute the number of pages
+ * fetched, and then charge spc_random_page_cost per page fetched.
+ *
+ * When the index ordering is exactly correlated with the table ordering
+ * (just after a CLUSTER, for example), the number of pages fetched should
+ * be exactly selectivity * table_size. What's more, all but the first
+ * will be sequential fetches, not the random fetches that occur in the
+ * uncorrelated case. So if the number of pages is more than 1, we
+ * ought to charge
+ * spc_random_page_cost + (pages_fetched - 1) * spc_seq_page_cost
+ * For partially-correlated indexes, we ought to charge somewhere between
+ * these two estimates. We currently interpolate linearly between the
+ * estimates based on the correlation squared (XXX is that appropriate?).
+ *
+ * If it's an index-only scan, then we will not need to fetch any heap
+ * pages for which the visibility map shows all tuples are visible.
+ * Hence, reduce the estimated number of heap fetches accordingly.
+ * We use the measured fraction of the entire heap that is all-visible,
+ * which might not be particularly relevant to the subset of the heap
+ * that this query will fetch; but it's not clear how to do better.
+ *----------
+ */
+ if (loop_count > 1)
+ {
+ /*
+ * For repeated indexscans, the appropriate estimate for the
+ * uncorrelated case is to scale up the number of tuples fetched in
+ * the Mackert and Lohman formula by the number of scans, so that we
+ * estimate the number of pages fetched by all the scans; then
+ * pro-rate the costs for one scan. In this case we assume all the
+ * fetches are random accesses.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+
+ /*
+ * In the perfectly correlated case, the number of pages touched by
+ * each scan is selectivity * table_size, and we can use the Mackert
+ * and Lohman formula at the page level to estimate how much work is
+ * saved by caching across scans. We still assume all the fetches are
+ * random, though, which is an overestimate that's hard to correct for
+ * without double-counting the cache effects. (But in most cases
+ * where such a plan is actually interesting, only one page would get
+ * fetched per scan anyway, so it shouldn't matter much.)
+ */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ pages_fetched = index_pages_fetched(pages_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+ }
+ else
+ {
+ /*
+ * Normal case: apply the Mackert and Lohman formula, and then
+ * interpolate between that and the correlation-derived result.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ /* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
+ max_IO_cost = pages_fetched * spc_random_page_cost;
+
+ /* min_IO_cost is for the perfectly correlated case (csquared=1) */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ if (pages_fetched > 0)
+ {
+ min_IO_cost = spc_random_page_cost;
+ if (pages_fetched > 1)
+ min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
+ }
+ else
+ min_IO_cost = 0;
+ }
+
+ if (partial_path)
+ {
+ /*
+ * Estimate the number of parallel workers required to scan index. Use
+ * the number of heap pages computed considering heap fetches won't be
+ * sequential as for parallel scans the pages are accessed in random
+ * order.
+ */
+ path->ipath.path.parallel_workers = compute_parallel_worker(baserel,
+ rand_heap_pages,
+ index_pages,
+ max_parallel_workers_per_gather);
+
+ /*
+ * Fall out if workers can't be assigned for parallel scan, because in
+ * such a case this path will be rejected. So there is no benefit in
+ * doing extra computation.
+ */
+ if (path->ipath.path.parallel_workers <= 0)
+ return;
+
+ path->ipath.path.parallel_aware = true;
+ }
+
+ /*
+ * Now interpolate based on estimated index order correlation to get total
+ * disk I/O cost for main table accesses.
+ */
+ csquared = indexCorrelation * indexCorrelation;
+
+ run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
+
+ /*
+ * Estimate CPU costs per tuple.
+ *
+ * What we want here is cpu_tuple_cost plus the evaluation costs of any
+ * qual clauses that we have to evaluate as qpquals.
+ */
+ cost_qual_eval(&qpqual_cost, qpquals, root);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+
+ cpu_run_cost += cpu_per_tuple * tuples_fetched;
+
+ /* tlist eval costs are paid per output row, not per tuple scanned */
+ startup_cost += path->ipath.path.pathtarget->cost.startup;
+ cpu_run_cost += path->ipath.path.pathtarget->cost.per_tuple * path->ipath.path.rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->ipath.path.parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(&path->ipath.path);
+
+ path->ipath.path.rows = clamp_row_est(path->ipath.path.rows / parallel_divisor);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+ }
+
+ run_cost += cpu_run_cost;
+
+ path->ipath.path.startup_cost = startup_cost;
+ path->ipath.path.total_cost = startup_cost + run_cost;
+}
+
/*
* extract_nonindex_conditions
*
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index c31fcc917df..6ba4347dbdc 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -17,12 +17,16 @@
#include <math.h>
+#include "access/brin_internal.h"
+#include "access/relation.h"
#include "access/stratnum.h"
#include "access/sysattr.h"
#include "catalog/pg_am.h"
#include "catalog/pg_operator.h"
+#include "catalog/pg_opclass.h"
#include "catalog/pg_opfamily.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "nodes/supportnodes.h"
@@ -32,10 +36,13 @@
#include "optimizer/paths.h"
#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "utils/rel.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
+bool enable_brinsort = true;
+
/* XXX see PartCollMatchesExprColl */
#define IndexCollMatchesExprColl(idxcollation, exprcollation) \
((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))
@@ -1127,6 +1134,196 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
}
}
+ /*
+ * If this is a BRIN index with suitable opclass (minmax or such), we may
+ * try doing BRIN sort. BRIN indexes are not ordered and amcanorderbyop
+ * is set to false, so we probably will need some new opclass flag to
+ * mark indexes that support this.
+ */
+ if (enable_brinsort && pathkeys_possibly_useful)
+ {
+ ListCell *lc;
+ Relation rel2 = relation_open(index->indexoid, NoLock);
+ int idx;
+
+ /*
+ * Try generating sorted paths for each key with the right opclass.
+ */
+ idx = -1;
+ foreach(lc, index->indextlist)
+ {
+ TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+ BrinSortPath *bpath;
+ Oid rangeproc;
+ AttrNumber attnum;
+
+ idx++;
+ attnum = (idx + 1);
+
+ /* skip expressions for now */
+ if (!AttributeNumberIsValid(index->indexkeys[idx]))
+ continue;
+
+ /* XXX ignore non-BRIN indexes */
+ if (rel2->rd_rel->relam != BRIN_AM_OID)
+ continue;
+
+ /*
+ * XXX Ignore keys not using an opclass with the "ranges" proc.
+ * For now we only do this for some minmax opclasses, but adding
+ * it to all minmax is simple, and adding it to minmax-multi
+ * should not be very hard.
+ */
+ rangeproc = index_getprocid(rel2, attnum, BRIN_PROCNUM_RANGES);
+ if (!OidIsValid(rangeproc))
+ continue;
+
+ orderbyclauses = NIL;
+ orderbyclausecols = NIL;
+
+ /*
+ * XXX stuff extracted from build_index_pathkeys, except that we
+ * only deal with a single index key (producing a single pathkey),
+ * so we only sort on a single column. I guess we could use more
+ * index keys and sort on more expressions? Would that mean these
+ * keys need to be rather well correlated? In any case, it seems
+ * rather complex to implement, so I leave it as a possible
+ * future improvement.
+ *
+ * XXX This could also use the other BRIN keys (even from other
+ * indexes) in a different way - we might use the other ranges
+ * to quickly eliminate some of the chunks, essentially like a
+ * bitmap, but maybe without using the bitmap. Or we might use
+ * other indexes through bitmaps.
+ *
+ * XXX This fakes a number of parameters, because we don't store
+ * the btree opclass in the index, instead we use the default
+ * one for the key data type. And BRIN does not allow specifying
+ *
+ * XXX We don't add the path to result, because this function is
+ * supposed to generate IndexPaths. Instead, we just add the path
+ * using add_path(). We should be building this in a different
+ * place, perhaps in create_index_paths() or so.
+ *
+ * XXX By building it elsewhere, we could also leverage the index
+ * paths we've built here, particularly the bitmap index paths,
+ * which we could use to eliminate many of the ranges.
+ *
+ * XXX We don't have any explicit ordering associated with the
+ * BRIN index, e.g. we don't have ASC/DESC and NULLS FIRST/LAST.
+ * So this is not encoded in the index, and we can satisfy all
+ * these cases - but we need to add paths for each combination.
+ * I wonder if there's a better way to do this.
+ */
+
+ /* ASC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ orderbyclauses,
+ orderbyclausecols,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ orderbyclauses,
+ orderbyclausecols,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* ASC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ orderbyclauses,
+ orderbyclausecols,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ orderbyclauses,
+ orderbyclausecols,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+ }
+
+ relation_close(rel2, NoLock);
+ }
+
return result;
}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a9943cd6e01..83dde6f22eb 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -27,6 +27,7 @@
#include "optimizer/paths.h"
#include "partitioning/partbounds.h"
#include "utils/lsyscache.h"
+#include "utils/typcache.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -630,6 +631,55 @@ build_index_pathkeys(PlannerInfo *root,
return retval;
}
+
+List *
+build_index_pathkeys_brin(PlannerInfo *root,
+ IndexOptInfo *index,
+ TargetEntry *tle,
+ int idx,
+ bool reverse_sort,
+ bool nulls_first)
+{
+ TypeCacheEntry *typcache;
+ PathKey *cpathkey;
+ Oid sortopfamily;
+
+ /*
+ * Get default btree opfamily for the type, extracted from the
+ * entry in index targetlist.
+ *
+ * XXX Is there a better / more correct way to do this?
+ */
+ typcache = lookup_type_cache(exprType((Node *) tle->expr),
+ TYPECACHE_BTREE_OPFAMILY);
+ sortopfamily = typcache->btree_opf;
+
+ /*
+ * OK, try to make a canonical pathkey for this sort key. Note we're
+ * underneath any outer joins, so nullable_relids should be NULL.
+ */
+ cpathkey = make_pathkey_from_sortinfo(root,
+ tle->expr,
+ NULL,
+ sortopfamily,
+ index->opcintype[idx],
+ index->indexcollations[idx],
+ reverse_sort,
+ nulls_first,
+ 0,
+ index->rel->relids,
+ false);
+
+ /*
+ * There may be no pathkey if we haven't matched any sortkey, in which
+ * case ignore it.
+ */
+ if (!cpathkey)
+ return NIL;
+
+ return list_make1(cpathkey);
+}
+
/*
* partkey_is_bool_constant_for_query
*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ab4d8e201df..63ffdf9a6ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -124,6 +124,8 @@ static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
List *tlist, List *scan_clauses);
static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
List *tlist, List *scan_clauses, bool indexonly);
+static BrinSort *create_brinsort_plan(PlannerInfo *root, BrinSortPath *best_path,
+ List *tlist, List *scan_clauses);
static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
BitmapHeapPath *best_path,
List *tlist, List *scan_clauses);
@@ -191,6 +193,9 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
List *indexorderby,
List *indextlist,
ScanDirection indexscandir);
+static BrinSort *make_brinsort(List *qptlist, List *qpqual, Index scanrelid,
+ Oid indexid, List *indexqual, List *indexqualorig,
+ ScanDirection indexscandir);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -410,6 +415,9 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
case T_CustomScan:
plan = create_scan_plan(root, best_path, flags);
break;
+ case T_BrinSort:
+ plan = create_scan_plan(root, best_path, flags);
+ break;
case T_HashJoin:
case T_MergeJoin:
case T_NestLoop:
@@ -776,6 +784,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path, int flags)
scan_clauses);
break;
+ case T_BrinSort:
+ plan = (Plan *) create_brinsort_plan(root,
+ (BrinSortPath *) best_path,
+ tlist,
+ scan_clauses);
+ break;
+
default:
elog(ERROR, "unrecognized node type: %d",
(int) best_path->pathtype);
@@ -3180,6 +3195,155 @@ create_indexscan_plan(PlannerInfo *root,
return scan_plan;
}
+/*
+ * create_indexscan_plan
+ * Returns an indexscan plan for the base relation scanned by 'best_path'
+ * with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ *
+ * We use this for both plain IndexScans and IndexOnlyScans, because the
+ * qual preprocessing work is the same for both. Note that the caller tells
+ * us which to build --- we don't look at best_path->path.pathtype, because
+ * create_bitmap_subplan needs to be able to override the prior decision.
+ */
+static BrinSort *
+create_brinsort_plan(PlannerInfo *root,
+ BrinSortPath *best_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ BrinSort *brinsort_plan;
+ List *indexclauses = best_path->ipath.indexclauses;
+ Index baserelid = best_path->ipath.path.parent->relid;
+ IndexOptInfo *indexinfo = best_path->ipath.indexinfo;
+ Oid indexoid = indexinfo->indexoid;
+ List *qpqual;
+ List *stripped_indexquals;
+ List *fixed_indexquals;
+ ListCell *l;
+
+ List *pathkeys = best_path->ipath.path.pathkeys;
+
+ /* it should be a base rel... */
+ Assert(baserelid > 0);
+ Assert(best_path->ipath.path.parent->rtekind == RTE_RELATION);
+
+ /*
+ * Extract the index qual expressions (stripped of RestrictInfos) from the
+ * IndexClauses list, and prepare a copy with index Vars substituted for
+ * table Vars. (This step also does replace_nestloop_params on the
+ * fixed_indexquals.)
+ */
+ fix_indexqual_references(root, &best_path->ipath,
+ &stripped_indexquals,
+ &fixed_indexquals);
+
+ /*
+ * The qpqual list must contain all restrictions not automatically handled
+ * by the index, other than pseudoconstant clauses which will be handled
+ * by a separate gating plan node. All the predicates in the indexquals
+ * will be checked (either by the index itself, or by nodeIndexscan.c),
+ * but if there are any "special" operators involved then they must be
+ * included in qpqual. The upshot is that qpqual must contain
+ * scan_clauses minus whatever appears in indexquals.
+ *
+ * is_redundant_with_indexclauses() detects cases where a scan clause is
+ * present in the indexclauses list or is generated from the same
+ * EquivalenceClass as some indexclause, and is therefore redundant with
+ * it, though not equal. (The latter happens when indxpath.c prefers a
+ * different derived equality than what generate_join_implied_equalities
+ * picked for a parameterized scan's ppi_clauses.) Note that it will not
+ * match to lossy index clauses, which is critical because we have to
+ * include the original clause in qpqual in that case.
+ *
+ * In some situations (particularly with OR'd index conditions) we may
+ * have scan_clauses that are not equal to, but are logically implied by,
+ * the index quals; so we also try a predicate_implied_by() check to see
+ * if we can discard quals that way. (predicate_implied_by assumes its
+ * first input contains only immutable functions, so we have to check
+ * that.)
+ *
+ * Note: if you change this bit of code you should also look at
+ * extract_nonindex_conditions() in costsize.c.
+ */
+ qpqual = NIL;
+ foreach(l, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
+
+ if (rinfo->pseudoconstant)
+ continue; /* we may drop pseudoconstants here */
+ if (is_redundant_with_indexclauses(rinfo, indexclauses))
+ continue; /* dup or derived from same EquivalenceClass */
+ if (!contain_mutable_functions((Node *) rinfo->clause) &&
+ predicate_implied_by(list_make1(rinfo->clause), stripped_indexquals,
+ false))
+ continue; /* provably implied by indexquals */
+ qpqual = lappend(qpqual, rinfo);
+ }
+
+ /* Sort clauses into best execution order */
+ qpqual = order_qual_clauses(root, qpqual);
+
+ /* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+ qpqual = extract_actual_clauses(qpqual, false);
+
+ /*
+ * We have to replace any outer-relation variables with nestloop params in
+ * the indexqualorig, qpqual, and indexorderbyorig expressions. A bit
+ * annoying to have to do this separately from the processing in
+ * fix_indexqual_references --- rethink this when generalizing the inner
+ * indexscan support. But note we can't really do this earlier because
+ * it'd break the comparisons to predicates above ... (or would it? Those
+ * wouldn't have outer refs)
+ */
+ if (best_path->ipath.path.param_info)
+ {
+ stripped_indexquals = (List *)
+ replace_nestloop_params(root, (Node *) stripped_indexquals);
+ qpqual = (List *)
+ replace_nestloop_params(root, (Node *) qpqual);
+ }
+
+ /* Finally ready to build the plan node */
+ brinsort_plan = make_brinsort(tlist,
+ qpqual,
+ baserelid,
+ indexoid,
+ fixed_indexquals,
+ stripped_indexquals,
+ best_path->ipath.indexscandir);
+
+ if (pathkeys != NIL)
+ {
+ /*
+ * Compute sort column info, and adjust the Append's tlist as needed.
+ * Because we pass adjust_tlist_in_place = true, we may ignore the
+ * function result; it must be the same plan node. However, we then
+ * need to detect whether any tlist entries were added.
+ */
+ (void) prepare_sort_from_pathkeys((Plan *) brinsort_plan, pathkeys,
+ best_path->ipath.path.parent->relids,
+ NULL,
+ true,
+ &brinsort_plan->numCols,
+ &brinsort_plan->sortColIdx,
+ &brinsort_plan->sortOperators,
+ &brinsort_plan->collations,
+ &brinsort_plan->nullsFirst);
+ //tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
+ for (int i = 0; i < brinsort_plan->numCols; i++)
+ elog(DEBUG1, "%d => %d %d %d %d", i,
+ brinsort_plan->sortColIdx[i],
+ brinsort_plan->sortOperators[i],
+ brinsort_plan->collations[i],
+ brinsort_plan->nullsFirst[i]);
+ }
+
+ copy_generic_path_info(&brinsort_plan->scan.plan, &best_path->ipath.path);
+
+ return brinsort_plan;
+}
+
/*
* create_bitmap_scan_plan
* Returns a bitmap scan plan for the base relation scanned by 'best_path'
@@ -5523,6 +5687,31 @@ make_indexscan(List *qptlist,
return node;
}
+static BrinSort *
+make_brinsort(List *qptlist,
+ List *qpqual,
+ Index scanrelid,
+ Oid indexid,
+ List *indexqual,
+ List *indexqualorig,
+ ScanDirection indexscandir)
+{
+ BrinSort *node = makeNode(BrinSort);
+ Plan *plan = &node->scan.plan;
+
+ plan->targetlist = qptlist;
+ plan->qual = qpqual;
+ plan->lefttree = NULL;
+ plan->righttree = NULL;
+ node->scan.scanrelid = scanrelid;
+ node->indexid = indexid;
+ node->indexqual = indexqual;
+ node->indexqualorig = indexqualorig;
+ node->indexorderdir = indexscandir;
+
+ return node;
+}
+
static IndexOnlyScan *
make_indexonlyscan(List *qptlist,
List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 1cb0abdbc1f..2584a1f032d 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -609,6 +609,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
return set_indexonlyscan_references(root, splan, rtoffset);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *splan = (BrinSort *) plan;
+
+ splan->scan.scanrelid += rtoffset;
+ splan->scan.plan.targetlist =
+ fix_scan_list(root, splan->scan.plan.targetlist,
+ rtoffset, NUM_EXEC_TLIST(plan));
+ splan->scan.plan.qual =
+ fix_scan_list(root, splan->scan.plan.qual,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ splan->indexqual =
+ fix_scan_list(root, splan->indexqual,
+ rtoffset, 1);
+ splan->indexqualorig =
+ fix_scan_list(root, splan->indexqualorig,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *splan = (BitmapIndexScan *) plan;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 70f61ae7b1c..e8beadb17b5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1030,6 +1030,65 @@ create_index_path(PlannerInfo *root,
return pathnode;
}
+
+/*
+ * create_brinsort_path
+ * Creates a path node for sorted brin index scan.
+ *
+ * 'index' is a usable index.
+ * 'indexclauses' is a list of IndexClause nodes representing clauses
+ * to be enforced as qual conditions in the scan.
+ * 'indexorderbys' is a list of bare expressions (no RestrictInfos)
+ * to be used as index ordering operators in the scan.
+ * 'indexorderbycols' is an integer list of index column numbers (zero based)
+ * the ordering operators can be used with.
+ * 'pathkeys' describes the ordering of the path.
+ * 'indexscandir' is ForwardScanDirection or BackwardScanDirection
+ * for an ordered index, or NoMovementScanDirection for
+ * an unordered index.
+ * 'indexonly' is true if an index-only scan is wanted.
+ * 'required_outer' is the set of outer relids for a parameterized path.
+ * 'loop_count' is the number of repetitions of the indexscan to factor into
+ * estimates of caching behavior.
+ * 'partial_path' is true if constructing a parallel index scan path.
+ *
+ * Returns the new path node.
+ */
+BrinSortPath *
+create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *indexorderbys,
+ List *indexorderbycols,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path)
+{
+ BrinSortPath *pathnode = makeNode(BrinSortPath);
+ RelOptInfo *rel = index->rel;
+
+ pathnode->ipath.path.pathtype = T_BrinSort;
+ pathnode->ipath.path.parent = rel;
+ pathnode->ipath.path.pathtarget = rel->reltarget;
+ pathnode->ipath.path.param_info = get_baserel_parampathinfo(root, rel,
+ required_outer);
+ pathnode->ipath.path.parallel_aware = false;
+ pathnode->ipath.path.parallel_safe = rel->consider_parallel;
+ pathnode->ipath.path.parallel_workers = 0;
+ pathnode->ipath.path.pathkeys = pathkeys;
+
+ pathnode->ipath.indexinfo = index;
+ pathnode->ipath.indexclauses = indexclauses;
+ pathnode->ipath.indexscandir = indexscandir;
+
+ cost_brinsort(pathnode, root, loop_count, partial_path);
+
+ return pathnode;
+}
+
/*
* create_bitmap_heap_path
* Creates a path node for a bitmap scan.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934c..6c854e355b0 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -967,6 +967,16 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of BRIN sort plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_brinsort,
+ false,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index 887fb0a5532..e8ffc4a0456 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -34,6 +34,26 @@ typedef struct BrinStatsData
BlockNumber revmapNumPages;
} BrinStatsData;
+/*
+ * Info about ranges for BRIN Sort.
+ */
+typedef struct BrinRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+} BrinRange;
+
+typedef struct BrinRanges
+{
+ int nranges;
+ BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
+} BrinRanges;
#define BRIN_DEFAULT_PAGES_PER_RANGE 128
#define BrinGetPagesPerRange(relation) \
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 25186609272..7027b41d5fb 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,6 +73,7 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
+#define BRIN_PROCNUM_RANGES 6 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_LAST_OPTIONAL_PROCNUM 15
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 4cc129bebd8..41e7143b870 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -804,6 +804,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
+ amprocrighttype => 'bytea', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -835,6 +837,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -864,6 +868,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
+ amprocrighttype => 'name', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -893,6 +899,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -905,6 +913,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -917,6 +927,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1034,6 +1046,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
+ amprocrighttype => 'text', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1062,6 +1076,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1110,6 +1126,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
+ amprocrighttype => 'tid', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1160,6 +1178,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
amprocrighttype => 'float4', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
+ amprocrighttype => 'float4', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '1',
@@ -1173,6 +1194,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
+ amprocrighttype => 'float8', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi float
{ amprocfamily => 'brin/float_minmax_multi_ops', amproclefttype => 'float4',
@@ -1261,6 +1285,9 @@
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
+ amprocrighttype => 'macaddr', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr
{ amprocfamily => 'brin/macaddr_minmax_multi_ops', amproclefttype => 'macaddr',
@@ -1314,6 +1341,9 @@
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
amprocrighttype => 'macaddr8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
+ amprocrighttype => 'macaddr8', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr8
{ amprocfamily => 'brin/macaddr8_minmax_multi_ops',
@@ -1366,6 +1396,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
+ amprocrighttype => 'inet', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1436,6 +1468,9 @@
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
amprocrighttype => 'bpchar', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
+ amprocrighttype => 'bpchar', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# bloom character
{ amprocfamily => 'brin/bpchar_bloom_ops', amproclefttype => 'bpchar',
@@ -1467,6 +1502,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
+ amprocrighttype => 'time', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1517,6 +1554,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
amprocrighttype => 'timestamp', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
+ amprocrighttype => 'timestamp', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '1',
@@ -1530,6 +1570,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
+ amprocrighttype => 'timestamptz', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1',
@@ -1542,6 +1585,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1668,6 +1713,9 @@
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
+ amprocrighttype => 'interval', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi interval
{ amprocfamily => 'brin/interval_minmax_multi_ops',
@@ -1721,6 +1769,9 @@
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
amprocrighttype => 'timetz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
+ amprocrighttype => 'timetz', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi time with time zone
{ amprocfamily => 'brin/timetz_minmax_multi_ops', amproclefttype => 'timetz',
@@ -1771,6 +1822,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
+ amprocrighttype => 'bit', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1785,6 +1838,9 @@
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
amprocrighttype => 'varbit', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
+ amprocrighttype => 'varbit', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax numeric
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
@@ -1799,6 +1855,9 @@
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
amprocrighttype => 'numeric', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
+ amprocrighttype => 'numeric', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi numeric
{ amprocfamily => 'brin/numeric_minmax_multi_ops', amproclefttype => 'numeric',
@@ -1851,6 +1910,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'brin_minmax_ranges' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1924,6 +1985,9 @@
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
amprocrighttype => 'pg_lsn', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
+ amprocrighttype => 'pg_lsn', amprocnum => '6',
+ amproc => 'brin_minmax_ranges' },
# minmax multi pg_lsn
{ amprocfamily => 'brin/pg_lsn_minmax_multi_ops', amproclefttype => 'pg_lsn',
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index dbcae7ffdd2..52fdfa8cc0c 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -301,7 +301,7 @@
opckeytype => 'int2' },
{ opcmethod => 'brin', opcname => 'int4_minmax_ops',
opcfamily => 'brin/integer_minmax_ops', opcintype => 'int4',
- opckeytype => 'int4' },
+ opckeytype => 'int4', oid_symbol => 'INT4_BRIN_MINMAX_OPS_OID'},
{ opcmethod => 'brin', opcname => 'int4_minmax_multi_ops',
opcfamily => 'brin/integer_minmax_multi_ops', opcintype => 'int4',
opcdefault => 'f', opckeytype => 'int4' },
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62a5b8e655d..9fea2a8387c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8407,6 +8407,9 @@
{ oid => '3386', descr => 'BRIN minmax support',
proname => 'brin_minmax_union', prorettype => 'bool',
proargtypes => 'internal internal internal', prosrc => 'brin_minmax_union' },
+{ oid => '9976', descr => 'BRIN minmax support',
+ proname => 'brin_minmax_ranges', prorettype => 'bool',
+ proargtypes => 'internal int2', prosrc => 'brin_minmax_ranges' },
# BRIN minmax multi
{ oid => '4616', descr => 'BRIN multi minmax support',
diff --git a/src/include/executor/nodeBrinSort.h b/src/include/executor/nodeBrinSort.h
new file mode 100644
index 00000000000..2c860d926ea
--- /dev/null
+++ b/src/include/executor/nodeBrinSort.h
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeBrinSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_H
+
+#include "access/genam.h"
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern BrinSortState *ExecInitBrinSort(BrinSort *node, EState *estate, int eflags);
+extern void ExecEndBrinSort(BrinSortState *node);
+extern void ExecBrinSortMarkPos(BrinSortState *node);
+extern void ExecBrinSortRestrPos(BrinSortState *node);
+extern void ExecReScanBrinSort(BrinSortState *node);
+extern void ExecBrinSortEstimate(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortReInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt);
+
+/*
+ * These routines are exported to share code with nodeIndexonlyscan.c and
+ * nodeBitmapBrinSort.c
+ */
+extern void ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
+ List *quals, bool isorderby,
+ ScanKey *scanKeys, int *numScanKeys,
+ IndexRuntimeKeyInfo **runtimeKeys, int *numRuntimeKeys,
+ IndexArrayKeyInfo **arrayKeys, int *numArrayKeys);
+extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
+ IndexRuntimeKeyInfo *runtimeKeys, int numRuntimeKeys);
+extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
+ IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+
+#endif /* NODEBrinSort_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01b1727fc09..74fb0467d7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1549,6 +1549,75 @@ typedef struct IndexScanState
Size iss_PscanLen;
} IndexScanState;
+typedef struct BrinSortRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+
+ bool processed;
+} BrinSortRange;
+
+typedef enum {
+ BRINSORT_START,
+ BRINSORT_LOAD_RANGE,
+ BRINSORT_PROCESS_RANGE,
+ BRINSORT_LOAD_NULLS,
+ BRINSORT_PROCESS_NULLS,
+ BRINSORT_FINISHED
+} BrinSortPhase;
+
+typedef struct BrinSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ ExprState *indexqualorig;
+ List *indexorderbyorig;
+ struct ScanKeyData *iss_ScanKeys;
+ int iss_NumScanKeys;
+ struct ScanKeyData *iss_OrderByKeys;
+ int iss_NumOrderByKeys;
+ IndexRuntimeKeyInfo *iss_RuntimeKeys;
+ int iss_NumRuntimeKeys;
+ bool iss_RuntimeKeysReady;
+ ExprContext *iss_RuntimeContext;
+ Relation iss_RelationDesc;
+ struct IndexScanDescData *iss_ScanDesc;
+
+ /* These are needed for re-checking ORDER BY expr ordering */
+ pairingheap *iss_ReorderQueue;
+ bool iss_ReachedEnd;
+ Datum *iss_OrderByValues;
+ bool *iss_OrderByNulls;
+ SortSupport iss_SortSupport;
+ bool *iss_OrderByTypByVals;
+ int16 *iss_OrderByTypLens;
+ Size iss_PscanLen;
+
+ /* */
+ int bs_nranges;
+ BrinSortRange *bs_ranges;
+ BrinSortRange **bs_ranges_minval;
+ int bs_next_range;
+ int bs_next_range_intersect;
+ int bs_next_range_nulls;
+ ExprState *bs_qual;
+ Datum bs_watermark;
+ BrinSortPhase bs_phase;
+ SortSupportData bs_sortsupport;
+
+ /*
+ * We need two tuplesort instances - one for current range, one for
+ * spill-over tuples from the overlapping ranges
+ */
+ void *bs_tuplesortstate;
+ Tuplestorestate *bs_tuplestore;
+} BrinSortState;
+
/* ----------------
* IndexOnlyScanState information
*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6bda383bead..e79c904a8fc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1596,6 +1596,17 @@ typedef struct IndexPath
Selectivity indexselectivity;
} IndexPath;
+/*
+ * read sorted data from brin index
+ *
+ * We use IndexPath, because that's what amcostestimate is expecting, but
+ * we typedef it as a separate struct.
+ */
+typedef struct BrinSortPath
+{
+ IndexPath ipath;
+} BrinSortPath;
+
/*
* Each IndexClause references a RestrictInfo node from the query's WHERE
* or JOIN conditions, and shows how that restriction can be applied to
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 21e642a64c4..c4ef5362acc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -495,6 +495,32 @@ typedef struct IndexOnlyScan
ScanDirection indexorderdir; /* forward or backward or don't care */
} IndexOnlyScan;
+
+typedef struct BrinSort
+{
+ Scan scan;
+ Oid indexid; /* OID of index to scan */
+ List *indexqual; /* list of index quals (usually OpExprs) */
+ List *indexqualorig; /* the same in original form */
+ ScanDirection indexorderdir; /* forward or backward or don't care */
+
+ /* number of sort-key columns */
+ int numCols;
+
+ /* their indexes in the target list */
+ AttrNumber *sortColIdx pg_node_attr(array_size(numCols));
+
+ /* OIDs of operators to sort them by */
+ Oid *sortOperators pg_node_attr(array_size(numCols));
+
+ /* OIDs of collations */
+ Oid *collations pg_node_attr(array_size(numCols));
+
+ /* NULLS FIRST/LAST directions */
+ bool *nullsFirst pg_node_attr(array_size(numCols));
+
+} BrinSort;
+
/* ----------------
* bitmap index scan node
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 204e94b6d10..b77440728d1 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -69,6 +69,7 @@ extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_async_append;
+extern PGDLLIMPORT bool enable_brinsort;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
@@ -79,6 +80,8 @@ extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info);
extern void cost_index(IndexPath *path, PlannerInfo *root,
double loop_count, bool partial_path);
+extern void cost_brinsort(BrinSortPath *path, PlannerInfo *root,
+ double loop_count, bool partial_path);
extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info,
Path *bitmapqual, double loop_count);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 050f00e79a4..2415c07a856 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,6 +49,17 @@ extern IndexPath *create_index_path(PlannerInfo *root,
Relids required_outer,
double loop_count,
bool partial_path);
+extern BrinSortPath *create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *indexorderbys,
+ List *indexorderbycols,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path);
extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
RelOptInfo *rel,
Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 41f765d3422..6aa50257730 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -213,6 +213,9 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
ScanDirection scandir);
+extern List *build_index_pathkeys_brin(PlannerInfo *root, IndexOptInfo *index,
+ TargetEntry *tle, int idx,
+ bool reverse_sort, bool nulls_first);
extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
ScanDirection scandir, bool *partialkeys);
extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--
2.37.3
On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
Hi,
There have been a couple discussions about using BRIN indexes for
sorting - in fact this was mentioned even in the "Improving Indexing
Performance" unconference session this year (don't remember by whom).
But I haven't seen any patches, so here's one.The idea is that we can use information about ranges to split the table
into smaller parts that can be sorted in smaller chunks. For example if
you have a tiny 2MB table with two ranges, with values in [0,100] and
[101,200] intervals, then it's clear we can sort the first range, output
tuples, and then sort/output the second range.The attached patch builds "BRIN Sort" paths/plans, closely resembling
index scans, only for BRIN indexes. And this special type of index scan
does what was mentioned above - incrementally sorts the data. It's a bit
more complicated because of overlapping ranges, ASC/DESC, NULL etc.This is disabled by default, using a GUC enable_brinsort (you may need
to tweak other GUCs to disable parallel plans etc.).A trivial example, demonstrating the benefits:
create table t (a int) with (fillfactor = 10);
insert into t select i from generate_series(1,10000000) s(i);First, a simple LIMIT query:
explain (analyze, costs off) select * from t order by a limit 10;
QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1879.768..1879.770 rows=10 loops=1)
-> Sort (actual time=1879.767..1879.768 rows=10 loops=1)
Sort Key: a
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on t
(actual time=0.007..1353.110 rows=10000000 loops=1)
Planning Time: 0.083 ms
Execution Time: 1879.786 ms
(7 rows)QUERY PLAN
------------------------------------------------------------------------
Limit (actual time=1.217..1.219 rows=10 loops=1)
-> BRIN Sort using t_a_idx on t
(actual time=1.216..1.217 rows=10 loops=1)
Sort Key: a
Planning Time: 0.084 ms
Execution Time: 1.234 ms
(5 rows)That's a pretty nice improvement - of course, this is thanks to having a
perfectly sequential, and the difference can be almost arbitrary by
making the table smaller/larger. Similarly, if the table gets less
sequential (making ranges to overlap), the BRIN plan will be more
expensive. Feel free to experiment with other data sets.However, not only the LIMIT queries can improve - consider a sort of the
whole table:test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN
-------------------------------------------------------------------------
Sort (actual time=2806.468..3487.213 rows=10000000 loops=1)
Sort Key: a
Sort Method: external merge Disk: 117528kB
-> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1)
Planning Time: 0.110 ms
Execution Time: 3766.825 ms
(6 rows)test=# explain (analyze, costs off) select * from t order by a;
QUERY PLAN----------------------------------------------------------------------------------
BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000
loops=1)
Sort Key: a
Planning Time: 0.073 ms
Execution Time: 2939.324 ms
(4 rows)Right - not a huge difference, but still a nice 25% speedup, mostly due
to not having to spill data to disk and sorting smaller amounts of data.There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)1) The paths are created in build_index_paths() because that's what
creates index scans (which the new path resembles). But that is expected
to produce IndexPath, not BrinSortPath, so it's not quite correct.
Should be somewhere "higher" I guess.2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS
FIRST/LAST does not really matter for them. The patch just generates
paths for all 4 combinations (or tries to). Maybe there's a better way.3) I'm not quite sure the separation of responsibilities between
opfamily and opclass is optimal. I added a new amproc, but maybe this
should be split differently. At the moment only minmax indexes have
this, but adding this to minmax-multi should be trivial.4) The state changes in nodeBrinSort is a bit confusing. Works, but may
need cleanup and refactoring. Ideas welcome.5) The costing is essentially just plain cost_index. I have some ideas
about BRIN costing in general, which I'll post in a separate thread (as
it's not specific to this patch).6) At the moment this only picks one of the index keys, specified in the
ORDER BY clause. I think we can generalize this to multiple keys, but
thinking about multi-key ranges was a bit too much for me. The good
thing is this nicely combines with IncrementalSort.7) Only plain index keys for the ORDER BY keys, no expressions. Should
not be hard to fix, though.8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.9) I was also thinking about leveraging other indexes to quickly
eliminate ranges that need to be sorted. The node does evaluate filter,
of course, but only after reading the tuple from the range. But imagine
we allow BrinSort to utilize BRIN indexes to evaluate the filter - in
that case we might skip many ranges entirely. Essentially like a bitmap
index scan does, except that building the bitmap incrementally with BRIN
is trivial - you can quickly check if a particular range matches or not.
With other indexes (e.g. btree) you essentially need to evaluate the
filter completely, and only then you can look at the bitmap. Which seems
rather against the idea of this patch, which is about low startup cost.
Of course, the condition might be very selective, but then you probably
can just fetch the matching tuples and do a Sort.regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
I am still going over the patch.
Minor: for #8, I guess you meant `it should be possible` .
Cheers
On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and then let the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can check
tup directly.
+ /* Maybe mark the range as processed. */
+ range->processed |= mark_processed;
`Maybe` can be dropped.
For brinsort_load_tuples(), do we need to check for interrupts inside the
loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.
Cheers
On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.
We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.
The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.
The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.
A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.
This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).
Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]
and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.
We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?
So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to
- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)
Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL CompanyHi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.
bq. can store the info into a tuplestore/tuplesort
Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?
Cheers
On 10/16/22 03:36, Zhihong Yu wrote:
On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 15:46, Zhihong Yu wrote:
...
8) Parallel version is not supported, but I think it shouldn't be
possible. Just make the leader build the range info, and thenlet the
workers to acquire/sort ranges and merge them by Gather Merge.
...
Hi,
I am still going over the patch.Minor: for #8, I guess you meant `it should be possible` .
Yes, I meant to say it should be possible. Sorry for the confusion.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL CompanyHi,
For brin_minmax_ranges, looking at the assignment to gottuple and
reading gottuple, it seems variable gottuple can be omitted - we can
check tup directly.+ /* Maybe mark the range as processed. */ + range->processed |= mark_processed;`Maybe` can be dropped.
No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.
For brinsort_load_tuples(), do we need to check for interrupts inside
the loop ?
Similar question for subsequent methods involving loops, such
as brinsort_load_unsummarized_ranges.
We could/should, although most of the loops should be very short.
regrds
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 10/16/22 16:01, Zhihong Yu wrote:
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL CompanyHi,
In your example involving [150,500], can this range be broken down into
4 ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.
Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?
Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.
It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.
Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. done
For "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.
bq. can store the info into a tuplestore/tuplesort
Wouldn't this involve disk accesses which may reduce the effectiveness
of BRIN sort ?
Yes, it might. But the question is whether the result is still faster
than alternative plans (e.g. seqscan+sort), and those are likely to do
even more I/O.
Moreover, for "regular" cases this shouldn't be a significant issue,
because the stuff will fit into work_mem and so there'll be no I/O. But
it'll handle those extreme cases gracefully.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
I forgot to mention one important issue in my list yesterday, and that's
memory consumption.
TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.
The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables. It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on. It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
regards, tom lane
On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
On 10/16/22 16:01, Zhihong Yu wrote:
On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
<tomas.vondra@enterprisedb.com <mailto:tomas.vondra@enterprisedb.com>>
wrote:On 10/15/22 14:33, Tomas Vondra wrote:
Hi,
...
There's a bunch of issues with this initial version of the patch,
usually described in XXX comments in the relevant places.6)...
I forgot to mention one important issue in my list yesterday, and
that's
memory consumption. The way the patch is coded now, the new BRIN
support
function (brin_minmax_ranges) produces information about *all*
ranges in
one go, which may be an issue. The worst case is 32TB table, with
1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of
~32B
structs, so this would require ~128GB of RAM. With the default
128-page
ranges, it's still be ~1GB, which is quite a lot.
We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billionranges
is going to be "not cheap" so this is likely pretty rare). But we
should
not introduce new nodes that ignore work_mem, so we need a way to
deal
with such cases somehow.
The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality ishard.
The wrinkle here is that counting just the range info (BrinRange
struct)
does not include the values for by-reference types. We could use
average
width - that's just an estimate, though.
A more comprehensive solution seems to be to allow requesting chunks
of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can onlyhandle
100 at a time, we'd do 10 loops, each requesting 100 ranges.
This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice),but
very low minval (thus intersecting with ranges in the current slice).
Imagine ranges with these minval/maxval values, sorted by maxval:
[101,200]
[201,300]
[301,400]
[150,500]and let's say we can only process 2-range slices. So we'll get the
first
two, but both of them intersect with the very last range.
We could always include all the intersecting ranges into the slice,
but
what if there are too many very "wide" ranges?
So I think this will need to switch to an iterative communication
with
the BRIN index - instead of asking "give me info about all the
ranges",
we'll need a way to
- request the next range (sorted by maxval)
- request the intersecting ranges one by one (sorted by minval)Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I supposeit
can store the info into a tuplestore/tuplesort, and use that instead
of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort ofAPI,
especially if we want to support e.g. minmax-multi opclasses, that
have
a more complicated concept of "intersection".
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com>
The Enterprise PostgreSQL Company
Hi,
In your example involving [150,500], can this range be broken down into
4 ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.Hi,
Thanks for the quick reply.
I don't have good answer w.r.t. splitting the page range [0,127] now. Let
me think more about it.
The 10 step flow (subject to changes down the road) should be either given
in the description of the patch or, written as comment inside the code.
This would help people grasp the concept much faster.
BTW splill seems to be a typo - I assume you meant spill.
Cheers
On 10/16/22 16:41, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
I forgot to mention one important issue in my list yesterday, and that's
memory consumption.TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.
Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.
As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.
I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?
I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:
1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
sample rows vs. ranges)
I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).
The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables. It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.
Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.
But let's see an example with less correlated data, say, like this:
create table t (a int) with (fillfactor = 10);
insert into t select i + 10000 * random()
from generate_series(1,10000000) s(i);
With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:
1) select * from t order by a;
seqscan+sort: 4437 ms
brinsort: 4233 ms
2) select * from t order by a limit 10;
seqscan+sort: 1859 ms
brinsort: 4 ms
If you increase the random factor from 10000 to 100000 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.
Only at 5000000, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.
Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).
It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 10/16/22 16:42, Zhihong Yu wrote:
...
I don't have good answer w.r.t. splitting the page range [0,127] now.
Let me think more about it.
Sure, no problem.
The 10 step flow (subject to changes down the road) should be either
given in the description of the patch or, written as comment inside the
code.
This would help people grasp the concept much faster.
True. I'll add it to the next version of the pach.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, 16 Oct 2022 at 16:42, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.
I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.
A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.
I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".
If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.
Kind regards,
Matthias van de Meent
First of all, it's really great to see that this is being worked on.
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minval
Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxval
Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.
When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):
1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
- Remove Rs' from Rs
- store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
- Remove Ts' from Ts
- Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
- Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
- remove Rs' from Rs
- assign Rs' min value to max_sorted
- store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
- remove Rs' from Rs
- store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7
This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.
The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.
Kind regards,
Matthias van de Meent
PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0]/messages/by-id/CAEze2Wi9=Bay_=rTf8Z6WPgZ5V0tDOayszQJJO=R_9aaHvr+Tg@mail.gmail.com would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.
[0]: /messages/by-id/CAEze2Wi9=Bay_=rTf8Z6WPgZ5V0tDOayszQJJO=R_9aaHvr+Tg@mail.gmail.com
On 10/16/22 22:17, Matthias van de Meent wrote:
First of all, it's really great to see that this is being worked on.
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.
Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?
Consider these ranges sorted by maxval
range #1 [0,100]
range #2 [101,200]
range #3 [150,250]
...
range #1000000 [190,1000000000]
processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.
So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.
The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.
I don't understand what you mean by "update to tuplesort". Can you
elaborate?
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[0,1000]
[500,1500]
[1001,2000]
...
We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).
So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.
Or maybe I just don't understand what you mean.
6. load all tuples from no-summarized ranges (first range only)
(into tuplesort/tuplestore, depending on maxval comparison)
7. load all intersecting ranges (with minval < current maxval)
(into tuplesort/tuplestore, depending on maxval comparison)
8. sort the tuplesort, output all tuples, then back to (4)
9. NULLS LAST: read all ranges that might have NULLs => output
10. doneFor "DESC" ordering the process is almost the same, except that we swap
minval/maxval in most places.When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
- Remove Rs' from Rs
- store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
- Remove Ts' from Ts
- Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
- Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
- remove Rs' from Rs
- assign Rs' min value to max_sorted
- store the tuples of Rs' range in Ts
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.
12. while the next item Rs' from Rs has a min value of max_sorted:
- remove Rs' from Rs
- store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.
I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.
Kind regards,
Matthias van de Meent
PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.
Thanks for reminding me, I'll take a look before the next CF.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
On 10/16/22 22:17, Matthias van de Meent wrote:
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?
Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.I don't understand what you mean by "update to tuplesort". Can you
elaborate?
Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.
For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[...]Or maybe I just don't understand what you mean.
If we sort the ranges by minval like this:
1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]
Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.
If the ranges are sorted and loaded by maxval, like your algorithm expects:
1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]
We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.
[algo]
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.
max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.
There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.
The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.
I see that the two main differences between our designs are in
answering these questions:
- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?
For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.
For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.
Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.
Kind regards,
Matthias van de Meent
On 10/17/22 16:00, Matthias van de Meent wrote:
On Mon, 17 Oct 2022 at 05:43, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:On 10/16/22 22:17, Matthias van de Meent wrote:
On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Try to formulate the whole algorithm. Maybe I'm missing something.
The current algorithm is something like this:
1. request info about ranges from the BRIN opclass
2. sort them by maxval and minvalWhy sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.Because that allows to identify overlapping ranges quickly.
Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?Why do we need to identify overlapping ranges specifically? If you
sort by minval, it becomes obvious that any subsequent range cannot
contain values < the minval of the next range in the list, allowing
you to emit any values less than the next, unprocessed, minmax range's
minval.
D'oh! I think you're right, it should be possible to do this with only
sort by minval. And it might actually be better way to do that.
I think I chose the "maxval" ordering because it seemed reasonable.
Looking at the current range and using the maxval as the threshold
seemed reasonable. But it leads to a bunch of complexity with the
intersecting ranges, and I never reconsidered this choice. Silly me.
3. NULLS FIRST: read all ranges that might have NULLs => output
4. read the next range (by maxval) into tuplesort
(if no more ranges, go to (9))
5. load all tuples from "splill" tuplestore, compare to maxvalInstead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.I don't understand what you mean by "update to tuplesort". Can you
elaborate?Tuplesort currently only allows the following workflow: you to load
tuples, then call finalize, then extract tuples. There is currently no
way to add tuples once you've started extracting them.For my design to work efficiently or without hacking into the
internals of tuplesort, we'd need a way to restart or 'un-finalize'
the tuplesort so that it returns to the 'load tuples' phase. Because
all data of the previous iteration is already sorted, adding more data
shouldn't be too expensive.
Not sure. I still think it's better to limit the amount of data we have
in the tuplesort. Even if the tuplesort can efficiently skip the already
sorted part, it'll still occupy disk space, possibly even force the data
to disk etc. (We'll still have to write that into a tuplestore, but that
should be relatively small and short-lived/recycled).
FWIW I wonder if the assumption that tuplesort can quickly skip already
sorted data holds e.g. for tuplesorts much larger than work_mem, but I
haven't checked that.
I'd also like to include some more info in the explain, like how many
times we did a sort, and what was the largest amount of data we sorted.
Although, maybe that could be tracked by tracking the tuplesort size of
the last sort.
Considering the tuplesort does not currently support this, I'll probably
stick to the existing approach with separate tuplestore. There's enough
complexity in the patch already, I think. The only thing we'll need with
the minval ordering is the ability to "peek ahead" to the next minval
(which is going to be the threshold used to route values either to
tuplesort or tuplestore).
The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval
[...]Or maybe I just don't understand what you mean.
If we sort the ranges by minval like this:
1. [0,1000]
2. [0,999]
3. [50,998]
4. [100,997]
5. [100,996]
6. [150,995]Then we can load and sort the values for range 1 and 2, and emit all
values up to (not including) 50 - the minval of the next,
not-yet-loaded range in the ordered list of ranges. Then add the
values from range 3 to the set of tuples we have yet to output; sort;
and then emit valus up to 100 (range 4's minval), etc. This reduces
the amount of tuples in the tuplesort to the minimum amount needed to
output any specific value.If the ranges are sorted and loaded by maxval, like your algorithm expects:
1. [150,995]
2. [100,996]
3. [100,997]
4. [50,998]
5. [0,999]
6. [0,1000]We need to load all ranges into the sort before it could start
emitting any tuples, as all ranges overlap with the first range.
Right, thanks - I get this now.
[algo]
I don't think this works, because we may get a range (Rs') with very
high maxval (thus read very late from Rs), but with very low minval.
AFAICS max_sorted must never go back, and this breaks it.max_sorted cannot go back, because it is the min value of the next
range in the list of ranges sorted by min value; see also above.There is a small issue in my algorithm where I use <= for yielding
values where it should be <, where initialization of max_value to NULL
is then be incorrect, but apart from that I don't think there are any
issues with the base algorithm.The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.I'm not a tuplesort expert, but my assumption it's better to sort
smaller amounts of rows - which is why the patch sorts only the rows it
knows it can actually output.I see that the two main differences between our designs are in
answering these questions:- How do we select table ranges for processing?
- How do we handle tuples that we know we can't output yet?For the first, I think the differences are explained above. The main
drawback of your selection algorithm seems to be that your algorithm's
worst-case is "all ranges overlap", whereas my algorithm's worst case
is "all ranges start at the same value", which is only a subset of
your worst case.
Right, those are very good points.
For the second, the difference is whether we choose to sort the tuples
that are out-of-bounds, but are already in the working set due to
being returned from a range overlapping with the current bound.
My algorithm tries to reduce the overhead of increasing the sort
boundaries by also sorting the out-of-bound data, allowing for
O(n-less-than-newbound) overhead of extending the bounds (total
complexity for whole sort O(n-out-of-bound)), and O(n log n)
processing of all tuples during insertion.
Your algorithm - if I understand it correctly - seems to optimize for
faster results within the current bound by not sorting the
out-of-bounds data with O(1) processing when out-of-bounds, at the
cost of needing O(n-out-of-bound-tuples) operations when the maxval /
max_sorted boundary is increased, with a complexity of O(n*m) for an
average of n out-of-bound tuples and m bound updates.
Right. I wonder if we these are actually complementary approaches, and
we could/should pick between them depending on how many rows we expect
to consume.
My focus was LIMIT queries, so I favored the approach with the lowest
startup cost. I haven't quite planned for this to work so well even in
full-sort cases. That kinda surprised me (I wonder if the very large
tuplesorts - compared to work_mem - would hurt this, though).
Lastly, there is the small difference in how the ranges are extracted
from BRIN: I prefer and mention an iterative approach where the tuples
are extracted from the index and loaded into a tuplesort in some
iterative fashion (which spills to disk and does not need all tuples
to reside in memory), whereas your current approach was mentioned as
(paraphrasing) 'allocate all this data in one chunk and hope that
there is enough memory available'. I think this is not so much a
disagreement in best approach, but mostly a case of what could be made
to work; so in later updates I hope we'll see improvements here.
Right. I think I mentioned this in my post [1]/messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com, where I also envisioned
some sort of iterative approach. And I think you're right the approach
with ordering by minval is naturally more suitable because it just
consumes the single sequence of ranges.
regards
[1]: /messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com
/messages/by-id/1a7c2ff5-a855-64e9-0272-1f9947f8a558@enterprisedb.com
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
here's an updated/reworked version of the patch, on top of the "BRIN
statistics" patch as 0001 (because some of the stuff is useful, but we
can ignore this part in this thread).
Warning: I realized the new node is somewhat broken when it comes to
projection and matching the indexed column, most likely because the
targetlists are wired/processed incorrectly or something like that. So
when experimenting with this, just index the first column of the table
and don't do anything requiring a projection. I'll get this fixed, but
I've been focusing on the other stuff. I'm not particularly familiar
with this tlist/project stuff, so any help is welcome.
The main change in this version is the adoption of multiple ideas
suggested by Matthias in his earlier responses.
Firstly, this changes how the index opclass passes information to the
executor node. Instead of using a plain array, we now use a tuplesort.
This addresses the memory consumption issues with large number of
ranges, and it also simplifies the sorting etc. which is now handled by
the tuplesort. The support procedure simply fills a tuplesort and then
hands it over to the caller (more or less).
Secondly, instead of ordering the ranges by maxval, this orders them by
minval (as suggested by Matthias), which greatly simplifies the code
because we don't need to detect overlapping ranges etc.
More precisely, the ranges are sorted to get this ordering
- not yet summarized ranges
- ranges sorted by (minval, blkno)
- all-nulls ranges
This particular ordering is beneficial for the algorithm, which does two
passes over the ranges. For the NULLS LAST case (i.e. the default), we
do this:
- produce tuples with non-NULL values, ordered by the value
- produce tuples with NULL values (arbitrary order)
And each of these phases does a separate pass over the ranges (I'll get
to that in a minute). And the ordering is tailored to this.
Note: For DESC we'd sort by maxval, and for NULLS FIRST the phases would
happen in the opposite order, but those are details. Let's assume ASC
ordering with NULLS LAST, unless stated otherwise.
The idea here is that all not-summarized ranges need to be processed
always, both when processing NULLs and non-NULL values, which happens as
two separate passes over ranges.
The all-null ranges don't need to be processed during the non-NULL pass,
and we can terminate this pass early once we hit the first null-only
range. So placing them last helps with this.
The regular ranges are ordered by minval, as dictated by the algorithm
(which is now described in nodeBrinSort.c comment), but we also sort
them by blkno to make this a bit more sequential (but this only matters
for ranges with the same minval, and that's probably rare, but the extra
sort key is also cheap so why not).
I mentioned we do two separate passes - one for non-NULL values, one for
NULL values. That may be somewhat inefficient, because in extreme cases
we might end up scanning the whole table twice (imagine BRIN ranges
where each range has both regular values and NULLs). It might be
possible to do all of this in a single pass, at least in some cases -
for example while scanning ranges, we might stash NULL rows into a
tuplestore, so that the second pass is not needed. That assumes there
are not too many such rows (otherwise we might need to write and then
read many rows, outweighing the cost of just doing two passes). This
should be possible to estimate/cost fairly well, I think, and the
comment in nodeBrinSort actually presents some ideas about this.
And we can't do that for the NULLS FIRST case, because if we stash the
non-NULL rows somewhere, we won't be able to do the "incremental" sort,
i.e. we might just do regular Sort right away. So I've kept this simple
approach with two passes for now.
This still uses the approach with spilling tuples to a tuplestore, and
only sorting rows that we know are safe to output. I still think this is
a good approach, for the reasons I explained before, but changing this
is not hard so we can experiment.
There's however a related question - how quickly should we increment the
minval value, serving as a watermark? One option is to go to the next
distinct minval value - but that may result in excessive number of tiny
sorts, because the number ranges and rows between the old and new minval
values tends to be small. Another negative consequence is that this may
cause of lot of spilling (and re-spilling), because we only consume tiny
number of rows from the tuplestore after incrementing the watermark.
Or we can do larger steps, skipping some of the minval values, so that
more rows quality into the sort. Of course, too large step means we'll
exceed work_mem and switch to an on-disk sort, which we probably don't
want. Also, this may be the wrong thing to do for LIMIT queries, that
only need a couple rows, and a tiny sort is fine (because we won't do
too many of them).
Patch 0004 introduces a new GUC called brinsort_watermark_step, that can
be used to experiment with this. By default it's set to '1' which means
we simply progress to the next minval value.
Then 0005 tries to customize this based on statistics - we estimate the
number of rows we expect to get for each minval increment to "add" and
then pick just a step value not to overflow work_mem. This happens in
create_brinsort_plan, and the comment explains the main weakness - the
way the number of rows is estimated is somewhat naive, as it just
divides reltuples by number of ranges. But I have a couple ideas about
what statistics we might collect, explained in 0001 in the comment at
brin_minmax_stats.
But there's another option - we can tune the step based on past sorts.
If we see the sorts are doing on-disk sort, maybe try doing smaller
steps. Patch 0006 implements a very simple variant of this. There's a
couple ideas about how it might be improved, mentioned in the comment at
brinsort_adjust_watermark_step.
There's also patch 0003, which extends the EXPLAIN output with a
counters tracking the number of sorts, counts of on-disk/in-memory
sorts, space used, number of rows sorted/spilled, and so on. This is
useful when analyzing e.g. the effect of higher/lower watermark steps,
discussed in the preceding paragraphs.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0001-Allow-index-AMs-to-build-and-use-custom-sta-20221022.patchtext/x-patch; charset=UTF-8; name=0001-Allow-index-AMs-to-build-and-use-custom-sta-20221022.patchDownload
From d8da87f72f367cc0364357ff4bda1dd02810215d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 17 Oct 2022 18:39:28 +0200
Subject: [PATCH 1/6] Allow index AMs to build and use custom statistics
Some indexing AMs work very differently and estimating them using
existing statistics is problematic, producing unreliable costing. This
applies e.g. to BRIN, which relies on page ranges, not tuple pointers.
This adds an optional AM procedure, allowing the opfamily to build
custom statistics, store them in pg_statistic and then use them during
planning. By default this is disabled, but may be enabled by setting
SET enable_indexam_stats = true;
Then ANALYZE will call the optional procedure for all indexes.
---
src/backend/access/brin/brin.c | 1 +
src/backend/access/brin/brin_minmax.c | 1332 +++++++++++++++++++++++++
src/backend/commands/analyze.c | 138 ++-
src/backend/utils/adt/selfuncs.c | 59 ++
src/backend/utils/cache/lsyscache.c | 41 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/access/amapi.h | 2 +
src/include/access/brin.h | 51 +
src/include/access/brin_internal.h | 1 +
src/include/catalog/pg_amproc.dat | 64 ++
src/include/catalog/pg_proc.dat | 4 +
src/include/catalog/pg_statistic.h | 5 +
src/include/commands/vacuum.h | 2 +
src/include/utils/lsyscache.h | 1 +
14 files changed, 1706 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6fabd14c263..3fe95cd717f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -96,6 +96,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amstrategies = 0;
amroutine->amsupport = BRIN_LAST_OPTIONAL_PROCNUM;
amroutine->amoptsprocnum = BRIN_PROCNUM_OPTIONS;
+ amroutine->amstatsprocnum = BRIN_PROCNUM_STATISTICS;
amroutine->amcanorder = false;
amroutine->amcanorderbyop = false;
amroutine->amcanbackward = false;
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index ead9e8f4e36..0135a00ae91 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -10,17 +10,22 @@
*/
#include "postgres.h"
+#include "access/brin.h"
#include "access/brin_internal.h"
+#include "access/brin_revmap.h"
#include "access/brin_tuple.h"
#include "access/genam.h"
#include "access/stratnum.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
#include "utils/builtins.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/syscache.h"
+#include "utils/timestamp.h"
typedef struct MinmaxOpaque
{
@@ -31,6 +36,11 @@ typedef struct MinmaxOpaque
static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno,
Oid subtype, uint16 strategynum);
+/* print debugging into about calculated statistics */
+#define STATS_DEBUG
+
+/* calculate the stats in different ways for cross-checking */
+#define STATS_CROSS_CHECK
Datum
brin_minmax_opcinfo(PG_FUNCTION_ARGS)
@@ -262,6 +272,1328 @@ brin_minmax_union(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+/* FIXME copy of a private struct from brin.c */
+typedef struct BrinOpaque
+{
+ BlockNumber bo_pagesPerRange;
+ BrinRevmap *bo_rmAccess;
+ BrinDesc *bo_bdesc;
+} BrinOpaque;
+
+/*
+ * Compare ranges by minval (collation and operator are taken from the extra
+ * argument, which is expected to be TypeCacheEntry).
+ */
+static int
+range_minval_cmp(const void *a, const void *b, void *arg)
+{
+ BrinRange *ra = *(BrinRange **) a;
+ BrinRange *rb = *(BrinRange **) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+ int r;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ ra->min_value, rb->min_value);
+ r = DatumGetInt32(c);
+
+ if (r != 0)
+ return r;
+
+ if (ra->blkno_start < rb->blkno_start)
+ return -1;
+ else
+ return 1;
+}
+
+/*
+ * Compare ranges by maxval (collation and operator are taken from the extra
+ * argument, which is expected to be TypeCacheEntry).
+ */
+static int
+range_maxval_cmp(const void *a, const void *b, void *arg)
+{
+ BrinRange *ra = *(BrinRange **) a;
+ BrinRange *rb = *(BrinRange **) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+ int r;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ ra->max_value, rb->max_value);
+ r = DatumGetInt32(c);
+
+ if (r != 0)
+ return r;
+
+ if (ra->blkno_start < rb->blkno_start)
+ return -1;
+ else
+ return 1;
+}
+
+/* compare values using an operator from typcache */
+static int
+range_values_cmp(const void *a, const void *b, void *arg)
+{
+ Datum da = * (Datum *) a;
+ Datum db = * (Datum *) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ da, db);
+ return DatumGetInt32(c);
+}
+
+/*
+ * maxval_start
+ * Determine first index so that (maxvalue >= value).
+ *
+ * The array of ranges is expected to be sorted by maxvalue, so this is the first
+ * range that can possibly intersect with range having "value" as minval.
+ */
+static int
+maxval_start(BrinRange **ranges, int nranges, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nranges - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &ranges[start]->max_value, typcache) <= 0)
+ return 0;
+
+ // no matches
+ if (range_values_cmp(&value, &ranges[end]->max_value, typcache) > 0)
+ return nranges;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &ranges[midpoint]->max_value, typcache);
+
+ if (r <= 0)
+ end = midpoint;
+ else
+ start = (midpoint + 1);
+ }
+
+ Assert(ranges[start]->max_value >= value);
+ Assert(ranges[start-1]->max_value < value);
+
+ return start;
+}
+
+/*
+ * minval_end
+ * Determine first index so that (minval > value).
+ *
+ * The array of ranges is expected to be sorted by minvalue, so this is the first
+ * range that can't possibly intersect with a range having "value" as maxval.
+ */
+static int
+minval_end(BrinRange **ranges, int nranges, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nranges - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &ranges[end]->min_value, typcache) >= 0)
+ return nranges;
+
+ // no matches
+ if (range_values_cmp(&value, &ranges[start]->min_value, typcache) < 0)
+ return 0;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &ranges[midpoint]->min_value, typcache);
+
+ if (r >= 0)
+ start = midpoint + 1;
+ else
+ end = midpoint;
+ }
+
+ Assert(ranges[start]->min_value > value);
+ Assert(ranges[start-1]->min_value <= value);
+
+ return start;
+}
+
+
+/*
+ * lower_bound
+ * Determine first index so that (values[index] >= value).
+ *
+ * The array of ranges is expected to be sorted by maxvalue, so this is the first
+ * range that can possibly intersect with range having "value" as minval.
+ */
+static int
+lower_bound(Datum *values, int nvalues, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nvalues - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &values[start], typcache) <= 0)
+ return 0;
+
+ // no matches
+ if (range_values_cmp(&value, &values[end], typcache) > 0)
+ return nvalues;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &values[midpoint], typcache);
+
+ if (r <= 0)
+ end = midpoint;
+ else
+ start = (midpoint + 1);
+ }
+
+ Assert(values[start] >= value);
+ Assert(values[start-1] < value);
+
+ return start;
+}
+
+/*
+ * upper_bound
+ * Determine first index so that (values[index] > value).
+ *
+ * The array of ranges is expected to be sorted by minvalue, so this is the first
+ * range that can't possibly intersect with a range having "value" as maxval.
+ */
+static int
+upper_bound(Datum *values, int nvalues, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nvalues - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &values[end], typcache) >= 0)
+ return nvalues;
+
+ // no matches
+ if (range_values_cmp(&value, &values[start], typcache) < 0)
+ return 0;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &values[midpoint], typcache);
+
+ if (r >= 0)
+ start = midpoint + 1;
+ else
+ end = midpoint;
+ }
+
+ Assert(values[start] > value);
+ Assert(values[start-1] <= value);
+
+ return start;
+}
+
+/*
+ * Simple histogram, with bins tracking value and two overlap counts.
+ *
+ * XXX Maybe we should have two separate histograms, one for all counts and
+ * another one for "unique" values.
+ *
+ * XXX Serialize the histogram. There might be a data set where we have very
+ * many distinct buckets (values having very different number of matching
+ * ranges) - not sure if there's some sort of upper limit (but hard to say for
+ * other opclasses, like bloom). And we don't want arbitrarily large histogram,
+ * to keep the statistics fairly small, I guess. So we'd need to pick a subset,
+ * merge buckets with "similar" counts, or approximate it somehow. For now we
+ * don't serialize it, because we don't use the histogram.
+ */
+typedef struct histogram_bin_t
+{
+ int value;
+ int count;
+} histogram_bin_t;
+
+typedef struct histogram_t
+{
+ int nbins;
+ int nbins_max;
+ histogram_bin_t bins[FLEXIBLE_ARRAY_MEMBER];
+} histogram_t;
+
+#define HISTOGRAM_BINS_START 32
+
+/* allocate histogram with default number of bins */
+static histogram_t *
+histogram_init(void)
+{
+ histogram_t *hist;
+
+ hist = (histogram_t *) palloc0(offsetof(histogram_t, bins) +
+ sizeof(histogram_bin_t) * HISTOGRAM_BINS_START);
+ hist->nbins_max = HISTOGRAM_BINS_START;
+
+ return hist;
+}
+
+/*
+ * histogram_add
+ * Add a hit for a particular value to the histogram.
+ *
+ * XXX We don't sort the bins, so just do binary sort. For large number of values
+ * this might be an issue, for small number of values a linear search is fine.
+ */
+static histogram_t *
+histogram_add(histogram_t *hist, int value)
+{
+ bool found = false;
+ histogram_bin_t *bin;
+
+ for (int i = 0; i < hist->nbins; i++)
+ {
+ if (hist->bins[i].value == value)
+ {
+ bin = &hist->bins[i];
+ found = true;
+ }
+ }
+
+ if (!found)
+ {
+ if (hist->nbins == hist->nbins_max)
+ {
+ int nbins = (2 * hist->nbins_max);
+ hist = repalloc(hist, offsetof(histogram_t, bins) +
+ sizeof(histogram_bin_t) * nbins);
+ hist->nbins_max = nbins;
+ }
+
+ Assert(hist->nbins < hist->nbins_max);
+
+ bin = &hist->bins[hist->nbins++];
+ bin->value = value;
+ bin->count = 0;
+ }
+
+ bin->count += 1;
+
+ Assert(bin->value == value);
+ Assert(bin->count >= 0);
+
+ return hist;
+}
+
+/* used to sort histogram bins by value */
+static int
+histogram_bin_cmp(const void *a, const void *b)
+{
+ histogram_bin_t *ba = (histogram_bin_t *) a;
+ histogram_bin_t *bb = (histogram_bin_t *) b;
+
+ if (ba->value < bb->value)
+ return -1;
+
+ if (bb->value < ba->value)
+ return 1;
+
+ return 0;
+}
+
+static void
+histogram_print(histogram_t *hist)
+{
+ return;
+
+ elog(WARNING, "----- histogram -----");
+ for (int i = 0; i < hist->nbins; i++)
+ {
+ elog(WARNING, "bin %d value %d count %d",
+ i, hist->bins[i].value, hist->bins[i].count);
+ }
+}
+
+/*
+ * brin_minmax_count_overlaps
+ * Calculate number of overlaps.
+ *
+ * This uses the minranges to quickly eliminate ranges that can't possibly
+ * intersect. We simply walk minranges until minval > current maxval, and
+ * we're done.
+ *
+ * Unlike brin_minmax_count_overlaps2, this does not have issues with wide
+ * ranges, so this is what we should use.
+ */
+static int
+brin_minmax_count_overlaps(BrinRange **minranges, int nranges, TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+#ifdef STATS_DEBUG
+ TimestampTz start_ts = GetCurrentTimestamp();
+#endif
+
+ noverlaps = 0;
+ for (int i = 0; i < nranges; i++)
+ {
+ Datum maxval = minranges[i]->max_value;
+
+ /*
+ * Determine index of the first range with (minval > current maxval)
+ * by binary search. We know all other ranges can't overlap the
+ * current one. We simply subtract indexes to count ranges.
+ */
+ int idx = minval_end(minranges, nranges, maxval, typcache);
+
+ /* -1 because we don't count the range as intersecting with itself */
+ noverlaps += (idx - i - 1);
+ }
+
+ /*
+ * We only count 1/2 the ranges (minval > current minval), so the total
+ * number of overlaps is twice what we counted.
+ */
+ noverlaps *= 2;
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_count_overlaps -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+#endif
+
+ return noverlaps;
+}
+
+#ifdef STATS_CROSS_CHECK
+/*
+ * brin_minmax_count_overlaps2
+ * Calculate number of overlaps.
+ *
+ * This uses the minranges/maxranges to quickly eliminate ranges that can't
+ * possibly intersect.
+ *
+ * XXX Seems rather complicated and works poorly for wide ranges (with outlier
+ * values), brin_minmax_count_overlaps is likely better.
+ */
+static int
+brin_minmax_count_overlaps2(BrinRanges *ranges,
+ BrinRange **minranges, BrinRange **maxranges,
+ TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ /*
+ * Walk the ranges ordered by max_values, see how many ranges overlap.
+ *
+ * Once we get to a state where (min_value > current.max_value) for
+ * all future ranges, we know none of them can overlap and we can
+ * terminate. This is what min_index_lowest is for.
+ *
+ * XXX If there are very wide ranges (with outlier min/max values),
+ * the min_index_lowest is going to be pretty useless, because the
+ * range will be sorted at the very end by max_value, but will have
+ * very low min_index, so this won't work.
+ *
+ * XXX We could collect a more elaborate stuff, like for example a
+ * histogram of number of overlaps, or maximum number of overlaps.
+ * So we'd have average, but then also an info if there are some
+ * ranges with very many overlaps.
+ */
+ noverlaps = 0;
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ int idx = i+1;
+ BrinRange *ra = maxranges[i];
+ uint64 min_index = ra->min_index;
+
+ CHECK_FOR_INTERRUPTS();
+
+#ifdef NOT_USED
+ /*
+ * XXX Not needed, we can just count "future" ranges and then
+ * we just multiply by 2.
+ */
+
+ /*
+ * What's the first range that might overlap with this one?
+ * needs to have maxval > current.minval.
+ */
+ while (idx > 0)
+ {
+ BrinRange *rb = maxranges[idx - 1];
+
+ /* the range is before the current one, so can't intersect */
+ if (range_values_cmp(&rb->max_value, &ra->min_value, typcache) < 0)
+ break;
+
+ idx--;
+ }
+#endif
+
+ /*
+ * Find the first min_index that is higher than the max_value,
+ * so that we can compare that instead of the values in the
+ * next loop. There should be fewer value comparisons than in
+ * the next loop, so we'll save on function calls.
+ */
+ while (min_index < ranges->nranges)
+ {
+ if (range_values_cmp(&minranges[min_index]->min_value,
+ &ra->max_value, typcache) > 0)
+ break;
+
+ min_index++;
+ }
+
+ /*
+ * Walk the following ranges (ordered by max_value), and check
+ * if it overlaps. If it matches, we look at the next one. If
+ * not, we check if there can be more ranges.
+ */
+ for (int j = idx; j < ranges->nranges; j++)
+ {
+ BrinRange *rb = maxranges[j];
+
+ /* the range overlaps - just continue with the next one */
+ // if (range_values_cmp(&rb->min_value, &ra->max_value, typcache) <= 0)
+ if (rb->min_index < min_index)
+ {
+ noverlaps++;
+ continue;
+ }
+
+ /*
+ * Are there any future ranges that might overlap? We can
+ * check the min_index_lowest to decide quickly.
+ */
+ if (rb->min_index_lowest >= min_index)
+ break;
+ }
+ }
+
+ /*
+ * We only count intersect for "following" ranges when ordered by maxval,
+ * so we only see 1/2 the overlaps. So double the result.
+ */
+ noverlaps *= 2;
+
+ elog(WARNING, "----- brin_minmax_count_overlaps2 -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ return noverlaps;
+}
+
+/*
+ * brin_minmax_count_overlaps_bruteforce
+ * Calculate number of overlaps by brute force.
+ *
+ * Actually compares every range to every other range. Quite expensive, used
+ * primarily to cross-check the other algorithms.
+ */
+static int
+brin_minmax_count_overlaps_bruteforce(BrinRanges *ranges, TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ /*
+ * Brute force calculation of overlapping ranges, comparing each
+ * range to every other range - bound to be pretty expensive, as
+ * it's pretty much O(N^2). Kept mostly for easy cross-check with
+ * the preceding "optimized" code.
+ */
+ noverlaps = 0;
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ BrinRange *ra = &ranges->ranges[i];
+
+ for (int j = 0; j < ranges->nranges; j++)
+ {
+ BrinRange *rb = &ranges->ranges[j];
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (i == j)
+ continue;
+
+ if (range_values_cmp(&ra->max_value, &rb->min_value, typcache) < 0)
+ continue;
+
+ if (range_values_cmp(&rb->max_value, &ra->min_value, typcache) < 0)
+ continue;
+
+ elog(DEBUG1, "[%ld,%ld] overlaps [%ld,%ld]",
+ ra->min_value, ra->max_value,
+ rb->min_value, rb->max_value);
+
+ noverlaps++;
+ }
+ }
+
+ elog(WARNING, "----- brin_minmax_count_overlaps_bruteforce -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ return noverlaps;
+}
+#endif
+
+/*
+ * brin_minmax_match_tuples_to_ranges
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering
+ * of values, not ranges.
+ *
+ * XXX This seems like the optimal way to do this.
+ */
+static void
+brin_minmax_match_tuples_to_ranges(BrinRanges *ranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+ int nmatches_value = 0;
+
+ int *unique = (int *) palloc0(sizeof(int) * nvalues);
+
+#ifdef STATS_DEBUG
+ TimestampTz start_ts = GetCurrentTimestamp();
+#endif
+
+ /*
+ * Build running count of unique values. We know there are unique[i]
+ * unique values in values array up to index "i".
+ */
+ unique[0] = 1;
+ for (int i = 1; i < nvalues; i++)
+ {
+ if (range_values_cmp(&values[i-1], &values[i], typcache) == 0)
+ unique[i] = unique[i-1];
+ else
+ unique[i] = unique[i-1] + 1;
+ }
+
+ nvalues_unique = unique[nvalues-1];
+
+ /*
+ * Walk the ranges, for each range determine the first/last mapping
+ * value. Use the "unique" array to count the unique values.
+ */
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ int start;
+ int end;
+
+ CHECK_FOR_INTERRUPTS();
+
+ start = lower_bound(values, nvalues, ranges->ranges[i].min_value, typcache);
+ end = upper_bound(values, nvalues, ranges->ranges[i].max_value, typcache);
+
+ Assert(end > start);
+
+ nmatches_value = (end - start);
+ nmatches_unique += (unique[end-1] - unique[start] + 1);
+
+ nmatches += nmatches_value;
+ }
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f", nmatches_unique, nvalues_unique,
+ (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+#endif
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+
+#ifdef STATS_CROSS_CHECK
+/*
+ * brin_minmax_match_tuples_to_ranges2
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Match sample tuples to the ranges, so that we can count how many ranges
+ * a value matches on average. This might seem redundant to the number of
+ * overlaps, because the value is ~avg_overlaps/2.
+ *
+ * Imagine ranges arranged in "shifted" uniformly by 1/overlaps, e.g. with 3
+ * overlaps [0,100], [33,133], [66, 166] and so on. A random value will hit
+ * only half of there ranges, thus 1/2. This can be extended to randomly
+ * overlapping ranges.
+ *
+ * However, we may not be able to count overlaps for some opclasses (e.g. for
+ * bloom ranges), in which case we have at least this.
+ *
+ * This simply walks the values, and determines matching ranges by looking
+ * for lower/upper bound in ranges ordered by minval/maxval.
+ *
+ * XXX The other question is what to do about duplicate values. If we have a
+ * very frequent value in the sample, it's likely in many places/ranges. Which
+ * will skew the average, because it'll be added repeatedly. So we also count
+ * avg_ranges for unique values.
+ *
+ * XXX The relationship that (average_matches ~ average_overlaps/2) only
+ * works for minmax opclass, and can't be extended to minmax-multi. The
+ * overlaps can only consider the two extreme values (essentially treating
+ * the summary as a single minmax range), because that's what brinsort
+ * needs. But the minmax-multi range may have "gaps" (kinda the whole point
+ * of these opclasses), which affects matching tuples to ranges.
+ *
+ * XXX This also builds histograms of the number of matches, both for the
+ * raw and unique values. At the moment we don't do anything with the
+ * results, though (except for printing those).
+ */
+static void
+brin_minmax_match_tuples_to_ranges2(BrinRanges *ranges,
+ BrinRange **minranges, BrinRange **maxranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+ histogram_t *hist = histogram_init();
+ histogram_t *hist_unique = histogram_init();
+ int nmatches_value = 0;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ for (int i = 0; i < nvalues; i++)
+ {
+ int start;
+ int end;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Same value as preceding, so just use the preceding count.
+ * We don't increment the unique counters, because this is
+ * a duplicate.
+ */
+ if ((i > 0) && (range_values_cmp(&values[i-1], &values[i], typcache) == 0))
+ {
+ nmatches += nmatches_value;
+ hist = histogram_add(hist, nmatches_value);
+ continue;
+ }
+
+ nmatches_value = 0;
+
+ start = maxval_start(maxranges, ranges->nranges, values[i], typcache);
+ end = minval_end(minranges, ranges->nranges, values[i], typcache);
+
+ for (int j = start; j < ranges->nranges; j++)
+ {
+ if (maxranges[j]->min_index >= end)
+ continue;
+
+ if (maxranges[j]->min_index_lowest >= end)
+ break;
+
+ nmatches_value++;
+ }
+
+ hist = histogram_add(hist, nmatches_value);
+ hist_unique = histogram_add(hist_unique, nmatches_value);
+
+ nmatches += nmatches_value;
+ nmatches_unique += nmatches_value;
+ nvalues_unique++;
+ }
+
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges2 -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f",
+ nmatches_unique, nvalues_unique, (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ pg_qsort(hist->bins, hist->nbins, sizeof(histogram_bin_t), histogram_bin_cmp);
+ pg_qsort(hist_unique->bins, hist_unique->nbins, sizeof(histogram_bin_t), histogram_bin_cmp);
+
+ histogram_print(hist);
+ histogram_print(hist_unique);
+
+ pfree(hist);
+ pfree(hist_unique);
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+
+/*
+ * brin_minmax_match_tuples_to_ranges_bruteforce
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Bruteforce approach, used mostly for cross-checking.
+ */
+static void
+brin_minmax_match_tuples_to_ranges_bruteforce(BrinRanges *ranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ for (int i = 0; i < nvalues; i++)
+ {
+ bool is_unique;
+ int nmatches_value = 0;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* is this a new value? */
+ is_unique = ((i == 0) || (range_values_cmp(&values[i-1], &values[i], typcache) != 0));
+
+ /* count unique values */
+ nvalues_unique += (is_unique) ? 1 : 0;
+
+ for (int j = 0; j < ranges->nranges; j++)
+ {
+ if (range_values_cmp(&values[i], &ranges->ranges[j].min_value, typcache) < 0)
+ continue;
+
+ if (range_values_cmp(&values[i], &ranges->ranges[j].max_value, typcache) > 0)
+ continue;
+
+ nmatches_value++;
+ }
+
+ nmatches += nmatches_value;
+ nmatches_unique += (is_unique) ? nmatches_value : 0;
+ }
+
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges_bruteforce -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f", nmatches_unique, nvalues_unique,
+ (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+#endif
+
+/*
+ * brin_minmax_value_stats
+ * Calculate statistics about minval/maxval values.
+ *
+ * We calculate the number of distinct values, and also correlation with respect
+ * to blkno_start. We don't calculate the regular correlation coefficient, because
+ * our goal is to estimate how sequential the accesses are. The regular correlation
+ * would produce 0 for cyclical data sets like mod(i,1000000), but it may be quite
+ * sequantial access. Maybe it should be called differently, not correlation?
+ *
+ * XXX Maybe this should calculate minval vs. maxval correlation too?
+ *
+ * XXX I don't know how important the sequentiality is - BRIN generally uses 1MB
+ * page ranges, which is pretty sequential and the one random seek in between is
+ * likely going to be negligible. Maybe for small page ranges it'll matter, though.
+ */
+static void
+brin_minmax_value_stats(BrinRange **minranges, BrinRange **maxranges,
+ int nranges, TypeCacheEntry *typcache,
+ double *minval_correlation, int64 *minval_ndistinct,
+ double *maxval_correlation, int64 *maxval_ndistinct)
+{
+ /* */
+ int64 minval_ndist = 1,
+ maxval_ndist = 1,
+ minval_corr = 0,
+ maxval_corr = 0;
+
+ for (int i = 1; i < nranges; i++)
+ {
+ if (range_values_cmp(&minranges[i-1]->min_value, &minranges[i]->min_value, typcache) != 0)
+ minval_ndist++;
+
+ if (range_values_cmp(&maxranges[i-1]->max_value, &maxranges[i]->max_value, typcache) != 0)
+ maxval_ndist++;
+
+ /* is it immediately sequential? */
+ if (minranges[i-1]->blkno_end + 1 == minranges[i]->blkno_start)
+ minval_corr++;
+
+ /* is it immediately sequential? */
+ if (maxranges[i-1]->blkno_end + 1 == maxranges[i]->blkno_start)
+ maxval_corr++;
+ }
+
+ *minval_ndistinct = minval_ndist;
+ *maxval_ndistinct = maxval_ndist;
+
+ *minval_correlation = (double) minval_corr / nranges;
+ *maxval_correlation = (double) maxval_corr / nranges;
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_value_stats -----");
+ elog(WARNING, "minval ndistinct %ld correlation %f",
+ *minval_ndistinct, *minval_correlation);
+
+ elog(WARNING, "maxval ndistinct %ld correlation %f",
+ *maxval_ndistinct, *maxval_correlation);
+#endif
+}
+
+/*
+ * brin_minmax_stats
+ * Calculate custom statistics for a BRIN minmax index.
+ *
+ * At the moment this calculates:
+ *
+ * - number of summarized/not-summarized and all/has nulls ranges
+ * - average number of overlaps for a range
+ * - average number of rows matching a range
+ * - number of distinct minval/maxval values
+ *
+ * There are multiple ways to calculate some of the metrics, so to allow
+ * cross-checking during development it's possible to run and compare all.
+ * To do that, define STATS_CROSS_CHECK. There's also STATS_DEBUG define
+ * that simply prints the calculated results.
+ *
+ * XXX This could also calculate correlation of the range minval, so that
+ * we can estimate how much random I/O will happen during the BrinSort.
+ * And perhaps we should also sort the ranges by (minval,block_start) to
+ * make this as sequential as possible?
+ *
+ * XXX Another interesting statistics might be the number of ranges with
+ * the same minval (or number of distinct minval values), because that's
+ * essentially what we need to estimate how many ranges will be read in
+ * one brinsort step. In fact, knowing the number of distinct minval
+ * values tells us the number of BrinSort loops.
+ *
+ * XXX We might also calculate a histogram of minval/maxval values.
+ *
+ * XXX I wonder if we could track for each range track probabilities:
+ *
+ * - P1 = P(v <= minval)
+ * - P2 = P(x <= Max(maxval)) for Max(maxval) over preceding ranges
+ *
+ * That would allow us to estimate how many ranges we'll have to read to produce
+ * a particular number of rows, because we need the first probability to exceed
+ * the requested number of rows (fraction of the table):
+ *
+ * (limit rows / reltuples) <= P(v <= minval)
+ *
+ * and then the second probability would say how many rows we'll process (either
+ * sort or spill). And inversely for the DESC ordering.
+ *
+ * The difference between P1 for two ranges is how much we'd have to sort
+ * if we moved the watermark between the ranges (first minval to second one).
+ * The (P2 - P1) for the new watermark range measures the number of rows in
+ * the tuplestore. We'll need to aggregate this, though, we can't keep the
+ * whole data - probably average/median/max for the differences would be nice.
+ * Might be tricky for different watermark step values, though.
+ *
+ * This would also allow estimating how many rows will spill from each range,
+ * because we have an estimate how many rows match a range on average, and
+ * we can compare it to the difference between P1.
+ *
+ * One issue is we don't have actual tuples from the ranges, so we can't
+ * measure exactly how many rows would we add. But we can match the sample
+ * and at least estimate the the probability difference.
+ */
+Datum
+brin_minmax_stats(PG_FUNCTION_ARGS)
+{
+ Relation heapRel = (Relation) PG_GETARG_POINTER(0);
+ Relation indexRel = (Relation) PG_GETARG_POINTER(1);
+ AttrNumber attnum = PG_GETARG_INT16(2);
+ AttrNumber heap_attnum = PG_GETARG_INT16(3);
+ HeapTuple *rows = (HeapTuple *) PG_GETARG_POINTER(4);
+ int numrows = PG_GETARG_INT32(5);
+
+ BrinOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber nranges;
+ BlockNumber heapBlk;
+ BrinMemTuple *dtup;
+ BrinTuple *btup = NULL;
+ Size btupsz = 0;
+ Buffer buf = InvalidBuffer;
+ BrinRanges *ranges;
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc;
+ BrinMinmaxStats *stats;
+
+ Oid typoid;
+ TypeCacheEntry *typcache;
+ BrinRange **minranges,
+ **maxranges;
+ int64 noverlaps;
+ int64 prev_min_index;
+
+ /*
+ * Mostly what brinbeginscan does to initialize BrinOpaque, except that
+ * we use active snapshot instead of the scan snapshot.
+ */
+ opaque = palloc_object(BrinOpaque);
+ opaque->bo_rmAccess = brinRevmapInitialize(indexRel,
+ &opaque->bo_pagesPerRange,
+ GetActiveSnapshot());
+ opaque->bo_bdesc = brin_build_desc(indexRel);
+
+ bdesc = opaque->bo_bdesc;
+ pagesPerRange = opaque->bo_pagesPerRange;
+
+ /* make sure the provided attnum is valid */
+ Assert((attnum > 0) && (attnum <= bdesc->bd_tupdesc->natts));
+
+ /*
+ * We need to know the size of the table so that we know how long to iterate
+ * on the revmap (and to pre-allocate the arrays).
+ */
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+
+ /*
+ * How many ranges can there be? We simply look at the number of pages,
+ * divide it by the pages_per_range.
+ *
+ * XXX We need to be careful not to overflow nranges, so we just divide
+ * and then maybe add 1 for partial ranges.
+ */
+ nranges = (nblocks / pagesPerRange);
+ if (nblocks % pagesPerRange != 0)
+ nranges += 1;
+
+ /* allocate for space, and also for the alternative ordering */
+ ranges = palloc0(offsetof(BrinRanges, ranges) + nranges * sizeof(BrinRange));
+ ranges->nranges = 0;
+
+ /* allocate an initial in-memory tuple, out of the per-range memcxt */
+ dtup = brin_new_memtuple(bdesc);
+
+ /* result stats */
+ stats = palloc0(sizeof(BrinMinmaxStats));
+ SET_VARSIZE(stats, sizeof(BrinMinmaxStats));
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ *
+ * XXX We count the ranges, and count the special types (not summarized,
+ * all-null and has-null). The regular ranges are accumulated into an
+ * array, so that we can calculate additional statistics (overlaps, hits
+ * for sample tuples, etc).
+ *
+ * XXX This needs rethinking to make it work with large indexes with more
+ * ranges than we can fit into memory (work_mem/maintenance_work_mem).
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ bool gottuple = false;
+ BrinTuple *tup;
+ OffsetNumber off;
+ Size size;
+
+ stats->n_ranges++;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE,
+ GetActiveSnapshot());
+ if (tup)
+ {
+ gottuple = true;
+ btup = brin_copy_tuple(tup, size, btup, &btupsz);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /* Ranges with no indexed tuple are ignored for overlap analysis. */
+ if (!gottuple)
+ {
+ continue;
+ }
+ else
+ {
+ dtup = brin_deform_tuple(bdesc, btup, dtup);
+ if (dtup->bt_placeholder)
+ {
+ /* Placeholders can be ignored too, as if not summarized. */
+ continue;
+ }
+ else
+ {
+ BrinValues *bval;
+
+ bval = &dtup->bt_columns[attnum - 1];
+
+ /* OK this range is summarized */
+ stats->n_summarized++;
+
+ if (bval->bv_allnulls)
+ stats->n_all_nulls++;
+
+ if (bval->bv_hasnulls)
+ stats->n_has_nulls++;
+
+ if (!bval->bv_allnulls)
+ {
+ BrinRange *range;
+
+ range = &ranges->ranges[ranges->nranges++];
+
+ range->blkno_start = heapBlk;
+ range->blkno_end = heapBlk + (pagesPerRange - 1);
+
+ range->min_value = bval->bv_values[0];
+ range->max_value = bval->bv_values[1];
+ }
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ elog(WARNING, "extracted ranges %d from BRIN index", ranges->nranges);
+
+ /* if we have no regular ranges, we're done */
+ if (ranges->nranges == 0)
+ goto cleanup;
+
+ /*
+ * Build auxiliary info to optimize the calculation.
+ *
+ * We have ranges in the blocknum order, but that is not very useful when
+ * calculating which ranges interstect - we could cross-check every range
+ * against every other range, but that's O(N^2) and thus may get extremely
+ * expensive pretty quick).
+ *
+ * To make that cheaper, we'll build two orderings, allowing us to quickly
+ * eliminate ranges that can't possibly overlap:
+ *
+ * - minranges = ranges ordered by min_value
+ * - maxranges = ranges ordered by max_value
+ *
+ * To count intersections, we'll then walk maxranges (i.e. ranges ordered
+ * by maxval), and for each following range we'll check if it overlaps.
+ * If yes, we'll proceed to the next one, until we find a range that does
+ * not overlap. But there might be a later page overlapping - but we can
+ * use a min_index_lowest tracking the minimum min_index for "future"
+ * ranges to quickly decide if there are such ranges. If there are none,
+ * we can terminate (and proceed to the next maxranges element), else we
+ * have to process additional ranges.
+ *
+ * Note: This only counts overlaps with ranges with max_value higher than
+ * the current one - we want to count all, but the overlaps with preceding
+ * ranges have already been counted when processing those preceding ranges.
+ * That is, we'll end up with counting each overlap just for one of those
+ * ranges, so we get only 1/2 the count.
+ *
+ * Note: We don't count the range as overlapping with itself. This needs
+ * to be considered later, when applying the statistics.
+ *
+ *
+ * XXX This will not work for very many ranges - we can have up to 2^32 of
+ * them, so allocating a ~32B struct for each would need a lot of memory.
+ * Not sure what to do about that, perhaps we could sample a couple ranges
+ * and do some calculations based on that? That is, we could process all
+ * ranges up to some number (say, statistics_target * 300, as for rows), and
+ * then sample ranges for larger tables. Then sort the sampled ranges, and
+ * walk through all ranges once, comparing them to the sample and counting
+ * overlaps (having them sorted should allow making this quite efficient,
+ * I think - following algorithm similar to the one implemented here).
+ */
+
+ /* info about ordering for the data type */
+ typoid = get_atttype(RelationGetRelid(indexRel), attnum);
+ typcache = lookup_type_cache(typoid, TYPECACHE_CMP_PROC_FINFO);
+
+ /* shouldn't happen, I think - we use this to build the index */
+ Assert(OidIsValid(typcache->cmp_proc_finfo.fn_oid));
+
+ minranges = (BrinRange **) palloc0(ranges->nranges * sizeof(BrinRanges *));
+ maxranges = (BrinRange **) palloc0(ranges->nranges * sizeof(BrinRanges *));
+
+ /*
+ * Build and sort the ranges min_value / max_value (just pointers
+ * to the main array). Then go and assign the min_index to each
+ * range, and finally walk the maxranges array backwards and track
+ * the min_index_lowest as minimum of "future" indexes.
+ */
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ minranges[i] = &ranges->ranges[i];
+ maxranges[i] = &ranges->ranges[i];
+ }
+
+ qsort_arg(minranges, ranges->nranges, sizeof(BrinRange *),
+ range_minval_cmp, typcache);
+
+ qsort_arg(maxranges, ranges->nranges, sizeof(BrinRange *),
+ range_maxval_cmp, typcache);
+
+ /*
+ * Update the min_index for each range. If the values are equal, be sure to
+ * pick the lowest index with that min_value.
+ */
+ minranges[0]->min_index = 0;
+ for (int i = 1; i < ranges->nranges; i++)
+ {
+ if (range_values_cmp(&minranges[i]->min_value, &minranges[i-1]->min_value, typcache) == 0)
+ minranges[i]->min_index = minranges[i-1]->min_index;
+ else
+ minranges[i]->min_index = i;
+ }
+
+ /*
+ * Walk the maxranges backward and assign the min_index_lowest as
+ * a running minimum.
+ */
+ prev_min_index = ranges->nranges;
+ for (int i = (ranges->nranges - 1); i >= 0; i--)
+ {
+ maxranges[i]->min_index_lowest = Min(maxranges[i]->min_index,
+ prev_min_index);
+ prev_min_index = maxranges[i]->min_index_lowest;
+ }
+
+ /* calculate average number of overlapping ranges for any range */
+ noverlaps = brin_minmax_count_overlaps(minranges, ranges->nranges, typcache);
+
+ stats->avg_overlaps = (double) noverlaps / ranges->nranges;
+
+#ifdef STATS_CROSS_CHECK
+ brin_minmax_count_overlaps2(ranges, minranges, maxranges, typcache);
+ brin_minmax_count_overlaps_bruteforce(ranges, typcache);
+#endif
+
+ /* calculate minval/maxval stats (distinct values and correlation) */
+ brin_minmax_value_stats(minranges, maxranges,
+ ranges->nranges, typcache,
+ &stats->minval_correlation,
+ &stats->minval_ndistinct,
+ &stats->maxval_correlation,
+ &stats->maxval_ndistinct);
+
+ /* match tuples to ranges */
+ {
+ int nvalues = 0;
+ int nmatches,
+ nmatches_unique,
+ nvalues_unique;
+
+ Datum *values = (Datum *) palloc0(numrows * sizeof(Datum));
+
+ TupleDesc tdesc = RelationGetDescr(heapRel);
+
+ for (int i = 0; i < numrows; i++)
+ {
+ bool isnull;
+ Datum value;
+
+ value = heap_getattr(rows[i], heap_attnum, tdesc, &isnull);
+ if (!isnull)
+ values[nvalues++] = value;
+ }
+
+ qsort_arg(values, nvalues, sizeof(Datum), range_values_cmp, typcache);
+
+ /* optimized algorithm */
+ brin_minmax_match_tuples_to_ranges(ranges,
+ numrows, rows, nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+
+ stats->avg_matches = (double) nmatches / numrows;
+ stats->avg_matches_unique = (double) nmatches_unique / nvalues_unique;
+
+#ifdef STATS_CROSS_CHECK
+ brin_minmax_match_tuples_to_ranges2(ranges, minranges, maxranges,
+ numrows, rows, nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+
+ brin_minmax_match_tuples_to_ranges_bruteforce(ranges,
+ numrows, rows,
+ nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+#endif
+ }
+
+ /*
+ * Possibly quite large, so release explicitly and don't rely
+ * on the memory context to discard this.
+ */
+ pfree(minranges);
+ pfree(maxranges);
+
+cleanup:
+ /* possibly quite large, so release explicitly */
+ pfree(ranges);
+
+ /* free the BrinOpaque, just like brinendscan() would */
+ brinRevmapTerminate(opaque->bo_rmAccess);
+ brin_free_desc(opaque->bo_bdesc);
+
+ PG_RETURN_POINTER(stats);
+}
+
/*
* Cache and return the procedure for the given strategy.
*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ff1354812bd..b7435194dc0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -16,6 +16,7 @@
#include <math.h>
+#include "access/brin_internal.h"
#include "access/detoast.h"
#include "access/genam.h"
#include "access/multixact.h"
@@ -30,6 +31,7 @@
#include "catalog/catalog.h"
#include "catalog/index.h"
#include "catalog/indexing.h"
+#include "catalog/pg_am.h"
#include "catalog/pg_collation.h"
#include "catalog/pg_inherits.h"
#include "catalog/pg_namespace.h"
@@ -81,6 +83,7 @@ typedef struct AnlIndexData
/* Default statistics target (GUC parameter) */
int default_statistics_target = 100;
+bool enable_indexam_stats = false;
/* A few variables that don't seem worth passing around as parameters */
static MemoryContext anl_context = NULL;
@@ -92,7 +95,7 @@ static void do_analyze_rel(Relation onerel,
AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
bool inh, bool in_outer_xact, int elevel);
static void compute_index_stats(Relation onerel, double totalrows,
- AnlIndexData *indexdata, int nindexes,
+ AnlIndexData *indexdata, Relation *indexRels, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
static VacAttrStats *examine_attribute(Relation onerel, int attnum,
@@ -454,15 +457,49 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo;
+ bool collectAmStats;
+ Oid regproc;
thisdata->indexInfo = indexInfo = BuildIndexInfo(Irel[ind]);
thisdata->tupleFract = 1.0; /* fix later if partial */
- if (indexInfo->ii_Expressions != NIL && va_cols == NIL)
+
+ /*
+ * Should we collect AM-specific statistics for any of the columns?
+ *
+ * If AM-specific statistics are enabled (using a GUC), see if we
+ * have an optional support procedure to build the statistics.
+ *
+ * If there's any such attribute, we just force building stats
+ * even for regular index keys (not just expressions) and indexes
+ * without predicates. It'd be good to only build the AM stats, but
+ * for now this is good enough.
+ *
+ * XXX The GUC is there morestly to make it easier to enable/disable
+ * this during development.
+ *
+ * FIXME Only build the AM statistics, not the other stats. And only
+ * do that for the keys with the optional procedure. not all of them.
+ */
+ collectAmStats = false;
+ if (enable_indexam_stats && (Irel[ind]->rd_indam->amstatsprocnum != 0))
+ {
+ for (int j = 0; j < indexInfo->ii_NumIndexAttrs; j++)
+ {
+ regproc = index_getprocid(Irel[ind], (j+1), Irel[ind]->rd_indam->amstatsprocnum);
+ if (OidIsValid(regproc))
+ {
+ collectAmStats = true;
+ break;
+ }
+ }
+ }
+
+ if ((indexInfo->ii_Expressions != NIL || collectAmStats) && va_cols == NIL)
{
ListCell *indexpr_item = list_head(indexInfo->ii_Expressions);
thisdata->vacattrstats = (VacAttrStats **)
- palloc(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
+ palloc0(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
tcnt = 0;
for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
{
@@ -483,6 +520,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (thisdata->vacattrstats[tcnt] != NULL)
tcnt++;
}
+ else
+ {
+ thisdata->vacattrstats[tcnt] =
+ examine_attribute(Irel[ind], i + 1, NULL);
+ tcnt++;
+ }
}
thisdata->attr_cnt = tcnt;
}
@@ -588,7 +631,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (nindexes > 0)
compute_index_stats(onerel, totalrows,
- indexdata, nindexes,
+ indexdata, Irel, nindexes,
rows, numrows,
col_context);
@@ -822,12 +865,82 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
anl_context = NULL;
}
+/*
+ * compute_indexam_stats
+ * Call the optional procedure to compute AM-specific statistics.
+ *
+ * We simply call the procedure, which is expected to produce a bytea value.
+ *
+ * At the moment this only deals with BRIN indexes, and bails out for other
+ * access methods, but it should be generic - use something like amoptsprocnum
+ * and just check if the procedure exists.
+ */
+static void
+compute_indexam_stats(Relation onerel,
+ Relation indexRel, IndexInfo *indexInfo,
+ double totalrows, AnlIndexData *indexdata,
+ HeapTuple *rows, int numrows)
+{
+ if (!enable_indexam_stats)
+ return;
+
+ /* ignore index AMs without the optional procedure */
+ if (indexRel->rd_indam->amstatsprocnum == 0)
+ return;
+
+ /*
+ * Look at attributes, and calculate stats for those that have the
+ * optional stats proc for the opfamily.
+ */
+ for (int i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attno = (i + 1);
+ AttrNumber attnum = indexInfo->ii_IndexAttrNumbers[i]; /* heap attnum */
+ RegProcedure regproc;
+ FmgrInfo *statsproc;
+ Datum datum;
+ VacAttrStats *stats;
+ MemoryContext oldcxt;
+
+ /* do this first, as it doesn't fail when proc not defined */
+ regproc = index_getprocid(indexRel, attno, indexRel->rd_indam->amstatsprocnum);
+
+ /* ignore opclasses without the optional procedure */
+ if (!RegProcedureIsValid(regproc))
+ continue;
+
+ statsproc = index_getprocinfo(indexRel, attno, indexRel->rd_indam->amstatsprocnum);
+
+ stats = indexdata->vacattrstats[i];
+
+ if (statsproc != NULL)
+ elog(WARNING, "collecting stats on BRIN ranges %p using proc %p attnum %d",
+ indexRel, statsproc, attno);
+
+ oldcxt = MemoryContextSwitchTo(stats->anl_context);
+
+ /* call the proc, let the AM calculate whatever it wants */
+ datum = FunctionCall6Coll(statsproc,
+ InvalidOid, /* FIXME correct collation */
+ PointerGetDatum(onerel),
+ PointerGetDatum(indexRel),
+ Int16GetDatum(attno),
+ Int16GetDatum(attnum),
+ PointerGetDatum(rows),
+ Int32GetDatum(numrows));
+
+ stats->staindexam = datum;
+
+ MemoryContextSwitchTo(oldcxt);
+ }
+}
+
/*
* Compute statistics about indexes of a relation
*/
static void
compute_index_stats(Relation onerel, double totalrows,
- AnlIndexData *indexdata, int nindexes,
+ AnlIndexData *indexdata, Relation *indexRels, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context)
{
@@ -847,6 +960,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo = thisdata->indexInfo;
+ Relation indexRel = indexRels[ind];
int attr_cnt = thisdata->attr_cnt;
TupleTableSlot *slot;
EState *estate;
@@ -859,6 +973,13 @@ compute_index_stats(Relation onerel, double totalrows,
rowno;
double totalindexrows;
+ /*
+ * If this is a BRIN index, try calling a procedure to collect
+ * extra opfamily-specific statistics (if procedure defined).
+ */
+ compute_indexam_stats(onerel, indexRel, indexInfo, totalrows,
+ thisdata, rows, numrows);
+
/* Ignore index if no columns to analyze and not partial */
if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL)
continue;
@@ -1661,6 +1782,13 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
+
+ /* optional AM-specific stats */
+ if (DatumGetPointer(stats->staindexam) != NULL)
+ values[Anum_pg_statistic_staindexam - 1] = stats->staindexam;
+ else
+ nulls[Anum_pg_statistic_staindexam - 1] = true;
+
i = Anum_pg_statistic_stakind1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 69e0fb98f5b..9f640adb13c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7715,6 +7715,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Relation indexRel;
ListCell *l;
VariableStatData vardata;
+ double averageOverlaps;
Assert(rte->rtekind == RTE_RELATION);
@@ -7762,6 +7763,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* correlation statistics, we will keep it as 0.
*/
*indexCorrelation = 0;
+ averageOverlaps = 0.0;
foreach(l, path->indexclauses)
{
@@ -7771,6 +7773,36 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
/* attempt to lookup stats in relation for this index column */
if (attnum != 0)
{
+ /*
+ * If AM-specific statistics are enabled, try looking up the stats
+ * for the index key. We only have this for minmax opclasses, so
+ * we just cast it like that. But other BRIN opclasses might need
+ * other stats so either we need to abstract this somehow, or maybe
+ * just collect a sufficiently generic stats for all BRIN indexes.
+ *
+ * XXX Make this non-minmax specific.
+ */
+ if (enable_indexam_stats)
+ {
+ BrinMinmaxStats *amstats
+ = (BrinMinmaxStats *) get_attindexam(index->indexoid, attnum);
+
+ if (amstats)
+ {
+ elog(DEBUG1, "found AM stats: attnum %d n_ranges %ld n_summarized %ld n_all_nulls %ld n_has_nulls %ld avg_overlaps %f",
+ attnum, amstats->n_ranges, amstats->n_summarized,
+ amstats->n_all_nulls, amstats->n_has_nulls,
+ amstats->avg_overlaps);
+
+ /*
+ * The only thing we use at the moment is the average number
+ * of overlaps for a single range. Use the other stuff too.
+ */
+ averageOverlaps = Max(averageOverlaps,
+ 1.0 + amstats->avg_overlaps);
+ }
+ }
+
/* Simple variable -- look to stats for the underlying table */
if (get_relation_stats_hook &&
(*get_relation_stats_hook) (root, rte, attnum, &vardata))
@@ -7851,6 +7883,14 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
baserel->relid,
JOIN_INNER, NULL);
+ /*
+ * XXX Can we combine qualSelectivity with the average number of matching
+ * ranges per value? qualSelectivity estimates how many tuples ar we
+ * going to match, and average number of matches says how many ranges
+ * will each of those match on average. We don't know how many will
+ * be duplicate, but it gives us a worst-case estimate, at least.
+ */
+
/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
@@ -7867,6 +7907,25 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);
+ elog(DEBUG1, "before index AM stats: cestimatedRanges = %f", estimatedRanges);
+
+ /*
+ * If we found some AM stats, look at average number of overlapping ranges,
+ * and apply that to the currently estimated ranges.
+ *
+ * XXX We pretty much combine this with correlation info (because it was
+ * already applied in the estimatedRanges formula above), which might be
+ * overly pessimistic. The overlaps stats seems somewhat redundant with
+ * the correlation, so maybe we should do just one? The AM stats seems
+ * like a more reliable information, because the correlation is not very
+ * sensitive to outliers, for example. So maybe let's prefer that, and
+ * only use the correlation as fallback when AM stats are not available?
+ */
+ if (averageOverlaps > 0.0)
+ estimatedRanges = Min(estimatedRanges * averageOverlaps, indexRanges);
+
+ elog(DEBUG1, "after index AM stats: cestimatedRanges = %f", estimatedRanges);
+
/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f4957..1725f5af347 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3138,6 +3138,47 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
return 0;
}
+
+/*
+ * get_attstaindexam
+ *
+ * Given the table and attribute number of a column, get the index AM
+ * statistics. Return NULL if no data available.
+ *
+ * Currently this is only consulted for individual tables, not for inheritance
+ * trees, so we don't need an "inh" parameter.
+ */
+bytea *
+get_attindexam(Oid relid, AttrNumber attnum)
+{
+ HeapTuple tp;
+
+ tp = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(attnum),
+ BoolGetDatum(false));
+ if (HeapTupleIsValid(tp))
+ {
+ Datum val;
+ bytea *retval = NULL;
+ bool isnull;
+
+ val = SysCacheGetAttr(STATRELATTINH, tp,
+ Anum_pg_statistic_staindexam,
+ &isnull);
+
+ if (!isnull)
+ retval = (bytea *) PG_DETOAST_DATUM(val);
+
+ // staindexam = ((Form_pg_statistic) GETSTRUCT(tp))->staindexam;
+ ReleaseSysCache(tp);
+
+ return retval;
+ }
+
+ return NULL;
+}
+
/*
* get_attstatsslot
*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934c..06dfeb6cd8b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -967,6 +967,16 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexam_stats", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index AM stats."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_indexam_stats,
+ false,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1dc674d2305..8437c2f0e71 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -216,6 +216,8 @@ typedef struct IndexAmRoutine
uint16 amsupport;
/* opclass options support function number or 0 */
uint16 amoptsprocnum;
+ /* opclass statistics support function number or 0 */
+ uint16 amstatsprocnum;
/* does AM support ORDER BY indexed column's value? */
bool amcanorder;
/* does AM support ORDER BY result of an operator on indexed column? */
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index 887fb0a5532..a7cccac9c90 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -34,6 +34,57 @@ typedef struct BrinStatsData
BlockNumber revmapNumPages;
} BrinStatsData;
+/*
+ * Info about ranges for BRIN Sort.
+ */
+typedef struct BrinRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+
+ /*
+ * Index of the range when ordered by min_value (if there are multiple
+ * ranges with the same min_value, it's the lowest one).
+ */
+ uint32 min_index;
+
+ /*
+ * Minimum min_index from all ranges with higher max_value (i.e. when
+ * sorted by max_value). If there are multiple ranges with the same
+ * max_value, it depends on the ordering (i.e. the ranges may get
+ * different min_index_lowest, depending on the exact ordering).
+ */
+ uint32 min_index_lowest;
+} BrinRange;
+
+typedef struct BrinRanges
+{
+ int nranges;
+ BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
+} BrinRanges;
+
+typedef struct BrinMinmaxStats
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int64 n_ranges;
+ int64 n_summarized;
+ int64 n_all_nulls;
+ int64 n_has_nulls;
+ double avg_overlaps;
+ double avg_matches;
+ double avg_matches_unique;
+
+ double minval_correlation;
+ double maxval_correlation;
+ int64 minval_ndistinct;
+ int64 maxval_ndistinct;
+} BrinMinmaxStats;
#define BRIN_DEFAULT_PAGES_PER_RANGE 128
#define BrinGetPagesPerRange(relation) \
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 25186609272..ee6c6f9b709 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,6 +73,7 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
+#define BRIN_PROCNUM_STATISTICS 6 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_LAST_OPTIONAL_PROCNUM 15
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 4cc129bebd8..ea3de9bcba1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -804,6 +804,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
+ amprocrighttype => 'bytea', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -835,6 +837,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -864,6 +868,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
+ amprocrighttype => 'name', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -893,6 +899,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -905,6 +913,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -917,6 +927,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1034,6 +1046,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
+ amprocrighttype => 'text', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1062,6 +1076,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1110,6 +1126,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
+ amprocrighttype => 'tid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1160,6 +1178,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
amprocrighttype => 'float4', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
+ amprocrighttype => 'float4', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '1',
@@ -1173,6 +1194,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
+ amprocrighttype => 'float8', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi float
{ amprocfamily => 'brin/float_minmax_multi_ops', amproclefttype => 'float4',
@@ -1261,6 +1285,9 @@
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
+ amprocrighttype => 'macaddr', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi macaddr
{ amprocfamily => 'brin/macaddr_minmax_multi_ops', amproclefttype => 'macaddr',
@@ -1314,6 +1341,9 @@
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
amprocrighttype => 'macaddr8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
+ amprocrighttype => 'macaddr8', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi macaddr8
{ amprocfamily => 'brin/macaddr8_minmax_multi_ops',
@@ -1366,6 +1396,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
+ amprocrighttype => 'inet', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1436,6 +1468,9 @@
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
amprocrighttype => 'bpchar', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
+ amprocrighttype => 'bpchar', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# bloom character
{ amprocfamily => 'brin/bpchar_bloom_ops', amproclefttype => 'bpchar',
@@ -1467,6 +1502,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
+ amprocrighttype => 'time', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1517,6 +1554,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
amprocrighttype => 'timestamp', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
+ amprocrighttype => 'timestamp', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '1',
@@ -1530,6 +1570,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
+ amprocrighttype => 'timestamptz', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1',
@@ -1542,6 +1585,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1668,6 +1713,9 @@
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
+ amprocrighttype => 'interval', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi interval
{ amprocfamily => 'brin/interval_minmax_multi_ops',
@@ -1721,6 +1769,9 @@
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
amprocrighttype => 'timetz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
+ amprocrighttype => 'timetz', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi time with time zone
{ amprocfamily => 'brin/timetz_minmax_multi_ops', amproclefttype => 'timetz',
@@ -1771,6 +1822,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
+ amprocrighttype => 'bit', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1785,6 +1838,9 @@
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
amprocrighttype => 'varbit', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
+ amprocrighttype => 'varbit', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax numeric
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
@@ -1799,6 +1855,9 @@
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
amprocrighttype => 'numeric', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
+ amprocrighttype => 'numeric', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi numeric
{ amprocfamily => 'brin/numeric_minmax_multi_ops', amproclefttype => 'numeric',
@@ -1851,6 +1910,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1924,6 +1985,9 @@
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
amprocrighttype => 'pg_lsn', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
+ amprocrighttype => 'pg_lsn', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi pg_lsn
{ amprocfamily => 'brin/pg_lsn_minmax_multi_ops', amproclefttype => 'pg_lsn',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62a5b8e655d..1dd9177b01c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8407,6 +8407,10 @@
{ oid => '3386', descr => 'BRIN minmax support',
proname => 'brin_minmax_union', prorettype => 'bool',
proargtypes => 'internal internal internal', prosrc => 'brin_minmax_union' },
+{ oid => '9979', descr => 'BRIN minmax support',
+ proname => 'brin_minmax_stats', prorettype => 'bool',
+ proargtypes => 'internal internal int2 int2 internal int4',
+ prosrc => 'brin_minmax_stats' },
# BRIN minmax multi
{ oid => '4616', descr => 'BRIN multi minmax support',
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index cdf74481398..7043b169f7c 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -121,6 +121,11 @@ CATALOG(pg_statistic,2619,StatisticRelationId)
anyarray stavalues3;
anyarray stavalues4;
anyarray stavalues5;
+
+ /*
+ * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
+ */
+ bytea staindexam;
#endif
} FormData_pg_statistic;
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4e..319f7d4aadc 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -155,6 +155,7 @@ typedef struct VacAttrStats
float4 *stanumbers[STATISTIC_NUM_SLOTS];
int numvalues[STATISTIC_NUM_SLOTS];
Datum *stavalues[STATISTIC_NUM_SLOTS];
+ Datum staindexam; /* index-specific stats (as bytea) */
/*
* These fields describe the stavalues[n] element types. They will be
@@ -258,6 +259,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool enable_indexam_stats;
/* Variables for cost-based parallel vacuum */
extern PGDLLIMPORT pg_atomic_uint32 *VacuumSharedCostBalance;
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 50f02883052..71ce5b15d74 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -185,6 +185,7 @@ extern Oid getBaseType(Oid typid);
extern Oid getBaseTypeAndTypmod(Oid typid, int32 *typmod);
extern int32 get_typavgwidth(Oid typid, int32 typmod);
extern int32 get_attavgwidth(Oid relid, AttrNumber attnum);
+extern bytea *get_attindexam(Oid relid, AttrNumber attnum);
extern bool get_attstatsslot(AttStatsSlot *sslot, HeapTuple statstuple,
int reqkind, Oid reqop, int flags);
extern void free_attstatsslot(AttStatsSlot *sslot);
--
2.37.3
0002-Allow-BRIN-indexes-to-produce-sorted-output-20221022.patchtext/x-patch; charset=UTF-8; name=0002-Allow-BRIN-indexes-to-produce-sorted-output-20221022.patchDownload
From 09c7127124c44373c006126adb7391b5fc3e3475 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 9 Oct 2022 11:33:37 +0200
Subject: [PATCH 2/6] Allow BRIN indexes to produce sorted output
Some BRIN indexes can be used to produce sorted output, by using the
range information to sort tuples incrementally. This is particularly
interesting for LIMIT queries, which only need to scan the first few
rows, and alternative plans (e.g. Seq Scan + Sort) have a very high
startup cost.
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.
This is disabled by default, use enable_brinsort GUC to enable it.
---
src/backend/access/brin/brin_minmax.c | 386 ++++++
src/backend/commands/explain.c | 44 +
src/backend/executor/Makefile | 1 +
src/backend/executor/execProcnode.c | 10 +
src/backend/executor/nodeBrinSort.c | 1550 +++++++++++++++++++++++
src/backend/optimizer/path/costsize.c | 254 ++++
src/backend/optimizer/path/indxpath.c | 186 +++
src/backend/optimizer/path/pathkeys.c | 50 +
src/backend/optimizer/plan/createplan.c | 188 +++
src/backend/optimizer/plan/setrefs.c | 19 +
src/backend/optimizer/util/pathnode.c | 57 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/access/brin.h | 35 -
src/include/access/brin_internal.h | 1 +
src/include/catalog/pg_amproc.dat | 64 +
src/include/catalog/pg_proc.dat | 3 +
src/include/executor/nodeBrinSort.h | 47 +
src/include/nodes/execnodes.h | 103 ++
src/include/nodes/pathnodes.h | 11 +
src/include/nodes/plannodes.h | 26 +
src/include/optimizer/cost.h | 3 +
src/include/optimizer/pathnode.h | 9 +
src/include/optimizer/paths.h | 3 +
23 files changed, 3025 insertions(+), 35 deletions(-)
create mode 100644 src/backend/executor/nodeBrinSort.c
create mode 100644 src/include/executor/nodeBrinSort.h
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index 0135a00ae91..9d84063055c 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -16,6 +16,10 @@
#include "access/brin_tuple.h"
#include "access/genam.h"
#include "access/stratnum.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -42,6 +46,9 @@ static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno,
/* calculate the stats in different ways for cross-checking */
#define STATS_CROSS_CHECK
+/* print info about ranges */
+#define BRINSORT_DEBUG
+
Datum
brin_minmax_opcinfo(PG_FUNCTION_ARGS)
{
@@ -1594,6 +1601,385 @@ cleanup:
PG_RETURN_POINTER(stats);
}
+/*
+ * brin_minmax_range_tupdesc
+ * Create a tuple descriptor to store BrinRange data.
+ */
+static TupleDesc
+brin_minmax_range_tupdesc(BrinDesc *brdesc, AttrNumber attnum)
+{
+ TupleDesc tupdesc;
+ AttrNumber attno = 1;
+
+ /* expect minimum and maximum */
+ Assert(brdesc->bd_info[attnum - 1]->oi_nstored == 2);
+
+ tupdesc = CreateTemplateTupleDesc(7);
+
+ /* blkno_start */
+ TupleDescInitEntry(tupdesc, attno++, NULL, INT8OID, -1, 0);
+
+ /* blkno_end (could be calculated as blkno_start + pages_per_range) */
+ TupleDescInitEntry(tupdesc, attno++, NULL, INT8OID, -1, 0);
+
+ /* has_nulls */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* all_nulls */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* not_summarized */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* min_value */
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ brdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id,
+ -1, 0);
+
+ /* max_value */
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ brdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id,
+ -1, 0);
+
+ return tupdesc;
+}
+
+/*
+ * brin_minmax_range_tuple
+ * Form a minimal tuple representing range info.
+ */
+static MinimalTuple
+brin_minmax_range_tuple(TupleDesc tupdesc,
+ BlockNumber block_start, BlockNumber block_end,
+ bool has_nulls, bool all_nulls, bool not_summarized,
+ Datum min_value, Datum max_value)
+{
+ Datum values[7];
+ bool nulls[7];
+
+ memset(nulls, 0, sizeof(nulls));
+
+ values[0] = Int64GetDatum(block_start);
+ values[1] = Int64GetDatum(block_end);
+ values[2] = BoolGetDatum(has_nulls);
+ values[3] = BoolGetDatum(all_nulls);
+ values[4] = BoolGetDatum(not_summarized);
+ values[5] = min_value;
+ values[6] = max_value;
+
+ if (all_nulls || not_summarized)
+ {
+ nulls[5] = true;
+ nulls[6] = true;
+ }
+
+ return heap_form_minimal_tuple(tupdesc, values, nulls);
+}
+
+/*
+ * brin_minmax_scan_init
+ * Prepare the BrinRangeScanDesc including the sorting info etc.
+ *
+ * We want to have the ranges in roughly this order
+ *
+ * - not-summarized
+ * - summarized, non-null values
+ * - summarized, all-nulls
+ *
+ * We do it this way, because the not-summarized ranges need to be
+ * scanned always (both to produce NULL and non-NULL values), and
+ * we need to read all of them into the tuplesort before producing
+ * anything. So placing them at the beginning is reasonable.
+ *
+ * The all-nulls ranges are placed last, because when processing
+ * NULLs we need to scan everything anyway (some of the ranges might
+ * have has_nulls=true). But for non-NULL values we can abort once
+ * we hit the first all-nulls range.
+ *
+ * The regular ranges are sorted by blkno_start, to make it maybe
+ * a bit more sequential (but this only helps if there are ranges
+ * with the same minval).
+ */
+static BrinRangeScanDesc *
+brin_minmax_scan_init(BrinDesc *bdesc, AttrNumber attnum, bool asc)
+{
+ BrinRangeScanDesc *scan;
+
+ /* sort by (not_summarized, minval, blkno_start, all_nulls) */
+ AttrNumber keys[4];
+ Oid collations[4];
+ bool nullsFirst[4];
+ Oid operators[4];
+ Oid typid;
+ TypeCacheEntry *typcache;
+
+ /* we expect to have min/max value for each range, same type for both */
+ Assert(bdesc->bd_info[attnum - 1]->oi_nstored == 2);
+ Assert(bdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id ==
+ bdesc->bd_info[attnum - 1]->oi_typcache[1]->type_id);
+
+ scan = (BrinRangeScanDesc *) palloc0(sizeof(BrinRangeScanDesc));
+
+ /* build tuple descriptor for range data */
+ scan->tdesc = brin_minmax_range_tupdesc(bdesc, attnum);
+
+ /* initialize ordering info */
+ keys[0] = 5; /* not_summarized */
+ keys[1] = 4; /* all_nulls */
+ keys[2] = (asc) ? 6 : 7; /* min_value (asc) or max_value (desc) */
+ keys[3] = 1; /* blkno_start */
+
+ collations[0] = InvalidOid; /* FIXME */
+ collations[1] = InvalidOid; /* FIXME */
+ collations[2] = InvalidOid; /* FIXME */
+ collations[3] = InvalidOid; /* FIXME */
+
+ /* unrelated to the ordering desired by the user */
+ nullsFirst[0] = false;
+ nullsFirst[1] = false;
+ nullsFirst[2] = false;
+ nullsFirst[3] = false;
+
+ /* lookup sort operator for the boolean type (used for not_summarized) */
+ typcache = lookup_type_cache(BOOLOID, TYPECACHE_GT_OPR);
+ operators[0] = typcache->gt_opr;
+
+ /* lookup sort operator for the boolean type (used for all_nulls) */
+ typcache = lookup_type_cache(BOOLOID, TYPECACHE_LT_OPR);
+ operators[1] = typcache->lt_opr;
+
+ /* lookup sort operator for the min/max type */
+ typid = bdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id;
+ typcache = lookup_type_cache(typid, TYPECACHE_LT_OPR | TYPECACHE_GT_OPR);
+ operators[2] = (asc) ? typcache->lt_opr : typcache->gt_opr;
+
+ /* lookup sort operator for the bigint type (used for blkno_start) */
+ typcache = lookup_type_cache(INT8OID, TYPECACHE_LT_OPR);
+ operators[3] = typcache->lt_opr;
+
+ scan->ranges = tuplesort_begin_heap(scan->tdesc,
+ 4, /* nkeys */
+ keys,
+ operators,
+ collations,
+ nullsFirst,
+ work_mem,
+ NULL,
+ TUPLESORT_RANDOMACCESS);
+
+ scan->slot = MakeSingleTupleTableSlot(scan->tdesc,
+ &TTSOpsMinimalTuple);
+
+ return scan;
+}
+
+/*
+ * brin_minmax_scan_add_tuple
+ * Form and store a tuple representing the BRIN range to the tuplestore.
+ */
+static void
+brin_minmax_scan_add_tuple(BrinRangeScanDesc *scan,
+ BlockNumber block_start, BlockNumber block_end,
+ bool has_nulls, bool all_nulls, bool not_summarized,
+ Datum min_value, Datum max_value)
+{
+ MinimalTuple tup;
+
+ tup = brin_minmax_range_tuple(scan->tdesc, block_start, block_end,
+ has_nulls, all_nulls, not_summarized,
+ min_value, max_value);
+
+ ExecStoreMinimalTuple(tup, scan->slot, false);
+
+ tuplesort_puttupleslot(scan->ranges, scan->slot);
+}
+
+#ifdef BRINSORT_DEBUG
+/*
+ * brin_minmax_scan_next
+ * Return the next BRIN range information from the tuplestore.
+ *
+ * Returns NULL when there are no more ranges.
+ */
+static BrinRange *
+brin_minmax_scan_next(BrinRangeScanDesc *scan)
+{
+ if (tuplesort_gettupleslot(scan->ranges, true, false, scan->slot, NULL))
+ {
+ bool isnull;
+ BrinRange *range = (BrinRange *) palloc(sizeof(BrinRange));
+
+ range->blkno_start = slot_getattr(scan->slot, 1, &isnull);
+ range->blkno_end = slot_getattr(scan->slot, 2, &isnull);
+ range->has_nulls = slot_getattr(scan->slot, 3, &isnull);
+ range->all_nulls = slot_getattr(scan->slot, 4, &isnull);
+ range->not_summarized = slot_getattr(scan->slot, 5, &isnull);
+ range->min_value = slot_getattr(scan->slot, 6, &isnull);
+ range->max_value = slot_getattr(scan->slot, 7, &isnull);
+
+ return range;
+ }
+
+ return NULL;
+}
+
+/*
+ * brin_minmax_scan_dump
+ * Print info about all page ranges stored in the tuplestore.
+ */
+static void
+brin_minmax_scan_dump(BrinRangeScanDesc *scan)
+{
+ BrinRange *range;
+
+ elog(WARNING, "===== dumping =====");
+ while ((range = brin_minmax_scan_next(scan)) != NULL)
+ {
+ elog(WARNING, "[%u %u] has_nulls %d all_nulls %d not_summarized %d values [%f %f]",
+ range->blkno_start, range->blkno_end,
+ range->has_nulls, range->all_nulls, range->not_summarized,
+ DatumGetFloat8(range->min_value), DatumGetFloat8(range->max_value));
+
+ pfree(range);
+ }
+
+ /* reset the tuplestore, so that we can start scanning again */
+ tuplesort_rescan(scan->ranges);
+}
+#endif
+
+static void
+brin_minmax_scan_finalize(BrinRangeScanDesc *scan)
+{
+ tuplesort_performsort(scan->ranges);
+}
+
+/*
+ * brin_minmax_ranges
+ * Load the BRIN ranges and sort them.
+ */
+Datum
+brin_minmax_ranges(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ AttrNumber attnum = PG_GETARG_INT16(1);
+ bool asc = PG_GETARG_BOOL(2);
+ BrinOpaque *opaque;
+ Relation indexRel;
+ Relation heapRel;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ Oid heapOid;
+ BrinMemTuple *dtup;
+ BrinTuple *btup = NULL;
+ Size btupsz = 0;
+ Buffer buf = InvalidBuffer;
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc;
+ BrinRangeScanDesc *brscan;
+
+ /*
+ * Determine how many BRIN ranges could there be, allocate space and read
+ * all the min/max values.
+ */
+ opaque = (BrinOpaque *) scan->opaque;
+ bdesc = opaque->bo_bdesc;
+ pagesPerRange = opaque->bo_pagesPerRange;
+
+ indexRel = bdesc->bd_index;
+
+ /* make sure the provided attnum is valid */
+ Assert((attnum > 0) && (attnum <= bdesc->bd_tupdesc->natts));
+
+ /*
+ * We need to know the size of the table so that we know how long to iterate
+ * on the revmap (and to pre-allocate the arrays).
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(indexRel), false);
+ heapRel = table_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ table_close(heapRel, AccessShareLock);
+
+ /* allocate an initial in-memory tuple, out of the per-range memcxt */
+ dtup = brin_new_memtuple(bdesc);
+
+ /* initialize the scan describing scan of ranges sorted by minval */
+ brscan = brin_minmax_scan_init(bdesc, attnum, asc);
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ bool gottuple = false;
+ BrinTuple *tup;
+ OffsetNumber off;
+ Size size;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE,
+ scan->xs_snapshot);
+ if (tup)
+ {
+ gottuple = true;
+ btup = brin_copy_tuple(tup, size, btup, &btupsz);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Ranges with no indexed tuple may contain anything.
+ */
+ if (!gottuple)
+ {
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ false, false, true, 0, 0);
+ }
+ else
+ {
+ dtup = brin_deform_tuple(bdesc, btup, dtup);
+ if (dtup->bt_placeholder)
+ {
+ /*
+ * Placeholder tuples are treated as if not summarized.
+ *
+ * XXX Is this correct?
+ */
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ false, false, true, 0, 0);
+ }
+ else
+ {
+ BrinValues *bval;
+
+ bval = &dtup->bt_columns[attnum - 1];
+
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ bval->bv_hasnulls, bval->bv_allnulls, false,
+ bval->bv_values[0], bval->bv_values[1]);
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /* do the sort and any necessary post-processing */
+ brin_minmax_scan_finalize(brscan);
+
+#ifdef BRINSORT_DEBUG
+ brin_minmax_scan_dump(brscan);
+#endif
+
+ PG_RETURN_POINTER(brscan);
+}
+
/*
* Cache and return the procedure for the given strategy.
*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f86983c6601..e15b29246b1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -85,6 +85,8 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
+static void show_brinsort_keys(BrinSortState *sortstate, List *ancestors,
+ ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -1100,6 +1102,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
case T_IndexScan:
case T_IndexOnlyScan:
case T_BitmapHeapScan:
+ case T_BrinSort:
case T_TidScan:
case T_TidRangeScan:
case T_SubqueryScan:
@@ -1262,6 +1265,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_IndexOnlyScan:
pname = sname = "Index Only Scan";
break;
+ case T_BrinSort:
+ pname = sname = "BRIN Sort";
+ break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
break;
@@ -1508,6 +1514,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainScanTarget((Scan *) indexonlyscan, es);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *brinsort = (BrinSort *) plan;
+
+ ExplainIndexScanDetails(brinsort->indexid,
+ brinsort->indexorderdir,
+ es);
+ ExplainScanTarget((Scan *) brinsort, es);
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *bitmapindexscan = (BitmapIndexScan *) plan;
@@ -1790,6 +1806,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
break;
+ case T_BrinSort:
+ show_scan_qual(((BrinSort *) plan)->indexqualorig,
+ "Index Cond", planstate, ancestors, es);
+ if (((BrinSort *) plan)->indexqualorig)
+ show_instrumentation_count("Rows Removed by Index Recheck", 2,
+ planstate, es);
+ show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_brinsort_keys(castNode(BrinSortState, planstate), ancestors, es);
+ if (plan->qual)
+ show_instrumentation_count("Rows Removed by Filter", 1,
+ planstate, es);
+ break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
@@ -2389,6 +2417,21 @@ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
ancestors, es);
}
+/*
+ * Show the sort keys for a BRIN Sort node.
+ */
+static void
+show_brinsort_keys(BrinSortState *sortstate, List *ancestors, ExplainState *es)
+{
+ BrinSort *plan = (BrinSort *) sortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) sortstate, "Sort Key",
+ plan->numCols, 0, plan->sortColIdx,
+ plan->sortOperators, plan->collations,
+ plan->nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -3812,6 +3855,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
case T_ForeignScan:
case T_CustomScan:
case T_ModifyTable:
+ case T_BrinSort:
/* Assert it's on a real relation */
Assert(rte->rtekind == RTE_RELATION);
objectname = get_rel_name(rte->relid);
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..bcaa2ce8e21 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -38,6 +38,7 @@ OBJS = \
nodeBitmapHeapscan.o \
nodeBitmapIndexscan.o \
nodeBitmapOr.o \
+ nodeBrinSort.o \
nodeCtescan.o \
nodeCustom.o \
nodeForeignscan.o \
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 36406c3af57..4a6dc3f263c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -79,6 +79,7 @@
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeBitmapIndexscan.h"
#include "executor/nodeBitmapOr.h"
+#include "executor/nodeBrinSort.h"
#include "executor/nodeCtescan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
@@ -226,6 +227,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_BrinSort:
+ result = (PlanState *) ExecInitBrinSort((BrinSort *) node,
+ estate, eflags);
+ break;
+
case T_BitmapIndexScan:
result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
estate, eflags);
@@ -639,6 +645,10 @@ ExecEndNode(PlanState *node)
ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
break;
+ case T_BrinSortState:
+ ExecEndBrinSort((BrinSortState *) node);
+ break;
+
case T_BitmapIndexScanState:
ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
break;
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
new file mode 100644
index 00000000000..ca72c1ed22d
--- /dev/null
+++ b/src/backend/executor/nodeBrinSort.c
@@ -0,0 +1,1550 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.c
+ * Routines to support sorted scan of relations using a BRIN index
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * The overall algorithm is roughly this:
+ *
+ * 0) initialize a tuplestore and a tuplesort
+ *
+ * 1) fetch list of page ranges from the BRIN index, sorted by minval
+ * (with the not-summarized ranges first, and all-null ranges last)
+ *
+ * 2) for NULLS FIRST ordering, walk all ranges that may contain NULL
+ * values and output them (and return to the beginning of the list)
+ *
+ * 3) while there are ranges in the list, do this:
+ *
+ * a) get next (distinct) minval from the list, call it watermark
+ *
+ * b) if there are any tuples in the tuplestore, move them to tuplesort
+ *
+ * c) process all ranges with (minval < watermark) - read tuples and feed
+ * them either into tuplestore (when value < watermark) or tuplestore
+ *
+ * d) sort the tuplestore, output all the tuples
+ *
+ * 4) if some tuples remain in the tuplestore, sort and output them
+ *
+ * 5) for NULLS LAST ordering, walk all ranges that may contain NULL
+ * values and output them (and return to the beginning of the list)
+ *
+ *
+ * For DESC orderings the process is almost the same, except that we look
+ * at maxval and use '>' operator (but that's transparent).
+ *
+ * There's a couple possible things that might be done in different ways:
+ *
+ * 1) Not using tuplestore, and feeding tuples only to a tuplesort. Then
+ * while producing the tuples, we'd only output tuples up to the current
+ * watermark, and then we'd keep the remaining tuples for the next round.
+ * Either we'd need to transfer them into a second tuplesort, or allow
+ * "reopening" the tuplesort and adding more tuples. And then only the
+ * part since the watermark would get sorted (possibly using a merge-sort
+ * with the already sorted part).
+ *
+ *
+ * 2) The other question is what to do with NULL values - at the moment we
+ * just read the ranges, output the NULL tuples and that's it - we're not
+ * retaining any non-NULL tuples, so that we'll read the ranges again in
+ * the second range. The logic here is that either there are very few
+ * such ranges, so it's won't cost much to just re-read them. Or maybe
+ * there are very many such ranges, and we'd do a lot of spilling to the
+ * tuplestore, and it's not much more expensive to just re-read the source
+ * data. There are counter-examples, though - e.g., there might be many
+ * has_nulls ranges, but with very few non-NULL tuples. In this case it
+ * might be better to actually spill the tuples instead of re-reading all
+ * the ranges. Maybe this is something we can do at run-time, or maybe we
+ * could estimate this at planning time. We do know the null_frac for the
+ * column, so we know the number of NULL rows. And we also know the number
+ * of all_nulls and has_nulls ranges. We can estimate the number of rows
+ * per range, and we can estimate how many non-NULL rows are in the
+ * has_nulls ranges (we don't need to re-read all-nulls ranges). There's
+ * also the filter, which may reduce the amount of rows to store.
+ *
+ * So we'd need to compare two metrics calculated roughly like this:
+ *
+ * cost(re-reading has-nulls ranges)
+ * = cost(random_page_cost * n_has_nulls + seq_page_cost * pages_per_range)
+ *
+ * cost(spilling non-NULL rows from has-nulls ranges)
+ * = cost(numrows * width / BLCKSZ * seq_page_cost * 2)
+ *
+ * where numrows is the number of non-NULL rows in has_null ranges, which
+ * can be calculated like this:
+ *
+ * // estimated number of rows in has-null ranges
+ * rows_in_has_nulls = (reltuples / relpages) * pages_per_range * n_has_nulls
+ *
+ * // number of NULL rows in the has-nulls ranges
+ * nulls_in_ranges = reltuples * null_frac - n_all_nulls * (reltuples / relpages)
+ *
+ * // numrows is the difference, multiplied by selectivity of the index
+ * // filter condition (value between 0.0 and 1.0)
+ * numrows = (rows_in_has_nulls - nulls_in_ranges) * selectivity
+ *
+ * This ignores non-summarized ranges, but there should be only very few of
+ * those, so it should not make a huge difference. Otherwise we can divide
+ * them between regular, has-nulls and all-nulls pages to keep the ratio.
+ *
+ *
+ * 3) How large step to make when updating the watermark?
+ *
+ * When updating the watermark, one option is to simply proceed to the next
+ * distinct minval value, which is the smallest possible step we can make.
+ * This may be both fine and very inefficient, depending on how many rows
+ * end up in the tuplesort and how many rows we end up spilling (possibly
+ * repeatedly to the tuplestore).
+ *
+ * When having to sort large number of rows, it's inefficient to run many
+ * tiny sorts, even if it produces correct result. For example when sorting
+ * 1M rows, we may split this as either (a) 100000x sorts of 10 rows, or
+ * (b) 1000 sorts of 1000 rows. The (b) option is almost certainly more
+ * efficient. Maybe sorts of 10k rows would be even better, if it fits
+ * into work_mem.
+ *
+ * This gets back to how large the page ranges are, and if/how much they
+ * overlap. With tiny ranges (e.g. a single-page ranges), a single range
+ * can only add as many rows as we can fit on a single page. So we need
+ * more ranges by default - how many watermark steps that is depends on
+ * how many distinct minval values are there ...
+ *
+ * Then there's overlaps - if ranges do not overlap, we're done and we'll
+ * add the whole range because the next watermark is above maxval. But
+ * when the ranges overlap, we'll only add the first part (assuming the
+ * minval of the next range is the watermark). Assume 10 overlapping
+ * ranges - imagine for example ranges shifted by 10%, so something like
+ *
+ * [0,100] [10,110], [20,120], [30, 130], ..., [90, 190]
+ *
+ * In the first step we use watermark=10 and load the first range, with
+ * maybe 1000 rows in total. But assuming uniform distribution, only about
+ * 100 rows will go into the tuplesort, the remaining 900 rows will go into
+ * the tuplestore (assuming uniform distribution). Then in the second step
+ * we sort another 100 rows and the remaining 800 rows will be moved into
+ * a new tuplestore. And so on and so on.
+ *
+ * This means that incrementing the watermarks by single steps may be
+ * quite inefficient, and we need to reflect both the range size and
+ * how much the ranges overlap.
+ *
+ * In fact, maybe we should not determine the step as number of minval
+ * values to skip, but how many ranges would that mean reading. Because
+ * if we have a minval with many duplicates, that may load many rows.
+ * Or even better, we could look at how many rows would that mean loading
+ * into the tuplestore - if we track P(x<minval) for each range (e.g. by
+ * calculating average value during ANALYZE, or perhaps by estimating
+ * it from per-column stats), then we know the increment is going to be
+ * about
+ *
+ * P(x < minval[i]) - P(x < minval[i-1])
+ *
+ * and we can stop once we'd exceed work_mem (with some slack). See comment
+ * for brin_minmax_stats() for more thoughts.
+ *
+ *
+ * 4) LIMIT/OFFSET vs. full sort
+ *
+ * There's one case where very small sorts may be actually optimal, and
+ * that's queries that need to process only very few rows - say, LIMIT
+ * queries with very small bound.
+ *
+ *
+ * FIXME Projection does not work (fails on projection slot expecting
+ * buffer ops, but we're sending it minimal tuple slot).
+ *
+ * FIXME The tlists are not wired quite correctly - the sortColIdx is an
+ * index to the tlist, but we need attnum from the heap table, so that we
+ * can fetch the attribute etc. Or maybe fetching the value from the raw
+ * tuple (before projection) is wrong and needs to be done differently.
+ *
+ * FIXME Indexes on expressions don't work (possibly related to the tlist
+ * being done incorrectly).
+ *
+ * FIXME handling of other brin opclasses (minmax-multi)
+ *
+ * FIXME improve costing
+ *
+ *
+ * Improvement ideas:
+ *
+ * 1) multiple tuplestores for overlapping ranges
+ *
+ * When there are many overlapping ranges (so that maxval > current.maxval),
+ * we're loading all the "future" tuples into a new tuplestore. However, if
+ * there are multiple such ranges (imagine ranges "shifting" by 10%, which
+ * gives us 9 more ranges), we know in the next round we'll only need rows
+ * until the next maxval. We'll not sort these rows, but we'll still shuffle
+ * them around until we get to the proper range (so about 10x each row).
+ * Maybe we should pre-allocate the tuplestores (or maybe even tuplesorts)
+ * for future ranges, and route the tuples to the correct one? Maybe we
+ * could be a bit smarter and discard tuples once we have enough rows for
+ * the preceding ranges (say, with LIMIT queries). We'd also need to worry
+ * about work_mem, though - we can't just use many tuplestores, each with
+ * whole work_mem. So we'd probably use e.g. work_mem/2 for the next one,
+ * and then /4, /8 etc. for the following ones. That's work_mem in total.
+ * And there'd need to be some limit on number of tuplestores, I guess.
+ *
+ * 2) handling NULL values
+ *
+ * We need to handle NULLS FIRST / NULLS LAST cases. The question is how
+ * to do that - the easiest way is to simply do a separate scan of ranges
+ * that might contain NULL values, processing just rows with NULLs, and
+ * discarding other rows. And then process non-NULL values as currently.
+ * The NULL scan would happen before/after this regular phase.
+ *
+ * Byt maybe we could be smarter, and not do separate scans. When reading
+ * a page, we might stash the tuple in a tuplestore, so that we can read
+ * it the next round. Obviously, this might be expensive if we need to
+ * keep too many rows, so the tuplestore would grow too large - in that
+ * case it might be better to just do the two scans.
+ *
+ * 3) parallelism
+ *
+ * Presumably we could do a parallel version of this. The leader or first
+ * worker would prepare the range information, and the workers would then
+ * grab ranges (in a kinda round robin manner), sort them independently,
+ * and then the results would be merged by Gather Merge.
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeBrinSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ * ExecBrinSort scans a relation using an index
+ * IndexNext retrieve next tuple using index
+ * ExecInitBrinSort creates and initializes state info.
+ * ExecReScanBrinSort rescans the indexed relation.
+ * ExecEndBrinSort releases all storage.
+ * ExecBrinSortMarkPos marks scan position.
+ * ExecBrinSortRestrPos restores scan position.
+ * ExecBrinSortEstimate estimates DSM space needed for parallel index scan
+ * ExecBrinSortInitializeDSM initialize DSM for parallel BrinSort
+ * ExecBrinSortReInitializeDSM reinitialize DSM for fresh scan
+ * ExecBrinSortInitializeWorker attach to DSM info in parallel worker
+ */
+#include "postgres.h"
+
+#include "access/brin.h"
+#include "access/brin_internal.h"
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "executor/execdebug.h"
+#include "executor/nodeBrinSort.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
+#include "utils/array.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *IndexNext(BrinSortState *node);
+static bool IndexRecheck(BrinSortState *node, TupleTableSlot *slot);
+static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
+
+#define BRINSORT_DEBUG
+
+/* do various consistency checks */
+static void
+AssertCheckRanges(BrinSortState *node)
+{
+#ifdef USE_ASSERT_CHECKING
+
+#endif
+}
+
+/*
+ * brinsort_start_tidscan
+ * Start scanning tuples from a given page range.
+ *
+ * We open a TID range scan for the given range, and initialize the tuplesort.
+ * Optionally, we update the watermark (with either high/low value). We only
+ * need to do this for the main page range, not for the intersecting ranges.
+ *
+ * XXX Maybe we should initialize the tidscan only once, and then do rescan
+ * for the following ranges? And similarly for the tuplesort?
+ */
+static void
+brinsort_start_tidscan(BrinSortState *node)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate = node->ss.ps.state;
+ BrinRange *range = node->bs_range;
+
+ /* There must not be any TID scan in progress yet. */
+ Assert(node->ss.ss_currentScanDesc == NULL);
+
+ /* Initialize the TID range scan, for the provided block range. */
+ if (node->ss.ss_currentScanDesc == NULL)
+ {
+ TableScanDesc tscandesc;
+ ItemPointerData mintid,
+ maxtid;
+
+ ItemPointerSetBlockNumber(&mintid, range->blkno_start);
+ ItemPointerSetOffsetNumber(&mintid, 0);
+
+ ItemPointerSetBlockNumber(&maxtid, range->blkno_end);
+ ItemPointerSetOffsetNumber(&maxtid, MaxHeapTuplesPerPage);
+
+ elog(DEBUG1, "loading range blocks [%u, %u]",
+ range->blkno_start, range->blkno_end);
+
+ tscandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ &mintid, &maxtid);
+ node->ss.ss_currentScanDesc = tscandesc;
+ }
+
+ if (node->bs_tuplesortstate == NULL)
+ {
+ TupleDesc tupDesc = RelationGetDescr(node->ss.ss_currentRelation);
+
+ node->bs_tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plan->numCols,
+ plan->sortColIdx,
+ plan->sortOperators,
+ plan->collations,
+ plan->nullsFirst,
+ work_mem,
+ NULL,
+ TUPLESORT_NONE);
+ }
+
+ if (node->bs_tuplestore == NULL)
+ {
+ node->bs_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+ }
+}
+
+/*
+ * brinsort_end_tidscan
+ * Finish the TID range scan.
+ */
+static void
+brinsort_end_tidscan(BrinSortState *node)
+{
+ /* get the first range, read all tuples using a tid range scan */
+ if (node->ss.ss_currentScanDesc != NULL)
+ {
+ table_endscan(node->ss.ss_currentScanDesc);
+ node->ss.ss_currentScanDesc = NULL;
+ }
+}
+
+/*
+ * brinsort_update_watermark
+ * Advance the watermark to the next minval (or maxval for DESC).
+ *
+ * We could could actually advance the watermark by multiple steps (not to
+ * the immediately following minval, but a couple more), to accumulate more
+ * rows in the tuplesort. The number of steps we make correlates with the
+ * amount of data we sort in a given step, but we don't know in advance
+ * how many rows (or bytes) will that actually be. We could do some simple
+ * heuristics (measure past sorts and extrapolate).
+ */
+static void
+brinsort_update_watermark(BrinSortState *node, bool asc)
+{
+ int cmp;
+ bool found = false;
+
+ tuplesort_markpos(node->bs_scan->ranges);
+
+ while (tuplesort_gettupleslot(node->bs_scan->ranges, true, false, node->bs_scan->slot, NULL))
+ {
+ bool isnull;
+ Datum value;
+ bool all_nulls;
+ bool not_summarized;
+
+ all_nulls = DatumGetBool(slot_getattr(node->bs_scan->slot, 4, &isnull));
+ Assert(!isnull);
+
+ not_summarized = DatumGetBool(slot_getattr(node->bs_scan->slot, 5, &isnull));
+ Assert(!isnull);
+
+ /* we ignore ranges that are either all_nulls or not summarized */
+ if (all_nulls || not_summarized)
+ continue;
+
+ /* use either minval or maxval, depending on the ASC / DESC */
+ if (asc)
+ value = slot_getattr(node->bs_scan->slot, 6, &isnull);
+ else
+ value = slot_getattr(node->bs_scan->slot, 7, &isnull);
+
+ if (!node->bs_watermark_set)
+ {
+ node->bs_watermark_set = true;
+ node->bs_watermark = value;
+ continue;
+ }
+
+ cmp = ApplySortComparator(node->bs_watermark, false, value, false,
+ &node->bs_sortsupport);
+
+ if (cmp < 0)
+ {
+ node->bs_watermark_set = true;
+ node->bs_watermark = value;
+ found = true;
+ break;
+ }
+ }
+
+ tuplesort_restorepos(node->bs_scan->ranges);
+
+ node->bs_watermark_set = found;
+}
+
+/*
+ * brinsort_load_tuples
+ * Load tuples from the TID range scan, add them to tuplesort/store.
+ *
+ * When called for the "current" range, we don't need to check the watermark,
+ * we know the tuple goes into the tuplesort. So with check_watermark we
+ * skip the comparator call to save CPU cost.
+ */
+static void
+brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_processing)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ TableScanDesc scan;
+ EState *estate;
+ ScanDirection direction;
+ TupleTableSlot *slot;
+ BrinRange *range = node->bs_range;
+
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ slot = node->ss.ss_ScanTupleSlot;
+
+ Assert(node->bs_range != NULL);
+
+ /*
+ * If we're not processign NULLS, and this is all-nulls range, we can
+ * just skip it - we won't find any non-NULL tuples in it.
+ *
+ * XXX Shouldn't happen, thanks to logic in brinsort_next_range().
+ */
+ if (!null_processing && range->all_nulls)
+ return;
+
+ /*
+ * Similarly, if we're processing NULLs and this range does not have
+ * has_nulls flag, we can skip it.
+ *
+ * XXX Shouldn't happen, thanks to logic in brinsort_next_range().
+ */
+ if (null_processing && !(range->has_nulls || range->not_summarized || range->all_nulls))
+ return;
+
+ brinsort_start_tidscan(node);
+
+ scan = node->ss.ss_currentScanDesc;
+
+ /*
+ * Read tuples, evaluate the filer (so that we don't keep tuples only to
+ * discard them later), and decide if it goes into the current range
+ * (tuplesort) or overflow (tuplestore).
+ */
+ while (table_scan_getnextslot_tidrange(scan, direction, slot))
+ {
+ ExprContext *econtext;
+ ExprState *qual;
+
+ /*
+ * Fetch data from node
+ */
+ qual = node->bs_qual;
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /*
+ * place the current tuple into the expr context
+ */
+ econtext->ecxt_scantuple = slot;
+
+ /*
+ * check that the current tuple satisfies the qual-clause
+ *
+ * check for non-null qual here to avoid a function call to ExecQual()
+ * when the qual is null ... saves only a few cycles, but they add up
+ * ...
+ *
+ * XXX Done here, because in ExecScan we'll get different slot type
+ * (minimal tuple vs. buffered tuple). Scan expects slot while reading
+ * from the table (like here), but we're stashing it into a tuplesort.
+ *
+ * XXX Maybe we could eliminate many tuples by leveraging the BRIN
+ * range, by executing the consistent function. But we don't have
+ * the qual in appropriate format at the moment, so we'd preprocess
+ * the keys similarly to bringetbitmap(). In which case we should
+ * probably evaluate the stuff while building the ranges? Although,
+ * if the "consistent" function is expensive, it might be cheaper
+ * to do that incrementally, as we need the ranges. Would be a win
+ * for LIMIT queries, for example.
+ *
+ * XXX However, maybe we could also leverage other bitmap indexes,
+ * particularly for BRIN indexes because that makes it simpler to
+ * eliminage the ranges incrementally - we know which ranges to
+ * load from the index, while for other indexes (e.g. btree) we
+ * have to read the whole index and build a bitmap in order to have
+ * a bitmap for any range. Although, if the condition is very
+ * selective, we may need to read only a small fraction of the
+ * index, so maybe that's OK.
+ */
+ if (qual == NULL || ExecQual(qual, econtext))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ Datum value;
+ bool isnull;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /*
+ * FIXME Not handling NULLS for now, we need to stash them into
+ * a separate tuplestore (so that we can output them first or
+ * last), and then skip them in the regular processing?
+ */
+ if (null_processing)
+ {
+ /* Stash it to the tuplestore (when NULL, or ignore
+ * it (when not-NULL). */
+ if (isnull)
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+
+ /* NULL or not, we're done */
+ continue;
+ }
+
+ /* we're not processing NULL values, so ignore NULLs */
+ if (isnull)
+ continue;
+
+ /*
+ * Otherwise compare to watermark, and stash it either to the
+ * tuplesort or tuplestore.
+ */
+ if (check_watermark && node->bs_watermark_set)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+ }
+
+ ExecClearTuple(slot);
+ }
+
+ ExecClearTuple(slot);
+
+ brinsort_end_tidscan(node);
+}
+
+/*
+ * brinsort_load_spill_tuples
+ * Load tuples from the spill tuplestore, and either stash them into
+ * a tuplesort or a new tuplestore.
+ *
+ * After processing the last range, we want to process all remaining ranges,
+ * so with check_watermark=false we skip the check.
+ */
+static void
+brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ Tuplestorestate *tupstore;
+ TupleTableSlot *slot;
+
+ if (node->bs_tuplestore == NULL)
+ return;
+
+ /* start scanning the existing tuplestore (XXX needed?) */
+ tuplestore_rescan(node->bs_tuplestore);
+
+ /*
+ * Create a new tuplestore, for tuples that exceed the watermark and so
+ * should not be included in the current sort.
+ */
+ tupstore = tuplestore_begin_heap(false, false, work_mem);
+
+ /*
+ * We need a slot for minimal tuples. The scan slot uses buffered tuples,
+ * so it'd trigger an error in the loop.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(node->ss.ss_currentRelation),
+ &TTSOpsMinimalTuple);
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ bool isnull;
+ Datum value;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /* We shouldn't have NULL values in the spill, at least not now. */
+ Assert(!isnull);
+
+ if (check_watermark && node->bs_watermark_set)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(tupstore, slot);
+ }
+
+ /*
+ * Discard the existing tuplestore (that we just processed), use the new
+ * one instead.
+ */
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = tupstore;
+
+ ExecDropSingleTupleTableSlot(slot);
+}
+
+static bool
+brinsort_next_range(BrinSortState *node, bool asc)
+{
+ /* FIXME free the current bs_range, if any */
+ node->bs_range = NULL;
+
+ /*
+ * Mark the position, so that we can restore it in case we reach the
+ * current watermark.
+ */
+ tuplesort_markpos(node->bs_scan->ranges);
+
+ /*
+ * Get the next range and return it, unless we can prove it's the last
+ * range that can possibly match the current conditon (thanks to how we
+ * order the ranges).
+ *
+ * Also skip ranges that can't possibly match (e.g. because we are in
+ * NULL processing, and the range has no NULLs).
+ */
+ while (tuplesort_gettupleslot(node->bs_scan->ranges, true, false, node->bs_scan->slot, NULL))
+ {
+ bool isnull;
+ Datum value;
+
+ BrinRange *range = (BrinRange *) palloc(sizeof(BrinRange));
+
+ range->blkno_start = slot_getattr(node->bs_scan->slot, 1, &isnull);
+ range->blkno_end = slot_getattr(node->bs_scan->slot, 2, &isnull);
+ range->has_nulls = slot_getattr(node->bs_scan->slot, 3, &isnull);
+ range->all_nulls = slot_getattr(node->bs_scan->slot, 4, &isnull);
+ range->not_summarized = slot_getattr(node->bs_scan->slot, 5, &isnull);
+ range->min_value = slot_getattr(node->bs_scan->slot, 6, &isnull);
+ range->max_value = slot_getattr(node->bs_scan->slot, 7, &isnull);
+
+ /*
+ * Not-summarized ranges match irrespectedly of the watermark (if
+ * it's set at all).
+ */
+ if (range->not_summarized)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ /*
+ * The range is summarized, but maybe the watermark is not? That
+ * would mean we're processing NULL values, so we skip ranges that
+ * can't possibly match (i.e. with all_nulls=has_nulls=false).
+ */
+ if (!node->bs_watermark_set)
+ {
+ if (range->all_nulls || range->has_nulls)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ /* update the position and try the next range */
+ tuplesort_markpos(node->bs_scan->ranges);
+ pfree(range);
+
+ continue;
+ }
+
+ /*
+ * So now we have a summarized range, and we know the watermark
+ * is set too (so we're not processing NULLs). We place the ranges
+ * with only nulls last, so once we hit one we're done.
+ */
+ if (range->all_nulls)
+ {
+ pfree(range);
+ return false; /* no more matching ranges */
+ }
+
+ /*
+ * Compare the range to the watermark, using either the minval or
+ * maxval, depending on ASC/DESC ordering. If the range precedes the
+ * watermark, return it. Otherwise abort, all the future ranges are
+ * either not matching the watermark (thanks to ordering) or contain
+ * only NULL values.
+ */
+
+ /* use minval or maxval, depending on ASC / DESC */
+ value = (asc) ? range->min_value : range->max_value;
+
+ /*
+ * compare it to the current watermark (if set)
+ *
+ * XXX We don't use (... <= 0) here, because then we'd load ranges
+ * with that minval (and there might be multiple), but most of the
+ * rows would go into the tuplestore, because only rows matching the
+ * minval exactly would be loaded into tuplesort.
+ */
+ if (ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport) < 0)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ pfree(range);
+ break;
+ }
+
+ /* not a matching range, we're done */
+ tuplesort_restorepos(node->bs_scan->ranges);
+
+ return false;
+}
+
+static bool
+brinsort_range_with_nulls(BrinSortState *node)
+{
+ BrinRange *range = node->bs_range;
+
+ if (range->all_nulls || range->has_nulls || range->not_summarized)
+ return true;
+
+ return false;
+}
+
+static void
+brinsort_rescan(BrinSortState *node)
+{
+ tuplesort_rescan(node->bs_scan->ranges);
+}
+
+/* ----------------------------------------------------------------
+ * IndexNext
+ *
+ * Retrieve a tuple from the BrinSort node's currentRelation
+ * using the index specified in the BrinSortState information.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+IndexNext(BrinSortState *node)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate;
+ ScanDirection direction;
+ IndexScanDesc scandesc;
+ TupleTableSlot *slot;
+ bool nullsFirst;
+ bool asc;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ /* flip direction if this is an overall backward scan */
+ /* XXX For BRIN indexes this is always forward direction */
+ // if (ScanDirectionIsBackward(((BrinSort *) node->ss.ps.plan)->indexorderdir))
+ if (false)
+ {
+ if (ScanDirectionIsForward(direction))
+ direction = BackwardScanDirection;
+ else if (ScanDirectionIsBackward(direction))
+ direction = ForwardScanDirection;
+ }
+ scandesc = node->iss_ScanDesc;
+ slot = node->ss.ss_ScanTupleSlot;
+
+ nullsFirst = plan->nullsFirst[0];
+ asc = ScanDirectionIsForward(plan->indexorderdir);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the index scan is not parallel, or if we're
+ * serially executing an index scan that was planned to be parallel.
+ */
+ scandesc = index_beginscan(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys);
+
+ node->iss_ScanDesc = scandesc;
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and
+ * pass the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(scandesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+ /*
+ * Load info about BRIN ranges, sort them to match the desired ordering.
+ */
+ ExecInitBrinSortRanges(plan, node);
+ node->bs_phase = BRINSORT_START;
+ }
+
+ /*
+ * ok, now that we have what we need, fetch the next tuple.
+ */
+ while (node->bs_phase != BRINSORT_FINISHED)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ elog(DEBUG1, "phase = %d", node->bs_phase);
+
+ AssertCheckRanges(node);
+
+ switch (node->bs_phase)
+ {
+ case BRINSORT_START:
+
+ elog(DEBUG1, "phase = START");
+
+ /*
+ * If we have NULLS FIRST, move to that stage. Otherwise
+ * start scanning regular ranges.
+ */
+ if (nullsFirst)
+ node->bs_phase = BRINSORT_LOAD_NULLS;
+ else
+ {
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+
+ /* set the first watermark */
+ brinsort_update_watermark(node, asc);
+ }
+
+ break;
+
+ case BRINSORT_LOAD_RANGE:
+ {
+ elog(DEBUG1, "phase = LOAD_RANGE");
+
+ /*
+ * Load tuples matching the new watermark from the existing
+ * spill tuplestore. We do this before loading tuples from
+ * the next chunk of ranges, because those will add tuples
+ * to the spill, and we'd end up processing those twice.
+ */
+ brinsort_load_spill_tuples(node, true);
+
+ /*
+ * Load tuples from ranges, until we find a range that has
+ * min_value >= watermark.
+ *
+ * XXX In fact, we are guaranteed to find an exact match
+ * for the watermark, because of how we pick the watermark.
+ */
+ while (brinsort_next_range(node, asc))
+ brinsort_load_tuples(node, true, false);
+
+ /*
+ * If we have loaded any tuples into the tuplesort, try
+ * sorting it and move to producing the tuples.
+ *
+ * XXX The range might have no rows matching the current
+ * watermark, in which case the tuplesort is empty.
+ */
+ if (node->bs_tuplesortstate)
+ {
+ tuplesort_performsort(node->bs_tuplesortstate);
+#ifdef BRINSORT_DEBUG
+ {
+ TuplesortInstrumentation stats;
+
+ tuplesort_get_stats(node->bs_tuplesortstate, &stats);
+
+ elog(DEBUG1, "method: %s space: %ld kB (%s)",
+ tuplesort_method_name(stats.sortMethod),
+ stats.spaceUsed,
+ tuplesort_space_type_name(stats.spaceType));
+ }
+#endif
+ }
+
+ node->bs_phase = BRINSORT_PROCESS_RANGE;
+ break;
+ }
+
+ case BRINSORT_PROCESS_RANGE:
+
+ elog(DEBUG1, "phase BRINSORT_PROCESS_RANGE");
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplesortstate != NULL)
+ {
+ if (tuplesort_gettupleslot(node->bs_tuplesortstate,
+ ScanDirectionIsForward(direction),
+ false, slot, NULL))
+ return slot;
+
+ /* once we're done with the tuplesort, reset it */
+ tuplesort_reset(node->bs_tuplesortstate);
+ }
+
+ /*
+ * Now that we processed tuples from the last range batch,
+ * see if we reached the end of if we should try updating
+ * the watermark once again. If the watermark is not set,
+ * we've already processed the last range.
+ */
+ if (!node->bs_watermark_set)
+ {
+ if (nullsFirst)
+ node->bs_phase = BRINSORT_FINISHED;
+ else
+ {
+ brinsort_rescan(node);
+ node->bs_phase = BRINSORT_LOAD_NULLS;
+ }
+ }
+ else
+ {
+ /* updte the watermark and try reading more ranges */
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+ brinsort_update_watermark(node, asc);
+ }
+
+ break;
+
+ case BRINSORT_LOAD_NULLS:
+ {
+ elog(DEBUG1, "phase = LOAD_NULLS");
+
+ /*
+ * Try loading another range. If there are no more ranges,
+ * we're done and we move either to loading regular ranges.
+ * Otherwise check if this range can contain
+ */
+ while (true)
+ {
+ /* no more ranges - terminate or load regular ranges */
+ if (!brinsort_next_range(node, asc))
+ {
+ if (nullsFirst)
+ {
+ brinsort_rescan(node);
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+ brinsort_update_watermark(node, asc);
+ }
+ else
+ node->bs_phase = BRINSORT_FINISHED;
+
+ break;
+ }
+
+ /* If this range (may) have nulls, proces them */
+ if (brinsort_range_with_nulls(node))
+ break;
+ }
+
+ if (node->bs_range == NULL)
+ break;
+
+ /*
+ * There should be nothing left in the tuplestore, because
+ * we flush that at the end of processing regular tuples,
+ * and we don't retain tuples between NULL ranges.
+ */
+ // Assert(node->bs_tuplestore == NULL);
+
+ /*
+ * Load the next unprocessed / NULL range. We don't need to
+ * check watermark while processing NULLS.
+ */
+ brinsort_load_tuples(node, false, true);
+
+ node->bs_phase = BRINSORT_PROCESS_NULLS;
+ break;
+ }
+
+ break;
+
+ case BRINSORT_PROCESS_NULLS:
+
+ elog(DEBUG1, "phase = LOAD_NULLS");
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ Assert(node->bs_tuplestore != NULL);
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplestore != NULL)
+ {
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ return slot;
+
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ node->bs_phase = BRINSORT_LOAD_NULLS; /* load next range */
+ }
+
+ break;
+
+ case BRINSORT_FINISHED:
+ elog(ERROR, "unexpected BrinSort phase: FINISHED");
+ break;
+ }
+ }
+
+ /*
+ * if we get here it means the index scan failed so we are at the end of
+ * the scan..
+ */
+ node->iss_ReachedEnd = true;
+ return ExecClearTuple(slot);
+}
+
+/*
+ * IndexRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+IndexRecheck(BrinSortState *node, TupleTableSlot *slot)
+{
+ ExprContext *econtext;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /* Does the tuple meet the indexqual condition? */
+ econtext->ecxt_scantuple = slot;
+ return ExecQualAndReset(node->indexqualorig, econtext);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecBrinSort(node)
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecBrinSort(PlanState *pstate)
+{
+ BrinSortState *node = castNode(BrinSortState, pstate);
+
+ /*
+ * If we have runtime keys and they've not already been set up, do it now.
+ */
+ if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+ ExecReScan((PlanState *) node);
+
+ return ExecScan(&node->ss,
+ (ExecScanAccessMtd) IndexNext,
+ (ExecScanRecheckMtd) IndexRecheck);
+}
+
+/* ----------------------------------------------------------------
+ * ExecReScanBrinSort(node)
+ *
+ * Recalculates the values of any scan keys whose value depends on
+ * information known at runtime, then rescans the indexed relation.
+ *
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanBrinSort(BrinSortState *node)
+{
+ /*
+ * If we are doing runtime key calculations (ie, any of the index key
+ * values weren't simple Consts), compute the new key values. But first,
+ * reset the context so we don't leak memory as each outer tuple is
+ * scanned. Note this assumes that we will recalculate *all* runtime keys
+ * on each call.
+ */
+ if (node->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *econtext = node->iss_RuntimeContext;
+
+ ResetExprContext(econtext);
+ ExecIndexEvalRuntimeKeys(econtext,
+ node->iss_RuntimeKeys,
+ node->iss_NumRuntimeKeys);
+ }
+ node->iss_RuntimeKeysReady = true;
+
+ /* reset index scan */
+ if (node->iss_ScanDesc)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+ node->iss_ReachedEnd = false;
+
+ ExecScanReScan(&node->ss);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecEndBrinSort
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndBrinSort(BrinSortState *node)
+{
+ Relation indexRelationDesc;
+ IndexScanDesc IndexScanDesc;
+
+ /*
+ * extract information from the node
+ */
+ indexRelationDesc = node->iss_RelationDesc;
+ IndexScanDesc = node->iss_ScanDesc;
+
+ /*
+ * clear out tuple table slots
+ */
+ if (node->ss.ps.ps_ResultTupleSlot)
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+ /*
+ * close the index relation (no-op if we didn't open it)
+ */
+ if (IndexScanDesc)
+ index_endscan(IndexScanDesc);
+ if (indexRelationDesc)
+ index_close(indexRelationDesc, NoLock);
+
+ if (node->ss.ss_currentScanDesc != NULL)
+ table_endscan(node->ss.ss_currentScanDesc);
+
+ if (node->bs_tuplestore != NULL)
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ if (node->bs_tuplesortstate != NULL)
+ tuplesort_end(node->bs_tuplesortstate);
+ node->bs_tuplesortstate = NULL;
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortMarkPos
+ *
+ * Note: we assume that no caller attempts to set a mark before having read
+ * at least one tuple. Otherwise, iss_ScanDesc might still be NULL.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortMarkPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (epqstate != NULL)
+ {
+ /*
+ * We are inside an EvalPlanQual recheck. If a test tuple exists for
+ * this relation, then we shouldn't access the index at all. We would
+ * instead need to save, and later restore, the state of the
+ * relsubs_done flag, so that re-fetching the test tuple is possible.
+ * However, given the assumption that no caller sets a mark at the
+ * start of the scan, we can only get here with relsubs_done[i]
+ * already set, and so no state need be saved.
+ */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortMarkPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_markpos(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecIndexRestrPos
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortRestrPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (estate->es_epq_active != NULL)
+ {
+ /* See comments in ExecIndexMarkPos */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortRestrPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_restrpos(node->iss_ScanDesc);
+}
+
+/*
+ * somewhat crippled verson of bringetbitmap
+ *
+ * XXX We don't call consistent function (or any other function), so unlike
+ * bringetbitmap we don't set a separate memory context. If we end up filtering
+ * the ranges somehow (e.g. by WHERE conditions), this might be necessary.
+ *
+ * XXX Should be part of opclass, to somewhere in brin_minmax.c etc.
+ */
+static void
+ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate)
+{
+ IndexScanDesc scan = planstate->iss_ScanDesc;
+ Relation indexRel = planstate->iss_RelationDesc;
+ int attno;
+ FmgrInfo *rangeproc;
+ BrinRangeScanDesc *brscan;
+ bool asc;
+
+ /* BRIN Sort only allows ORDER BY using a single column */
+ Assert(node->numCols == 1);
+
+ /*
+ * Determine index attnum we're interested in. The sortColIdx has attnums
+ * from the table, but we need index attnum so that we can fetch the right
+ * range summary.
+ *
+ * XXX Maybe we could/should arrange the tlists differently, so that this
+ * is not necessary?
+ *
+ * FIXME This is broken, node->sortColIdx[0] is an index into the target
+ * list, not table attnum.
+ *
+ * FIXME Also the projection is broken.
+ */
+ attno = 0;
+ for (int i = 0; i < indexRel->rd_index->indnatts; i++)
+ {
+ if (indexRel->rd_index->indkey.values[i] == node->sortColIdx[0])
+ {
+ attno = (i + 1);
+ break;
+ }
+ }
+
+ /* make sure we matched the argument */
+ Assert(attno > 0);
+
+ /* get procedure to generate sort ranges */
+ rangeproc = index_getprocinfo(indexRel, attno, BRIN_PROCNUM_RANGES);
+
+ /*
+ * Should not get here without a proc, thanks to the check before
+ * building the BrinSort path.
+ */
+ Assert(rangeproc != NULL);
+
+ memset(&planstate->bs_sortsupport, 0, sizeof(SortSupportData));
+ PrepareSortSupportFromOrderingOp(node->sortOperators[0], &planstate->bs_sortsupport);
+
+ /*
+ * Determine if this ASC or DESC sort, so that we can request the
+ * ranges in the appropriate order (ordered either by minval for
+ * ASC, or by maxval for DESC).
+ */
+ asc = ScanDirectionIsForward(node->indexorderdir);
+
+ /*
+ * Ask the opclass to produce ranges in appropriate ordering.
+ *
+ * XXX Pass info about ASC/DESC, NULLS FIRST/LAST.
+ */
+ brscan = (BrinRangeScanDesc *) DatumGetPointer(FunctionCall3Coll(rangeproc,
+ InvalidOid, /* FIXME use proper collation*/
+ PointerGetDatum(scan),
+ Int16GetDatum(attno),
+ BoolGetDatum(asc)));
+
+ /* allocate for space, and also for the alternative ordering */
+ planstate->bs_scan = brscan;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitBrinSort
+ *
+ * Initializes the index scan's state information, creates
+ * scan keys, and opens the base and index relations.
+ *
+ * Note: index scans have 2 sets of state information because
+ * we have to keep track of the base relation and the
+ * index relation.
+ * ----------------------------------------------------------------
+ */
+BrinSortState *
+ExecInitBrinSort(BrinSort *node, EState *estate, int eflags)
+{
+ BrinSortState *indexstate;
+ Relation currentRelation;
+ LOCKMODE lockmode;
+
+ /*
+ * create state structure
+ */
+ indexstate = makeNode(BrinSortState);
+ indexstate->ss.ps.plan = (Plan *) node;
+ indexstate->ss.ps.state = estate;
+ indexstate->ss.ps.ExecProcNode = ExecBrinSort;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * create expression context for node
+ */
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+
+ /*
+ * open the scan relation
+ */
+ currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
+
+ indexstate->ss.ss_currentRelation = currentRelation;
+ indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
+
+ /*
+ * get the scan type from the relation descriptor.
+ */
+ ExecInitScanTupleSlot(estate, &indexstate->ss,
+ RelationGetDescr(currentRelation),
+ table_slot_callbacks(currentRelation));
+
+ /*
+ * Initialize result type and projection.
+ */
+ ExecInitResultTypeTL(&indexstate->ss.ps);
+ ExecAssignScanProjectionInfo(&indexstate->ss);
+
+ /*
+ * initialize child expressions
+ *
+ * Note: we don't initialize all of the indexqual expression, only the
+ * sub-parts corresponding to runtime keys (see below). Likewise for
+ * indexorderby, if any. But the indexqualorig expression is always
+ * initialized even though it will only be used in some uncommon cases ---
+ * would be nice to improve that. (Problem is that any SubPlans present
+ * in the expression must be found now...)
+ */
+ indexstate->ss.ps.qual =
+ ExecInitQual(node->scan.plan.qual, (PlanState *) indexstate);
+ indexstate->indexqualorig =
+ ExecInitQual(node->indexqualorig, (PlanState *) indexstate);
+
+ /*
+ * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+ * here. This allows an index-advisor plugin to EXPLAIN a plan containing
+ * references to nonexistent indexes.
+ */
+ if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+ return indexstate;
+
+ /* Open the index relation. */
+ lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
+ indexstate->iss_RelationDesc = index_open(node->indexid, lockmode);
+
+ /*
+ * Initialize index-specific scan state
+ */
+ indexstate->iss_RuntimeKeysReady = false;
+ indexstate->iss_RuntimeKeys = NULL;
+ indexstate->iss_NumRuntimeKeys = 0;
+
+ /*
+ * build the index scan keys from the index qualification
+ */
+ ExecIndexBuildScanKeys((PlanState *) indexstate,
+ indexstate->iss_RelationDesc,
+ node->indexqual,
+ false,
+ &indexstate->iss_ScanKeys,
+ &indexstate->iss_NumScanKeys,
+ &indexstate->iss_RuntimeKeys,
+ &indexstate->iss_NumRuntimeKeys,
+ NULL, /* no ArrayKeys */
+ NULL);
+
+ /*
+ * If we have runtime keys, we need an ExprContext to evaluate them. The
+ * node's standard context won't do because we want to reset that context
+ * for every tuple. So, build another context just like the other one...
+ * -tgl 7/11/00
+ */
+ if (indexstate->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *stdecontext = indexstate->ss.ps.ps_ExprContext;
+
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+ indexstate->iss_RuntimeContext = indexstate->ss.ps.ps_ExprContext;
+ indexstate->ss.ps.ps_ExprContext = stdecontext;
+ }
+ else
+ {
+ indexstate->iss_RuntimeContext = NULL;
+ }
+
+ indexstate->bs_tuplesortstate = NULL;
+ indexstate->bs_qual = indexstate->ss.ps.qual;
+ indexstate->ss.ps.qual = NULL;
+ ExecInitResultTupleSlotTL(&indexstate->ss.ps, &TTSOpsMinimalTuple);
+
+ /*
+ * all done.
+ */
+ return indexstate;
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortEstimate(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeDSM
+ *
+ * Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+ index_parallelscan_initialize(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ piscan);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortReInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ index_parallelrescan(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4c6b1d1f55b..64d103b19e9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -790,6 +790,260 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.total_cost = startup_cost + run_cost;
}
+void
+cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
+ bool partial_path)
+{
+ IndexOptInfo *index = path->ipath.indexinfo;
+ RelOptInfo *baserel = index->rel;
+ amcostestimate_function amcostestimate;
+ List *qpquals;
+ Cost startup_cost = 0;
+ Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost indexStartupCost;
+ Cost indexTotalCost;
+ Selectivity indexSelectivity;
+ double indexCorrelation,
+ csquared;
+ double spc_seq_page_cost,
+ spc_random_page_cost;
+ Cost min_IO_cost,
+ max_IO_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ double tuples_fetched;
+ double pages_fetched;
+ double rand_heap_pages;
+ double index_pages;
+
+ /* Should only be applied to base relations */
+ Assert(IsA(baserel, RelOptInfo) &&
+ IsA(index, IndexOptInfo));
+ Assert(baserel->relid > 0);
+ Assert(baserel->rtekind == RTE_RELATION);
+
+ /*
+ * Mark the path with the correct row estimate, and identify which quals
+ * will need to be enforced as qpquals. We need not check any quals that
+ * are implied by the index's predicate, so we can use indrestrictinfo not
+ * baserestrictinfo as the list of relevant restriction clauses for the
+ * rel.
+ */
+ if (path->ipath.path.param_info)
+ {
+ path->ipath.path.rows = path->ipath.path.param_info->ppi_rows;
+ /* qpquals come from the rel's restriction clauses and ppi_clauses */
+ qpquals = list_concat(extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses),
+ extract_nonindex_conditions(path->ipath.path.param_info->ppi_clauses,
+ path->ipath.indexclauses));
+ }
+ else
+ {
+ path->ipath.path.rows = baserel->rows;
+ /* qpquals come from just the rel's restriction clauses */
+ qpquals = extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses);
+ }
+
+ if (!enable_indexscan)
+ startup_cost += disable_cost;
+ /* we don't need to check enable_indexonlyscan; indxpath.c does that */
+
+ /*
+ * Call index-access-method-specific code to estimate the processing cost
+ * for scanning the index, as well as the selectivity of the index (ie,
+ * the fraction of main-table tuples we will have to retrieve) and its
+ * correlation to the main-table tuple order. We need a cast here because
+ * pathnodes.h uses a weak function type to avoid including amapi.h.
+ */
+ amcostestimate = (amcostestimate_function) index->amcostestimate;
+ amcostestimate(root, &path->ipath, loop_count,
+ &indexStartupCost, &indexTotalCost,
+ &indexSelectivity, &indexCorrelation,
+ &index_pages);
+
+ /*
+ * Save amcostestimate's results for possible use in bitmap scan planning.
+ * We don't bother to save indexStartupCost or indexCorrelation, because a
+ * bitmap scan doesn't care about either.
+ */
+ path->ipath.indextotalcost = indexTotalCost;
+ path->ipath.indexselectivity = indexSelectivity;
+
+ /* all costs for touching index itself included here */
+ startup_cost += indexStartupCost;
+ run_cost += indexTotalCost - indexStartupCost;
+
+ /* estimate number of main-table tuples fetched */
+ tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
+
+ /* fetch estimated page costs for tablespace containing table */
+ get_tablespace_page_costs(baserel->reltablespace,
+ &spc_random_page_cost,
+ &spc_seq_page_cost);
+
+ /*----------
+ * Estimate number of main-table pages fetched, and compute I/O cost.
+ *
+ * When the index ordering is uncorrelated with the table ordering,
+ * we use an approximation proposed by Mackert and Lohman (see
+ * index_pages_fetched() for details) to compute the number of pages
+ * fetched, and then charge spc_random_page_cost per page fetched.
+ *
+ * When the index ordering is exactly correlated with the table ordering
+ * (just after a CLUSTER, for example), the number of pages fetched should
+ * be exactly selectivity * table_size. What's more, all but the first
+ * will be sequential fetches, not the random fetches that occur in the
+ * uncorrelated case. So if the number of pages is more than 1, we
+ * ought to charge
+ * spc_random_page_cost + (pages_fetched - 1) * spc_seq_page_cost
+ * For partially-correlated indexes, we ought to charge somewhere between
+ * these two estimates. We currently interpolate linearly between the
+ * estimates based on the correlation squared (XXX is that appropriate?).
+ *
+ * If it's an index-only scan, then we will not need to fetch any heap
+ * pages for which the visibility map shows all tuples are visible.
+ * Hence, reduce the estimated number of heap fetches accordingly.
+ * We use the measured fraction of the entire heap that is all-visible,
+ * which might not be particularly relevant to the subset of the heap
+ * that this query will fetch; but it's not clear how to do better.
+ *----------
+ */
+ if (loop_count > 1)
+ {
+ /*
+ * For repeated indexscans, the appropriate estimate for the
+ * uncorrelated case is to scale up the number of tuples fetched in
+ * the Mackert and Lohman formula by the number of scans, so that we
+ * estimate the number of pages fetched by all the scans; then
+ * pro-rate the costs for one scan. In this case we assume all the
+ * fetches are random accesses.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+
+ /*
+ * In the perfectly correlated case, the number of pages touched by
+ * each scan is selectivity * table_size, and we can use the Mackert
+ * and Lohman formula at the page level to estimate how much work is
+ * saved by caching across scans. We still assume all the fetches are
+ * random, though, which is an overestimate that's hard to correct for
+ * without double-counting the cache effects. (But in most cases
+ * where such a plan is actually interesting, only one page would get
+ * fetched per scan anyway, so it shouldn't matter much.)
+ */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ pages_fetched = index_pages_fetched(pages_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+ }
+ else
+ {
+ /*
+ * Normal case: apply the Mackert and Lohman formula, and then
+ * interpolate between that and the correlation-derived result.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ /* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
+ max_IO_cost = pages_fetched * spc_random_page_cost;
+
+ /* min_IO_cost is for the perfectly correlated case (csquared=1) */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ if (pages_fetched > 0)
+ {
+ min_IO_cost = spc_random_page_cost;
+ if (pages_fetched > 1)
+ min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
+ }
+ else
+ min_IO_cost = 0;
+ }
+
+ if (partial_path)
+ {
+ /*
+ * Estimate the number of parallel workers required to scan index. Use
+ * the number of heap pages computed considering heap fetches won't be
+ * sequential as for parallel scans the pages are accessed in random
+ * order.
+ */
+ path->ipath.path.parallel_workers = compute_parallel_worker(baserel,
+ rand_heap_pages,
+ index_pages,
+ max_parallel_workers_per_gather);
+
+ /*
+ * Fall out if workers can't be assigned for parallel scan, because in
+ * such a case this path will be rejected. So there is no benefit in
+ * doing extra computation.
+ */
+ if (path->ipath.path.parallel_workers <= 0)
+ return;
+
+ path->ipath.path.parallel_aware = true;
+ }
+
+ /*
+ * Now interpolate based on estimated index order correlation to get total
+ * disk I/O cost for main table accesses.
+ */
+ csquared = indexCorrelation * indexCorrelation;
+
+ run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
+
+ /*
+ * Estimate CPU costs per tuple.
+ *
+ * What we want here is cpu_tuple_cost plus the evaluation costs of any
+ * qual clauses that we have to evaluate as qpquals.
+ */
+ cost_qual_eval(&qpqual_cost, qpquals, root);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+
+ cpu_run_cost += cpu_per_tuple * tuples_fetched;
+
+ /* tlist eval costs are paid per output row, not per tuple scanned */
+ startup_cost += path->ipath.path.pathtarget->cost.startup;
+ cpu_run_cost += path->ipath.path.pathtarget->cost.per_tuple * path->ipath.path.rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->ipath.path.parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(&path->ipath.path);
+
+ path->ipath.path.rows = clamp_row_est(path->ipath.path.rows / parallel_divisor);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+ }
+
+ run_cost += cpu_run_cost;
+
+ path->ipath.path.startup_cost = startup_cost;
+ path->ipath.path.total_cost = startup_cost + run_cost;
+}
+
/*
* extract_nonindex_conditions
*
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index c31fcc917df..18b625460eb 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -17,12 +17,16 @@
#include <math.h>
+#include "access/brin_internal.h"
+#include "access/relation.h"
#include "access/stratnum.h"
#include "access/sysattr.h"
#include "catalog/pg_am.h"
#include "catalog/pg_operator.h"
+#include "catalog/pg_opclass.h"
#include "catalog/pg_opfamily.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "nodes/supportnodes.h"
@@ -32,10 +36,13 @@
#include "optimizer/paths.h"
#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "utils/rel.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
+bool enable_brinsort = true;
+
/* XXX see PartCollMatchesExprColl */
#define IndexCollMatchesExprColl(idxcollation, exprcollation) \
((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))
@@ -1127,6 +1134,185 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
}
}
+ /*
+ * If this is a BRIN index with suitable opclass (minmax or such), we may
+ * try doing BRIN sort. BRIN indexes are not ordered and amcanorderbyop
+ * is set to false, so we probably will need some new opclass flag to
+ * mark indexes that support this.
+ */
+ if (enable_brinsort && pathkeys_possibly_useful)
+ {
+ ListCell *lc;
+ Relation rel2 = relation_open(index->indexoid, NoLock);
+ int idx;
+
+ /*
+ * Try generating sorted paths for each key with the right opclass.
+ */
+ idx = -1;
+ foreach(lc, index->indextlist)
+ {
+ TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+ BrinSortPath *bpath;
+ Oid rangeproc;
+ AttrNumber attnum;
+
+ idx++;
+ attnum = (idx + 1);
+
+ /* skip expressions for now */
+ if (!AttributeNumberIsValid(index->indexkeys[idx]))
+ continue;
+
+ /* XXX ignore non-BRIN indexes */
+ if (rel2->rd_rel->relam != BRIN_AM_OID)
+ continue;
+
+ /*
+ * XXX Ignore keys not using an opclass with the "ranges" proc.
+ * For now we only do this for some minmax opclasses, but adding
+ * it to all minmax is simple, and adding it to minmax-multi
+ * should not be very hard.
+ */
+ rangeproc = index_getprocid(rel2, attnum, BRIN_PROCNUM_RANGES);
+ if (!OidIsValid(rangeproc))
+ continue;
+
+ /*
+ * XXX stuff extracted from build_index_pathkeys, except that we
+ * only deal with a single index key (producing a single pathkey),
+ * so we only sort on a single column. I guess we could use more
+ * index keys and sort on more expressions? Would that mean these
+ * keys need to be rather well correlated? In any case, it seems
+ * rather complex to implement, so I leave it as a possible
+ * future improvement.
+ *
+ * XXX This could also use the other BRIN keys (even from other
+ * indexes) in a different way - we might use the other ranges
+ * to quickly eliminate some of the chunks, essentially like a
+ * bitmap, but maybe without using the bitmap. Or we might use
+ * other indexes through bitmaps.
+ *
+ * XXX This fakes a number of parameters, because we don't store
+ * the btree opclass in the index, instead we use the default
+ * one for the key data type. And BRIN does not allow specifying
+ *
+ * XXX We don't add the path to result, because this function is
+ * supposed to generate IndexPaths. Instead, we just add the path
+ * using add_path(). We should be building this in a different
+ * place, perhaps in create_index_paths() or so.
+ *
+ * XXX By building it elsewhere, we could also leverage the index
+ * paths we've built here, particularly the bitmap index paths,
+ * which we could use to eliminate many of the ranges.
+ *
+ * XXX We don't have any explicit ordering associated with the
+ * BRIN index, e.g. we don't have ASC/DESC and NULLS FIRST/LAST.
+ * So this is not encoded in the index, and we can satisfy all
+ * these cases - but we need to add paths for each combination.
+ * I wonder if there's a better way to do this.
+ */
+
+ /* ASC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* ASC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+ }
+
+ relation_close(rel2, NoLock);
+ }
+
return result;
}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a9943cd6e01..83dde6f22eb 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -27,6 +27,7 @@
#include "optimizer/paths.h"
#include "partitioning/partbounds.h"
#include "utils/lsyscache.h"
+#include "utils/typcache.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -630,6 +631,55 @@ build_index_pathkeys(PlannerInfo *root,
return retval;
}
+
+List *
+build_index_pathkeys_brin(PlannerInfo *root,
+ IndexOptInfo *index,
+ TargetEntry *tle,
+ int idx,
+ bool reverse_sort,
+ bool nulls_first)
+{
+ TypeCacheEntry *typcache;
+ PathKey *cpathkey;
+ Oid sortopfamily;
+
+ /*
+ * Get default btree opfamily for the type, extracted from the
+ * entry in index targetlist.
+ *
+ * XXX Is there a better / more correct way to do this?
+ */
+ typcache = lookup_type_cache(exprType((Node *) tle->expr),
+ TYPECACHE_BTREE_OPFAMILY);
+ sortopfamily = typcache->btree_opf;
+
+ /*
+ * OK, try to make a canonical pathkey for this sort key. Note we're
+ * underneath any outer joins, so nullable_relids should be NULL.
+ */
+ cpathkey = make_pathkey_from_sortinfo(root,
+ tle->expr,
+ NULL,
+ sortopfamily,
+ index->opcintype[idx],
+ index->indexcollations[idx],
+ reverse_sort,
+ nulls_first,
+ 0,
+ index->rel->relids,
+ false);
+
+ /*
+ * There may be no pathkey if we haven't matched any sortkey, in which
+ * case ignore it.
+ */
+ if (!cpathkey)
+ return NIL;
+
+ return list_make1(cpathkey);
+}
+
/*
* partkey_is_bool_constant_for_query
*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ac86ce90033..395c632f430 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -124,6 +124,8 @@ static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
List *tlist, List *scan_clauses);
static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
List *tlist, List *scan_clauses, bool indexonly);
+static BrinSort *create_brinsort_plan(PlannerInfo *root, BrinSortPath *best_path,
+ List *tlist, List *scan_clauses);
static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
BitmapHeapPath *best_path,
List *tlist, List *scan_clauses);
@@ -191,6 +193,9 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
List *indexorderby,
List *indextlist,
ScanDirection indexscandir);
+static BrinSort *make_brinsort(List *qptlist, List *qpqual, Index scanrelid,
+ Oid indexid, List *indexqual, List *indexqualorig,
+ ScanDirection indexscandir);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -410,6 +415,9 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
case T_CustomScan:
plan = create_scan_plan(root, best_path, flags);
break;
+ case T_BrinSort:
+ plan = create_scan_plan(root, best_path, flags);
+ break;
case T_HashJoin:
case T_MergeJoin:
case T_NestLoop:
@@ -776,6 +784,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path, int flags)
scan_clauses);
break;
+ case T_BrinSort:
+ plan = (Plan *) create_brinsort_plan(root,
+ (BrinSortPath *) best_path,
+ tlist,
+ scan_clauses);
+ break;
+
default:
elog(ERROR, "unrecognized node type: %d",
(int) best_path->pathtype);
@@ -3180,6 +3195,154 @@ create_indexscan_plan(PlannerInfo *root,
return scan_plan;
}
+/*
+ * create_brinsort_plan
+ * Returns a brinsort plan for the base relation scanned by 'best_path'
+ * with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ *
+ * This is mostly a slighly simplified version of create_indexscan_plan, with
+ * the unecessary parts removed (we don't support indexonly scans, or reordering
+ * and similar stuff).
+ */
+static BrinSort *
+create_brinsort_plan(PlannerInfo *root,
+ BrinSortPath *best_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ BrinSort *brinsort_plan;
+ List *indexclauses = best_path->ipath.indexclauses;
+ Index baserelid = best_path->ipath.path.parent->relid;
+ IndexOptInfo *indexinfo = best_path->ipath.indexinfo;
+ Oid indexoid = indexinfo->indexoid;
+ List *qpqual;
+ List *stripped_indexquals;
+ List *fixed_indexquals;
+ ListCell *l;
+
+ List *pathkeys = best_path->ipath.path.pathkeys;
+
+ /* it should be a base rel... */
+ Assert(baserelid > 0);
+ Assert(best_path->ipath.path.parent->rtekind == RTE_RELATION);
+
+ /*
+ * Extract the index qual expressions (stripped of RestrictInfos) from the
+ * IndexClauses list, and prepare a copy with index Vars substituted for
+ * table Vars. (This step also does replace_nestloop_params on the
+ * fixed_indexquals.)
+ */
+ fix_indexqual_references(root, &best_path->ipath,
+ &stripped_indexquals,
+ &fixed_indexquals);
+
+ /*
+ * The qpqual list must contain all restrictions not automatically handled
+ * by the index, other than pseudoconstant clauses which will be handled
+ * by a separate gating plan node. All the predicates in the indexquals
+ * will be checked (either by the index itself, or by nodeIndexscan.c),
+ * but if there are any "special" operators involved then they must be
+ * included in qpqual. The upshot is that qpqual must contain
+ * scan_clauses minus whatever appears in indexquals.
+ *
+ * is_redundant_with_indexclauses() detects cases where a scan clause is
+ * present in the indexclauses list or is generated from the same
+ * EquivalenceClass as some indexclause, and is therefore redundant with
+ * it, though not equal. (The latter happens when indxpath.c prefers a
+ * different derived equality than what generate_join_implied_equalities
+ * picked for a parameterized scan's ppi_clauses.) Note that it will not
+ * match to lossy index clauses, which is critical because we have to
+ * include the original clause in qpqual in that case.
+ *
+ * In some situations (particularly with OR'd index conditions) we may
+ * have scan_clauses that are not equal to, but are logically implied by,
+ * the index quals; so we also try a predicate_implied_by() check to see
+ * if we can discard quals that way. (predicate_implied_by assumes its
+ * first input contains only immutable functions, so we have to check
+ * that.)
+ *
+ * Note: if you change this bit of code you should also look at
+ * extract_nonindex_conditions() in costsize.c.
+ */
+ qpqual = NIL;
+ foreach(l, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
+
+ if (rinfo->pseudoconstant)
+ continue; /* we may drop pseudoconstants here */
+ if (is_redundant_with_indexclauses(rinfo, indexclauses))
+ continue; /* dup or derived from same EquivalenceClass */
+ if (!contain_mutable_functions((Node *) rinfo->clause) &&
+ predicate_implied_by(list_make1(rinfo->clause), stripped_indexquals,
+ false))
+ continue; /* provably implied by indexquals */
+ qpqual = lappend(qpqual, rinfo);
+ }
+
+ /* Sort clauses into best execution order */
+ qpqual = order_qual_clauses(root, qpqual);
+
+ /* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+ qpqual = extract_actual_clauses(qpqual, false);
+
+ /*
+ * We have to replace any outer-relation variables with nestloop params in
+ * the indexqualorig, qpqual, and indexorderbyorig expressions. A bit
+ * annoying to have to do this separately from the processing in
+ * fix_indexqual_references --- rethink this when generalizing the inner
+ * indexscan support. But note we can't really do this earlier because
+ * it'd break the comparisons to predicates above ... (or would it? Those
+ * wouldn't have outer refs)
+ */
+ if (best_path->ipath.path.param_info)
+ {
+ stripped_indexquals = (List *)
+ replace_nestloop_params(root, (Node *) stripped_indexquals);
+ qpqual = (List *)
+ replace_nestloop_params(root, (Node *) qpqual);
+ }
+
+ /* Finally ready to build the plan node */
+ brinsort_plan = make_brinsort(tlist,
+ qpqual,
+ baserelid,
+ indexoid,
+ fixed_indexquals,
+ stripped_indexquals,
+ best_path->ipath.indexscandir);
+
+ if (pathkeys != NIL)
+ {
+ /*
+ * Compute sort column info, and adjust the Append's tlist as needed.
+ * Because we pass adjust_tlist_in_place = true, we may ignore the
+ * function result; it must be the same plan node. However, we then
+ * need to detect whether any tlist entries were added.
+ */
+ (void) prepare_sort_from_pathkeys((Plan *) brinsort_plan, pathkeys,
+ best_path->ipath.path.parent->relids,
+ NULL,
+ true,
+ &brinsort_plan->numCols,
+ &brinsort_plan->sortColIdx,
+ &brinsort_plan->sortOperators,
+ &brinsort_plan->collations,
+ &brinsort_plan->nullsFirst);
+ //tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
+ for (int i = 0; i < brinsort_plan->numCols; i++)
+ elog(DEBUG1, "%d => %d %d %d %d", i,
+ brinsort_plan->sortColIdx[i],
+ brinsort_plan->sortOperators[i],
+ brinsort_plan->collations[i],
+ brinsort_plan->nullsFirst[i]);
+ }
+
+ copy_generic_path_info(&brinsort_plan->scan.plan, &best_path->ipath.path);
+
+ return brinsort_plan;
+}
+
/*
* create_bitmap_scan_plan
* Returns a bitmap scan plan for the base relation scanned by 'best_path'
@@ -5523,6 +5686,31 @@ make_indexscan(List *qptlist,
return node;
}
+static BrinSort *
+make_brinsort(List *qptlist,
+ List *qpqual,
+ Index scanrelid,
+ Oid indexid,
+ List *indexqual,
+ List *indexqualorig,
+ ScanDirection indexscandir)
+{
+ BrinSort *node = makeNode(BrinSort);
+ Plan *plan = &node->scan.plan;
+
+ plan->targetlist = qptlist;
+ plan->qual = qpqual;
+ plan->lefttree = NULL;
+ plan->righttree = NULL;
+ node->scan.scanrelid = scanrelid;
+ node->indexid = indexid;
+ node->indexqual = indexqual;
+ node->indexqualorig = indexqualorig;
+ node->indexorderdir = indexscandir;
+
+ return node;
+}
+
static IndexOnlyScan *
make_indexonlyscan(List *qptlist,
List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 1cb0abdbc1f..2584a1f032d 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -609,6 +609,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
return set_indexonlyscan_references(root, splan, rtoffset);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *splan = (BrinSort *) plan;
+
+ splan->scan.scanrelid += rtoffset;
+ splan->scan.plan.targetlist =
+ fix_scan_list(root, splan->scan.plan.targetlist,
+ rtoffset, NUM_EXEC_TLIST(plan));
+ splan->scan.plan.qual =
+ fix_scan_list(root, splan->scan.plan.qual,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ splan->indexqual =
+ fix_scan_list(root, splan->indexqual,
+ rtoffset, 1);
+ splan->indexqualorig =
+ fix_scan_list(root, splan->indexqualorig,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *splan = (BitmapIndexScan *) plan;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 70f61ae7b1c..6471bbb5de8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1030,6 +1030,63 @@ create_index_path(PlannerInfo *root,
return pathnode;
}
+
+/*
+ * create_brinsort_path
+ * Creates a path node for sorted brin sort scan.
+ *
+ * 'index' is a usable index.
+ * 'indexclauses' is a list of IndexClause nodes representing clauses
+ * to be enforced as qual conditions in the scan.
+ * 'indexorderbys' is a list of bare expressions (no RestrictInfos)
+ * to be used as index ordering operators in the scan.
+ * 'indexorderbycols' is an integer list of index column numbers (zero based)
+ * the ordering operators can be used with.
+ * 'pathkeys' describes the ordering of the path.
+ * 'indexscandir' is ForwardScanDirection or BackwardScanDirection
+ * for an ordered index, or NoMovementScanDirection for
+ * an unordered index.
+ * 'indexonly' is true if an index-only scan is wanted.
+ * 'required_outer' is the set of outer relids for a parameterized path.
+ * 'loop_count' is the number of repetitions of the indexscan to factor into
+ * estimates of caching behavior.
+ * 'partial_path' is true if constructing a parallel index scan path.
+ *
+ * Returns the new path node.
+ */
+BrinSortPath *
+create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path)
+{
+ BrinSortPath *pathnode = makeNode(BrinSortPath);
+ RelOptInfo *rel = index->rel;
+
+ pathnode->ipath.path.pathtype = T_BrinSort;
+ pathnode->ipath.path.parent = rel;
+ pathnode->ipath.path.pathtarget = rel->reltarget;
+ pathnode->ipath.path.param_info = get_baserel_parampathinfo(root, rel,
+ required_outer);
+ pathnode->ipath.path.parallel_aware = false;
+ pathnode->ipath.path.parallel_safe = rel->consider_parallel;
+ pathnode->ipath.path.parallel_workers = 0;
+ pathnode->ipath.path.pathkeys = pathkeys;
+
+ pathnode->ipath.indexinfo = index;
+ pathnode->ipath.indexclauses = indexclauses;
+ pathnode->ipath.indexscandir = indexscandir;
+
+ cost_brinsort(pathnode, root, loop_count, partial_path);
+
+ return pathnode;
+}
+
/*
* create_bitmap_heap_path
* Creates a path node for a bitmap scan.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 06dfeb6cd8b..a5ca3bd0cc4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -977,6 +977,16 @@ struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of BRIN sort plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_brinsort,
+ false,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index a7cccac9c90..be05586ec57 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -34,41 +34,6 @@ typedef struct BrinStatsData
BlockNumber revmapNumPages;
} BrinStatsData;
-/*
- * Info about ranges for BRIN Sort.
- */
-typedef struct BrinRange
-{
- BlockNumber blkno_start;
- BlockNumber blkno_end;
-
- Datum min_value;
- Datum max_value;
- bool has_nulls;
- bool all_nulls;
- bool not_summarized;
-
- /*
- * Index of the range when ordered by min_value (if there are multiple
- * ranges with the same min_value, it's the lowest one).
- */
- uint32 min_index;
-
- /*
- * Minimum min_index from all ranges with higher max_value (i.e. when
- * sorted by max_value). If there are multiple ranges with the same
- * max_value, it depends on the ordering (i.e. the ranges may get
- * different min_index_lowest, depending on the exact ordering).
- */
- uint32 min_index_lowest;
-} BrinRange;
-
-typedef struct BrinRanges
-{
- int nranges;
- BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
-} BrinRanges;
-
typedef struct BrinMinmaxStats
{
int32 vl_len_; /* varlena header (do not touch directly!) */
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index ee6c6f9b709..fcdd4cafda8 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -74,6 +74,7 @@ typedef struct BrinDesc
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
#define BRIN_PROCNUM_STATISTICS 6 /* optional */
+#define BRIN_PROCNUM_RANGES 7 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_LAST_OPTIONAL_PROCNUM 15
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index ea3de9bcba1..562f481af18 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -806,6 +806,8 @@
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
+ amprocrighttype => 'bytea', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -839,6 +841,8 @@
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -870,6 +874,8 @@
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
+ amprocrighttype => 'name', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -901,6 +907,8 @@
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '7', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -915,6 +923,8 @@
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '7', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -929,6 +939,8 @@
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1048,6 +1060,8 @@
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
+ amprocrighttype => 'text', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1078,6 +1092,8 @@
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1128,6 +1144,8 @@
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
+ amprocrighttype => 'tid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1181,6 +1199,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
amprocrighttype => 'float4', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
+ amprocrighttype => 'float4', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '1',
@@ -1197,6 +1218,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
+ amprocrighttype => 'float8', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi float
{ amprocfamily => 'brin/float_minmax_multi_ops', amproclefttype => 'float4',
@@ -1288,6 +1312,9 @@
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
+ amprocrighttype => 'macaddr', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr
{ amprocfamily => 'brin/macaddr_minmax_multi_ops', amproclefttype => 'macaddr',
@@ -1344,6 +1371,9 @@
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
amprocrighttype => 'macaddr8', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
+ amprocrighttype => 'macaddr8', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr8
{ amprocfamily => 'brin/macaddr8_minmax_multi_ops',
@@ -1398,6 +1428,8 @@
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
+ amprocrighttype => 'inet', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1471,6 +1503,9 @@
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
amprocrighttype => 'bpchar', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
+ amprocrighttype => 'bpchar', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# bloom character
{ amprocfamily => 'brin/bpchar_bloom_ops', amproclefttype => 'bpchar',
@@ -1504,6 +1539,8 @@
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
+ amprocrighttype => 'time', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1557,6 +1594,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
amprocrighttype => 'timestamp', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
+ amprocrighttype => 'timestamp', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '1',
@@ -1573,6 +1613,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
+ amprocrighttype => 'timestamptz', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1',
@@ -1587,6 +1630,8 @@
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1716,6 +1761,9 @@
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
+ amprocrighttype => 'interval', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi interval
{ amprocfamily => 'brin/interval_minmax_multi_ops',
@@ -1772,6 +1820,9 @@
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
amprocrighttype => 'timetz', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
+ amprocrighttype => 'timetz', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi time with time zone
{ amprocfamily => 'brin/timetz_minmax_multi_ops', amproclefttype => 'timetz',
@@ -1824,6 +1875,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
+ amprocrighttype => 'bit', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1841,6 +1894,9 @@
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
amprocrighttype => 'varbit', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
+ amprocrighttype => 'varbit', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax numeric
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
@@ -1858,6 +1914,9 @@
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
amprocrighttype => 'numeric', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
+ amprocrighttype => 'numeric', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi numeric
{ amprocfamily => 'brin/numeric_minmax_multi_ops', amproclefttype => 'numeric',
@@ -1912,6 +1971,8 @@
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1988,6 +2049,9 @@
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
amprocrighttype => 'pg_lsn', amprocnum => '6',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
+ amprocrighttype => 'pg_lsn', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi pg_lsn
{ amprocfamily => 'brin/pg_lsn_minmax_multi_ops', amproclefttype => 'pg_lsn',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1dd9177b01c..18e0824a08e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8411,6 +8411,9 @@
proname => 'brin_minmax_stats', prorettype => 'bool',
proargtypes => 'internal internal int2 int2 internal int4',
prosrc => 'brin_minmax_stats' },
+{ oid => '9980', descr => 'BRIN minmax support',
+ proname => 'brin_minmax_ranges', prorettype => 'bool',
+ proargtypes => 'internal int2 bool', prosrc => 'brin_minmax_ranges' },
# BRIN minmax multi
{ oid => '4616', descr => 'BRIN multi minmax support',
diff --git a/src/include/executor/nodeBrinSort.h b/src/include/executor/nodeBrinSort.h
new file mode 100644
index 00000000000..2c860d926ea
--- /dev/null
+++ b/src/include/executor/nodeBrinSort.h
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeBrinSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_H
+
+#include "access/genam.h"
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern BrinSortState *ExecInitBrinSort(BrinSort *node, EState *estate, int eflags);
+extern void ExecEndBrinSort(BrinSortState *node);
+extern void ExecBrinSortMarkPos(BrinSortState *node);
+extern void ExecBrinSortRestrPos(BrinSortState *node);
+extern void ExecReScanBrinSort(BrinSortState *node);
+extern void ExecBrinSortEstimate(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortReInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt);
+
+/*
+ * These routines are exported to share code with nodeIndexonlyscan.c and
+ * nodeBitmapBrinSort.c
+ */
+extern void ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
+ List *quals, bool isorderby,
+ ScanKey *scanKeys, int *numScanKeys,
+ IndexRuntimeKeyInfo **runtimeKeys, int *numRuntimeKeys,
+ IndexArrayKeyInfo **arrayKeys, int *numArrayKeys);
+extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
+ IndexRuntimeKeyInfo *runtimeKeys, int numRuntimeKeys);
+extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
+ IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+
+#endif /* NODEBrinSort_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01b1727fc09..381c2fcd3d6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1549,6 +1549,109 @@ typedef struct IndexScanState
Size iss_PscanLen;
} IndexScanState;
+typedef enum {
+ BRINSORT_START,
+ BRINSORT_LOAD_RANGE,
+ BRINSORT_PROCESS_RANGE,
+ BRINSORT_LOAD_NULLS,
+ BRINSORT_PROCESS_NULLS,
+ BRINSORT_FINISHED
+} BrinSortPhase;
+
+typedef struct BrinRangeScanDesc
+{
+ /* range info tuple descriptor */
+ TupleDesc tdesc;
+
+ /* ranges, sorted by minval, blkno_start */
+ Tuplesortstate *ranges;
+
+ /* distinct minval (sorted) */
+ Tuplestorestate *minvals;
+
+ /* slot for accessing the tuplesort/tuplestore */
+ TupleTableSlot *slot;
+
+} BrinRangeScanDesc;
+
+/*
+ * Info about ranges for BRIN Sort.
+ */
+typedef struct BrinRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+
+ /*
+ * Index of the range when ordered by min_value (if there are multiple
+ * ranges with the same min_value, it's the lowest one).
+ */
+ uint32 min_index;
+
+ /*
+ * Minimum min_index from all ranges with higher max_value (i.e. when
+ * sorted by max_value). If there are multiple ranges with the same
+ * max_value, it depends on the ordering (i.e. the ranges may get
+ * different min_index_lowest, depending on the exact ordering).
+ */
+ uint32 min_index_lowest;
+} BrinRange;
+
+typedef struct BrinRanges
+{
+ int nranges;
+ BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
+} BrinRanges;
+
+typedef struct BrinSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ ExprState *indexqualorig;
+ List *indexorderbyorig;
+ struct ScanKeyData *iss_ScanKeys;
+ int iss_NumScanKeys;
+ struct ScanKeyData *iss_OrderByKeys;
+ int iss_NumOrderByKeys;
+ IndexRuntimeKeyInfo *iss_RuntimeKeys;
+ int iss_NumRuntimeKeys;
+ bool iss_RuntimeKeysReady;
+ ExprContext *iss_RuntimeContext;
+ Relation iss_RelationDesc;
+ struct IndexScanDescData *iss_ScanDesc;
+
+ /* These are needed for re-checking ORDER BY expr ordering */
+ pairingheap *iss_ReorderQueue;
+ bool iss_ReachedEnd;
+ Datum *iss_OrderByValues;
+ bool *iss_OrderByNulls;
+ SortSupport iss_SortSupport;
+ bool *iss_OrderByTypByVals;
+ int16 *iss_OrderByTypLens;
+ Size iss_PscanLen;
+
+ /* */
+ BrinRangeScanDesc *bs_scan;
+ BrinRange *bs_range;
+ ExprState *bs_qual;
+ Datum bs_watermark;
+ bool bs_watermark_set;
+ BrinSortPhase bs_phase;
+ SortSupportData bs_sortsupport;
+
+ /*
+ * We need two tuplesort instances - one for current range, one for
+ * spill-over tuples from the overlapping ranges
+ */
+ void *bs_tuplesortstate;
+ Tuplestorestate *bs_tuplestore;
+} BrinSortState;
+
/* ----------------
* IndexOnlyScanState information
*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6bda383bead..e79c904a8fc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1596,6 +1596,17 @@ typedef struct IndexPath
Selectivity indexselectivity;
} IndexPath;
+/*
+ * read sorted data from brin index
+ *
+ * We use IndexPath, because that's what amcostestimate is expecting, but
+ * we typedef it as a separate struct.
+ */
+typedef struct BrinSortPath
+{
+ IndexPath ipath;
+} BrinSortPath;
+
/*
* Each IndexClause references a RestrictInfo node from the query's WHERE
* or JOIN conditions, and shows how that restriction can be applied to
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 21e642a64c4..c4ef5362acc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -495,6 +495,32 @@ typedef struct IndexOnlyScan
ScanDirection indexorderdir; /* forward or backward or don't care */
} IndexOnlyScan;
+
+typedef struct BrinSort
+{
+ Scan scan;
+ Oid indexid; /* OID of index to scan */
+ List *indexqual; /* list of index quals (usually OpExprs) */
+ List *indexqualorig; /* the same in original form */
+ ScanDirection indexorderdir; /* forward or backward or don't care */
+
+ /* number of sort-key columns */
+ int numCols;
+
+ /* their indexes in the target list */
+ AttrNumber *sortColIdx pg_node_attr(array_size(numCols));
+
+ /* OIDs of operators to sort them by */
+ Oid *sortOperators pg_node_attr(array_size(numCols));
+
+ /* OIDs of collations */
+ Oid *collations pg_node_attr(array_size(numCols));
+
+ /* NULLS FIRST/LAST directions */
+ bool *nullsFirst pg_node_attr(array_size(numCols));
+
+} BrinSort;
+
/* ----------------
* bitmap index scan node
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 204e94b6d10..b77440728d1 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -69,6 +69,7 @@ extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_async_append;
+extern PGDLLIMPORT bool enable_brinsort;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
@@ -79,6 +80,8 @@ extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info);
extern void cost_index(IndexPath *path, PlannerInfo *root,
double loop_count, bool partial_path);
+extern void cost_brinsort(BrinSortPath *path, PlannerInfo *root,
+ double loop_count, bool partial_path);
extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info,
Path *bitmapqual, double loop_count);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 050f00e79a4..11caad3ec51 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,6 +49,15 @@ extern IndexPath *create_index_path(PlannerInfo *root,
Relids required_outer,
double loop_count,
bool partial_path);
+extern BrinSortPath *create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path);
extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
RelOptInfo *rel,
Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 41f765d3422..6aa50257730 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -213,6 +213,9 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
ScanDirection scandir);
+extern List *build_index_pathkeys_brin(PlannerInfo *root, IndexOptInfo *index,
+ TargetEntry *tle, int idx,
+ bool reverse_sort, bool nulls_first);
extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
ScanDirection scandir, bool *partialkeys);
extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--
2.37.3
0003-wip-brinsort-explain-stats-20221022.patchtext/x-patch; charset=UTF-8; name=0003-wip-brinsort-explain-stats-20221022.patchDownload
From 49d9058eab1d4009de8e82eb1a87ad49372f297b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Fri, 21 Oct 2022 15:33:16 +0200
Subject: [PATCH 3/6] wip: brinsort explain stats
Show some internal stats about BRIN Sort in EXPLAIN output.
---
src/backend/commands/explain.c | 115 ++++++++++++++++++++++++++++
src/backend/executor/nodeBrinSort.c | 35 +++++++++
src/include/nodes/execnodes.h | 33 ++++++++
3 files changed, 183 insertions(+)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e15b29246b1..c5ace02a10d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -87,6 +87,8 @@ static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
static void show_brinsort_keys(BrinSortState *sortstate, List *ancestors,
ExplainState *es);
+static void show_brinsort_stats(BrinSortState *sortstate, List *ancestors,
+ ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -1814,6 +1816,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate, es);
show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
show_brinsort_keys(castNode(BrinSortState, planstate), ancestors, es);
+ show_brinsort_stats(castNode(BrinSortState, planstate), ancestors, es);
if (plan->qual)
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
@@ -2432,6 +2435,118 @@ show_brinsort_keys(BrinSortState *sortstate, List *ancestors, ExplainState *es)
ancestors, es);
}
+static void
+show_brinsort_stats(BrinSortState *sortstate, List *ancestors, ExplainState *es)
+{
+ BrinSortStats *stats = &sortstate->bs_stats;
+
+ if (stats->sort_count > 0)
+ {
+ ExplainPropertyInteger("Ranges Processed", NULL, (int64)
+ stats->range_count, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainPropertyInteger("Sorts", NULL, (int64)
+ stats->sort_count, es);
+
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "Tuples Sorted: " INT64_FORMAT " Per-sort: " INT64_FORMAT " Direct: " INT64_FORMAT " Spilled: " INT64_FORMAT " Respilled: " INT64_FORMAT "\n",
+ stats->ntuples_tuplesort_all,
+ stats->ntuples_tuplesort_all / stats->sort_count,
+ stats->ntuples_tuplesort_direct,
+ stats->ntuples_spilled,
+ stats->ntuples_respilled);
+ }
+ else
+ {
+ ExplainOpenGroup("Sorts", "Sorts", true, es);
+
+ ExplainPropertyInteger("Count", NULL, (int64)
+ stats->sort_count, es);
+
+ ExplainPropertyInteger("Tuples per sort", NULL, (int64)
+ stats->ntuples_tuplesort_all / stats->sort_count, es);
+
+ ExplainPropertyInteger("Sorted tuples (all)", NULL, (int64)
+ stats->ntuples_tuplesort_all, es);
+
+ ExplainPropertyInteger("Sorted tuples (direct)", NULL, (int64)
+ stats->ntuples_tuplesort_direct, es);
+
+ ExplainPropertyInteger("Spilled tuples", NULL, (int64)
+ stats->ntuples_spilled, es);
+
+ ExplainPropertyInteger("Respilled tuples", NULL, (int64)
+ stats->ntuples_respilled, es);
+
+ ExplainCloseGroup("Sorts", "Sorts", true, es);
+ }
+ }
+
+ if (stats->sort_count_in_memory > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "Sorts (in-memory) Count: " INT64_FORMAT " Space Total: " INT64_FORMAT " kB Maximum: " INT64_FORMAT " kB Average: " INT64_FORMAT " kB\n",
+ stats->sort_count_in_memory,
+ stats->total_space_used_in_memory,
+ stats->max_space_used_in_memory,
+ stats->total_space_used_in_memory / stats->sort_count_in_memory);
+ }
+ else
+ {
+ ExplainOpenGroup("In-Memory Sorts", "In-Memory Sorts", true, es);
+
+ ExplainPropertyInteger("Count", NULL, (int64)
+ stats->sort_count_in_memory, es);
+
+ ExplainPropertyInteger("Average space", "kB", (int64)
+ stats->total_space_used_in_memory / stats->sort_count_in_memory, es);
+
+ ExplainPropertyInteger("Maximum space", "kB", (int64)
+ stats->max_space_used_in_memory, es);
+
+ ExplainPropertyInteger("Total space", "kB", (int64)
+ stats->total_space_used_in_memory, es);
+
+ ExplainCloseGroup("In-Memory Sorts", "In-Memory Sorts", true, es);
+ }
+ }
+
+ if (stats->sort_count_on_disk > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "Sorts (on-disk) Count: " INT64_FORMAT " Space Total: " INT64_FORMAT " kB Maximum: " INT64_FORMAT " kB Average: " INT64_FORMAT " kB\n",
+ stats->sort_count_on_disk,
+ stats->total_space_used_on_disk,
+ stats->max_space_used_on_disk,
+ stats->total_space_used_on_disk / stats->sort_count_on_disk);
+ }
+ else
+ {
+ ExplainOpenGroup("On-Disk Sorts", "On-Disk Sorts", true, es);
+
+ ExplainPropertyInteger("Count", NULL, (int64)
+ stats->sort_count_on_disk, es);
+
+ ExplainPropertyInteger("Average space", "kB", (int64)
+ stats->total_space_used_on_disk / stats->sort_count_on_disk, es);
+
+ ExplainPropertyInteger("Maximum space", "kB", (int64)
+ stats->max_space_used_on_disk, es);
+
+ ExplainPropertyInteger("Total space", "kB", (int64)
+ stats->total_space_used_on_disk, es);
+
+ ExplainCloseGroup("On-Disk Sorts", "On-Disk Sorts", true, es);
+ }
+ }
+}
+
/*
* Likewise, for a MergeAppend node.
*/
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index ca72c1ed22d..c7d417d6e57 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -454,6 +454,8 @@ brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_proces
if (null_processing && !(range->has_nulls || range->not_summarized || range->all_nulls))
return;
+ node->bs_stats.range_count++;
+
brinsort_start_tidscan(node);
scan = node->ss.ss_currentScanDesc;
@@ -526,7 +528,10 @@ brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_proces
/* Stash it to the tuplestore (when NULL, or ignore
* it (when not-NULL). */
if (isnull)
+ {
tuplestore_puttupleslot(node->bs_tuplestore, slot);
+ node->bs_stats.ntuples_spilled++;
+ }
/* NULL or not, we're done */
continue;
@@ -546,9 +551,16 @@ brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_proces
&node->bs_sortsupport);
if (cmp <= 0)
+ {
tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ node->bs_stats.ntuples_tuplesort_direct++;
+ node->bs_stats.ntuples_tuplesort_all++;
+ }
else
+ {
tuplestore_puttupleslot(node->bs_tuplestore, slot);
+ node->bs_stats.ntuples_spilled++;
+ }
}
ExecClearTuple(slot);
@@ -610,9 +622,15 @@ brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
&node->bs_sortsupport);
if (cmp <= 0)
+ {
tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ node->bs_stats.ntuples_tuplesort_all++;
+ }
else
+ {
tuplestore_puttupleslot(tupstore, slot);
+ node->bs_stats.ntuples_respilled++;
+ }
}
/*
@@ -890,12 +908,29 @@ IndexNext(BrinSortState *node)
if (node->bs_tuplesortstate)
{
tuplesort_performsort(node->bs_tuplesortstate);
+ node->bs_stats.sort_count++;
+
#ifdef BRINSORT_DEBUG
{
TuplesortInstrumentation stats;
tuplesort_get_stats(node->bs_tuplesortstate, &stats);
+ if (stats.spaceType == SORT_SPACE_TYPE_DISK)
+ {
+ node->bs_stats.sort_count_on_disk++;
+ node->bs_stats.total_space_used_on_disk += stats.spaceUsed;
+ node->bs_stats.max_space_used_on_disk = Max(node->bs_stats.max_space_used_on_disk,
+ stats.spaceUsed);
+ }
+ else if (stats.spaceType == SORT_SPACE_TYPE_MEMORY)
+ {
+ node->bs_stats.sort_count_in_memory++;
+ node->bs_stats.total_space_used_in_memory += stats.spaceUsed;
+ node->bs_stats.max_space_used_in_memory = Max(node->bs_stats.max_space_used_in_memory,
+ stats.spaceUsed);
+ }
+
elog(DEBUG1, "method: %s space: %ld kB (%s)",
tuplesort_method_name(stats.sortMethod),
stats.spaceUsed,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 381c2fcd3d6..e8f7b25549f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1609,6 +1609,38 @@ typedef struct BrinRanges
BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
} BrinRanges;
+typedef struct BrinSortStats
+{
+ /* number of sorts */
+ int64 sort_count;
+
+ /* number of ranges loaded */
+ int64 range_count;
+
+ /* tuples written directly to tuplesort */
+ int64 ntuples_tuplesort_direct;
+
+ /* tuples written to tuplesort (all) */
+ int64 ntuples_tuplesort_all;
+
+ /* tuples written to tuplestore */
+ int64 ntuples_spilled;
+
+ /* tuples copied from old to new tuplestore */
+ int64 ntuples_respilled;
+
+ /* number of in-memory/on-disk sorts */
+ int64 sort_count_in_memory;
+ int64 sort_count_on_disk;
+
+ /* total/maximum amount of space used by either sort */
+ int64 total_space_used_in_memory;
+ int64 total_space_used_on_disk;
+ int64 max_space_used_in_memory;
+ int64 max_space_used_on_disk;
+
+} BrinSortStats;
+
typedef struct BrinSortState
{
ScanState ss; /* its first field is NodeTag */
@@ -1643,6 +1675,7 @@ typedef struct BrinSortState
bool bs_watermark_set;
BrinSortPhase bs_phase;
SortSupportData bs_sortsupport;
+ BrinSortStats bs_stats;
/*
* We need two tuplesort instances - one for current range, one for
--
2.37.3
0004-wip-multiple-watermark-steps-20221022.patchtext/x-patch; charset=UTF-8; name=0004-wip-multiple-watermark-steps-20221022.patchDownload
From ad9f73005b425ff15f2ff2687a23f2b024db56b2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Oct 2022 13:03:00 +0200
Subject: [PATCH 4/6] wip: multiple watermark steps
Allow incrementing the minval watermark faster, by skipping some minval
values. This allows sorting more data at once (instead of many tiny
sorts, which is inefficient). This also reduces the number of rows we
need to spill (and possibly transfer multiple times).
To use a different watermark step, use a new GUC:
SET brinsort_watermark_step = 16
---
src/backend/executor/nodeBrinSort.c | 59 ++++++++++++++++++++++++++---
src/backend/utils/misc/guc_tables.c | 11 ++++++
2 files changed, 64 insertions(+), 6 deletions(-)
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index c7d417d6e57..3563bf3c1ad 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -257,6 +257,14 @@ static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
#define BRINSORT_DEBUG
+/*
+ * How many distinct minval values to look forward for the next watermark?
+ *
+ * The smallest step we can do is 1, which means the immediately following
+ * (while distinct) minval.
+ */
+int brinsort_watermark_step = 1;
+
/* do various consistency checks */
static void
AssertCheckRanges(BrinSortState *node)
@@ -357,11 +365,24 @@ brinsort_end_tidscan(BrinSortState *node)
* heuristics (measure past sorts and extrapolate).
*/
static void
-brinsort_update_watermark(BrinSortState *node, bool asc)
+brinsort_update_watermark(BrinSortState *node, bool first, bool asc, int steps)
{
int cmp;
+
+ /* assume we haven't found a watermark */
bool found = false;
+ Assert(steps > 0);
+
+ /*
+ * If the watermark is not set, either this is the first call (in
+ * which case we just use the first (or rather second) value.
+ * Otherwise it means we've reached the end, so no point in looking
+ * for more watermarks.
+ */
+ if (!node->bs_watermark_set && !first)
+ return;
+
tuplesort_markpos(node->bs_scan->ranges);
while (tuplesort_gettupleslot(node->bs_scan->ranges, true, false, node->bs_scan->slot, NULL))
@@ -387,22 +408,48 @@ brinsort_update_watermark(BrinSortState *node, bool asc)
else
value = slot_getattr(node->bs_scan->slot, 7, &isnull);
+ /*
+ * Has to be the first call (otherwise we would not get here, because we
+ * terminate after bs_watermark_set gets flipped back to false), so we
+ * just set the value. But we don't count this as a step, because that
+ * just picks the first minval value, as we certainly need to do at least
+ * one more step.
+ *
+ * XXX Actually, do we need to make another step? Maybe there are enough
+ * not-summarized ranges? Although, we don't know what values are in
+ * those, ranges, and with increasing data we might easily end up just
+ * writing all of it into the spill tuplestore. So making one more step
+ * seems like a better idea - we'll at lest be able to produce something
+ * which is good for LIMIT queries.
+ */
if (!node->bs_watermark_set)
{
+ Assert(first);
node->bs_watermark_set = true;
node->bs_watermark = value;
+ found = true;
continue;
}
cmp = ApplySortComparator(node->bs_watermark, false, value, false,
&node->bs_sortsupport);
- if (cmp < 0)
+ /*
+ * Values should not decrease (or whatever the operator says, might
+ * be a DESC sort).
+ */
+ Assert(cmp <= 0);
+
+ if (cmp < 0) /* new watermark value */
{
node->bs_watermark_set = true;
node->bs_watermark = value;
found = true;
- break;
+
+ steps--;
+
+ if (steps == 0)
+ break;
}
}
@@ -871,7 +918,7 @@ IndexNext(BrinSortState *node)
node->bs_phase = BRINSORT_LOAD_RANGE;
/* set the first watermark */
- brinsort_update_watermark(node, asc);
+ brinsort_update_watermark(node, true, asc, brinsort_watermark_step);
}
break;
@@ -981,7 +1028,7 @@ IndexNext(BrinSortState *node)
{
/* updte the watermark and try reading more ranges */
node->bs_phase = BRINSORT_LOAD_RANGE;
- brinsort_update_watermark(node, asc);
+ brinsort_update_watermark(node, false, asc, brinsort_watermark_step);
}
break;
@@ -1004,7 +1051,7 @@ IndexNext(BrinSortState *node)
{
brinsort_rescan(node);
node->bs_phase = BRINSORT_LOAD_RANGE;
- brinsort_update_watermark(node, asc);
+ brinsort_update_watermark(node, true, asc, brinsort_watermark_step);
}
else
node->bs_phase = BRINSORT_FINISHED;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a5ca3bd0cc4..c7abdade496 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -95,6 +95,7 @@ extern char *temp_tablespaces;
extern bool ignore_checksum_failure;
extern bool ignore_invalid_pages;
extern bool synchronize_seqscans;
+extern int brinsort_watermark_step;
#ifdef TRACE_SYNCSCAN
extern bool trace_syncscan;
@@ -3425,6 +3426,16 @@ struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"brinsort_watermark_step", PGC_USERSET, DEVELOPER_OPTIONS,
+ gettext_noop("sets the step for brinsort watermark increments"),
+ NULL
+ },
+ &brinsort_watermark_step,
+ 1, 1, INT_MAX,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
--
2.37.3
0005-wip-adjust-watermark-step-20221022.patchtext/x-patch; charset=UTF-8; name=0005-wip-adjust-watermark-step-20221022.patchDownload
From 223037eaff1a5b008be3230f713875a4b05f0453 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sat, 22 Oct 2022 00:06:28 +0200
Subject: [PATCH 5/6] wip: adjust watermark step
Look at available statistics - number of possible watermark values,
number of rows, work_mem, etc. and pick a good watermark_step value.
To calculate step using statistics, set the GUC to 0:
SET brinsort_watermark_step = 0;
---
src/backend/commands/explain.c | 4 ++
src/backend/executor/nodeBrinSort.c | 20 +++----
src/backend/optimizer/plan/createplan.c | 70 +++++++++++++++++++++++++
src/backend/utils/misc/guc_tables.c | 2 +-
src/include/nodes/execnodes.h | 1 +
src/include/nodes/plannodes.h | 3 ++
6 files changed, 86 insertions(+), 14 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c5ace02a10d..114846ebe0b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2439,6 +2439,10 @@ static void
show_brinsort_stats(BrinSortState *sortstate, List *ancestors, ExplainState *es)
{
BrinSortStats *stats = &sortstate->bs_stats;
+ BrinSort *plan = (BrinSort *) sortstate->ss.ps.plan;
+
+ ExplainPropertyInteger("Step", NULL, (int64)
+ plan->watermark_step, es);
if (stats->sort_count > 0)
{
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index 3563bf3c1ad..2f8e92753cd 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -257,14 +257,6 @@ static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
#define BRINSORT_DEBUG
-/*
- * How many distinct minval values to look forward for the next watermark?
- *
- * The smallest step we can do is 1, which means the immediately following
- * (while distinct) minval.
- */
-int brinsort_watermark_step = 1;
-
/* do various consistency checks */
static void
AssertCheckRanges(BrinSortState *node)
@@ -365,9 +357,11 @@ brinsort_end_tidscan(BrinSortState *node)
* heuristics (measure past sorts and extrapolate).
*/
static void
-brinsort_update_watermark(BrinSortState *node, bool first, bool asc, int steps)
+brinsort_update_watermark(BrinSortState *node, bool first, bool asc)
{
int cmp;
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ int steps = plan->watermark_step;
/* assume we haven't found a watermark */
bool found = false;
@@ -918,7 +912,7 @@ IndexNext(BrinSortState *node)
node->bs_phase = BRINSORT_LOAD_RANGE;
/* set the first watermark */
- brinsort_update_watermark(node, true, asc, brinsort_watermark_step);
+ brinsort_update_watermark(node, true, asc);
}
break;
@@ -978,7 +972,7 @@ IndexNext(BrinSortState *node)
stats.spaceUsed);
}
- elog(DEBUG1, "method: %s space: %ld kB (%s)",
+ elog(WARNING, "method: %s space: %ld kB (%s)",
tuplesort_method_name(stats.sortMethod),
stats.spaceUsed,
tuplesort_space_type_name(stats.spaceType));
@@ -1028,7 +1022,7 @@ IndexNext(BrinSortState *node)
{
/* updte the watermark and try reading more ranges */
node->bs_phase = BRINSORT_LOAD_RANGE;
- brinsort_update_watermark(node, false, asc, brinsort_watermark_step);
+ brinsort_update_watermark(node, false, asc);
}
break;
@@ -1051,7 +1045,7 @@ IndexNext(BrinSortState *node)
{
brinsort_rescan(node);
node->bs_phase = BRINSORT_LOAD_RANGE;
- brinsort_update_watermark(node, true, asc, brinsort_watermark_step);
+ brinsort_update_watermark(node, true, asc);
}
else
node->bs_phase = BRINSORT_FINISHED;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 395c632f430..997c272dec0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -18,6 +18,7 @@
#include <math.h>
+#include "access/brin.h"
#include "access/sysattr.h"
#include "catalog/pg_class.h"
#include "foreign/fdwapi.h"
@@ -321,6 +322,14 @@ static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
GatherMergePath *best_path);
+/*
+ * How many distinct minval values to look forward for the next watermark?
+ *
+ * The smallest step we can do is 1, which means the immediately following
+ * (while distinct) minval.
+ */
+int brinsort_watermark_step = 0;
+
/*
* create_plan
* Creates the access plan for a query by recursively processing the
@@ -3340,6 +3349,67 @@ create_brinsort_plan(PlannerInfo *root,
copy_generic_path_info(&brinsort_plan->scan.plan, &best_path->ipath.path);
+ /*
+ * determine watermark step (how fast to advance)
+ *
+ * If the brinsort_watermark_step is set to a non-zero value, we just use
+ * that value directly. Otherwise we pick a value using some simple
+ * heuristics heuristics - we don't want the rows to exceed work_mem, and
+ * we leave a bit slack (because we're adding batches of rows, not row
+ * by row).
+ *
+ * This has a weakness, because it assumes we incrementally add the same
+ * number of rows into the "sort" set - but imagine very wide overlapping
+ * ranges (e.g. random data on the same domain). Most of them will have
+ * about the same minval, so the sort grows only very slowly. Until the
+ * very last range, that removes the watermark and only then do most of
+ * the rows get to the tuplesort.
+ *
+ * XXX But maybe we can look at the other statistics we have, like number
+ * of overlaps and average range selectivity (% of tuples matching), and
+ * deduce something from that?
+ *
+ * XXX Could we maybe adjust the watermark step adaptively at runtime?
+ * That is, when we get to the "sort" step, maybe check how many rows
+ * are there, and if there are only few then try increasing the step?
+ */
+ brinsort_plan->watermark_step = brinsort_watermark_step;
+
+ if (brinsort_plan->watermark_step == 0)
+ {
+ BrinMinmaxStats *amstats;
+
+ /**/
+ Cardinality rows = brinsort_plan->scan.plan.plan_rows;
+
+ /* estimate rowsize in the tuplesort */
+ int width = brinsort_plan->scan.plan.plan_width;
+ int tupwidth = (MAXALIGN(width) + MAXALIGN(SizeofHeapTupleHeader));
+
+ /* Don't overflow work_mem (use only half to absorb variations. */
+ int maxrows = (work_mem * 1024L / tupwidth / 2);
+
+ /* If this is a LIMIT query, aim only for the required number of rows. */
+ if (root->limit_tuples > 0)
+ maxrows = Min(maxrows, root->limit_tuples);
+
+ /* FIXME hard-coded attnum */
+ amstats = (BrinMinmaxStats *) get_attindexam(brinsort_plan->indexid, 1);
+
+ if (amstats)
+ {
+ double rows_per_step = rows / amstats->minval_ndistinct;
+ elog(WARNING, "rows_per_step = %f", rows_per_step);
+
+ brinsort_plan->watermark_step = (int) (maxrows / rows_per_step);
+
+ elog(WARNING, "calculated step = %d", brinsort_plan->watermark_step);
+ }
+
+ brinsort_plan->watermark_step = Max(brinsort_plan->watermark_step, 1);
+ brinsort_plan->watermark_step = Min(brinsort_plan->watermark_step, 1024);
+ }
+
return brinsort_plan;
}
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c7abdade496..9ab51a22db7 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3432,7 +3432,7 @@ struct config_int ConfigureNamesInt[] =
NULL
},
&brinsort_watermark_step,
- 1, 1, INT_MAX,
+ 0, 0, INT_MAX,
NULL, NULL, NULL
},
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e8f7b25549f..86879bed2f4 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1671,6 +1671,7 @@ typedef struct BrinSortState
BrinRangeScanDesc *bs_scan;
BrinRange *bs_range;
ExprState *bs_qual;
+ int bs_watermark_step;
Datum bs_watermark;
bool bs_watermark_set;
BrinSortPhase bs_phase;
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index c4ef5362acc..9f9ad97ac2d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -519,6 +519,9 @@ typedef struct BrinSort
/* NULLS FIRST/LAST directions */
bool *nullsFirst pg_node_attr(array_size(numCols));
+ /* number of watermark steps to make */
+ int watermark_step;
+
} BrinSort;
/* ----------------
--
2.37.3
0006-wip-adaptive-watermark-step-20221022.patchtext/x-patch; charset=UTF-8; name=0006-wip-adaptive-watermark-step-20221022.patchDownload
From 3afa1148fe17a5c47267853e4cb3cac03bd595b0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sat, 22 Oct 2022 01:39:39 +0200
Subject: [PATCH 6/6] wip: adaptive watermark step
Another option it to adjust the watermark step based on past tuplesort
executions, and either increase or decrease the step, based on whether
the sort was in-memory or on-disk, etc.
To do this, set the GUC to -1:
SET brinsort_watermark_step = -1;
---
src/backend/executor/nodeBrinSort.c | 51 +++++++++++++++++++++++--
src/backend/optimizer/plan/createplan.c | 7 +---
src/backend/utils/misc/guc_tables.c | 2 +-
src/backend/utils/sort/tuplesort.c | 12 ++++++
src/include/nodes/execnodes.h | 2 +-
src/include/utils/tuplesort.h | 1 +
6 files changed, 64 insertions(+), 11 deletions(-)
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index 2f8e92753cd..2bf75fd603a 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -255,6 +255,8 @@ static TupleTableSlot *IndexNext(BrinSortState *node);
static bool IndexRecheck(BrinSortState *node, TupleTableSlot *slot);
static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
+extern int brinsort_watermark_step;
+
#define BRINSORT_DEBUG
/* do various consistency checks */
@@ -814,6 +816,42 @@ brinsort_rescan(BrinSortState *node)
tuplesort_rescan(node->bs_scan->ranges);
}
+/*
+ * Look at the tuplesort statistics, and maybe increase or decrease the
+ * watermark step. If the last sort was in-memory, we decrease the step.
+ * If the sort was in-memory, but we used less than work_mem/3, increment
+ * the step value.
+ *
+ * XXX This should probably behave differently for LIMIT queries, so that
+ * we don't load too many rows unnecessarily. We already consider that in
+ * create_brinsort_plan, but maybe we should limit increments to the ste
+ * value here too - say, by tracking how many rows are we supposed to
+ * produce, and limiting the watermark so that we don't process too many
+ * rows in future steps.
+ *
+ * XXX We might also track the number of rows in the sort and space used,
+ * to calculate more accurate estimate of row width. And then use that to
+ * calculate number of rows that fit into work_mem. But the number of rows
+ * that go into tuplesort (per range) added would still remain fairly
+ * inaccurate, so not sure how good this woud be.
+ */
+static void
+brinsort_adjust_watermark_step(BrinSortState *node, TuplesortInstrumentation *stats)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+
+ if (brinsort_watermark_step != -1)
+ return;
+
+ if (stats->spaceType == SORT_SPACE_TYPE_DISK)
+ plan->watermark_step--;
+ else if (stats->spaceUsed < work_mem / 3)
+ plan->watermark_step++;
+
+ plan->watermark_step = Max(1, plan->watermark_step);
+ plan->watermark_step = Min(1024, plan->watermark_step);
+}
+
/* ----------------------------------------------------------------
* IndexNext
*
@@ -948,15 +986,20 @@ IndexNext(BrinSortState *node)
*/
if (node->bs_tuplesortstate)
{
+ TuplesortInstrumentation stats;
+
+ tuplesort_reset_stats(node->bs_tuplesortstate);
+
tuplesort_performsort(node->bs_tuplesortstate);
node->bs_stats.sort_count++;
-#ifdef BRINSORT_DEBUG
- {
- TuplesortInstrumentation stats;
+ memset(&stats, 0, sizeof(TuplesortInstrumentation));
+ tuplesort_get_stats(node->bs_tuplesortstate, &stats);
- tuplesort_get_stats(node->bs_tuplesortstate, &stats);
+ brinsort_adjust_watermark_step(node, &stats);
+#ifdef BRINSORT_DEBUG
+ {
if (stats.spaceType == SORT_SPACE_TYPE_DISK)
{
node->bs_stats.sort_count_on_disk++;
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 997c272dec0..dc0a3669df2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -3375,7 +3375,7 @@ create_brinsort_plan(PlannerInfo *root,
*/
brinsort_plan->watermark_step = brinsort_watermark_step;
- if (brinsort_plan->watermark_step == 0)
+ if (brinsort_plan->watermark_step <= 0)
{
BrinMinmaxStats *amstats;
@@ -3398,12 +3398,9 @@ create_brinsort_plan(PlannerInfo *root,
if (amstats)
{
- double rows_per_step = rows / amstats->minval_ndistinct;
- elog(WARNING, "rows_per_step = %f", rows_per_step);
+ double rows_per_step = Max(1.0, (rows / amstats->minval_ndistinct));
brinsort_plan->watermark_step = (int) (maxrows / rows_per_step);
-
- elog(WARNING, "calculated step = %d", brinsort_plan->watermark_step);
}
brinsort_plan->watermark_step = Max(brinsort_plan->watermark_step, 1);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 9ab51a22db7..b6d4186241f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3432,7 +3432,7 @@ struct config_int ConfigureNamesInt[] =
NULL
},
&brinsort_watermark_step,
- 0, 0, INT_MAX,
+ 0, -1, INT_MAX,
NULL, NULL, NULL
},
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 416f02ba3cb..c61f27b6fa2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2574,6 +2574,18 @@ tuplesort_get_stats(Tuplesortstate *state,
}
}
+/*
+ * tuplesort_reset_stats - reset summary statistics
+ *
+ * This can be called before tuplesort_performsort() starts.
+ */
+void
+tuplesort_reset_stats(Tuplesortstate *state)
+{
+ state->isMaxSpaceDisk = false;
+ state->maxSpace = 0;
+}
+
/*
* Convert TuplesortMethod to a string.
*/
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 86879bed2f4..485b7b2eeb3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1682,7 +1682,7 @@ typedef struct BrinSortState
* We need two tuplesort instances - one for current range, one for
* spill-over tuples from the overlapping ranges
*/
- void *bs_tuplesortstate;
+ Tuplesortstate *bs_tuplesortstate;
Tuplestorestate *bs_tuplestore;
} BrinSortState;
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 44412749906..897dfeb274f 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -367,6 +367,7 @@ extern void tuplesort_reset(Tuplesortstate *state);
extern void tuplesort_get_stats(Tuplesortstate *state,
TuplesortInstrumentation *stats);
+extern void tuplesort_reset_stats(Tuplesortstate *state);
extern const char *tuplesort_method_name(TuplesortMethod m);
extern const char *tuplesort_space_type_name(TuplesortSpaceType t);
--
2.37.3
On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.
On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
Note that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b). I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).
+ /* There must not be any TID scan in progress yet. */ + Assert(node->ss.ss_currentScanDesc == NULL); + + /* Initialize the TID range scan, for the provided block range. */ + if (node->ss.ss_currentScanDesc == NULL) + {
Why is this conditional on the condition that was just Assert()ed ?
+void +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count, + bool partial_path)
It's be nice to refactor existing code to avoid this part being so
duplicitive.
+ * In some situations (particularly with OR'd index conditions) we may + * have scan_clauses that are not equal to, but are logically implied by, + * the index quals; so we also try a predicate_implied_by() check to see
Isn't that somewhat expensive ?
If that's known, then it'd be good to say that in the documentation.
+ { + {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the planner's use of BRIN sort plans."), + NULL, + GUC_EXPLAIN + }, + &enable_brinsort, + false,
I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.
Also, note that indxpath.c had the var initialized to true.
+ attno = (i + 1); + nranges = (nblocks / pagesPerRange); + node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;
I'm curious why you have parenthesis these places ?
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_H
NODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.
. meson.build needs the same change as the corresponding ./Makefile.
. guc missing from postgresql.conf.sample
. brin_validate.c is missing support for the opr function.
I gather you're planning on changing this part (?) but this allows to
pass tests for now.
. mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
. Uninitialized catalog attribute.
. Some typos in your other patches: "heuristics heuristics". ste.
lest (least).
--
Justin
Attachments:
0001-Allow-index-AMs-to-build-and-use-custom-statistics.patchtext/x-diff; charset=us-asciiDownload
From 1db91c1b55d6bc8016274ce880b799081021ab0a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 17 Oct 2022 18:39:28 +0200
Subject: [PATCH 1/4] Allow index AMs to build and use custom statistics
Some indexing AMs work very differently and estimating them using
existing statistics is problematic, producing unreliable costing. This
applies e.g. to BRIN, which relies on page ranges, not tuple pointers.
This adds an optional AM procedure, allowing the opfamily to build
custom statistics, store them in pg_statistic and then use them during
planning. By default this is disabled, but may be enabled by setting
SET enable_indexam_stats = true;
Then ANALYZE will call the optional procedure for all indexes.
---
src/backend/access/brin/brin.c | 1 +
src/backend/access/brin/brin_minmax.c | 1332 +++++++++++++++++++++++++
src/backend/commands/analyze.c | 138 ++-
src/backend/utils/adt/selfuncs.c | 59 ++
src/backend/utils/cache/lsyscache.c | 41 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/access/amapi.h | 2 +
src/include/access/brin.h | 51 +
src/include/access/brin_internal.h | 1 +
src/include/catalog/pg_amproc.dat | 64 ++
src/include/catalog/pg_proc.dat | 4 +
src/include/catalog/pg_statistic.h | 5 +
src/include/commands/vacuum.h | 2 +
src/include/utils/lsyscache.h | 1 +
14 files changed, 1706 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 20b7d65b948..d2c30336981 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -95,6 +95,7 @@ brinhandler(PG_FUNCTION_ARGS)
amroutine->amstrategies = 0;
amroutine->amsupport = BRIN_LAST_OPTIONAL_PROCNUM;
amroutine->amoptsprocnum = BRIN_PROCNUM_OPTIONS;
+ amroutine->amstatsprocnum = BRIN_PROCNUM_STATISTICS;
amroutine->amcanorder = false;
amroutine->amcanorderbyop = false;
amroutine->amcanbackward = false;
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index 9e8a8e056cc..e4c9e56c623 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -10,17 +10,22 @@
*/
#include "postgres.h"
+#include "access/brin.h"
#include "access/brin_internal.h"
+#include "access/brin_revmap.h"
#include "access/brin_tuple.h"
#include "access/genam.h"
#include "access/stratnum.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
#include "utils/builtins.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/syscache.h"
+#include "utils/timestamp.h"
typedef struct MinmaxOpaque
{
@@ -31,6 +36,11 @@ typedef struct MinmaxOpaque
static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno,
Oid subtype, uint16 strategynum);
+/* print debugging into about calculated statistics */
+#define STATS_DEBUG
+
+/* calculate the stats in different ways for cross-checking */
+#define STATS_CROSS_CHECK
Datum
brin_minmax_opcinfo(PG_FUNCTION_ARGS)
@@ -253,6 +263,1328 @@ brin_minmax_union(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+/* FIXME copy of a private struct from brin.c */
+typedef struct BrinOpaque
+{
+ BlockNumber bo_pagesPerRange;
+ BrinRevmap *bo_rmAccess;
+ BrinDesc *bo_bdesc;
+} BrinOpaque;
+
+/*
+ * Compare ranges by minval (collation and operator are taken from the extra
+ * argument, which is expected to be TypeCacheEntry).
+ */
+static int
+range_minval_cmp(const void *a, const void *b, void *arg)
+{
+ BrinRange *ra = *(BrinRange **) a;
+ BrinRange *rb = *(BrinRange **) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+ int r;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ ra->min_value, rb->min_value);
+ r = DatumGetInt32(c);
+
+ if (r != 0)
+ return r;
+
+ if (ra->blkno_start < rb->blkno_start)
+ return -1;
+ else
+ return 1;
+}
+
+/*
+ * Compare ranges by maxval (collation and operator are taken from the extra
+ * argument, which is expected to be TypeCacheEntry).
+ */
+static int
+range_maxval_cmp(const void *a, const void *b, void *arg)
+{
+ BrinRange *ra = *(BrinRange **) a;
+ BrinRange *rb = *(BrinRange **) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+ int r;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ ra->max_value, rb->max_value);
+ r = DatumGetInt32(c);
+
+ if (r != 0)
+ return r;
+
+ if (ra->blkno_start < rb->blkno_start)
+ return -1;
+ else
+ return 1;
+}
+
+/* compare values using an operator from typcache */
+static int
+range_values_cmp(const void *a, const void *b, void *arg)
+{
+ Datum da = * (Datum *) a;
+ Datum db = * (Datum *) b;
+ TypeCacheEntry *typentry = (TypeCacheEntry *) arg;
+ FmgrInfo *cmpfunc = &typentry->cmp_proc_finfo;
+ Datum c;
+
+ c = FunctionCall2Coll(cmpfunc, typentry->typcollation,
+ da, db);
+ return DatumGetInt32(c);
+}
+
+/*
+ * maxval_start
+ * Determine first index so that (maxvalue >= value).
+ *
+ * The array of ranges is expected to be sorted by maxvalue, so this is the first
+ * range that can possibly intersect with range having "value" as minval.
+ */
+static int
+maxval_start(BrinRange **ranges, int nranges, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nranges - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &ranges[start]->max_value, typcache) <= 0)
+ return 0;
+
+ // no matches
+ if (range_values_cmp(&value, &ranges[end]->max_value, typcache) > 0)
+ return nranges;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &ranges[midpoint]->max_value, typcache);
+
+ if (r <= 0)
+ end = midpoint;
+ else
+ start = (midpoint + 1);
+ }
+
+ Assert(ranges[start]->max_value >= value);
+ Assert(ranges[start-1]->max_value < value);
+
+ return start;
+}
+
+/*
+ * minval_end
+ * Determine first index so that (minval > value).
+ *
+ * The array of ranges is expected to be sorted by minvalue, so this is the first
+ * range that can't possibly intersect with a range having "value" as maxval.
+ */
+static int
+minval_end(BrinRange **ranges, int nranges, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nranges - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &ranges[end]->min_value, typcache) >= 0)
+ return nranges;
+
+ // no matches
+ if (range_values_cmp(&value, &ranges[start]->min_value, typcache) < 0)
+ return 0;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &ranges[midpoint]->min_value, typcache);
+
+ if (r >= 0)
+ start = midpoint + 1;
+ else
+ end = midpoint;
+ }
+
+ Assert(ranges[start]->min_value > value);
+ Assert(ranges[start-1]->min_value <= value);
+
+ return start;
+}
+
+
+/*
+ * lower_bound
+ * Determine first index so that (values[index] >= value).
+ *
+ * The array of ranges is expected to be sorted by maxvalue, so this is the first
+ * range that can possibly intersect with range having "value" as minval.
+ */
+static int
+lower_bound(Datum *values, int nvalues, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nvalues - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &values[start], typcache) <= 0)
+ return 0;
+
+ // no matches
+ if (range_values_cmp(&value, &values[end], typcache) > 0)
+ return nvalues;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &values[midpoint], typcache);
+
+ if (r <= 0)
+ end = midpoint;
+ else
+ start = (midpoint + 1);
+ }
+
+ Assert(values[start] >= value);
+ Assert(values[start-1] < value);
+
+ return start;
+}
+
+/*
+ * upper_bound
+ * Determine first index so that (values[index] > value).
+ *
+ * The array of ranges is expected to be sorted by minvalue, so this is the first
+ * range that can't possibly intersect with a range having "value" as maxval.
+ */
+static int
+upper_bound(Datum *values, int nvalues, Datum value, TypeCacheEntry *typcache)
+{
+ int start = 0,
+ end = (nvalues - 1);
+
+ // everything matches
+ if (range_values_cmp(&value, &values[end], typcache) >= 0)
+ return nvalues;
+
+ // no matches
+ if (range_values_cmp(&value, &values[start], typcache) < 0)
+ return 0;
+
+ while ((end - start) > 0)
+ {
+ int midpoint;
+ int r;
+
+ midpoint = start + (end - start) / 2;
+
+ r = range_values_cmp(&value, &values[midpoint], typcache);
+
+ if (r >= 0)
+ start = midpoint + 1;
+ else
+ end = midpoint;
+ }
+
+ Assert(values[start] > value);
+ Assert(values[start-1] <= value);
+
+ return start;
+}
+
+/*
+ * Simple histogram, with bins tracking value and two overlap counts.
+ *
+ * XXX Maybe we should have two separate histograms, one for all counts and
+ * another one for "unique" values.
+ *
+ * XXX Serialize the histogram. There might be a data set where we have very
+ * many distinct buckets (values having very different number of matching
+ * ranges) - not sure if there's some sort of upper limit (but hard to say for
+ * other opclasses, like bloom). And we don't want arbitrarily large histogram,
+ * to keep the statistics fairly small, I guess. So we'd need to pick a subset,
+ * merge buckets with "similar" counts, or approximate it somehow. For now we
+ * don't serialize it, because we don't use the histogram.
+ */
+typedef struct histogram_bin_t
+{
+ int value;
+ int count;
+} histogram_bin_t;
+
+typedef struct histogram_t
+{
+ int nbins;
+ int nbins_max;
+ histogram_bin_t bins[FLEXIBLE_ARRAY_MEMBER];
+} histogram_t;
+
+#define HISTOGRAM_BINS_START 32
+
+/* allocate histogram with default number of bins */
+static histogram_t *
+histogram_init(void)
+{
+ histogram_t *hist;
+
+ hist = (histogram_t *) palloc0(offsetof(histogram_t, bins) +
+ sizeof(histogram_bin_t) * HISTOGRAM_BINS_START);
+ hist->nbins_max = HISTOGRAM_BINS_START;
+
+ return hist;
+}
+
+/*
+ * histogram_add
+ * Add a hit for a particular value to the histogram.
+ *
+ * XXX We don't sort the bins, so just do binary sort. For large number of values
+ * this might be an issue, for small number of values a linear search is fine.
+ */
+static histogram_t *
+histogram_add(histogram_t *hist, int value)
+{
+ bool found = false;
+ histogram_bin_t *bin;
+
+ for (int i = 0; i < hist->nbins; i++)
+ {
+ if (hist->bins[i].value == value)
+ {
+ bin = &hist->bins[i];
+ found = true;
+ }
+ }
+
+ if (!found)
+ {
+ if (hist->nbins == hist->nbins_max)
+ {
+ int nbins = (2 * hist->nbins_max);
+ hist = repalloc(hist, offsetof(histogram_t, bins) +
+ sizeof(histogram_bin_t) * nbins);
+ hist->nbins_max = nbins;
+ }
+
+ Assert(hist->nbins < hist->nbins_max);
+
+ bin = &hist->bins[hist->nbins++];
+ bin->value = value;
+ bin->count = 0;
+ }
+
+ bin->count += 1;
+
+ Assert(bin->value == value);
+ Assert(bin->count >= 0);
+
+ return hist;
+}
+
+/* used to sort histogram bins by value */
+static int
+histogram_bin_cmp(const void *a, const void *b)
+{
+ histogram_bin_t *ba = (histogram_bin_t *) a;
+ histogram_bin_t *bb = (histogram_bin_t *) b;
+
+ if (ba->value < bb->value)
+ return -1;
+
+ if (bb->value < ba->value)
+ return 1;
+
+ return 0;
+}
+
+static void
+histogram_print(histogram_t *hist)
+{
+ return;
+
+ elog(WARNING, "----- histogram -----");
+ for (int i = 0; i < hist->nbins; i++)
+ {
+ elog(WARNING, "bin %d value %d count %d",
+ i, hist->bins[i].value, hist->bins[i].count);
+ }
+}
+
+/*
+ * brin_minmax_count_overlaps
+ * Calculate number of overlaps.
+ *
+ * This uses the minranges to quickly eliminate ranges that can't possibly
+ * intersect. We simply walk minranges until minval > current maxval, and
+ * we're done.
+ *
+ * Unlike brin_minmax_count_overlaps2, this does not have issues with wide
+ * ranges, so this is what we should use.
+ */
+static int
+brin_minmax_count_overlaps(BrinRange **minranges, int nranges, TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+#ifdef STATS_DEBUG
+ TimestampTz start_ts = GetCurrentTimestamp();
+#endif
+
+ noverlaps = 0;
+ for (int i = 0; i < nranges; i++)
+ {
+ Datum maxval = minranges[i]->max_value;
+
+ /*
+ * Determine index of the first range with (minval > current maxval)
+ * by binary search. We know all other ranges can't overlap the
+ * current one. We simply subtract indexes to count ranges.
+ */
+ int idx = minval_end(minranges, nranges, maxval, typcache);
+
+ /* -1 because we don't count the range as intersecting with itself */
+ noverlaps += (idx - i - 1);
+ }
+
+ /*
+ * We only count 1/2 the ranges (minval > current minval), so the total
+ * number of overlaps is twice what we counted.
+ */
+ noverlaps *= 2;
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_count_overlaps -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+#endif
+
+ return noverlaps;
+}
+
+#ifdef STATS_CROSS_CHECK
+/*
+ * brin_minmax_count_overlaps2
+ * Calculate number of overlaps.
+ *
+ * This uses the minranges/maxranges to quickly eliminate ranges that can't
+ * possibly intersect.
+ *
+ * XXX Seems rather complicated and works poorly for wide ranges (with outlier
+ * values), brin_minmax_count_overlaps is likely better.
+ */
+static int
+brin_minmax_count_overlaps2(BrinRanges *ranges,
+ BrinRange **minranges, BrinRange **maxranges,
+ TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ /*
+ * Walk the ranges ordered by max_values, see how many ranges overlap.
+ *
+ * Once we get to a state where (min_value > current.max_value) for
+ * all future ranges, we know none of them can overlap and we can
+ * terminate. This is what min_index_lowest is for.
+ *
+ * XXX If there are very wide ranges (with outlier min/max values),
+ * the min_index_lowest is going to be pretty useless, because the
+ * range will be sorted at the very end by max_value, but will have
+ * very low min_index, so this won't work.
+ *
+ * XXX We could collect a more elaborate stuff, like for example a
+ * histogram of number of overlaps, or maximum number of overlaps.
+ * So we'd have average, but then also an info if there are some
+ * ranges with very many overlaps.
+ */
+ noverlaps = 0;
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ int idx = i+1;
+ BrinRange *ra = maxranges[i];
+ uint64 min_index = ra->min_index;
+
+ CHECK_FOR_INTERRUPTS();
+
+#ifdef NOT_USED
+ /*
+ * XXX Not needed, we can just count "future" ranges and then
+ * we just multiply by 2.
+ */
+
+ /*
+ * What's the first range that might overlap with this one?
+ * needs to have maxval > current.minval.
+ */
+ while (idx > 0)
+ {
+ BrinRange *rb = maxranges[idx - 1];
+
+ /* the range is before the current one, so can't intersect */
+ if (range_values_cmp(&rb->max_value, &ra->min_value, typcache) < 0)
+ break;
+
+ idx--;
+ }
+#endif
+
+ /*
+ * Find the first min_index that is higher than the max_value,
+ * so that we can compare that instead of the values in the
+ * next loop. There should be fewer value comparisons than in
+ * the next loop, so we'll save on function calls.
+ */
+ while (min_index < ranges->nranges)
+ {
+ if (range_values_cmp(&minranges[min_index]->min_value,
+ &ra->max_value, typcache) > 0)
+ break;
+
+ min_index++;
+ }
+
+ /*
+ * Walk the following ranges (ordered by max_value), and check
+ * if it overlaps. If it matches, we look at the next one. If
+ * not, we check if there can be more ranges.
+ */
+ for (int j = idx; j < ranges->nranges; j++)
+ {
+ BrinRange *rb = maxranges[j];
+
+ /* the range overlaps - just continue with the next one */
+ // if (range_values_cmp(&rb->min_value, &ra->max_value, typcache) <= 0)
+ if (rb->min_index < min_index)
+ {
+ noverlaps++;
+ continue;
+ }
+
+ /*
+ * Are there any future ranges that might overlap? We can
+ * check the min_index_lowest to decide quickly.
+ */
+ if (rb->min_index_lowest >= min_index)
+ break;
+ }
+ }
+
+ /*
+ * We only count intersect for "following" ranges when ordered by maxval,
+ * so we only see 1/2 the overlaps. So double the result.
+ */
+ noverlaps *= 2;
+
+ elog(WARNING, "----- brin_minmax_count_overlaps2 -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ return noverlaps;
+}
+
+/*
+ * brin_minmax_count_overlaps_bruteforce
+ * Calculate number of overlaps by brute force.
+ *
+ * Actually compares every range to every other range. Quite expensive, used
+ * primarily to cross-check the other algorithms.
+ */
+static int
+brin_minmax_count_overlaps_bruteforce(BrinRanges *ranges, TypeCacheEntry *typcache)
+{
+ int noverlaps;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ /*
+ * Brute force calculation of overlapping ranges, comparing each
+ * range to every other range - bound to be pretty expensive, as
+ * it's pretty much O(N^2). Kept mostly for easy cross-check with
+ * the preceding "optimized" code.
+ */
+ noverlaps = 0;
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ BrinRange *ra = &ranges->ranges[i];
+
+ for (int j = 0; j < ranges->nranges; j++)
+ {
+ BrinRange *rb = &ranges->ranges[j];
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (i == j)
+ continue;
+
+ if (range_values_cmp(&ra->max_value, &rb->min_value, typcache) < 0)
+ continue;
+
+ if (range_values_cmp(&rb->max_value, &ra->min_value, typcache) < 0)
+ continue;
+
+ elog(DEBUG1, "[%ld,%ld] overlaps [%ld,%ld]",
+ ra->min_value, ra->max_value,
+ rb->min_value, rb->max_value);
+
+ noverlaps++;
+ }
+ }
+
+ elog(WARNING, "----- brin_minmax_count_overlaps_bruteforce -----");
+ elog(WARNING, "noverlaps = %d", noverlaps);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ return noverlaps;
+}
+#endif
+
+/*
+ * brin_minmax_match_tuples_to_ranges
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Alternative to brin_minmax_match_tuples_to_ranges2, leveraging ordering
+ * of values, not ranges.
+ *
+ * XXX This seems like the optimal way to do this.
+ */
+static void
+brin_minmax_match_tuples_to_ranges(BrinRanges *ranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+ int nmatches_value = 0;
+
+ int *unique = (int *) palloc0(sizeof(int) * nvalues);
+
+#ifdef STATS_DEBUG
+ TimestampTz start_ts = GetCurrentTimestamp();
+#endif
+
+ /*
+ * Build running count of unique values. We know there are unique[i]
+ * unique values in values array up to index "i".
+ */
+ unique[0] = 1;
+ for (int i = 1; i < nvalues; i++)
+ {
+ if (range_values_cmp(&values[i-1], &values[i], typcache) == 0)
+ unique[i] = unique[i-1];
+ else
+ unique[i] = unique[i-1] + 1;
+ }
+
+ nvalues_unique = unique[nvalues-1];
+
+ /*
+ * Walk the ranges, for each range determine the first/last mapping
+ * value. Use the "unique" array to count the unique values.
+ */
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ int start;
+ int end;
+
+ CHECK_FOR_INTERRUPTS();
+
+ start = lower_bound(values, nvalues, ranges->ranges[i].min_value, typcache);
+ end = upper_bound(values, nvalues, ranges->ranges[i].max_value, typcache);
+
+ Assert(end > start);
+
+ nmatches_value = (end - start);
+ nmatches_unique += (unique[end-1] - unique[start] + 1);
+
+ nmatches += nmatches_value;
+ }
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f", nmatches_unique, nvalues_unique,
+ (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+#endif
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+
+#ifdef STATS_CROSS_CHECK
+/*
+ * brin_minmax_match_tuples_to_ranges2
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Match sample tuples to the ranges, so that we can count how many ranges
+ * a value matches on average. This might seem redundant to the number of
+ * overlaps, because the value is ~avg_overlaps/2.
+ *
+ * Imagine ranges arranged in "shifted" uniformly by 1/overlaps, e.g. with 3
+ * overlaps [0,100], [33,133], [66, 166] and so on. A random value will hit
+ * only half of there ranges, thus 1/2. This can be extended to randomly
+ * overlapping ranges.
+ *
+ * However, we may not be able to count overlaps for some opclasses (e.g. for
+ * bloom ranges), in which case we have at least this.
+ *
+ * This simply walks the values, and determines matching ranges by looking
+ * for lower/upper bound in ranges ordered by minval/maxval.
+ *
+ * XXX The other question is what to do about duplicate values. If we have a
+ * very frequent value in the sample, it's likely in many places/ranges. Which
+ * will skew the average, because it'll be added repeatedly. So we also count
+ * avg_ranges for unique values.
+ *
+ * XXX The relationship that (average_matches ~ average_overlaps/2) only
+ * works for minmax opclass, and can't be extended to minmax-multi. The
+ * overlaps can only consider the two extreme values (essentially treating
+ * the summary as a single minmax range), because that's what brinsort
+ * needs. But the minmax-multi range may have "gaps" (kinda the whole point
+ * of these opclasses), which affects matching tuples to ranges.
+ *
+ * XXX This also builds histograms of the number of matches, both for the
+ * raw and unique values. At the moment we don't do anything with the
+ * results, though (except for printing those).
+ */
+static void
+brin_minmax_match_tuples_to_ranges2(BrinRanges *ranges,
+ BrinRange **minranges, BrinRange **maxranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+ histogram_t *hist = histogram_init();
+ histogram_t *hist_unique = histogram_init();
+ int nmatches_value = 0;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ for (int i = 0; i < nvalues; i++)
+ {
+ int start;
+ int end;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Same value as preceding, so just use the preceding count.
+ * We don't increment the unique counters, because this is
+ * a duplicate.
+ */
+ if ((i > 0) && (range_values_cmp(&values[i-1], &values[i], typcache) == 0))
+ {
+ nmatches += nmatches_value;
+ hist = histogram_add(hist, nmatches_value);
+ continue;
+ }
+
+ nmatches_value = 0;
+
+ start = maxval_start(maxranges, ranges->nranges, values[i], typcache);
+ end = minval_end(minranges, ranges->nranges, values[i], typcache);
+
+ for (int j = start; j < ranges->nranges; j++)
+ {
+ if (maxranges[j]->min_index >= end)
+ continue;
+
+ if (maxranges[j]->min_index_lowest >= end)
+ break;
+
+ nmatches_value++;
+ }
+
+ hist = histogram_add(hist, nmatches_value);
+ hist_unique = histogram_add(hist_unique, nmatches_value);
+
+ nmatches += nmatches_value;
+ nmatches_unique += nmatches_value;
+ nvalues_unique++;
+ }
+
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges2 -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f",
+ nmatches_unique, nvalues_unique, (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ pg_qsort(hist->bins, hist->nbins, sizeof(histogram_bin_t), histogram_bin_cmp);
+ pg_qsort(hist_unique->bins, hist_unique->nbins, sizeof(histogram_bin_t), histogram_bin_cmp);
+
+ histogram_print(hist);
+ histogram_print(hist_unique);
+
+ pfree(hist);
+ pfree(hist_unique);
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+
+/*
+ * brin_minmax_match_tuples_to_ranges_bruteforce
+ * Match tuples to ranges, count average number of ranges per tuple.
+ *
+ * Bruteforce approach, used mostly for cross-checking.
+ */
+static void
+brin_minmax_match_tuples_to_ranges_bruteforce(BrinRanges *ranges,
+ int numrows, HeapTuple *rows,
+ int nvalues, Datum *values,
+ TypeCacheEntry *typcache,
+ int *res_nmatches,
+ int *res_nmatches_unique,
+ int *res_nvalues_unique)
+{
+ int nmatches = 0;
+ int nmatches_unique = 0;
+ int nvalues_unique = 0;
+
+ TimestampTz start_ts = GetCurrentTimestamp();
+
+ for (int i = 0; i < nvalues; i++)
+ {
+ bool is_unique;
+ int nmatches_value = 0;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* is this a new value? */
+ is_unique = ((i == 0) || (range_values_cmp(&values[i-1], &values[i], typcache) != 0));
+
+ /* count unique values */
+ nvalues_unique += (is_unique) ? 1 : 0;
+
+ for (int j = 0; j < ranges->nranges; j++)
+ {
+ if (range_values_cmp(&values[i], &ranges->ranges[j].min_value, typcache) < 0)
+ continue;
+
+ if (range_values_cmp(&values[i], &ranges->ranges[j].max_value, typcache) > 0)
+ continue;
+
+ nmatches_value++;
+ }
+
+ nmatches += nmatches_value;
+ nmatches_unique += (is_unique) ? nmatches_value : 0;
+ }
+
+ elog(WARNING, "----- brin_minmax_match_tuples_to_ranges_bruteforce -----");
+ elog(WARNING, "nmatches = %d %f", nmatches, (double) nmatches / numrows);
+ elog(WARNING, "nmatches unique = %d %d %f", nmatches_unique, nvalues_unique,
+ (double) nmatches_unique / nvalues_unique);
+ elog(WARNING, "duration = %ld", TimestampDifferenceMilliseconds(start_ts,
+ GetCurrentTimestamp()));
+
+ *res_nmatches = nmatches;
+ *res_nmatches_unique = nmatches_unique;
+ *res_nvalues_unique = nvalues_unique;
+}
+#endif
+
+/*
+ * brin_minmax_value_stats
+ * Calculate statistics about minval/maxval values.
+ *
+ * We calculate the number of distinct values, and also correlation with respect
+ * to blkno_start. We don't calculate the regular correlation coefficient, because
+ * our goal is to estimate how sequential the accesses are. The regular correlation
+ * would produce 0 for cyclical data sets like mod(i,1000000), but it may be quite
+ * sequantial access. Maybe it should be called differently, not correlation?
+ *
+ * XXX Maybe this should calculate minval vs. maxval correlation too?
+ *
+ * XXX I don't know how important the sequentiality is - BRIN generally uses 1MB
+ * page ranges, which is pretty sequential and the one random seek in between is
+ * likely going to be negligible. Maybe for small page ranges it'll matter, though.
+ */
+static void
+brin_minmax_value_stats(BrinRange **minranges, BrinRange **maxranges,
+ int nranges, TypeCacheEntry *typcache,
+ double *minval_correlation, int64 *minval_ndistinct,
+ double *maxval_correlation, int64 *maxval_ndistinct)
+{
+ /* */
+ int64 minval_ndist = 1,
+ maxval_ndist = 1,
+ minval_corr = 0,
+ maxval_corr = 0;
+
+ for (int i = 1; i < nranges; i++)
+ {
+ if (range_values_cmp(&minranges[i-1]->min_value, &minranges[i]->min_value, typcache) != 0)
+ minval_ndist++;
+
+ if (range_values_cmp(&maxranges[i-1]->max_value, &maxranges[i]->max_value, typcache) != 0)
+ maxval_ndist++;
+
+ /* is it immediately sequential? */
+ if (minranges[i-1]->blkno_end + 1 == minranges[i]->blkno_start)
+ minval_corr++;
+
+ /* is it immediately sequential? */
+ if (maxranges[i-1]->blkno_end + 1 == maxranges[i]->blkno_start)
+ maxval_corr++;
+ }
+
+ *minval_ndistinct = minval_ndist;
+ *maxval_ndistinct = maxval_ndist;
+
+ *minval_correlation = (double) minval_corr / nranges;
+ *maxval_correlation = (double) maxval_corr / nranges;
+
+#ifdef STATS_DEBUG
+ elog(WARNING, "----- brin_minmax_value_stats -----");
+ elog(WARNING, "minval ndistinct %ld correlation %f",
+ *minval_ndistinct, *minval_correlation);
+
+ elog(WARNING, "maxval ndistinct %ld correlation %f",
+ *maxval_ndistinct, *maxval_correlation);
+#endif
+}
+
+/*
+ * brin_minmax_stats
+ * Calculate custom statistics for a BRIN minmax index.
+ *
+ * At the moment this calculates:
+ *
+ * - number of summarized/not-summarized and all/has nulls ranges
+ * - average number of overlaps for a range
+ * - average number of rows matching a range
+ * - number of distinct minval/maxval values
+ *
+ * There are multiple ways to calculate some of the metrics, so to allow
+ * cross-checking during development it's possible to run and compare all.
+ * To do that, define STATS_CROSS_CHECK. There's also STATS_DEBUG define
+ * that simply prints the calculated results.
+ *
+ * XXX This could also calculate correlation of the range minval, so that
+ * we can estimate how much random I/O will happen during the BrinSort.
+ * And perhaps we should also sort the ranges by (minval,block_start) to
+ * make this as sequential as possible?
+ *
+ * XXX Another interesting statistics might be the number of ranges with
+ * the same minval (or number of distinct minval values), because that's
+ * essentially what we need to estimate how many ranges will be read in
+ * one brinsort step. In fact, knowing the number of distinct minval
+ * values tells us the number of BrinSort loops.
+ *
+ * XXX We might also calculate a histogram of minval/maxval values.
+ *
+ * XXX I wonder if we could track for each range track probabilities:
+ *
+ * - P1 = P(v <= minval)
+ * - P2 = P(x <= Max(maxval)) for Max(maxval) over preceding ranges
+ *
+ * That would allow us to estimate how many ranges we'll have to read to produce
+ * a particular number of rows, because we need the first probability to exceed
+ * the requested number of rows (fraction of the table):
+ *
+ * (limit rows / reltuples) <= P(v <= minval)
+ *
+ * and then the second probability would say how many rows we'll process (either
+ * sort or spill). And inversely for the DESC ordering.
+ *
+ * The difference between P1 for two ranges is how much we'd have to sort
+ * if we moved the watermark between the ranges (first minval to second one).
+ * The (P2 - P1) for the new watermark range measures the number of rows in
+ * the tuplestore. We'll need to aggregate this, though, we can't keep the
+ * whole data - probably average/median/max for the differences would be nice.
+ * Might be tricky for different watermark step values, though.
+ *
+ * This would also allow estimating how many rows will spill from each range,
+ * because we have an estimate how many rows match a range on average, and
+ * we can compare it to the difference between P1.
+ *
+ * One issue is we don't have actual tuples from the ranges, so we can't
+ * measure exactly how many rows would we add. But we can match the sample
+ * and at least estimate the the probability difference.
+ */
+Datum
+brin_minmax_stats(PG_FUNCTION_ARGS)
+{
+ Relation heapRel = (Relation) PG_GETARG_POINTER(0);
+ Relation indexRel = (Relation) PG_GETARG_POINTER(1);
+ AttrNumber attnum = PG_GETARG_INT16(2);
+ AttrNumber heap_attnum = PG_GETARG_INT16(3);
+ HeapTuple *rows = (HeapTuple *) PG_GETARG_POINTER(4);
+ int numrows = PG_GETARG_INT32(5);
+
+ BrinOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber nranges;
+ BlockNumber heapBlk;
+ BrinMemTuple *dtup;
+ BrinTuple *btup = NULL;
+ Size btupsz = 0;
+ Buffer buf = InvalidBuffer;
+ BrinRanges *ranges;
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc;
+ BrinMinmaxStats *stats;
+
+ Oid typoid;
+ TypeCacheEntry *typcache;
+ BrinRange **minranges,
+ **maxranges;
+ int64 noverlaps;
+ int64 prev_min_index;
+
+ /*
+ * Mostly what brinbeginscan does to initialize BrinOpaque, except that
+ * we use active snapshot instead of the scan snapshot.
+ */
+ opaque = palloc_object(BrinOpaque);
+ opaque->bo_rmAccess = brinRevmapInitialize(indexRel,
+ &opaque->bo_pagesPerRange,
+ GetActiveSnapshot());
+ opaque->bo_bdesc = brin_build_desc(indexRel);
+
+ bdesc = opaque->bo_bdesc;
+ pagesPerRange = opaque->bo_pagesPerRange;
+
+ /* make sure the provided attnum is valid */
+ Assert((attnum > 0) && (attnum <= bdesc->bd_tupdesc->natts));
+
+ /*
+ * We need to know the size of the table so that we know how long to iterate
+ * on the revmap (and to pre-allocate the arrays).
+ */
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+
+ /*
+ * How many ranges can there be? We simply look at the number of pages,
+ * divide it by the pages_per_range.
+ *
+ * XXX We need to be careful not to overflow nranges, so we just divide
+ * and then maybe add 1 for partial ranges.
+ */
+ nranges = (nblocks / pagesPerRange);
+ if (nblocks % pagesPerRange != 0)
+ nranges += 1;
+
+ /* allocate for space, and also for the alternative ordering */
+ ranges = palloc0(offsetof(BrinRanges, ranges) + nranges * sizeof(BrinRange));
+ ranges->nranges = 0;
+
+ /* allocate an initial in-memory tuple, out of the per-range memcxt */
+ dtup = brin_new_memtuple(bdesc);
+
+ /* result stats */
+ stats = palloc0(sizeof(BrinMinmaxStats));
+ SET_VARSIZE(stats, sizeof(BrinMinmaxStats));
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ *
+ * XXX We count the ranges, and count the special types (not summarized,
+ * all-null and has-null). The regular ranges are accumulated into an
+ * array, so that we can calculate additional statistics (overlaps, hits
+ * for sample tuples, etc).
+ *
+ * XXX This needs rethinking to make it work with large indexes with more
+ * ranges than we can fit into memory (work_mem/maintenance_work_mem).
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ bool gottuple = false;
+ BrinTuple *tup;
+ OffsetNumber off;
+ Size size;
+
+ stats->n_ranges++;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE,
+ GetActiveSnapshot());
+ if (tup)
+ {
+ gottuple = true;
+ btup = brin_copy_tuple(tup, size, btup, &btupsz);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /* Ranges with no indexed tuple are ignored for overlap analysis. */
+ if (!gottuple)
+ {
+ continue;
+ }
+ else
+ {
+ dtup = brin_deform_tuple(bdesc, btup, dtup);
+ if (dtup->bt_placeholder)
+ {
+ /* Placeholders can be ignored too, as if not summarized. */
+ continue;
+ }
+ else
+ {
+ BrinValues *bval;
+
+ bval = &dtup->bt_columns[attnum - 1];
+
+ /* OK this range is summarized */
+ stats->n_summarized++;
+
+ if (bval->bv_allnulls)
+ stats->n_all_nulls++;
+
+ if (bval->bv_hasnulls)
+ stats->n_has_nulls++;
+
+ if (!bval->bv_allnulls)
+ {
+ BrinRange *range;
+
+ range = &ranges->ranges[ranges->nranges++];
+
+ range->blkno_start = heapBlk;
+ range->blkno_end = heapBlk + (pagesPerRange - 1);
+
+ range->min_value = bval->bv_values[0];
+ range->max_value = bval->bv_values[1];
+ }
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ elog(WARNING, "extracted ranges %d from BRIN index", ranges->nranges);
+
+ /* if we have no regular ranges, we're done */
+ if (ranges->nranges == 0)
+ goto cleanup;
+
+ /*
+ * Build auxiliary info to optimize the calculation.
+ *
+ * We have ranges in the blocknum order, but that is not very useful when
+ * calculating which ranges interstect - we could cross-check every range
+ * against every other range, but that's O(N^2) and thus may get extremely
+ * expensive pretty quick).
+ *
+ * To make that cheaper, we'll build two orderings, allowing us to quickly
+ * eliminate ranges that can't possibly overlap:
+ *
+ * - minranges = ranges ordered by min_value
+ * - maxranges = ranges ordered by max_value
+ *
+ * To count intersections, we'll then walk maxranges (i.e. ranges ordered
+ * by maxval), and for each following range we'll check if it overlaps.
+ * If yes, we'll proceed to the next one, until we find a range that does
+ * not overlap. But there might be a later page overlapping - but we can
+ * use a min_index_lowest tracking the minimum min_index for "future"
+ * ranges to quickly decide if there are such ranges. If there are none,
+ * we can terminate (and proceed to the next maxranges element), else we
+ * have to process additional ranges.
+ *
+ * Note: This only counts overlaps with ranges with max_value higher than
+ * the current one - we want to count all, but the overlaps with preceding
+ * ranges have already been counted when processing those preceding ranges.
+ * That is, we'll end up with counting each overlap just for one of those
+ * ranges, so we get only 1/2 the count.
+ *
+ * Note: We don't count the range as overlapping with itself. This needs
+ * to be considered later, when applying the statistics.
+ *
+ *
+ * XXX This will not work for very many ranges - we can have up to 2^32 of
+ * them, so allocating a ~32B struct for each would need a lot of memory.
+ * Not sure what to do about that, perhaps we could sample a couple ranges
+ * and do some calculations based on that? That is, we could process all
+ * ranges up to some number (say, statistics_target * 300, as for rows), and
+ * then sample ranges for larger tables. Then sort the sampled ranges, and
+ * walk through all ranges once, comparing them to the sample and counting
+ * overlaps (having them sorted should allow making this quite efficient,
+ * I think - following algorithm similar to the one implemented here).
+ */
+
+ /* info about ordering for the data type */
+ typoid = get_atttype(RelationGetRelid(indexRel), attnum);
+ typcache = lookup_type_cache(typoid, TYPECACHE_CMP_PROC_FINFO);
+
+ /* shouldn't happen, I think - we use this to build the index */
+ Assert(OidIsValid(typcache->cmp_proc_finfo.fn_oid));
+
+ minranges = (BrinRange **) palloc0(ranges->nranges * sizeof(BrinRanges *));
+ maxranges = (BrinRange **) palloc0(ranges->nranges * sizeof(BrinRanges *));
+
+ /*
+ * Build and sort the ranges min_value / max_value (just pointers
+ * to the main array). Then go and assign the min_index to each
+ * range, and finally walk the maxranges array backwards and track
+ * the min_index_lowest as minimum of "future" indexes.
+ */
+ for (int i = 0; i < ranges->nranges; i++)
+ {
+ minranges[i] = &ranges->ranges[i];
+ maxranges[i] = &ranges->ranges[i];
+ }
+
+ qsort_arg(minranges, ranges->nranges, sizeof(BrinRange *),
+ range_minval_cmp, typcache);
+
+ qsort_arg(maxranges, ranges->nranges, sizeof(BrinRange *),
+ range_maxval_cmp, typcache);
+
+ /*
+ * Update the min_index for each range. If the values are equal, be sure to
+ * pick the lowest index with that min_value.
+ */
+ minranges[0]->min_index = 0;
+ for (int i = 1; i < ranges->nranges; i++)
+ {
+ if (range_values_cmp(&minranges[i]->min_value, &minranges[i-1]->min_value, typcache) == 0)
+ minranges[i]->min_index = minranges[i-1]->min_index;
+ else
+ minranges[i]->min_index = i;
+ }
+
+ /*
+ * Walk the maxranges backward and assign the min_index_lowest as
+ * a running minimum.
+ */
+ prev_min_index = ranges->nranges;
+ for (int i = (ranges->nranges - 1); i >= 0; i--)
+ {
+ maxranges[i]->min_index_lowest = Min(maxranges[i]->min_index,
+ prev_min_index);
+ prev_min_index = maxranges[i]->min_index_lowest;
+ }
+
+ /* calculate average number of overlapping ranges for any range */
+ noverlaps = brin_minmax_count_overlaps(minranges, ranges->nranges, typcache);
+
+ stats->avg_overlaps = (double) noverlaps / ranges->nranges;
+
+#ifdef STATS_CROSS_CHECK
+ brin_minmax_count_overlaps2(ranges, minranges, maxranges, typcache);
+ brin_minmax_count_overlaps_bruteforce(ranges, typcache);
+#endif
+
+ /* calculate minval/maxval stats (distinct values and correlation) */
+ brin_minmax_value_stats(minranges, maxranges,
+ ranges->nranges, typcache,
+ &stats->minval_correlation,
+ &stats->minval_ndistinct,
+ &stats->maxval_correlation,
+ &stats->maxval_ndistinct);
+
+ /* match tuples to ranges */
+ {
+ int nvalues = 0;
+ int nmatches,
+ nmatches_unique,
+ nvalues_unique;
+
+ Datum *values = (Datum *) palloc0(numrows * sizeof(Datum));
+
+ TupleDesc tdesc = RelationGetDescr(heapRel);
+
+ for (int i = 0; i < numrows; i++)
+ {
+ bool isnull;
+ Datum value;
+
+ value = heap_getattr(rows[i], heap_attnum, tdesc, &isnull);
+ if (!isnull)
+ values[nvalues++] = value;
+ }
+
+ qsort_arg(values, nvalues, sizeof(Datum), range_values_cmp, typcache);
+
+ /* optimized algorithm */
+ brin_minmax_match_tuples_to_ranges(ranges,
+ numrows, rows, nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+
+ stats->avg_matches = (double) nmatches / numrows;
+ stats->avg_matches_unique = (double) nmatches_unique / nvalues_unique;
+
+#ifdef STATS_CROSS_CHECK
+ brin_minmax_match_tuples_to_ranges2(ranges, minranges, maxranges,
+ numrows, rows, nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+
+ brin_minmax_match_tuples_to_ranges_bruteforce(ranges,
+ numrows, rows,
+ nvalues, values,
+ typcache,
+ &nmatches,
+ &nmatches_unique,
+ &nvalues_unique);
+#endif
+ }
+
+ /*
+ * Possibly quite large, so release explicitly and don't rely
+ * on the memory context to discard this.
+ */
+ pfree(minranges);
+ pfree(maxranges);
+
+cleanup:
+ /* possibly quite large, so release explicitly */
+ pfree(ranges);
+
+ /* free the BrinOpaque, just like brinendscan() would */
+ brinRevmapTerminate(opaque->bo_rmAccess);
+ brin_free_desc(opaque->bo_bdesc);
+
+ PG_RETURN_POINTER(stats);
+}
+
/*
* Cache and return the procedure for the given strategy.
*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ff1354812bd..b7435194dc0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -16,6 +16,7 @@
#include <math.h>
+#include "access/brin_internal.h"
#include "access/detoast.h"
#include "access/genam.h"
#include "access/multixact.h"
@@ -30,6 +31,7 @@
#include "catalog/catalog.h"
#include "catalog/index.h"
#include "catalog/indexing.h"
+#include "catalog/pg_am.h"
#include "catalog/pg_collation.h"
#include "catalog/pg_inherits.h"
#include "catalog/pg_namespace.h"
@@ -81,6 +83,7 @@ typedef struct AnlIndexData
/* Default statistics target (GUC parameter) */
int default_statistics_target = 100;
+bool enable_indexam_stats = false;
/* A few variables that don't seem worth passing around as parameters */
static MemoryContext anl_context = NULL;
@@ -92,7 +95,7 @@ static void do_analyze_rel(Relation onerel,
AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
bool inh, bool in_outer_xact, int elevel);
static void compute_index_stats(Relation onerel, double totalrows,
- AnlIndexData *indexdata, int nindexes,
+ AnlIndexData *indexdata, Relation *indexRels, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context);
static VacAttrStats *examine_attribute(Relation onerel, int attnum,
@@ -454,15 +457,49 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo;
+ bool collectAmStats;
+ Oid regproc;
thisdata->indexInfo = indexInfo = BuildIndexInfo(Irel[ind]);
thisdata->tupleFract = 1.0; /* fix later if partial */
- if (indexInfo->ii_Expressions != NIL && va_cols == NIL)
+
+ /*
+ * Should we collect AM-specific statistics for any of the columns?
+ *
+ * If AM-specific statistics are enabled (using a GUC), see if we
+ * have an optional support procedure to build the statistics.
+ *
+ * If there's any such attribute, we just force building stats
+ * even for regular index keys (not just expressions) and indexes
+ * without predicates. It'd be good to only build the AM stats, but
+ * for now this is good enough.
+ *
+ * XXX The GUC is there morestly to make it easier to enable/disable
+ * this during development.
+ *
+ * FIXME Only build the AM statistics, not the other stats. And only
+ * do that for the keys with the optional procedure. not all of them.
+ */
+ collectAmStats = false;
+ if (enable_indexam_stats && (Irel[ind]->rd_indam->amstatsprocnum != 0))
+ {
+ for (int j = 0; j < indexInfo->ii_NumIndexAttrs; j++)
+ {
+ regproc = index_getprocid(Irel[ind], (j+1), Irel[ind]->rd_indam->amstatsprocnum);
+ if (OidIsValid(regproc))
+ {
+ collectAmStats = true;
+ break;
+ }
+ }
+ }
+
+ if ((indexInfo->ii_Expressions != NIL || collectAmStats) && va_cols == NIL)
{
ListCell *indexpr_item = list_head(indexInfo->ii_Expressions);
thisdata->vacattrstats = (VacAttrStats **)
- palloc(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
+ palloc0(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
tcnt = 0;
for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
{
@@ -483,6 +520,12 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (thisdata->vacattrstats[tcnt] != NULL)
tcnt++;
}
+ else
+ {
+ thisdata->vacattrstats[tcnt] =
+ examine_attribute(Irel[ind], i + 1, NULL);
+ tcnt++;
+ }
}
thisdata->attr_cnt = tcnt;
}
@@ -588,7 +631,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (nindexes > 0)
compute_index_stats(onerel, totalrows,
- indexdata, nindexes,
+ indexdata, Irel, nindexes,
rows, numrows,
col_context);
@@ -822,12 +865,82 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
anl_context = NULL;
}
+/*
+ * compute_indexam_stats
+ * Call the optional procedure to compute AM-specific statistics.
+ *
+ * We simply call the procedure, which is expected to produce a bytea value.
+ *
+ * At the moment this only deals with BRIN indexes, and bails out for other
+ * access methods, but it should be generic - use something like amoptsprocnum
+ * and just check if the procedure exists.
+ */
+static void
+compute_indexam_stats(Relation onerel,
+ Relation indexRel, IndexInfo *indexInfo,
+ double totalrows, AnlIndexData *indexdata,
+ HeapTuple *rows, int numrows)
+{
+ if (!enable_indexam_stats)
+ return;
+
+ /* ignore index AMs without the optional procedure */
+ if (indexRel->rd_indam->amstatsprocnum == 0)
+ return;
+
+ /*
+ * Look at attributes, and calculate stats for those that have the
+ * optional stats proc for the opfamily.
+ */
+ for (int i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+ {
+ AttrNumber attno = (i + 1);
+ AttrNumber attnum = indexInfo->ii_IndexAttrNumbers[i]; /* heap attnum */
+ RegProcedure regproc;
+ FmgrInfo *statsproc;
+ Datum datum;
+ VacAttrStats *stats;
+ MemoryContext oldcxt;
+
+ /* do this first, as it doesn't fail when proc not defined */
+ regproc = index_getprocid(indexRel, attno, indexRel->rd_indam->amstatsprocnum);
+
+ /* ignore opclasses without the optional procedure */
+ if (!RegProcedureIsValid(regproc))
+ continue;
+
+ statsproc = index_getprocinfo(indexRel, attno, indexRel->rd_indam->amstatsprocnum);
+
+ stats = indexdata->vacattrstats[i];
+
+ if (statsproc != NULL)
+ elog(WARNING, "collecting stats on BRIN ranges %p using proc %p attnum %d",
+ indexRel, statsproc, attno);
+
+ oldcxt = MemoryContextSwitchTo(stats->anl_context);
+
+ /* call the proc, let the AM calculate whatever it wants */
+ datum = FunctionCall6Coll(statsproc,
+ InvalidOid, /* FIXME correct collation */
+ PointerGetDatum(onerel),
+ PointerGetDatum(indexRel),
+ Int16GetDatum(attno),
+ Int16GetDatum(attnum),
+ PointerGetDatum(rows),
+ Int32GetDatum(numrows));
+
+ stats->staindexam = datum;
+
+ MemoryContextSwitchTo(oldcxt);
+ }
+}
+
/*
* Compute statistics about indexes of a relation
*/
static void
compute_index_stats(Relation onerel, double totalrows,
- AnlIndexData *indexdata, int nindexes,
+ AnlIndexData *indexdata, Relation *indexRels, int nindexes,
HeapTuple *rows, int numrows,
MemoryContext col_context)
{
@@ -847,6 +960,7 @@ compute_index_stats(Relation onerel, double totalrows,
{
AnlIndexData *thisdata = &indexdata[ind];
IndexInfo *indexInfo = thisdata->indexInfo;
+ Relation indexRel = indexRels[ind];
int attr_cnt = thisdata->attr_cnt;
TupleTableSlot *slot;
EState *estate;
@@ -859,6 +973,13 @@ compute_index_stats(Relation onerel, double totalrows,
rowno;
double totalindexrows;
+ /*
+ * If this is a BRIN index, try calling a procedure to collect
+ * extra opfamily-specific statistics (if procedure defined).
+ */
+ compute_indexam_stats(onerel, indexRel, indexInfo, totalrows,
+ thisdata, rows, numrows);
+
/* Ignore index if no columns to analyze and not partial */
if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL)
continue;
@@ -1661,6 +1782,13 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
+
+ /* optional AM-specific stats */
+ if (DatumGetPointer(stats->staindexam) != NULL)
+ values[Anum_pg_statistic_staindexam - 1] = stats->staindexam;
+ else
+ nulls[Anum_pg_statistic_staindexam - 1] = true;
+
i = Anum_pg_statistic_stakind1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 69e0fb98f5b..9f640adb13c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7715,6 +7715,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
Relation indexRel;
ListCell *l;
VariableStatData vardata;
+ double averageOverlaps;
Assert(rte->rtekind == RTE_RELATION);
@@ -7762,6 +7763,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
* correlation statistics, we will keep it as 0.
*/
*indexCorrelation = 0;
+ averageOverlaps = 0.0;
foreach(l, path->indexclauses)
{
@@ -7771,6 +7773,36 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
/* attempt to lookup stats in relation for this index column */
if (attnum != 0)
{
+ /*
+ * If AM-specific statistics are enabled, try looking up the stats
+ * for the index key. We only have this for minmax opclasses, so
+ * we just cast it like that. But other BRIN opclasses might need
+ * other stats so either we need to abstract this somehow, or maybe
+ * just collect a sufficiently generic stats for all BRIN indexes.
+ *
+ * XXX Make this non-minmax specific.
+ */
+ if (enable_indexam_stats)
+ {
+ BrinMinmaxStats *amstats
+ = (BrinMinmaxStats *) get_attindexam(index->indexoid, attnum);
+
+ if (amstats)
+ {
+ elog(DEBUG1, "found AM stats: attnum %d n_ranges %ld n_summarized %ld n_all_nulls %ld n_has_nulls %ld avg_overlaps %f",
+ attnum, amstats->n_ranges, amstats->n_summarized,
+ amstats->n_all_nulls, amstats->n_has_nulls,
+ amstats->avg_overlaps);
+
+ /*
+ * The only thing we use at the moment is the average number
+ * of overlaps for a single range. Use the other stuff too.
+ */
+ averageOverlaps = Max(averageOverlaps,
+ 1.0 + amstats->avg_overlaps);
+ }
+ }
+
/* Simple variable -- look to stats for the underlying table */
if (get_relation_stats_hook &&
(*get_relation_stats_hook) (root, rte, attnum, &vardata))
@@ -7851,6 +7883,14 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
baserel->relid,
JOIN_INNER, NULL);
+ /*
+ * XXX Can we combine qualSelectivity with the average number of matching
+ * ranges per value? qualSelectivity estimates how many tuples ar we
+ * going to match, and average number of matches says how many ranges
+ * will each of those match on average. We don't know how many will
+ * be duplicate, but it gives us a worst-case estimate, at least.
+ */
+
/*
* Now calculate the minimum possible ranges we could match with if all of
* the rows were in the perfect order in the table's heap.
@@ -7867,6 +7907,25 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
else
estimatedRanges = Min(minimalRanges / *indexCorrelation, indexRanges);
+ elog(DEBUG1, "before index AM stats: cestimatedRanges = %f", estimatedRanges);
+
+ /*
+ * If we found some AM stats, look at average number of overlapping ranges,
+ * and apply that to the currently estimated ranges.
+ *
+ * XXX We pretty much combine this with correlation info (because it was
+ * already applied in the estimatedRanges formula above), which might be
+ * overly pessimistic. The overlaps stats seems somewhat redundant with
+ * the correlation, so maybe we should do just one? The AM stats seems
+ * like a more reliable information, because the correlation is not very
+ * sensitive to outliers, for example. So maybe let's prefer that, and
+ * only use the correlation as fallback when AM stats are not available?
+ */
+ if (averageOverlaps > 0.0)
+ estimatedRanges = Min(estimatedRanges * averageOverlaps, indexRanges);
+
+ elog(DEBUG1, "after index AM stats: cestimatedRanges = %f", estimatedRanges);
+
/* we expect to visit this portion of the table */
selec = estimatedRanges / indexRanges;
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index a16a63f4957..1725f5af347 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -3138,6 +3138,47 @@ get_attavgwidth(Oid relid, AttrNumber attnum)
return 0;
}
+
+/*
+ * get_attstaindexam
+ *
+ * Given the table and attribute number of a column, get the index AM
+ * statistics. Return NULL if no data available.
+ *
+ * Currently this is only consulted for individual tables, not for inheritance
+ * trees, so we don't need an "inh" parameter.
+ */
+bytea *
+get_attindexam(Oid relid, AttrNumber attnum)
+{
+ HeapTuple tp;
+
+ tp = SearchSysCache3(STATRELATTINH,
+ ObjectIdGetDatum(relid),
+ Int16GetDatum(attnum),
+ BoolGetDatum(false));
+ if (HeapTupleIsValid(tp))
+ {
+ Datum val;
+ bytea *retval = NULL;
+ bool isnull;
+
+ val = SysCacheGetAttr(STATRELATTINH, tp,
+ Anum_pg_statistic_staindexam,
+ &isnull);
+
+ if (!isnull)
+ retval = (bytea *) PG_DETOAST_DATUM(val);
+
+ // staindexam = ((Form_pg_statistic) GETSTRUCT(tp))->staindexam;
+ ReleaseSysCache(tp);
+
+ return retval;
+ }
+
+ return NULL;
+}
+
/*
* get_attstatsslot
*
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 05ab087934c..06dfeb6cd8b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -967,6 +967,16 @@ struct config_bool ConfigureNamesBool[] =
true,
NULL, NULL, NULL
},
+ {
+ {"enable_indexam_stats", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of index AM stats."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_indexam_stats,
+ false,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1dc674d2305..8437c2f0e71 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -216,6 +216,8 @@ typedef struct IndexAmRoutine
uint16 amsupport;
/* opclass options support function number or 0 */
uint16 amoptsprocnum;
+ /* opclass statistics support function number or 0 */
+ uint16 amstatsprocnum;
/* does AM support ORDER BY indexed column's value? */
bool amcanorder;
/* does AM support ORDER BY result of an operator on indexed column? */
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index 887fb0a5532..a7cccac9c90 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -34,6 +34,57 @@ typedef struct BrinStatsData
BlockNumber revmapNumPages;
} BrinStatsData;
+/*
+ * Info about ranges for BRIN Sort.
+ */
+typedef struct BrinRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+
+ /*
+ * Index of the range when ordered by min_value (if there are multiple
+ * ranges with the same min_value, it's the lowest one).
+ */
+ uint32 min_index;
+
+ /*
+ * Minimum min_index from all ranges with higher max_value (i.e. when
+ * sorted by max_value). If there are multiple ranges with the same
+ * max_value, it depends on the ordering (i.e. the ranges may get
+ * different min_index_lowest, depending on the exact ordering).
+ */
+ uint32 min_index_lowest;
+} BrinRange;
+
+typedef struct BrinRanges
+{
+ int nranges;
+ BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
+} BrinRanges;
+
+typedef struct BrinMinmaxStats
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int64 n_ranges;
+ int64 n_summarized;
+ int64 n_all_nulls;
+ int64 n_has_nulls;
+ double avg_overlaps;
+ double avg_matches;
+ double avg_matches_unique;
+
+ double minval_correlation;
+ double maxval_correlation;
+ int64 minval_ndistinct;
+ int64 maxval_ndistinct;
+} BrinMinmaxStats;
#define BRIN_DEFAULT_PAGES_PER_RANGE 128
#define BrinGetPagesPerRange(relation) \
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 25186609272..ee6c6f9b709 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,6 +73,7 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
+#define BRIN_PROCNUM_STATISTICS 6 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_LAST_OPTIONAL_PROCNUM 15
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 4cc129bebd8..ea3de9bcba1 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -804,6 +804,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
+ amprocrighttype => 'bytea', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -835,6 +837,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -864,6 +868,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
+ amprocrighttype => 'name', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -893,6 +899,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '6', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -905,6 +913,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '6', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -917,6 +927,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1034,6 +1046,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
+ amprocrighttype => 'text', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1062,6 +1076,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1110,6 +1126,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
+ amprocrighttype => 'tid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1160,6 +1178,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
amprocrighttype => 'float4', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
+ amprocrighttype => 'float4', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '1',
@@ -1173,6 +1194,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
+ amprocrighttype => 'float8', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi float
{ amprocfamily => 'brin/float_minmax_multi_ops', amproclefttype => 'float4',
@@ -1261,6 +1285,9 @@
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
+ amprocrighttype => 'macaddr', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi macaddr
{ amprocfamily => 'brin/macaddr_minmax_multi_ops', amproclefttype => 'macaddr',
@@ -1314,6 +1341,9 @@
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
amprocrighttype => 'macaddr8', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
+ amprocrighttype => 'macaddr8', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi macaddr8
{ amprocfamily => 'brin/macaddr8_minmax_multi_ops',
@@ -1366,6 +1396,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
+ amprocrighttype => 'inet', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1436,6 +1468,9 @@
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
amprocrighttype => 'bpchar', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
+ amprocrighttype => 'bpchar', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# bloom character
{ amprocfamily => 'brin/bpchar_bloom_ops', amproclefttype => 'bpchar',
@@ -1467,6 +1502,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
+ amprocrighttype => 'time', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1517,6 +1554,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
amprocrighttype => 'timestamp', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
+ amprocrighttype => 'timestamp', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '1',
@@ -1530,6 +1570,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
+ amprocrighttype => 'timestamptz', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1',
@@ -1542,6 +1585,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1668,6 +1713,9 @@
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
+ amprocrighttype => 'interval', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi interval
{ amprocfamily => 'brin/interval_minmax_multi_ops',
@@ -1721,6 +1769,9 @@
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
amprocrighttype => 'timetz', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
+ amprocrighttype => 'timetz', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi time with time zone
{ amprocfamily => 'brin/timetz_minmax_multi_ops', amproclefttype => 'timetz',
@@ -1771,6 +1822,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
+ amprocrighttype => 'bit', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1785,6 +1838,9 @@
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
amprocrighttype => 'varbit', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
+ amprocrighttype => 'varbit', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax numeric
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
@@ -1799,6 +1855,9 @@
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
amprocrighttype => 'numeric', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
+ amprocrighttype => 'numeric', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi numeric
{ amprocfamily => 'brin/numeric_minmax_multi_ops', amproclefttype => 'numeric',
@@ -1851,6 +1910,8 @@
amproc => 'brin_minmax_consistent' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '6', amproc => 'brin_minmax_stats' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1924,6 +1985,9 @@
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
amprocrighttype => 'pg_lsn', amprocnum => '4',
amproc => 'brin_minmax_union' },
+{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
+ amprocrighttype => 'pg_lsn', amprocnum => '6',
+ amproc => 'brin_minmax_stats' },
# minmax multi pg_lsn
{ amprocfamily => 'brin/pg_lsn_minmax_multi_ops', amproclefttype => 'pg_lsn',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62a5b8e655d..1dd9177b01c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8407,6 +8407,10 @@
{ oid => '3386', descr => 'BRIN minmax support',
proname => 'brin_minmax_union', prorettype => 'bool',
proargtypes => 'internal internal internal', prosrc => 'brin_minmax_union' },
+{ oid => '9979', descr => 'BRIN minmax support',
+ proname => 'brin_minmax_stats', prorettype => 'bool',
+ proargtypes => 'internal internal int2 int2 internal int4',
+ prosrc => 'brin_minmax_stats' },
# BRIN minmax multi
{ oid => '4616', descr => 'BRIN multi minmax support',
diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h
index cdf74481398..7043b169f7c 100644
--- a/src/include/catalog/pg_statistic.h
+++ b/src/include/catalog/pg_statistic.h
@@ -121,6 +121,11 @@ CATALOG(pg_statistic,2619,StatisticRelationId)
anyarray stavalues3;
anyarray stavalues4;
anyarray stavalues5;
+
+ /*
+ * Statistics calculated by index AM (e.g. BRIN for ranges, etc.).
+ */
+ bytea staindexam;
#endif
} FormData_pg_statistic;
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4e..319f7d4aadc 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -155,6 +155,7 @@ typedef struct VacAttrStats
float4 *stanumbers[STATISTIC_NUM_SLOTS];
int numvalues[STATISTIC_NUM_SLOTS];
Datum *stavalues[STATISTIC_NUM_SLOTS];
+ Datum staindexam; /* index-specific stats (as bytea) */
/*
* These fields describe the stavalues[n] element types. They will be
@@ -258,6 +259,7 @@ extern PGDLLIMPORT int vacuum_multixact_freeze_min_age;
extern PGDLLIMPORT int vacuum_multixact_freeze_table_age;
extern PGDLLIMPORT int vacuum_failsafe_age;
extern PGDLLIMPORT int vacuum_multixact_failsafe_age;
+extern PGDLLIMPORT bool enable_indexam_stats;
/* Variables for cost-based parallel vacuum */
extern PGDLLIMPORT pg_atomic_uint32 *VacuumSharedCostBalance;
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 50f02883052..71ce5b15d74 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -185,6 +185,7 @@ extern Oid getBaseType(Oid typid);
extern Oid getBaseTypeAndTypmod(Oid typid, int32 *typmod);
extern int32 get_typavgwidth(Oid typid, int32 typmod);
extern int32 get_attavgwidth(Oid relid, AttrNumber attnum);
+extern bytea *get_attindexam(Oid relid, AttrNumber attnum);
extern bool get_attstatsslot(AttStatsSlot *sslot, HeapTuple statstuple,
int reqkind, Oid reqop, int flags);
extern void free_attstatsslot(AttStatsSlot *sslot);
--
2.25.1
0002-f-Allow-index-AMs-to-build-and-use-custom-statistics.patchtext/x-diff; charset=us-asciiDownload
From 42c8bc879e11ef816e50751e149f116112dc9905 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sun, 23 Oct 2022 12:29:33 -0500
Subject: [PATCH 2/4] f!Allow index AMs to build and use custom statistics
XXX: should enable GUC for CI during development
ci-os-only: windows-cross, windows-run-cross, windows-msvc
---
src/backend/access/brin/brin_minmax.c | 10 ++--
src/backend/statistics/extended_stats.c | 2 +
src/backend/utils/adt/selfuncs.c | 6 +--
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/brin_internal.h | 2 +-
src/include/catalog/pg_amproc.dat | 52 +++++++++----------
src/test/regress/expected/sysviews.out | 3 +-
7 files changed, 41 insertions(+), 35 deletions(-)
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index e4c9e56c623..be1d9b47d5b 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -842,9 +842,11 @@ brin_minmax_count_overlaps_bruteforce(BrinRanges *ranges, TypeCacheEntry *typcac
if (range_values_cmp(&rb->max_value, &ra->min_value, typcache) < 0)
continue;
+#if 0
elog(DEBUG1, "[%ld,%ld] overlaps [%ld,%ld]",
ra->min_value, ra->max_value,
rb->min_value, rb->max_value);
+#endif
noverlaps++;
}
@@ -1173,11 +1175,11 @@ brin_minmax_value_stats(BrinRange **minranges, BrinRange **maxranges,
#ifdef STATS_DEBUG
elog(WARNING, "----- brin_minmax_value_stats -----");
- elog(WARNING, "minval ndistinct %ld correlation %f",
- *minval_ndistinct, *minval_correlation);
+ elog(WARNING, "minval ndistinct %lld correlation %f",
+ (long long)*minval_ndistinct, *minval_correlation);
- elog(WARNING, "maxval ndistinct %ld correlation %f",
- *maxval_ndistinct, *maxval_correlation);
+ elog(WARNING, "maxval ndistinct %lld correlation %f",
+ (long long)*maxval_ndistinct, *maxval_correlation);
#endif
}
diff --git a/src/backend/statistics/extended_stats.c b/src/backend/statistics/extended_stats.c
index ab97e71dd79..d91b4fd93eb 100644
--- a/src/backend/statistics/extended_stats.c
+++ b/src/backend/statistics/extended_stats.c
@@ -2370,6 +2370,8 @@ serialize_expr_stats(AnlExprData *exprdata, int nexprs)
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
+ nulls[Anum_pg_statistic_staindexam - 1] = true;
+
i = Anum_pg_statistic_stakind1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
{
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 9f640adb13c..14e0885f19f 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7789,9 +7789,9 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
if (amstats)
{
- elog(DEBUG1, "found AM stats: attnum %d n_ranges %ld n_summarized %ld n_all_nulls %ld n_has_nulls %ld avg_overlaps %f",
- attnum, amstats->n_ranges, amstats->n_summarized,
- amstats->n_all_nulls, amstats->n_has_nulls,
+ elog(DEBUG1, "found AM stats: attnum %d n_ranges %lld n_summarized %lld n_all_nulls %lld n_has_nulls %lld avg_overlaps %f",
+ attnum, (long long)amstats->n_ranges, (long long)amstats->n_summarized,
+ (long long)amstats->n_all_nulls, (long long)amstats->n_has_nulls,
amstats->avg_overlaps);
/*
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 868d21c351e..8c5d442ff45 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -374,6 +374,7 @@
#enable_hashagg = on
#enable_hashjoin = on
#enable_incremental_sort = on
+#enable_indexam_stats = off
#enable_indexscan = on
#enable_indexonlyscan = on
#enable_material = on
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index ee6c6f9b709..f4be357c176 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,9 +73,9 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
-#define BRIN_PROCNUM_STATISTICS 6 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
+#define BRIN_PROCNUM_STATISTICS 11 /* optional */
#define BRIN_LAST_OPTIONAL_PROCNUM 15
#undef BRIN_DEBUG
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index ea3de9bcba1..558df53206d 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -805,7 +805,7 @@
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
- amprocrighttype => 'bytea', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'bytea', amprocnum => '11', amproc => 'brin_minmax_stats' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -838,7 +838,7 @@
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
- amprocrighttype => 'char', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'char', amprocnum => '11', amproc => 'brin_minmax_stats' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -869,7 +869,7 @@
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
- amprocrighttype => 'name', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'name', amprocnum => '11', amproc => 'brin_minmax_stats' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -900,7 +900,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
- amprocrighttype => 'int8', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'int8', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -914,7 +914,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
- amprocrighttype => 'int2', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'int2', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -928,7 +928,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
- amprocrighttype => 'int4', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'int4', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1047,7 +1047,7 @@
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
- amprocrighttype => 'text', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'text', amprocnum => '11', amproc => 'brin_minmax_stats' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1077,7 +1077,7 @@
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
- amprocrighttype => 'oid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'oid', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1127,7 +1127,7 @@
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
- amprocrighttype => 'tid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'tid', amprocnum => '11', amproc => 'brin_minmax_stats' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1179,7 +1179,7 @@
amprocrighttype => 'float4', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
- amprocrighttype => 'float4', amprocnum => '6',
+ amprocrighttype => 'float4', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
@@ -1195,7 +1195,7 @@
amprocrighttype => 'float8', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
- amprocrighttype => 'float8', amprocnum => '6',
+ amprocrighttype => 'float8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi float
@@ -1286,7 +1286,7 @@
amprocrighttype => 'macaddr', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
- amprocrighttype => 'macaddr', amprocnum => '6',
+ amprocrighttype => 'macaddr', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi macaddr
@@ -1342,7 +1342,7 @@
amprocrighttype => 'macaddr8', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
- amprocrighttype => 'macaddr8', amprocnum => '6',
+ amprocrighttype => 'macaddr8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi macaddr8
@@ -1397,7 +1397,7 @@
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
- amprocrighttype => 'inet', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'inet', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1469,7 +1469,7 @@
amprocrighttype => 'bpchar', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
- amprocrighttype => 'bpchar', amprocnum => '6',
+ amprocrighttype => 'bpchar', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# bloom character
@@ -1503,7 +1503,7 @@
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
- amprocrighttype => 'time', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'time', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1555,7 +1555,7 @@
amprocrighttype => 'timestamp', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
- amprocrighttype => 'timestamp', amprocnum => '6',
+ amprocrighttype => 'timestamp', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
@@ -1571,7 +1571,7 @@
amprocrighttype => 'timestamptz', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
- amprocrighttype => 'timestamptz', amprocnum => '6',
+ amprocrighttype => 'timestamptz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
@@ -1586,7 +1586,7 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
- amprocrighttype => 'date', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'date', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1714,7 +1714,7 @@
amprocrighttype => 'interval', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '6',
+ amprocrighttype => 'interval', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi interval
@@ -1770,7 +1770,7 @@
amprocrighttype => 'timetz', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
- amprocrighttype => 'timetz', amprocnum => '6',
+ amprocrighttype => 'timetz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi time with time zone
@@ -1823,7 +1823,7 @@
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
- amprocrighttype => 'bit', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'bit', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1839,7 +1839,7 @@
amprocrighttype => 'varbit', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
- amprocrighttype => 'varbit', amprocnum => '6',
+ amprocrighttype => 'varbit', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax numeric
@@ -1856,7 +1856,7 @@
amprocrighttype => 'numeric', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
- amprocrighttype => 'numeric', amprocnum => '6',
+ amprocrighttype => 'numeric', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi numeric
@@ -1911,7 +1911,7 @@
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
- amprocrighttype => 'uuid', amprocnum => '6', amproc => 'brin_minmax_stats' },
+ amprocrighttype => 'uuid', amprocnum => '11', amproc => 'brin_minmax_stats' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1986,7 +1986,7 @@
amprocrighttype => 'pg_lsn', amprocnum => '4',
amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
- amprocrighttype => 'pg_lsn', amprocnum => '6',
+ amprocrighttype => 'pg_lsn', amprocnum => '11',
amproc => 'brin_minmax_stats' },
# minmax multi pg_lsn
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 579b861d84f..b19dae255e9 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -117,6 +117,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_hashagg | on
enable_hashjoin | on
enable_incremental_sort | on
+ enable_indexam_stats | off
enable_indexonlyscan | on
enable_indexscan | on
enable_material | on
@@ -131,7 +132,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(20 rows)
+(21 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.25.1
0003-Allow-BRIN-indexes-to-produce-sorted-output.patchtext/x-diff; charset=us-asciiDownload
From 63ca62c13fa852c12d52ba0c53d801b7992ecb4b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Sun, 9 Oct 2022 11:33:37 +0200
Subject: [PATCH 3/4] Allow BRIN indexes to produce sorted output
Some BRIN indexes can be used to produce sorted output, by using the
range information to sort tuples incrementally. This is particularly
interesting for LIMIT queries, which only need to scan the first few
rows, and alternative plans (e.g. Seq Scan + Sort) have a very high
startup cost.
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.
This is disabled by default, use enable_brinsort GUC to enable it.
---
src/backend/access/brin/brin_minmax.c | 386 ++++++
src/backend/commands/explain.c | 44 +
src/backend/executor/Makefile | 1 +
src/backend/executor/execProcnode.c | 10 +
src/backend/executor/nodeBrinSort.c | 1550 +++++++++++++++++++++++
src/backend/optimizer/path/costsize.c | 254 ++++
src/backend/optimizer/path/indxpath.c | 186 +++
src/backend/optimizer/path/pathkeys.c | 50 +
src/backend/optimizer/plan/createplan.c | 188 +++
src/backend/optimizer/plan/setrefs.c | 19 +
src/backend/optimizer/util/pathnode.c | 57 +
src/backend/utils/misc/guc_tables.c | 10 +
src/include/access/brin.h | 35 -
src/include/access/brin_internal.h | 1 +
src/include/catalog/pg_amproc.dat | 64 +
src/include/catalog/pg_proc.dat | 3 +
src/include/executor/nodeBrinSort.h | 47 +
src/include/nodes/execnodes.h | 103 ++
src/include/nodes/pathnodes.h | 11 +
src/include/nodes/plannodes.h | 26 +
src/include/optimizer/cost.h | 3 +
src/include/optimizer/pathnode.h | 9 +
src/include/optimizer/paths.h | 3 +
23 files changed, 3025 insertions(+), 35 deletions(-)
create mode 100644 src/backend/executor/nodeBrinSort.c
create mode 100644 src/include/executor/nodeBrinSort.h
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index be1d9b47d5b..9064cd43852 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -16,6 +16,10 @@
#include "access/brin_tuple.h"
#include "access/genam.h"
#include "access/stratnum.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
#include "catalog/pg_amop.h"
#include "catalog/pg_type.h"
#include "miscadmin.h"
@@ -42,6 +46,9 @@ static FmgrInfo *minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno,
/* calculate the stats in different ways for cross-checking */
#define STATS_CROSS_CHECK
+/* print info about ranges */
+#define BRINSORT_DEBUG
+
Datum
brin_minmax_opcinfo(PG_FUNCTION_ARGS)
{
@@ -1587,6 +1594,385 @@ cleanup:
PG_RETURN_POINTER(stats);
}
+/*
+ * brin_minmax_range_tupdesc
+ * Create a tuple descriptor to store BrinRange data.
+ */
+static TupleDesc
+brin_minmax_range_tupdesc(BrinDesc *brdesc, AttrNumber attnum)
+{
+ TupleDesc tupdesc;
+ AttrNumber attno = 1;
+
+ /* expect minimum and maximum */
+ Assert(brdesc->bd_info[attnum - 1]->oi_nstored == 2);
+
+ tupdesc = CreateTemplateTupleDesc(7);
+
+ /* blkno_start */
+ TupleDescInitEntry(tupdesc, attno++, NULL, INT8OID, -1, 0);
+
+ /* blkno_end (could be calculated as blkno_start + pages_per_range) */
+ TupleDescInitEntry(tupdesc, attno++, NULL, INT8OID, -1, 0);
+
+ /* has_nulls */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* all_nulls */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* not_summarized */
+ TupleDescInitEntry(tupdesc, attno++, NULL, BOOLOID, -1, 0);
+
+ /* min_value */
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ brdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id,
+ -1, 0);
+
+ /* max_value */
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ brdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id,
+ -1, 0);
+
+ return tupdesc;
+}
+
+/*
+ * brin_minmax_range_tuple
+ * Form a minimal tuple representing range info.
+ */
+static MinimalTuple
+brin_minmax_range_tuple(TupleDesc tupdesc,
+ BlockNumber block_start, BlockNumber block_end,
+ bool has_nulls, bool all_nulls, bool not_summarized,
+ Datum min_value, Datum max_value)
+{
+ Datum values[7];
+ bool nulls[7];
+
+ memset(nulls, 0, sizeof(nulls));
+
+ values[0] = Int64GetDatum(block_start);
+ values[1] = Int64GetDatum(block_end);
+ values[2] = BoolGetDatum(has_nulls);
+ values[3] = BoolGetDatum(all_nulls);
+ values[4] = BoolGetDatum(not_summarized);
+ values[5] = min_value;
+ values[6] = max_value;
+
+ if (all_nulls || not_summarized)
+ {
+ nulls[5] = true;
+ nulls[6] = true;
+ }
+
+ return heap_form_minimal_tuple(tupdesc, values, nulls);
+}
+
+/*
+ * brin_minmax_scan_init
+ * Prepare the BrinRangeScanDesc including the sorting info etc.
+ *
+ * We want to have the ranges in roughly this order
+ *
+ * - not-summarized
+ * - summarized, non-null values
+ * - summarized, all-nulls
+ *
+ * We do it this way, because the not-summarized ranges need to be
+ * scanned always (both to produce NULL and non-NULL values), and
+ * we need to read all of them into the tuplesort before producing
+ * anything. So placing them at the beginning is reasonable.
+ *
+ * The all-nulls ranges are placed last, because when processing
+ * NULLs we need to scan everything anyway (some of the ranges might
+ * have has_nulls=true). But for non-NULL values we can abort once
+ * we hit the first all-nulls range.
+ *
+ * The regular ranges are sorted by blkno_start, to make it maybe
+ * a bit more sequential (but this only helps if there are ranges
+ * with the same minval).
+ */
+static BrinRangeScanDesc *
+brin_minmax_scan_init(BrinDesc *bdesc, AttrNumber attnum, bool asc)
+{
+ BrinRangeScanDesc *scan;
+
+ /* sort by (not_summarized, minval, blkno_start, all_nulls) */
+ AttrNumber keys[4];
+ Oid collations[4];
+ bool nullsFirst[4];
+ Oid operators[4];
+ Oid typid;
+ TypeCacheEntry *typcache;
+
+ /* we expect to have min/max value for each range, same type for both */
+ Assert(bdesc->bd_info[attnum - 1]->oi_nstored == 2);
+ Assert(bdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id ==
+ bdesc->bd_info[attnum - 1]->oi_typcache[1]->type_id);
+
+ scan = (BrinRangeScanDesc *) palloc0(sizeof(BrinRangeScanDesc));
+
+ /* build tuple descriptor for range data */
+ scan->tdesc = brin_minmax_range_tupdesc(bdesc, attnum);
+
+ /* initialize ordering info */
+ keys[0] = 5; /* not_summarized */
+ keys[1] = 4; /* all_nulls */
+ keys[2] = (asc) ? 6 : 7; /* min_value (asc) or max_value (desc) */
+ keys[3] = 1; /* blkno_start */
+
+ collations[0] = InvalidOid; /* FIXME */
+ collations[1] = InvalidOid; /* FIXME */
+ collations[2] = InvalidOid; /* FIXME */
+ collations[3] = InvalidOid; /* FIXME */
+
+ /* unrelated to the ordering desired by the user */
+ nullsFirst[0] = false;
+ nullsFirst[1] = false;
+ nullsFirst[2] = false;
+ nullsFirst[3] = false;
+
+ /* lookup sort operator for the boolean type (used for not_summarized) */
+ typcache = lookup_type_cache(BOOLOID, TYPECACHE_GT_OPR);
+ operators[0] = typcache->gt_opr;
+
+ /* lookup sort operator for the boolean type (used for all_nulls) */
+ typcache = lookup_type_cache(BOOLOID, TYPECACHE_LT_OPR);
+ operators[1] = typcache->lt_opr;
+
+ /* lookup sort operator for the min/max type */
+ typid = bdesc->bd_info[attnum - 1]->oi_typcache[0]->type_id;
+ typcache = lookup_type_cache(typid, TYPECACHE_LT_OPR | TYPECACHE_GT_OPR);
+ operators[2] = (asc) ? typcache->lt_opr : typcache->gt_opr;
+
+ /* lookup sort operator for the bigint type (used for blkno_start) */
+ typcache = lookup_type_cache(INT8OID, TYPECACHE_LT_OPR);
+ operators[3] = typcache->lt_opr;
+
+ scan->ranges = tuplesort_begin_heap(scan->tdesc,
+ 4, /* nkeys */
+ keys,
+ operators,
+ collations,
+ nullsFirst,
+ work_mem,
+ NULL,
+ TUPLESORT_RANDOMACCESS);
+
+ scan->slot = MakeSingleTupleTableSlot(scan->tdesc,
+ &TTSOpsMinimalTuple);
+
+ return scan;
+}
+
+/*
+ * brin_minmax_scan_add_tuple
+ * Form and store a tuple representing the BRIN range to the tuplestore.
+ */
+static void
+brin_minmax_scan_add_tuple(BrinRangeScanDesc *scan,
+ BlockNumber block_start, BlockNumber block_end,
+ bool has_nulls, bool all_nulls, bool not_summarized,
+ Datum min_value, Datum max_value)
+{
+ MinimalTuple tup;
+
+ tup = brin_minmax_range_tuple(scan->tdesc, block_start, block_end,
+ has_nulls, all_nulls, not_summarized,
+ min_value, max_value);
+
+ ExecStoreMinimalTuple(tup, scan->slot, false);
+
+ tuplesort_puttupleslot(scan->ranges, scan->slot);
+}
+
+#ifdef BRINSORT_DEBUG
+/*
+ * brin_minmax_scan_next
+ * Return the next BRIN range information from the tuplestore.
+ *
+ * Returns NULL when there are no more ranges.
+ */
+static BrinRange *
+brin_minmax_scan_next(BrinRangeScanDesc *scan)
+{
+ if (tuplesort_gettupleslot(scan->ranges, true, false, scan->slot, NULL))
+ {
+ bool isnull;
+ BrinRange *range = (BrinRange *) palloc(sizeof(BrinRange));
+
+ range->blkno_start = slot_getattr(scan->slot, 1, &isnull);
+ range->blkno_end = slot_getattr(scan->slot, 2, &isnull);
+ range->has_nulls = slot_getattr(scan->slot, 3, &isnull);
+ range->all_nulls = slot_getattr(scan->slot, 4, &isnull);
+ range->not_summarized = slot_getattr(scan->slot, 5, &isnull);
+ range->min_value = slot_getattr(scan->slot, 6, &isnull);
+ range->max_value = slot_getattr(scan->slot, 7, &isnull);
+
+ return range;
+ }
+
+ return NULL;
+}
+
+/*
+ * brin_minmax_scan_dump
+ * Print info about all page ranges stored in the tuplestore.
+ */
+static void
+brin_minmax_scan_dump(BrinRangeScanDesc *scan)
+{
+ BrinRange *range;
+
+ elog(WARNING, "===== dumping =====");
+ while ((range = brin_minmax_scan_next(scan)) != NULL)
+ {
+ elog(WARNING, "[%u %u] has_nulls %d all_nulls %d not_summarized %d values [%f %f]",
+ range->blkno_start, range->blkno_end,
+ range->has_nulls, range->all_nulls, range->not_summarized,
+ DatumGetFloat8(range->min_value), DatumGetFloat8(range->max_value));
+
+ pfree(range);
+ }
+
+ /* reset the tuplestore, so that we can start scanning again */
+ tuplesort_rescan(scan->ranges);
+}
+#endif
+
+static void
+brin_minmax_scan_finalize(BrinRangeScanDesc *scan)
+{
+ tuplesort_performsort(scan->ranges);
+}
+
+/*
+ * brin_minmax_ranges
+ * Load the BRIN ranges and sort them.
+ */
+Datum
+brin_minmax_ranges(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ AttrNumber attnum = PG_GETARG_INT16(1);
+ bool asc = PG_GETARG_BOOL(2);
+ BrinOpaque *opaque;
+ Relation indexRel;
+ Relation heapRel;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ Oid heapOid;
+ BrinMemTuple *dtup;
+ BrinTuple *btup = NULL;
+ Size btupsz = 0;
+ Buffer buf = InvalidBuffer;
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc;
+ BrinRangeScanDesc *brscan;
+
+ /*
+ * Determine how many BRIN ranges could there be, allocate space and read
+ * all the min/max values.
+ */
+ opaque = (BrinOpaque *) scan->opaque;
+ bdesc = opaque->bo_bdesc;
+ pagesPerRange = opaque->bo_pagesPerRange;
+
+ indexRel = bdesc->bd_index;
+
+ /* make sure the provided attnum is valid */
+ Assert((attnum > 0) && (attnum <= bdesc->bd_tupdesc->natts));
+
+ /*
+ * We need to know the size of the table so that we know how long to iterate
+ * on the revmap (and to pre-allocate the arrays).
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(indexRel), false);
+ heapRel = table_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ table_close(heapRel, AccessShareLock);
+
+ /* allocate an initial in-memory tuple, out of the per-range memcxt */
+ dtup = brin_new_memtuple(bdesc);
+
+ /* initialize the scan describing scan of ranges sorted by minval */
+ brscan = brin_minmax_scan_init(bdesc, attnum, asc);
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ bool gottuple = false;
+ BrinTuple *tup;
+ OffsetNumber off;
+ Size size;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE,
+ scan->xs_snapshot);
+ if (tup)
+ {
+ gottuple = true;
+ btup = brin_copy_tuple(tup, size, btup, &btupsz);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Ranges with no indexed tuple may contain anything.
+ */
+ if (!gottuple)
+ {
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ false, false, true, 0, 0);
+ }
+ else
+ {
+ dtup = brin_deform_tuple(bdesc, btup, dtup);
+ if (dtup->bt_placeholder)
+ {
+ /*
+ * Placeholder tuples are treated as if not summarized.
+ *
+ * XXX Is this correct?
+ */
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ false, false, true, 0, 0);
+ }
+ else
+ {
+ BrinValues *bval;
+
+ bval = &dtup->bt_columns[attnum - 1];
+
+ brin_minmax_scan_add_tuple(brscan,
+ heapBlk, heapBlk + (pagesPerRange - 1),
+ bval->bv_hasnulls, bval->bv_allnulls, false,
+ bval->bv_values[0], bval->bv_values[1]);
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /* do the sort and any necessary post-processing */
+ brin_minmax_scan_finalize(brscan);
+
+#ifdef BRINSORT_DEBUG
+ brin_minmax_scan_dump(brscan);
+#endif
+
+ PG_RETURN_POINTER(brscan);
+}
+
/*
* Cache and return the procedure for the given strategy.
*
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f86983c6601..e15b29246b1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -85,6 +85,8 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
ExplainState *es);
static void show_incremental_sort_keys(IncrementalSortState *incrsortstate,
List *ancestors, ExplainState *es);
+static void show_brinsort_keys(BrinSortState *sortstate, List *ancestors,
+ ExplainState *es);
static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
ExplainState *es);
static void show_agg_keys(AggState *astate, List *ancestors,
@@ -1100,6 +1102,7 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
case T_IndexScan:
case T_IndexOnlyScan:
case T_BitmapHeapScan:
+ case T_BrinSort:
case T_TidScan:
case T_TidRangeScan:
case T_SubqueryScan:
@@ -1262,6 +1265,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
case T_IndexOnlyScan:
pname = sname = "Index Only Scan";
break;
+ case T_BrinSort:
+ pname = sname = "BRIN Sort";
+ break;
case T_BitmapIndexScan:
pname = sname = "Bitmap Index Scan";
break;
@@ -1508,6 +1514,16 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainScanTarget((Scan *) indexonlyscan, es);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *brinsort = (BrinSort *) plan;
+
+ ExplainIndexScanDetails(brinsort->indexid,
+ brinsort->indexorderdir,
+ es);
+ ExplainScanTarget((Scan *) brinsort, es);
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *bitmapindexscan = (BitmapIndexScan *) plan;
@@ -1790,6 +1806,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
break;
+ case T_BrinSort:
+ show_scan_qual(((BrinSort *) plan)->indexqualorig,
+ "Index Cond", planstate, ancestors, es);
+ if (((BrinSort *) plan)->indexqualorig)
+ show_instrumentation_count("Rows Removed by Index Recheck", 2,
+ planstate, es);
+ show_scan_qual(plan->qual, "Filter", planstate, ancestors, es);
+ show_brinsort_keys(castNode(BrinSortState, planstate), ancestors, es);
+ if (plan->qual)
+ show_instrumentation_count("Rows Removed by Filter", 1,
+ planstate, es);
+ break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
"Index Cond", planstate, ancestors, es);
@@ -2389,6 +2417,21 @@ show_incremental_sort_keys(IncrementalSortState *incrsortstate,
ancestors, es);
}
+/*
+ * Show the sort keys for a BRIN Sort node.
+ */
+static void
+show_brinsort_keys(BrinSortState *sortstate, List *ancestors, ExplainState *es)
+{
+ BrinSort *plan = (BrinSort *) sortstate->ss.ps.plan;
+
+ show_sort_group_keys((PlanState *) sortstate, "Sort Key",
+ plan->numCols, 0, plan->sortColIdx,
+ plan->sortOperators, plan->collations,
+ plan->nullsFirst,
+ ancestors, es);
+}
+
/*
* Likewise, for a MergeAppend node.
*/
@@ -3812,6 +3855,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
case T_ForeignScan:
case T_CustomScan:
case T_ModifyTable:
+ case T_BrinSort:
/* Assert it's on a real relation */
Assert(rte->rtekind == RTE_RELATION);
objectname = get_rel_name(rte->relid);
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..bcaa2ce8e21 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -38,6 +38,7 @@ OBJS = \
nodeBitmapHeapscan.o \
nodeBitmapIndexscan.o \
nodeBitmapOr.o \
+ nodeBrinSort.o \
nodeCtescan.o \
nodeCustom.o \
nodeForeignscan.o \
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 36406c3af57..4a6dc3f263c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -79,6 +79,7 @@
#include "executor/nodeBitmapHeapscan.h"
#include "executor/nodeBitmapIndexscan.h"
#include "executor/nodeBitmapOr.h"
+#include "executor/nodeBrinSort.h"
#include "executor/nodeCtescan.h"
#include "executor/nodeCustom.h"
#include "executor/nodeForeignscan.h"
@@ -226,6 +227,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
estate, eflags);
break;
+ case T_BrinSort:
+ result = (PlanState *) ExecInitBrinSort((BrinSort *) node,
+ estate, eflags);
+ break;
+
case T_BitmapIndexScan:
result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
estate, eflags);
@@ -639,6 +645,10 @@ ExecEndNode(PlanState *node)
ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
break;
+ case T_BrinSortState:
+ ExecEndBrinSort((BrinSortState *) node);
+ break;
+
case T_BitmapIndexScanState:
ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
break;
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
new file mode 100644
index 00000000000..ca72c1ed22d
--- /dev/null
+++ b/src/backend/executor/nodeBrinSort.c
@@ -0,0 +1,1550 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.c
+ * Routines to support sorted scan of relations using a BRIN index
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * The overall algorithm is roughly this:
+ *
+ * 0) initialize a tuplestore and a tuplesort
+ *
+ * 1) fetch list of page ranges from the BRIN index, sorted by minval
+ * (with the not-summarized ranges first, and all-null ranges last)
+ *
+ * 2) for NULLS FIRST ordering, walk all ranges that may contain NULL
+ * values and output them (and return to the beginning of the list)
+ *
+ * 3) while there are ranges in the list, do this:
+ *
+ * a) get next (distinct) minval from the list, call it watermark
+ *
+ * b) if there are any tuples in the tuplestore, move them to tuplesort
+ *
+ * c) process all ranges with (minval < watermark) - read tuples and feed
+ * them either into tuplestore (when value < watermark) or tuplestore
+ *
+ * d) sort the tuplestore, output all the tuples
+ *
+ * 4) if some tuples remain in the tuplestore, sort and output them
+ *
+ * 5) for NULLS LAST ordering, walk all ranges that may contain NULL
+ * values and output them (and return to the beginning of the list)
+ *
+ *
+ * For DESC orderings the process is almost the same, except that we look
+ * at maxval and use '>' operator (but that's transparent).
+ *
+ * There's a couple possible things that might be done in different ways:
+ *
+ * 1) Not using tuplestore, and feeding tuples only to a tuplesort. Then
+ * while producing the tuples, we'd only output tuples up to the current
+ * watermark, and then we'd keep the remaining tuples for the next round.
+ * Either we'd need to transfer them into a second tuplesort, or allow
+ * "reopening" the tuplesort and adding more tuples. And then only the
+ * part since the watermark would get sorted (possibly using a merge-sort
+ * with the already sorted part).
+ *
+ *
+ * 2) The other question is what to do with NULL values - at the moment we
+ * just read the ranges, output the NULL tuples and that's it - we're not
+ * retaining any non-NULL tuples, so that we'll read the ranges again in
+ * the second range. The logic here is that either there are very few
+ * such ranges, so it's won't cost much to just re-read them. Or maybe
+ * there are very many such ranges, and we'd do a lot of spilling to the
+ * tuplestore, and it's not much more expensive to just re-read the source
+ * data. There are counter-examples, though - e.g., there might be many
+ * has_nulls ranges, but with very few non-NULL tuples. In this case it
+ * might be better to actually spill the tuples instead of re-reading all
+ * the ranges. Maybe this is something we can do at run-time, or maybe we
+ * could estimate this at planning time. We do know the null_frac for the
+ * column, so we know the number of NULL rows. And we also know the number
+ * of all_nulls and has_nulls ranges. We can estimate the number of rows
+ * per range, and we can estimate how many non-NULL rows are in the
+ * has_nulls ranges (we don't need to re-read all-nulls ranges). There's
+ * also the filter, which may reduce the amount of rows to store.
+ *
+ * So we'd need to compare two metrics calculated roughly like this:
+ *
+ * cost(re-reading has-nulls ranges)
+ * = cost(random_page_cost * n_has_nulls + seq_page_cost * pages_per_range)
+ *
+ * cost(spilling non-NULL rows from has-nulls ranges)
+ * = cost(numrows * width / BLCKSZ * seq_page_cost * 2)
+ *
+ * where numrows is the number of non-NULL rows in has_null ranges, which
+ * can be calculated like this:
+ *
+ * // estimated number of rows in has-null ranges
+ * rows_in_has_nulls = (reltuples / relpages) * pages_per_range * n_has_nulls
+ *
+ * // number of NULL rows in the has-nulls ranges
+ * nulls_in_ranges = reltuples * null_frac - n_all_nulls * (reltuples / relpages)
+ *
+ * // numrows is the difference, multiplied by selectivity of the index
+ * // filter condition (value between 0.0 and 1.0)
+ * numrows = (rows_in_has_nulls - nulls_in_ranges) * selectivity
+ *
+ * This ignores non-summarized ranges, but there should be only very few of
+ * those, so it should not make a huge difference. Otherwise we can divide
+ * them between regular, has-nulls and all-nulls pages to keep the ratio.
+ *
+ *
+ * 3) How large step to make when updating the watermark?
+ *
+ * When updating the watermark, one option is to simply proceed to the next
+ * distinct minval value, which is the smallest possible step we can make.
+ * This may be both fine and very inefficient, depending on how many rows
+ * end up in the tuplesort and how many rows we end up spilling (possibly
+ * repeatedly to the tuplestore).
+ *
+ * When having to sort large number of rows, it's inefficient to run many
+ * tiny sorts, even if it produces correct result. For example when sorting
+ * 1M rows, we may split this as either (a) 100000x sorts of 10 rows, or
+ * (b) 1000 sorts of 1000 rows. The (b) option is almost certainly more
+ * efficient. Maybe sorts of 10k rows would be even better, if it fits
+ * into work_mem.
+ *
+ * This gets back to how large the page ranges are, and if/how much they
+ * overlap. With tiny ranges (e.g. a single-page ranges), a single range
+ * can only add as many rows as we can fit on a single page. So we need
+ * more ranges by default - how many watermark steps that is depends on
+ * how many distinct minval values are there ...
+ *
+ * Then there's overlaps - if ranges do not overlap, we're done and we'll
+ * add the whole range because the next watermark is above maxval. But
+ * when the ranges overlap, we'll only add the first part (assuming the
+ * minval of the next range is the watermark). Assume 10 overlapping
+ * ranges - imagine for example ranges shifted by 10%, so something like
+ *
+ * [0,100] [10,110], [20,120], [30, 130], ..., [90, 190]
+ *
+ * In the first step we use watermark=10 and load the first range, with
+ * maybe 1000 rows in total. But assuming uniform distribution, only about
+ * 100 rows will go into the tuplesort, the remaining 900 rows will go into
+ * the tuplestore (assuming uniform distribution). Then in the second step
+ * we sort another 100 rows and the remaining 800 rows will be moved into
+ * a new tuplestore. And so on and so on.
+ *
+ * This means that incrementing the watermarks by single steps may be
+ * quite inefficient, and we need to reflect both the range size and
+ * how much the ranges overlap.
+ *
+ * In fact, maybe we should not determine the step as number of minval
+ * values to skip, but how many ranges would that mean reading. Because
+ * if we have a minval with many duplicates, that may load many rows.
+ * Or even better, we could look at how many rows would that mean loading
+ * into the tuplestore - if we track P(x<minval) for each range (e.g. by
+ * calculating average value during ANALYZE, or perhaps by estimating
+ * it from per-column stats), then we know the increment is going to be
+ * about
+ *
+ * P(x < minval[i]) - P(x < minval[i-1])
+ *
+ * and we can stop once we'd exceed work_mem (with some slack). See comment
+ * for brin_minmax_stats() for more thoughts.
+ *
+ *
+ * 4) LIMIT/OFFSET vs. full sort
+ *
+ * There's one case where very small sorts may be actually optimal, and
+ * that's queries that need to process only very few rows - say, LIMIT
+ * queries with very small bound.
+ *
+ *
+ * FIXME Projection does not work (fails on projection slot expecting
+ * buffer ops, but we're sending it minimal tuple slot).
+ *
+ * FIXME The tlists are not wired quite correctly - the sortColIdx is an
+ * index to the tlist, but we need attnum from the heap table, so that we
+ * can fetch the attribute etc. Or maybe fetching the value from the raw
+ * tuple (before projection) is wrong and needs to be done differently.
+ *
+ * FIXME Indexes on expressions don't work (possibly related to the tlist
+ * being done incorrectly).
+ *
+ * FIXME handling of other brin opclasses (minmax-multi)
+ *
+ * FIXME improve costing
+ *
+ *
+ * Improvement ideas:
+ *
+ * 1) multiple tuplestores for overlapping ranges
+ *
+ * When there are many overlapping ranges (so that maxval > current.maxval),
+ * we're loading all the "future" tuples into a new tuplestore. However, if
+ * there are multiple such ranges (imagine ranges "shifting" by 10%, which
+ * gives us 9 more ranges), we know in the next round we'll only need rows
+ * until the next maxval. We'll not sort these rows, but we'll still shuffle
+ * them around until we get to the proper range (so about 10x each row).
+ * Maybe we should pre-allocate the tuplestores (or maybe even tuplesorts)
+ * for future ranges, and route the tuples to the correct one? Maybe we
+ * could be a bit smarter and discard tuples once we have enough rows for
+ * the preceding ranges (say, with LIMIT queries). We'd also need to worry
+ * about work_mem, though - we can't just use many tuplestores, each with
+ * whole work_mem. So we'd probably use e.g. work_mem/2 for the next one,
+ * and then /4, /8 etc. for the following ones. That's work_mem in total.
+ * And there'd need to be some limit on number of tuplestores, I guess.
+ *
+ * 2) handling NULL values
+ *
+ * We need to handle NULLS FIRST / NULLS LAST cases. The question is how
+ * to do that - the easiest way is to simply do a separate scan of ranges
+ * that might contain NULL values, processing just rows with NULLs, and
+ * discarding other rows. And then process non-NULL values as currently.
+ * The NULL scan would happen before/after this regular phase.
+ *
+ * Byt maybe we could be smarter, and not do separate scans. When reading
+ * a page, we might stash the tuple in a tuplestore, so that we can read
+ * it the next round. Obviously, this might be expensive if we need to
+ * keep too many rows, so the tuplestore would grow too large - in that
+ * case it might be better to just do the two scans.
+ *
+ * 3) parallelism
+ *
+ * Presumably we could do a parallel version of this. The leader or first
+ * worker would prepare the range information, and the workers would then
+ * grab ranges (in a kinda round robin manner), sort them independently,
+ * and then the results would be merged by Gather Merge.
+ *
+ * IDENTIFICATION
+ * src/backend/executor/nodeBrinSort.c
+ *
+ *-------------------------------------------------------------------------
+ */
+/*
+ * INTERFACE ROUTINES
+ * ExecBrinSort scans a relation using an index
+ * IndexNext retrieve next tuple using index
+ * ExecInitBrinSort creates and initializes state info.
+ * ExecReScanBrinSort rescans the indexed relation.
+ * ExecEndBrinSort releases all storage.
+ * ExecBrinSortMarkPos marks scan position.
+ * ExecBrinSortRestrPos restores scan position.
+ * ExecBrinSortEstimate estimates DSM space needed for parallel index scan
+ * ExecBrinSortInitializeDSM initialize DSM for parallel BrinSort
+ * ExecBrinSortReInitializeDSM reinitialize DSM for fresh scan
+ * ExecBrinSortInitializeWorker attach to DSM info in parallel worker
+ */
+#include "postgres.h"
+
+#include "access/brin.h"
+#include "access/brin_internal.h"
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "executor/execdebug.h"
+#include "executor/nodeBrinSort.h"
+#include "lib/pairingheap.h"
+#include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
+#include "utils/array.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+
+static TupleTableSlot *IndexNext(BrinSortState *node);
+static bool IndexRecheck(BrinSortState *node, TupleTableSlot *slot);
+static void ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate);
+
+#define BRINSORT_DEBUG
+
+/* do various consistency checks */
+static void
+AssertCheckRanges(BrinSortState *node)
+{
+#ifdef USE_ASSERT_CHECKING
+
+#endif
+}
+
+/*
+ * brinsort_start_tidscan
+ * Start scanning tuples from a given page range.
+ *
+ * We open a TID range scan for the given range, and initialize the tuplesort.
+ * Optionally, we update the watermark (with either high/low value). We only
+ * need to do this for the main page range, not for the intersecting ranges.
+ *
+ * XXX Maybe we should initialize the tidscan only once, and then do rescan
+ * for the following ranges? And similarly for the tuplesort?
+ */
+static void
+brinsort_start_tidscan(BrinSortState *node)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate = node->ss.ps.state;
+ BrinRange *range = node->bs_range;
+
+ /* There must not be any TID scan in progress yet. */
+ Assert(node->ss.ss_currentScanDesc == NULL);
+
+ /* Initialize the TID range scan, for the provided block range. */
+ if (node->ss.ss_currentScanDesc == NULL)
+ {
+ TableScanDesc tscandesc;
+ ItemPointerData mintid,
+ maxtid;
+
+ ItemPointerSetBlockNumber(&mintid, range->blkno_start);
+ ItemPointerSetOffsetNumber(&mintid, 0);
+
+ ItemPointerSetBlockNumber(&maxtid, range->blkno_end);
+ ItemPointerSetOffsetNumber(&maxtid, MaxHeapTuplesPerPage);
+
+ elog(DEBUG1, "loading range blocks [%u, %u]",
+ range->blkno_start, range->blkno_end);
+
+ tscandesc = table_beginscan_tidrange(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ &mintid, &maxtid);
+ node->ss.ss_currentScanDesc = tscandesc;
+ }
+
+ if (node->bs_tuplesortstate == NULL)
+ {
+ TupleDesc tupDesc = RelationGetDescr(node->ss.ss_currentRelation);
+
+ node->bs_tuplesortstate = tuplesort_begin_heap(tupDesc,
+ plan->numCols,
+ plan->sortColIdx,
+ plan->sortOperators,
+ plan->collations,
+ plan->nullsFirst,
+ work_mem,
+ NULL,
+ TUPLESORT_NONE);
+ }
+
+ if (node->bs_tuplestore == NULL)
+ {
+ node->bs_tuplestore = tuplestore_begin_heap(false, false, work_mem);
+ }
+}
+
+/*
+ * brinsort_end_tidscan
+ * Finish the TID range scan.
+ */
+static void
+brinsort_end_tidscan(BrinSortState *node)
+{
+ /* get the first range, read all tuples using a tid range scan */
+ if (node->ss.ss_currentScanDesc != NULL)
+ {
+ table_endscan(node->ss.ss_currentScanDesc);
+ node->ss.ss_currentScanDesc = NULL;
+ }
+}
+
+/*
+ * brinsort_update_watermark
+ * Advance the watermark to the next minval (or maxval for DESC).
+ *
+ * We could could actually advance the watermark by multiple steps (not to
+ * the immediately following minval, but a couple more), to accumulate more
+ * rows in the tuplesort. The number of steps we make correlates with the
+ * amount of data we sort in a given step, but we don't know in advance
+ * how many rows (or bytes) will that actually be. We could do some simple
+ * heuristics (measure past sorts and extrapolate).
+ */
+static void
+brinsort_update_watermark(BrinSortState *node, bool asc)
+{
+ int cmp;
+ bool found = false;
+
+ tuplesort_markpos(node->bs_scan->ranges);
+
+ while (tuplesort_gettupleslot(node->bs_scan->ranges, true, false, node->bs_scan->slot, NULL))
+ {
+ bool isnull;
+ Datum value;
+ bool all_nulls;
+ bool not_summarized;
+
+ all_nulls = DatumGetBool(slot_getattr(node->bs_scan->slot, 4, &isnull));
+ Assert(!isnull);
+
+ not_summarized = DatumGetBool(slot_getattr(node->bs_scan->slot, 5, &isnull));
+ Assert(!isnull);
+
+ /* we ignore ranges that are either all_nulls or not summarized */
+ if (all_nulls || not_summarized)
+ continue;
+
+ /* use either minval or maxval, depending on the ASC / DESC */
+ if (asc)
+ value = slot_getattr(node->bs_scan->slot, 6, &isnull);
+ else
+ value = slot_getattr(node->bs_scan->slot, 7, &isnull);
+
+ if (!node->bs_watermark_set)
+ {
+ node->bs_watermark_set = true;
+ node->bs_watermark = value;
+ continue;
+ }
+
+ cmp = ApplySortComparator(node->bs_watermark, false, value, false,
+ &node->bs_sortsupport);
+
+ if (cmp < 0)
+ {
+ node->bs_watermark_set = true;
+ node->bs_watermark = value;
+ found = true;
+ break;
+ }
+ }
+
+ tuplesort_restorepos(node->bs_scan->ranges);
+
+ node->bs_watermark_set = found;
+}
+
+/*
+ * brinsort_load_tuples
+ * Load tuples from the TID range scan, add them to tuplesort/store.
+ *
+ * When called for the "current" range, we don't need to check the watermark,
+ * we know the tuple goes into the tuplesort. So with check_watermark we
+ * skip the comparator call to save CPU cost.
+ */
+static void
+brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_processing)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ TableScanDesc scan;
+ EState *estate;
+ ScanDirection direction;
+ TupleTableSlot *slot;
+ BrinRange *range = node->bs_range;
+
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ slot = node->ss.ss_ScanTupleSlot;
+
+ Assert(node->bs_range != NULL);
+
+ /*
+ * If we're not processign NULLS, and this is all-nulls range, we can
+ * just skip it - we won't find any non-NULL tuples in it.
+ *
+ * XXX Shouldn't happen, thanks to logic in brinsort_next_range().
+ */
+ if (!null_processing && range->all_nulls)
+ return;
+
+ /*
+ * Similarly, if we're processing NULLs and this range does not have
+ * has_nulls flag, we can skip it.
+ *
+ * XXX Shouldn't happen, thanks to logic in brinsort_next_range().
+ */
+ if (null_processing && !(range->has_nulls || range->not_summarized || range->all_nulls))
+ return;
+
+ brinsort_start_tidscan(node);
+
+ scan = node->ss.ss_currentScanDesc;
+
+ /*
+ * Read tuples, evaluate the filer (so that we don't keep tuples only to
+ * discard them later), and decide if it goes into the current range
+ * (tuplesort) or overflow (tuplestore).
+ */
+ while (table_scan_getnextslot_tidrange(scan, direction, slot))
+ {
+ ExprContext *econtext;
+ ExprState *qual;
+
+ /*
+ * Fetch data from node
+ */
+ qual = node->bs_qual;
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /*
+ * place the current tuple into the expr context
+ */
+ econtext->ecxt_scantuple = slot;
+
+ /*
+ * check that the current tuple satisfies the qual-clause
+ *
+ * check for non-null qual here to avoid a function call to ExecQual()
+ * when the qual is null ... saves only a few cycles, but they add up
+ * ...
+ *
+ * XXX Done here, because in ExecScan we'll get different slot type
+ * (minimal tuple vs. buffered tuple). Scan expects slot while reading
+ * from the table (like here), but we're stashing it into a tuplesort.
+ *
+ * XXX Maybe we could eliminate many tuples by leveraging the BRIN
+ * range, by executing the consistent function. But we don't have
+ * the qual in appropriate format at the moment, so we'd preprocess
+ * the keys similarly to bringetbitmap(). In which case we should
+ * probably evaluate the stuff while building the ranges? Although,
+ * if the "consistent" function is expensive, it might be cheaper
+ * to do that incrementally, as we need the ranges. Would be a win
+ * for LIMIT queries, for example.
+ *
+ * XXX However, maybe we could also leverage other bitmap indexes,
+ * particularly for BRIN indexes because that makes it simpler to
+ * eliminage the ranges incrementally - we know which ranges to
+ * load from the index, while for other indexes (e.g. btree) we
+ * have to read the whole index and build a bitmap in order to have
+ * a bitmap for any range. Although, if the condition is very
+ * selective, we may need to read only a small fraction of the
+ * index, so maybe that's OK.
+ */
+ if (qual == NULL || ExecQual(qual, econtext))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ Datum value;
+ bool isnull;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /*
+ * FIXME Not handling NULLS for now, we need to stash them into
+ * a separate tuplestore (so that we can output them first or
+ * last), and then skip them in the regular processing?
+ */
+ if (null_processing)
+ {
+ /* Stash it to the tuplestore (when NULL, or ignore
+ * it (when not-NULL). */
+ if (isnull)
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+
+ /* NULL or not, we're done */
+ continue;
+ }
+
+ /* we're not processing NULL values, so ignore NULLs */
+ if (isnull)
+ continue;
+
+ /*
+ * Otherwise compare to watermark, and stash it either to the
+ * tuplesort or tuplestore.
+ */
+ if (check_watermark && node->bs_watermark_set)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(node->bs_tuplestore, slot);
+ }
+
+ ExecClearTuple(slot);
+ }
+
+ ExecClearTuple(slot);
+
+ brinsort_end_tidscan(node);
+}
+
+/*
+ * brinsort_load_spill_tuples
+ * Load tuples from the spill tuplestore, and either stash them into
+ * a tuplesort or a new tuplestore.
+ *
+ * After processing the last range, we want to process all remaining ranges,
+ * so with check_watermark=false we skip the check.
+ */
+static void
+brinsort_load_spill_tuples(BrinSortState *node, bool check_watermark)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ Tuplestorestate *tupstore;
+ TupleTableSlot *slot;
+
+ if (node->bs_tuplestore == NULL)
+ return;
+
+ /* start scanning the existing tuplestore (XXX needed?) */
+ tuplestore_rescan(node->bs_tuplestore);
+
+ /*
+ * Create a new tuplestore, for tuples that exceed the watermark and so
+ * should not be included in the current sort.
+ */
+ tupstore = tuplestore_begin_heap(false, false, work_mem);
+
+ /*
+ * We need a slot for minimal tuples. The scan slot uses buffered tuples,
+ * so it'd trigger an error in the loop.
+ */
+ slot = MakeSingleTupleTableSlot(RelationGetDescr(node->ss.ss_currentRelation),
+ &TTSOpsMinimalTuple);
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ {
+ int cmp = 0; /* matters for check_watermark=false */
+ bool isnull;
+ Datum value;
+
+ value = slot_getattr(slot, plan->sortColIdx[0], &isnull);
+
+ /* We shouldn't have NULL values in the spill, at least not now. */
+ Assert(!isnull);
+
+ if (check_watermark && node->bs_watermark_set)
+ cmp = ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport);
+
+ if (cmp <= 0)
+ tuplesort_puttupleslot(node->bs_tuplesortstate, slot);
+ else
+ tuplestore_puttupleslot(tupstore, slot);
+ }
+
+ /*
+ * Discard the existing tuplestore (that we just processed), use the new
+ * one instead.
+ */
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = tupstore;
+
+ ExecDropSingleTupleTableSlot(slot);
+}
+
+static bool
+brinsort_next_range(BrinSortState *node, bool asc)
+{
+ /* FIXME free the current bs_range, if any */
+ node->bs_range = NULL;
+
+ /*
+ * Mark the position, so that we can restore it in case we reach the
+ * current watermark.
+ */
+ tuplesort_markpos(node->bs_scan->ranges);
+
+ /*
+ * Get the next range and return it, unless we can prove it's the last
+ * range that can possibly match the current conditon (thanks to how we
+ * order the ranges).
+ *
+ * Also skip ranges that can't possibly match (e.g. because we are in
+ * NULL processing, and the range has no NULLs).
+ */
+ while (tuplesort_gettupleslot(node->bs_scan->ranges, true, false, node->bs_scan->slot, NULL))
+ {
+ bool isnull;
+ Datum value;
+
+ BrinRange *range = (BrinRange *) palloc(sizeof(BrinRange));
+
+ range->blkno_start = slot_getattr(node->bs_scan->slot, 1, &isnull);
+ range->blkno_end = slot_getattr(node->bs_scan->slot, 2, &isnull);
+ range->has_nulls = slot_getattr(node->bs_scan->slot, 3, &isnull);
+ range->all_nulls = slot_getattr(node->bs_scan->slot, 4, &isnull);
+ range->not_summarized = slot_getattr(node->bs_scan->slot, 5, &isnull);
+ range->min_value = slot_getattr(node->bs_scan->slot, 6, &isnull);
+ range->max_value = slot_getattr(node->bs_scan->slot, 7, &isnull);
+
+ /*
+ * Not-summarized ranges match irrespectedly of the watermark (if
+ * it's set at all).
+ */
+ if (range->not_summarized)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ /*
+ * The range is summarized, but maybe the watermark is not? That
+ * would mean we're processing NULL values, so we skip ranges that
+ * can't possibly match (i.e. with all_nulls=has_nulls=false).
+ */
+ if (!node->bs_watermark_set)
+ {
+ if (range->all_nulls || range->has_nulls)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ /* update the position and try the next range */
+ tuplesort_markpos(node->bs_scan->ranges);
+ pfree(range);
+
+ continue;
+ }
+
+ /*
+ * So now we have a summarized range, and we know the watermark
+ * is set too (so we're not processing NULLs). We place the ranges
+ * with only nulls last, so once we hit one we're done.
+ */
+ if (range->all_nulls)
+ {
+ pfree(range);
+ return false; /* no more matching ranges */
+ }
+
+ /*
+ * Compare the range to the watermark, using either the minval or
+ * maxval, depending on ASC/DESC ordering. If the range precedes the
+ * watermark, return it. Otherwise abort, all the future ranges are
+ * either not matching the watermark (thanks to ordering) or contain
+ * only NULL values.
+ */
+
+ /* use minval or maxval, depending on ASC / DESC */
+ value = (asc) ? range->min_value : range->max_value;
+
+ /*
+ * compare it to the current watermark (if set)
+ *
+ * XXX We don't use (... <= 0) here, because then we'd load ranges
+ * with that minval (and there might be multiple), but most of the
+ * rows would go into the tuplestore, because only rows matching the
+ * minval exactly would be loaded into tuplesort.
+ */
+ if (ApplySortComparator(value, false,
+ node->bs_watermark, false,
+ &node->bs_sortsupport) < 0)
+ {
+ node->bs_range = range;
+ return true;
+ }
+
+ pfree(range);
+ break;
+ }
+
+ /* not a matching range, we're done */
+ tuplesort_restorepos(node->bs_scan->ranges);
+
+ return false;
+}
+
+static bool
+brinsort_range_with_nulls(BrinSortState *node)
+{
+ BrinRange *range = node->bs_range;
+
+ if (range->all_nulls || range->has_nulls || range->not_summarized)
+ return true;
+
+ return false;
+}
+
+static void
+brinsort_rescan(BrinSortState *node)
+{
+ tuplesort_rescan(node->bs_scan->ranges);
+}
+
+/* ----------------------------------------------------------------
+ * IndexNext
+ *
+ * Retrieve a tuple from the BrinSort node's currentRelation
+ * using the index specified in the BrinSortState information.
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+IndexNext(BrinSortState *node)
+{
+ BrinSort *plan = (BrinSort *) node->ss.ps.plan;
+ EState *estate;
+ ScanDirection direction;
+ IndexScanDesc scandesc;
+ TupleTableSlot *slot;
+ bool nullsFirst;
+ bool asc;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+
+ /* flip direction if this is an overall backward scan */
+ /* XXX For BRIN indexes this is always forward direction */
+ // if (ScanDirectionIsBackward(((BrinSort *) node->ss.ps.plan)->indexorderdir))
+ if (false)
+ {
+ if (ScanDirectionIsForward(direction))
+ direction = BackwardScanDirection;
+ else if (ScanDirectionIsBackward(direction))
+ direction = ForwardScanDirection;
+ }
+ scandesc = node->iss_ScanDesc;
+ slot = node->ss.ss_ScanTupleSlot;
+
+ nullsFirst = plan->nullsFirst[0];
+ asc = ScanDirectionIsForward(plan->indexorderdir);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the index scan is not parallel, or if we're
+ * serially executing an index scan that was planned to be parallel.
+ */
+ scandesc = index_beginscan(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys);
+
+ node->iss_ScanDesc = scandesc;
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and
+ * pass the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(scandesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+ /*
+ * Load info about BRIN ranges, sort them to match the desired ordering.
+ */
+ ExecInitBrinSortRanges(plan, node);
+ node->bs_phase = BRINSORT_START;
+ }
+
+ /*
+ * ok, now that we have what we need, fetch the next tuple.
+ */
+ while (node->bs_phase != BRINSORT_FINISHED)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ elog(DEBUG1, "phase = %d", node->bs_phase);
+
+ AssertCheckRanges(node);
+
+ switch (node->bs_phase)
+ {
+ case BRINSORT_START:
+
+ elog(DEBUG1, "phase = START");
+
+ /*
+ * If we have NULLS FIRST, move to that stage. Otherwise
+ * start scanning regular ranges.
+ */
+ if (nullsFirst)
+ node->bs_phase = BRINSORT_LOAD_NULLS;
+ else
+ {
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+
+ /* set the first watermark */
+ brinsort_update_watermark(node, asc);
+ }
+
+ break;
+
+ case BRINSORT_LOAD_RANGE:
+ {
+ elog(DEBUG1, "phase = LOAD_RANGE");
+
+ /*
+ * Load tuples matching the new watermark from the existing
+ * spill tuplestore. We do this before loading tuples from
+ * the next chunk of ranges, because those will add tuples
+ * to the spill, and we'd end up processing those twice.
+ */
+ brinsort_load_spill_tuples(node, true);
+
+ /*
+ * Load tuples from ranges, until we find a range that has
+ * min_value >= watermark.
+ *
+ * XXX In fact, we are guaranteed to find an exact match
+ * for the watermark, because of how we pick the watermark.
+ */
+ while (brinsort_next_range(node, asc))
+ brinsort_load_tuples(node, true, false);
+
+ /*
+ * If we have loaded any tuples into the tuplesort, try
+ * sorting it and move to producing the tuples.
+ *
+ * XXX The range might have no rows matching the current
+ * watermark, in which case the tuplesort is empty.
+ */
+ if (node->bs_tuplesortstate)
+ {
+ tuplesort_performsort(node->bs_tuplesortstate);
+#ifdef BRINSORT_DEBUG
+ {
+ TuplesortInstrumentation stats;
+
+ tuplesort_get_stats(node->bs_tuplesortstate, &stats);
+
+ elog(DEBUG1, "method: %s space: %ld kB (%s)",
+ tuplesort_method_name(stats.sortMethod),
+ stats.spaceUsed,
+ tuplesort_space_type_name(stats.spaceType));
+ }
+#endif
+ }
+
+ node->bs_phase = BRINSORT_PROCESS_RANGE;
+ break;
+ }
+
+ case BRINSORT_PROCESS_RANGE:
+
+ elog(DEBUG1, "phase BRINSORT_PROCESS_RANGE");
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplesortstate != NULL)
+ {
+ if (tuplesort_gettupleslot(node->bs_tuplesortstate,
+ ScanDirectionIsForward(direction),
+ false, slot, NULL))
+ return slot;
+
+ /* once we're done with the tuplesort, reset it */
+ tuplesort_reset(node->bs_tuplesortstate);
+ }
+
+ /*
+ * Now that we processed tuples from the last range batch,
+ * see if we reached the end of if we should try updating
+ * the watermark once again. If the watermark is not set,
+ * we've already processed the last range.
+ */
+ if (!node->bs_watermark_set)
+ {
+ if (nullsFirst)
+ node->bs_phase = BRINSORT_FINISHED;
+ else
+ {
+ brinsort_rescan(node);
+ node->bs_phase = BRINSORT_LOAD_NULLS;
+ }
+ }
+ else
+ {
+ /* updte the watermark and try reading more ranges */
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+ brinsort_update_watermark(node, asc);
+ }
+
+ break;
+
+ case BRINSORT_LOAD_NULLS:
+ {
+ elog(DEBUG1, "phase = LOAD_NULLS");
+
+ /*
+ * Try loading another range. If there are no more ranges,
+ * we're done and we move either to loading regular ranges.
+ * Otherwise check if this range can contain
+ */
+ while (true)
+ {
+ /* no more ranges - terminate or load regular ranges */
+ if (!brinsort_next_range(node, asc))
+ {
+ if (nullsFirst)
+ {
+ brinsort_rescan(node);
+ node->bs_phase = BRINSORT_LOAD_RANGE;
+ brinsort_update_watermark(node, asc);
+ }
+ else
+ node->bs_phase = BRINSORT_FINISHED;
+
+ break;
+ }
+
+ /* If this range (may) have nulls, proces them */
+ if (brinsort_range_with_nulls(node))
+ break;
+ }
+
+ if (node->bs_range == NULL)
+ break;
+
+ /*
+ * There should be nothing left in the tuplestore, because
+ * we flush that at the end of processing regular tuples,
+ * and we don't retain tuples between NULL ranges.
+ */
+ // Assert(node->bs_tuplestore == NULL);
+
+ /*
+ * Load the next unprocessed / NULL range. We don't need to
+ * check watermark while processing NULLS.
+ */
+ brinsort_load_tuples(node, false, true);
+
+ node->bs_phase = BRINSORT_PROCESS_NULLS;
+ break;
+ }
+
+ break;
+
+ case BRINSORT_PROCESS_NULLS:
+
+ elog(DEBUG1, "phase = LOAD_NULLS");
+
+ slot = node->ss.ps.ps_ResultTupleSlot;
+
+ Assert(node->bs_tuplestore != NULL);
+
+ /* read tuples from the tuplesort range, and output them */
+ if (node->bs_tuplestore != NULL)
+ {
+
+ while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
+ return slot;
+
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ node->bs_phase = BRINSORT_LOAD_NULLS; /* load next range */
+ }
+
+ break;
+
+ case BRINSORT_FINISHED:
+ elog(ERROR, "unexpected BrinSort phase: FINISHED");
+ break;
+ }
+ }
+
+ /*
+ * if we get here it means the index scan failed so we are at the end of
+ * the scan..
+ */
+ node->iss_ReachedEnd = true;
+ return ExecClearTuple(slot);
+}
+
+/*
+ * IndexRecheck -- access method routine to recheck a tuple in EvalPlanQual
+ */
+static bool
+IndexRecheck(BrinSortState *node, TupleTableSlot *slot)
+{
+ ExprContext *econtext;
+
+ /*
+ * extract necessary information from index scan node
+ */
+ econtext = node->ss.ps.ps_ExprContext;
+
+ /* Does the tuple meet the indexqual condition? */
+ econtext->ecxt_scantuple = slot;
+ return ExecQualAndReset(node->indexqualorig, econtext);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecBrinSort(node)
+ * ----------------------------------------------------------------
+ */
+static TupleTableSlot *
+ExecBrinSort(PlanState *pstate)
+{
+ BrinSortState *node = castNode(BrinSortState, pstate);
+
+ /*
+ * If we have runtime keys and they've not already been set up, do it now.
+ */
+ if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+ ExecReScan((PlanState *) node);
+
+ return ExecScan(&node->ss,
+ (ExecScanAccessMtd) IndexNext,
+ (ExecScanRecheckMtd) IndexRecheck);
+}
+
+/* ----------------------------------------------------------------
+ * ExecReScanBrinSort(node)
+ *
+ * Recalculates the values of any scan keys whose value depends on
+ * information known at runtime, then rescans the indexed relation.
+ *
+ * ----------------------------------------------------------------
+ */
+void
+ExecReScanBrinSort(BrinSortState *node)
+{
+ /*
+ * If we are doing runtime key calculations (ie, any of the index key
+ * values weren't simple Consts), compute the new key values. But first,
+ * reset the context so we don't leak memory as each outer tuple is
+ * scanned. Note this assumes that we will recalculate *all* runtime keys
+ * on each call.
+ */
+ if (node->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *econtext = node->iss_RuntimeContext;
+
+ ResetExprContext(econtext);
+ ExecIndexEvalRuntimeKeys(econtext,
+ node->iss_RuntimeKeys,
+ node->iss_NumRuntimeKeys);
+ }
+ node->iss_RuntimeKeysReady = true;
+
+ /* reset index scan */
+ if (node->iss_ScanDesc)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+ node->iss_ReachedEnd = false;
+
+ ExecScanReScan(&node->ss);
+}
+
+
+/* ----------------------------------------------------------------
+ * ExecEndBrinSort
+ * ----------------------------------------------------------------
+ */
+void
+ExecEndBrinSort(BrinSortState *node)
+{
+ Relation indexRelationDesc;
+ IndexScanDesc IndexScanDesc;
+
+ /*
+ * extract information from the node
+ */
+ indexRelationDesc = node->iss_RelationDesc;
+ IndexScanDesc = node->iss_ScanDesc;
+
+ /*
+ * clear out tuple table slots
+ */
+ if (node->ss.ps.ps_ResultTupleSlot)
+ ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ ExecClearTuple(node->ss.ss_ScanTupleSlot);
+
+ /*
+ * close the index relation (no-op if we didn't open it)
+ */
+ if (IndexScanDesc)
+ index_endscan(IndexScanDesc);
+ if (indexRelationDesc)
+ index_close(indexRelationDesc, NoLock);
+
+ if (node->ss.ss_currentScanDesc != NULL)
+ table_endscan(node->ss.ss_currentScanDesc);
+
+ if (node->bs_tuplestore != NULL)
+ tuplestore_end(node->bs_tuplestore);
+ node->bs_tuplestore = NULL;
+
+ if (node->bs_tuplesortstate != NULL)
+ tuplesort_end(node->bs_tuplesortstate);
+ node->bs_tuplesortstate = NULL;
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortMarkPos
+ *
+ * Note: we assume that no caller attempts to set a mark before having read
+ * at least one tuple. Otherwise, iss_ScanDesc might still be NULL.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortMarkPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (epqstate != NULL)
+ {
+ /*
+ * We are inside an EvalPlanQual recheck. If a test tuple exists for
+ * this relation, then we shouldn't access the index at all. We would
+ * instead need to save, and later restore, the state of the
+ * relsubs_done flag, so that re-fetching the test tuple is possible.
+ * However, given the assumption that no caller sets a mark at the
+ * start of the scan, we can only get here with relsubs_done[i]
+ * already set, and so no state need be saved.
+ */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortMarkPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_markpos(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecIndexRestrPos
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortRestrPos(BrinSortState *node)
+{
+ EState *estate = node->ss.ps.state;
+ EPQState *epqstate = estate->es_epq_active;
+
+ if (estate->es_epq_active != NULL)
+ {
+ /* See comments in ExecIndexMarkPos */
+ Index scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
+
+ Assert(scanrelid > 0);
+ if (epqstate->relsubs_slot[scanrelid - 1] != NULL ||
+ epqstate->relsubs_rowmark[scanrelid - 1] != NULL)
+ {
+ /* Verify the claim above */
+ if (!epqstate->relsubs_done[scanrelid - 1])
+ elog(ERROR, "unexpected ExecBrinSortRestrPos call in EPQ recheck");
+ return;
+ }
+ }
+
+ index_restrpos(node->iss_ScanDesc);
+}
+
+/*
+ * somewhat crippled verson of bringetbitmap
+ *
+ * XXX We don't call consistent function (or any other function), so unlike
+ * bringetbitmap we don't set a separate memory context. If we end up filtering
+ * the ranges somehow (e.g. by WHERE conditions), this might be necessary.
+ *
+ * XXX Should be part of opclass, to somewhere in brin_minmax.c etc.
+ */
+static void
+ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate)
+{
+ IndexScanDesc scan = planstate->iss_ScanDesc;
+ Relation indexRel = planstate->iss_RelationDesc;
+ int attno;
+ FmgrInfo *rangeproc;
+ BrinRangeScanDesc *brscan;
+ bool asc;
+
+ /* BRIN Sort only allows ORDER BY using a single column */
+ Assert(node->numCols == 1);
+
+ /*
+ * Determine index attnum we're interested in. The sortColIdx has attnums
+ * from the table, but we need index attnum so that we can fetch the right
+ * range summary.
+ *
+ * XXX Maybe we could/should arrange the tlists differently, so that this
+ * is not necessary?
+ *
+ * FIXME This is broken, node->sortColIdx[0] is an index into the target
+ * list, not table attnum.
+ *
+ * FIXME Also the projection is broken.
+ */
+ attno = 0;
+ for (int i = 0; i < indexRel->rd_index->indnatts; i++)
+ {
+ if (indexRel->rd_index->indkey.values[i] == node->sortColIdx[0])
+ {
+ attno = (i + 1);
+ break;
+ }
+ }
+
+ /* make sure we matched the argument */
+ Assert(attno > 0);
+
+ /* get procedure to generate sort ranges */
+ rangeproc = index_getprocinfo(indexRel, attno, BRIN_PROCNUM_RANGES);
+
+ /*
+ * Should not get here without a proc, thanks to the check before
+ * building the BrinSort path.
+ */
+ Assert(rangeproc != NULL);
+
+ memset(&planstate->bs_sortsupport, 0, sizeof(SortSupportData));
+ PrepareSortSupportFromOrderingOp(node->sortOperators[0], &planstate->bs_sortsupport);
+
+ /*
+ * Determine if this ASC or DESC sort, so that we can request the
+ * ranges in the appropriate order (ordered either by minval for
+ * ASC, or by maxval for DESC).
+ */
+ asc = ScanDirectionIsForward(node->indexorderdir);
+
+ /*
+ * Ask the opclass to produce ranges in appropriate ordering.
+ *
+ * XXX Pass info about ASC/DESC, NULLS FIRST/LAST.
+ */
+ brscan = (BrinRangeScanDesc *) DatumGetPointer(FunctionCall3Coll(rangeproc,
+ InvalidOid, /* FIXME use proper collation*/
+ PointerGetDatum(scan),
+ Int16GetDatum(attno),
+ BoolGetDatum(asc)));
+
+ /* allocate for space, and also for the alternative ordering */
+ planstate->bs_scan = brscan;
+}
+
+/* ----------------------------------------------------------------
+ * ExecInitBrinSort
+ *
+ * Initializes the index scan's state information, creates
+ * scan keys, and opens the base and index relations.
+ *
+ * Note: index scans have 2 sets of state information because
+ * we have to keep track of the base relation and the
+ * index relation.
+ * ----------------------------------------------------------------
+ */
+BrinSortState *
+ExecInitBrinSort(BrinSort *node, EState *estate, int eflags)
+{
+ BrinSortState *indexstate;
+ Relation currentRelation;
+ LOCKMODE lockmode;
+
+ /*
+ * create state structure
+ */
+ indexstate = makeNode(BrinSortState);
+ indexstate->ss.ps.plan = (Plan *) node;
+ indexstate->ss.ps.state = estate;
+ indexstate->ss.ps.ExecProcNode = ExecBrinSort;
+
+ /*
+ * Miscellaneous initialization
+ *
+ * create expression context for node
+ */
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+
+ /*
+ * open the scan relation
+ */
+ currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
+
+ indexstate->ss.ss_currentRelation = currentRelation;
+ indexstate->ss.ss_currentScanDesc = NULL; /* no heap scan here */
+
+ /*
+ * get the scan type from the relation descriptor.
+ */
+ ExecInitScanTupleSlot(estate, &indexstate->ss,
+ RelationGetDescr(currentRelation),
+ table_slot_callbacks(currentRelation));
+
+ /*
+ * Initialize result type and projection.
+ */
+ ExecInitResultTypeTL(&indexstate->ss.ps);
+ ExecAssignScanProjectionInfo(&indexstate->ss);
+
+ /*
+ * initialize child expressions
+ *
+ * Note: we don't initialize all of the indexqual expression, only the
+ * sub-parts corresponding to runtime keys (see below). Likewise for
+ * indexorderby, if any. But the indexqualorig expression is always
+ * initialized even though it will only be used in some uncommon cases ---
+ * would be nice to improve that. (Problem is that any SubPlans present
+ * in the expression must be found now...)
+ */
+ indexstate->ss.ps.qual =
+ ExecInitQual(node->scan.plan.qual, (PlanState *) indexstate);
+ indexstate->indexqualorig =
+ ExecInitQual(node->indexqualorig, (PlanState *) indexstate);
+
+ /*
+ * If we are just doing EXPLAIN (ie, aren't going to run the plan), stop
+ * here. This allows an index-advisor plugin to EXPLAIN a plan containing
+ * references to nonexistent indexes.
+ */
+ if (eflags & EXEC_FLAG_EXPLAIN_ONLY)
+ return indexstate;
+
+ /* Open the index relation. */
+ lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
+ indexstate->iss_RelationDesc = index_open(node->indexid, lockmode);
+
+ /*
+ * Initialize index-specific scan state
+ */
+ indexstate->iss_RuntimeKeysReady = false;
+ indexstate->iss_RuntimeKeys = NULL;
+ indexstate->iss_NumRuntimeKeys = 0;
+
+ /*
+ * build the index scan keys from the index qualification
+ */
+ ExecIndexBuildScanKeys((PlanState *) indexstate,
+ indexstate->iss_RelationDesc,
+ node->indexqual,
+ false,
+ &indexstate->iss_ScanKeys,
+ &indexstate->iss_NumScanKeys,
+ &indexstate->iss_RuntimeKeys,
+ &indexstate->iss_NumRuntimeKeys,
+ NULL, /* no ArrayKeys */
+ NULL);
+
+ /*
+ * If we have runtime keys, we need an ExprContext to evaluate them. The
+ * node's standard context won't do because we want to reset that context
+ * for every tuple. So, build another context just like the other one...
+ * -tgl 7/11/00
+ */
+ if (indexstate->iss_NumRuntimeKeys != 0)
+ {
+ ExprContext *stdecontext = indexstate->ss.ps.ps_ExprContext;
+
+ ExecAssignExprContext(estate, &indexstate->ss.ps);
+ indexstate->iss_RuntimeContext = indexstate->ss.ps.ps_ExprContext;
+ indexstate->ss.ps.ps_ExprContext = stdecontext;
+ }
+ else
+ {
+ indexstate->iss_RuntimeContext = NULL;
+ }
+
+ indexstate->bs_tuplesortstate = NULL;
+ indexstate->bs_qual = indexstate->ss.ps.qual;
+ indexstate->ss.ps.qual = NULL;
+ ExecInitResultTupleSlotTL(&indexstate->ss.ps, &TTSOpsMinimalTuple);
+
+ /*
+ * all done.
+ */
+ return indexstate;
+}
+
+/* ----------------------------------------------------------------
+ * Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortEstimate
+ *
+ * Compute the amount of space we'll need in the parallel
+ * query DSM, and inform pcxt->estimator about our needs.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortEstimate(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+
+ node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+ estate->es_snapshot);
+ shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeDSM
+ *
+ * Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ EState *estate = node->ss.ps.state;
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+ index_parallelscan_initialize(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ estate->es_snapshot,
+ piscan);
+ shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortReInitializeDSM
+ *
+ * Reset shared state before beginning a fresh scan.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortReInitializeDSM(BrinSortState *node,
+ ParallelContext *pcxt)
+{
+ index_parallelrescan(node->iss_ScanDesc);
+}
+
+/* ----------------------------------------------------------------
+ * ExecBrinSortInitializeWorker
+ *
+ * Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt)
+{
+ ParallelIndexScanDesc piscan;
+
+ piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
+ node->iss_ScanDesc =
+ index_beginscan_parallel(node->ss.ss_currentRelation,
+ node->iss_RelationDesc,
+ node->iss_NumScanKeys,
+ node->iss_NumOrderByKeys,
+ piscan);
+
+ /*
+ * If no run-time keys to calculate or they are ready, go ahead and pass
+ * the scankeys to the index AM.
+ */
+ if (node->iss_NumRuntimeKeys == 0 || node->iss_RuntimeKeysReady)
+ index_rescan(node->iss_ScanDesc,
+ node->iss_ScanKeys, node->iss_NumScanKeys,
+ node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4c6b1d1f55b..64d103b19e9 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -790,6 +790,260 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
path->path.total_cost = startup_cost + run_cost;
}
+void
+cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count,
+ bool partial_path)
+{
+ IndexOptInfo *index = path->ipath.indexinfo;
+ RelOptInfo *baserel = index->rel;
+ amcostestimate_function amcostestimate;
+ List *qpquals;
+ Cost startup_cost = 0;
+ Cost run_cost = 0;
+ Cost cpu_run_cost = 0;
+ Cost indexStartupCost;
+ Cost indexTotalCost;
+ Selectivity indexSelectivity;
+ double indexCorrelation,
+ csquared;
+ double spc_seq_page_cost,
+ spc_random_page_cost;
+ Cost min_IO_cost,
+ max_IO_cost;
+ QualCost qpqual_cost;
+ Cost cpu_per_tuple;
+ double tuples_fetched;
+ double pages_fetched;
+ double rand_heap_pages;
+ double index_pages;
+
+ /* Should only be applied to base relations */
+ Assert(IsA(baserel, RelOptInfo) &&
+ IsA(index, IndexOptInfo));
+ Assert(baserel->relid > 0);
+ Assert(baserel->rtekind == RTE_RELATION);
+
+ /*
+ * Mark the path with the correct row estimate, and identify which quals
+ * will need to be enforced as qpquals. We need not check any quals that
+ * are implied by the index's predicate, so we can use indrestrictinfo not
+ * baserestrictinfo as the list of relevant restriction clauses for the
+ * rel.
+ */
+ if (path->ipath.path.param_info)
+ {
+ path->ipath.path.rows = path->ipath.path.param_info->ppi_rows;
+ /* qpquals come from the rel's restriction clauses and ppi_clauses */
+ qpquals = list_concat(extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses),
+ extract_nonindex_conditions(path->ipath.path.param_info->ppi_clauses,
+ path->ipath.indexclauses));
+ }
+ else
+ {
+ path->ipath.path.rows = baserel->rows;
+ /* qpquals come from just the rel's restriction clauses */
+ qpquals = extract_nonindex_conditions(path->ipath.indexinfo->indrestrictinfo,
+ path->ipath.indexclauses);
+ }
+
+ if (!enable_indexscan)
+ startup_cost += disable_cost;
+ /* we don't need to check enable_indexonlyscan; indxpath.c does that */
+
+ /*
+ * Call index-access-method-specific code to estimate the processing cost
+ * for scanning the index, as well as the selectivity of the index (ie,
+ * the fraction of main-table tuples we will have to retrieve) and its
+ * correlation to the main-table tuple order. We need a cast here because
+ * pathnodes.h uses a weak function type to avoid including amapi.h.
+ */
+ amcostestimate = (amcostestimate_function) index->amcostestimate;
+ amcostestimate(root, &path->ipath, loop_count,
+ &indexStartupCost, &indexTotalCost,
+ &indexSelectivity, &indexCorrelation,
+ &index_pages);
+
+ /*
+ * Save amcostestimate's results for possible use in bitmap scan planning.
+ * We don't bother to save indexStartupCost or indexCorrelation, because a
+ * bitmap scan doesn't care about either.
+ */
+ path->ipath.indextotalcost = indexTotalCost;
+ path->ipath.indexselectivity = indexSelectivity;
+
+ /* all costs for touching index itself included here */
+ startup_cost += indexStartupCost;
+ run_cost += indexTotalCost - indexStartupCost;
+
+ /* estimate number of main-table tuples fetched */
+ tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
+
+ /* fetch estimated page costs for tablespace containing table */
+ get_tablespace_page_costs(baserel->reltablespace,
+ &spc_random_page_cost,
+ &spc_seq_page_cost);
+
+ /*----------
+ * Estimate number of main-table pages fetched, and compute I/O cost.
+ *
+ * When the index ordering is uncorrelated with the table ordering,
+ * we use an approximation proposed by Mackert and Lohman (see
+ * index_pages_fetched() for details) to compute the number of pages
+ * fetched, and then charge spc_random_page_cost per page fetched.
+ *
+ * When the index ordering is exactly correlated with the table ordering
+ * (just after a CLUSTER, for example), the number of pages fetched should
+ * be exactly selectivity * table_size. What's more, all but the first
+ * will be sequential fetches, not the random fetches that occur in the
+ * uncorrelated case. So if the number of pages is more than 1, we
+ * ought to charge
+ * spc_random_page_cost + (pages_fetched - 1) * spc_seq_page_cost
+ * For partially-correlated indexes, we ought to charge somewhere between
+ * these two estimates. We currently interpolate linearly between the
+ * estimates based on the correlation squared (XXX is that appropriate?).
+ *
+ * If it's an index-only scan, then we will not need to fetch any heap
+ * pages for which the visibility map shows all tuples are visible.
+ * Hence, reduce the estimated number of heap fetches accordingly.
+ * We use the measured fraction of the entire heap that is all-visible,
+ * which might not be particularly relevant to the subset of the heap
+ * that this query will fetch; but it's not clear how to do better.
+ *----------
+ */
+ if (loop_count > 1)
+ {
+ /*
+ * For repeated indexscans, the appropriate estimate for the
+ * uncorrelated case is to scale up the number of tuples fetched in
+ * the Mackert and Lohman formula by the number of scans, so that we
+ * estimate the number of pages fetched by all the scans; then
+ * pro-rate the costs for one scan. In this case we assume all the
+ * fetches are random accesses.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+
+ /*
+ * In the perfectly correlated case, the number of pages touched by
+ * each scan is selectivity * table_size, and we can use the Mackert
+ * and Lohman formula at the page level to estimate how much work is
+ * saved by caching across scans. We still assume all the fetches are
+ * random, though, which is an overestimate that's hard to correct for
+ * without double-counting the cache effects. (But in most cases
+ * where such a plan is actually interesting, only one page would get
+ * fetched per scan anyway, so it shouldn't matter much.)
+ */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ pages_fetched = index_pages_fetched(pages_fetched * loop_count,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
+ }
+ else
+ {
+ /*
+ * Normal case: apply the Mackert and Lohman formula, and then
+ * interpolate between that and the correlation-derived result.
+ */
+ pages_fetched = index_pages_fetched(tuples_fetched,
+ baserel->pages,
+ (double) index->pages,
+ root);
+
+ rand_heap_pages = pages_fetched;
+
+ /* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
+ max_IO_cost = pages_fetched * spc_random_page_cost;
+
+ /* min_IO_cost is for the perfectly correlated case (csquared=1) */
+ pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
+
+ if (pages_fetched > 0)
+ {
+ min_IO_cost = spc_random_page_cost;
+ if (pages_fetched > 1)
+ min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
+ }
+ else
+ min_IO_cost = 0;
+ }
+
+ if (partial_path)
+ {
+ /*
+ * Estimate the number of parallel workers required to scan index. Use
+ * the number of heap pages computed considering heap fetches won't be
+ * sequential as for parallel scans the pages are accessed in random
+ * order.
+ */
+ path->ipath.path.parallel_workers = compute_parallel_worker(baserel,
+ rand_heap_pages,
+ index_pages,
+ max_parallel_workers_per_gather);
+
+ /*
+ * Fall out if workers can't be assigned for parallel scan, because in
+ * such a case this path will be rejected. So there is no benefit in
+ * doing extra computation.
+ */
+ if (path->ipath.path.parallel_workers <= 0)
+ return;
+
+ path->ipath.path.parallel_aware = true;
+ }
+
+ /*
+ * Now interpolate based on estimated index order correlation to get total
+ * disk I/O cost for main table accesses.
+ */
+ csquared = indexCorrelation * indexCorrelation;
+
+ run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
+
+ /*
+ * Estimate CPU costs per tuple.
+ *
+ * What we want here is cpu_tuple_cost plus the evaluation costs of any
+ * qual clauses that we have to evaluate as qpquals.
+ */
+ cost_qual_eval(&qpqual_cost, qpquals, root);
+
+ startup_cost += qpqual_cost.startup;
+ cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
+
+ cpu_run_cost += cpu_per_tuple * tuples_fetched;
+
+ /* tlist eval costs are paid per output row, not per tuple scanned */
+ startup_cost += path->ipath.path.pathtarget->cost.startup;
+ cpu_run_cost += path->ipath.path.pathtarget->cost.per_tuple * path->ipath.path.rows;
+
+ /* Adjust costing for parallelism, if used. */
+ if (path->ipath.path.parallel_workers > 0)
+ {
+ double parallel_divisor = get_parallel_divisor(&path->ipath.path);
+
+ path->ipath.path.rows = clamp_row_est(path->ipath.path.rows / parallel_divisor);
+
+ /* The CPU cost is divided among all the workers. */
+ cpu_run_cost /= parallel_divisor;
+ }
+
+ run_cost += cpu_run_cost;
+
+ path->ipath.path.startup_cost = startup_cost;
+ path->ipath.path.total_cost = startup_cost + run_cost;
+}
+
/*
* extract_nonindex_conditions
*
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index c31fcc917df..18b625460eb 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -17,12 +17,16 @@
#include <math.h>
+#include "access/brin_internal.h"
+#include "access/relation.h"
#include "access/stratnum.h"
#include "access/sysattr.h"
#include "catalog/pg_am.h"
#include "catalog/pg_operator.h"
+#include "catalog/pg_opclass.h"
#include "catalog/pg_opfamily.h"
#include "catalog/pg_type.h"
+#include "miscadmin.h"
#include "nodes/makefuncs.h"
#include "nodes/nodeFuncs.h"
#include "nodes/supportnodes.h"
@@ -32,10 +36,13 @@
#include "optimizer/paths.h"
#include "optimizer/prep.h"
#include "optimizer/restrictinfo.h"
+#include "utils/rel.h"
#include "utils/lsyscache.h"
#include "utils/selfuncs.h"
+bool enable_brinsort = true;
+
/* XXX see PartCollMatchesExprColl */
#define IndexCollMatchesExprColl(idxcollation, exprcollation) \
((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))
@@ -1127,6 +1134,185 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
}
}
+ /*
+ * If this is a BRIN index with suitable opclass (minmax or such), we may
+ * try doing BRIN sort. BRIN indexes are not ordered and amcanorderbyop
+ * is set to false, so we probably will need some new opclass flag to
+ * mark indexes that support this.
+ */
+ if (enable_brinsort && pathkeys_possibly_useful)
+ {
+ ListCell *lc;
+ Relation rel2 = relation_open(index->indexoid, NoLock);
+ int idx;
+
+ /*
+ * Try generating sorted paths for each key with the right opclass.
+ */
+ idx = -1;
+ foreach(lc, index->indextlist)
+ {
+ TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+ BrinSortPath *bpath;
+ Oid rangeproc;
+ AttrNumber attnum;
+
+ idx++;
+ attnum = (idx + 1);
+
+ /* skip expressions for now */
+ if (!AttributeNumberIsValid(index->indexkeys[idx]))
+ continue;
+
+ /* XXX ignore non-BRIN indexes */
+ if (rel2->rd_rel->relam != BRIN_AM_OID)
+ continue;
+
+ /*
+ * XXX Ignore keys not using an opclass with the "ranges" proc.
+ * For now we only do this for some minmax opclasses, but adding
+ * it to all minmax is simple, and adding it to minmax-multi
+ * should not be very hard.
+ */
+ rangeproc = index_getprocid(rel2, attnum, BRIN_PROCNUM_RANGES);
+ if (!OidIsValid(rangeproc))
+ continue;
+
+ /*
+ * XXX stuff extracted from build_index_pathkeys, except that we
+ * only deal with a single index key (producing a single pathkey),
+ * so we only sort on a single column. I guess we could use more
+ * index keys and sort on more expressions? Would that mean these
+ * keys need to be rather well correlated? In any case, it seems
+ * rather complex to implement, so I leave it as a possible
+ * future improvement.
+ *
+ * XXX This could also use the other BRIN keys (even from other
+ * indexes) in a different way - we might use the other ranges
+ * to quickly eliminate some of the chunks, essentially like a
+ * bitmap, but maybe without using the bitmap. Or we might use
+ * other indexes through bitmaps.
+ *
+ * XXX This fakes a number of parameters, because we don't store
+ * the btree opclass in the index, instead we use the default
+ * one for the key data type. And BRIN does not allow specifying
+ *
+ * XXX We don't add the path to result, because this function is
+ * supposed to generate IndexPaths. Instead, we just add the path
+ * using add_path(). We should be building this in a different
+ * place, perhaps in create_index_paths() or so.
+ *
+ * XXX By building it elsewhere, we could also leverage the index
+ * paths we've built here, particularly the bitmap index paths,
+ * which we could use to eliminate many of the ranges.
+ *
+ * XXX We don't have any explicit ordering associated with the
+ * BRIN index, e.g. we don't have ASC/DESC and NULLS FIRST/LAST.
+ * So this is not encoded in the index, and we can satisfy all
+ * these cases - but we need to add paths for each combination.
+ * I wonder if there's a better way to do this.
+ */
+
+ /* ASC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS LAST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ false); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* ASC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ false, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ ForwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+
+ /* DESC NULLS FIRST */
+ index_pathkeys = build_index_pathkeys_brin(root, index, indextle,
+ idx,
+ true, /* reverse_sort */
+ true); /* nulls_first */
+
+ useful_pathkeys = truncate_useless_pathkeys(root, rel,
+ index_pathkeys);
+
+ if (useful_pathkeys != NIL)
+ {
+ bpath = create_brinsort_path(root, index,
+ index_clauses,
+ useful_pathkeys,
+ BackwardScanDirection,
+ index_only_scan,
+ outer_relids,
+ loop_count,
+ false);
+
+ /* cheat and add it anyway */
+ add_path(rel, (Path *) bpath);
+ }
+ }
+
+ relation_close(rel2, NoLock);
+ }
+
return result;
}
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a9943cd6e01..83dde6f22eb 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -27,6 +27,7 @@
#include "optimizer/paths.h"
#include "partitioning/partbounds.h"
#include "utils/lsyscache.h"
+#include "utils/typcache.h"
static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
@@ -630,6 +631,55 @@ build_index_pathkeys(PlannerInfo *root,
return retval;
}
+
+List *
+build_index_pathkeys_brin(PlannerInfo *root,
+ IndexOptInfo *index,
+ TargetEntry *tle,
+ int idx,
+ bool reverse_sort,
+ bool nulls_first)
+{
+ TypeCacheEntry *typcache;
+ PathKey *cpathkey;
+ Oid sortopfamily;
+
+ /*
+ * Get default btree opfamily for the type, extracted from the
+ * entry in index targetlist.
+ *
+ * XXX Is there a better / more correct way to do this?
+ */
+ typcache = lookup_type_cache(exprType((Node *) tle->expr),
+ TYPECACHE_BTREE_OPFAMILY);
+ sortopfamily = typcache->btree_opf;
+
+ /*
+ * OK, try to make a canonical pathkey for this sort key. Note we're
+ * underneath any outer joins, so nullable_relids should be NULL.
+ */
+ cpathkey = make_pathkey_from_sortinfo(root,
+ tle->expr,
+ NULL,
+ sortopfamily,
+ index->opcintype[idx],
+ index->indexcollations[idx],
+ reverse_sort,
+ nulls_first,
+ 0,
+ index->rel->relids,
+ false);
+
+ /*
+ * There may be no pathkey if we haven't matched any sortkey, in which
+ * case ignore it.
+ */
+ if (!cpathkey)
+ return NIL;
+
+ return list_make1(cpathkey);
+}
+
/*
* partkey_is_bool_constant_for_query
*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ac86ce90033..395c632f430 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -124,6 +124,8 @@ static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
List *tlist, List *scan_clauses);
static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
List *tlist, List *scan_clauses, bool indexonly);
+static BrinSort *create_brinsort_plan(PlannerInfo *root, BrinSortPath *best_path,
+ List *tlist, List *scan_clauses);
static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
BitmapHeapPath *best_path,
List *tlist, List *scan_clauses);
@@ -191,6 +193,9 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
List *indexorderby,
List *indextlist,
ScanDirection indexscandir);
+static BrinSort *make_brinsort(List *qptlist, List *qpqual, Index scanrelid,
+ Oid indexid, List *indexqual, List *indexqualorig,
+ ScanDirection indexscandir);
static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
List *indexqual,
List *indexqualorig);
@@ -410,6 +415,9 @@ create_plan_recurse(PlannerInfo *root, Path *best_path, int flags)
case T_CustomScan:
plan = create_scan_plan(root, best_path, flags);
break;
+ case T_BrinSort:
+ plan = create_scan_plan(root, best_path, flags);
+ break;
case T_HashJoin:
case T_MergeJoin:
case T_NestLoop:
@@ -776,6 +784,13 @@ create_scan_plan(PlannerInfo *root, Path *best_path, int flags)
scan_clauses);
break;
+ case T_BrinSort:
+ plan = (Plan *) create_brinsort_plan(root,
+ (BrinSortPath *) best_path,
+ tlist,
+ scan_clauses);
+ break;
+
default:
elog(ERROR, "unrecognized node type: %d",
(int) best_path->pathtype);
@@ -3180,6 +3195,154 @@ create_indexscan_plan(PlannerInfo *root,
return scan_plan;
}
+/*
+ * create_brinsort_plan
+ * Returns a brinsort plan for the base relation scanned by 'best_path'
+ * with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+ *
+ * This is mostly a slighly simplified version of create_indexscan_plan, with
+ * the unecessary parts removed (we don't support indexonly scans, or reordering
+ * and similar stuff).
+ */
+static BrinSort *
+create_brinsort_plan(PlannerInfo *root,
+ BrinSortPath *best_path,
+ List *tlist,
+ List *scan_clauses)
+{
+ BrinSort *brinsort_plan;
+ List *indexclauses = best_path->ipath.indexclauses;
+ Index baserelid = best_path->ipath.path.parent->relid;
+ IndexOptInfo *indexinfo = best_path->ipath.indexinfo;
+ Oid indexoid = indexinfo->indexoid;
+ List *qpqual;
+ List *stripped_indexquals;
+ List *fixed_indexquals;
+ ListCell *l;
+
+ List *pathkeys = best_path->ipath.path.pathkeys;
+
+ /* it should be a base rel... */
+ Assert(baserelid > 0);
+ Assert(best_path->ipath.path.parent->rtekind == RTE_RELATION);
+
+ /*
+ * Extract the index qual expressions (stripped of RestrictInfos) from the
+ * IndexClauses list, and prepare a copy with index Vars substituted for
+ * table Vars. (This step also does replace_nestloop_params on the
+ * fixed_indexquals.)
+ */
+ fix_indexqual_references(root, &best_path->ipath,
+ &stripped_indexquals,
+ &fixed_indexquals);
+
+ /*
+ * The qpqual list must contain all restrictions not automatically handled
+ * by the index, other than pseudoconstant clauses which will be handled
+ * by a separate gating plan node. All the predicates in the indexquals
+ * will be checked (either by the index itself, or by nodeIndexscan.c),
+ * but if there are any "special" operators involved then they must be
+ * included in qpqual. The upshot is that qpqual must contain
+ * scan_clauses minus whatever appears in indexquals.
+ *
+ * is_redundant_with_indexclauses() detects cases where a scan clause is
+ * present in the indexclauses list or is generated from the same
+ * EquivalenceClass as some indexclause, and is therefore redundant with
+ * it, though not equal. (The latter happens when indxpath.c prefers a
+ * different derived equality than what generate_join_implied_equalities
+ * picked for a parameterized scan's ppi_clauses.) Note that it will not
+ * match to lossy index clauses, which is critical because we have to
+ * include the original clause in qpqual in that case.
+ *
+ * In some situations (particularly with OR'd index conditions) we may
+ * have scan_clauses that are not equal to, but are logically implied by,
+ * the index quals; so we also try a predicate_implied_by() check to see
+ * if we can discard quals that way. (predicate_implied_by assumes its
+ * first input contains only immutable functions, so we have to check
+ * that.)
+ *
+ * Note: if you change this bit of code you should also look at
+ * extract_nonindex_conditions() in costsize.c.
+ */
+ qpqual = NIL;
+ foreach(l, scan_clauses)
+ {
+ RestrictInfo *rinfo = lfirst_node(RestrictInfo, l);
+
+ if (rinfo->pseudoconstant)
+ continue; /* we may drop pseudoconstants here */
+ if (is_redundant_with_indexclauses(rinfo, indexclauses))
+ continue; /* dup or derived from same EquivalenceClass */
+ if (!contain_mutable_functions((Node *) rinfo->clause) &&
+ predicate_implied_by(list_make1(rinfo->clause), stripped_indexquals,
+ false))
+ continue; /* provably implied by indexquals */
+ qpqual = lappend(qpqual, rinfo);
+ }
+
+ /* Sort clauses into best execution order */
+ qpqual = order_qual_clauses(root, qpqual);
+
+ /* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+ qpqual = extract_actual_clauses(qpqual, false);
+
+ /*
+ * We have to replace any outer-relation variables with nestloop params in
+ * the indexqualorig, qpqual, and indexorderbyorig expressions. A bit
+ * annoying to have to do this separately from the processing in
+ * fix_indexqual_references --- rethink this when generalizing the inner
+ * indexscan support. But note we can't really do this earlier because
+ * it'd break the comparisons to predicates above ... (or would it? Those
+ * wouldn't have outer refs)
+ */
+ if (best_path->ipath.path.param_info)
+ {
+ stripped_indexquals = (List *)
+ replace_nestloop_params(root, (Node *) stripped_indexquals);
+ qpqual = (List *)
+ replace_nestloop_params(root, (Node *) qpqual);
+ }
+
+ /* Finally ready to build the plan node */
+ brinsort_plan = make_brinsort(tlist,
+ qpqual,
+ baserelid,
+ indexoid,
+ fixed_indexquals,
+ stripped_indexquals,
+ best_path->ipath.indexscandir);
+
+ if (pathkeys != NIL)
+ {
+ /*
+ * Compute sort column info, and adjust the Append's tlist as needed.
+ * Because we pass adjust_tlist_in_place = true, we may ignore the
+ * function result; it must be the same plan node. However, we then
+ * need to detect whether any tlist entries were added.
+ */
+ (void) prepare_sort_from_pathkeys((Plan *) brinsort_plan, pathkeys,
+ best_path->ipath.path.parent->relids,
+ NULL,
+ true,
+ &brinsort_plan->numCols,
+ &brinsort_plan->sortColIdx,
+ &brinsort_plan->sortOperators,
+ &brinsort_plan->collations,
+ &brinsort_plan->nullsFirst);
+ //tlist_was_changed = (orig_tlist_length != list_length(plan->plan.targetlist));
+ for (int i = 0; i < brinsort_plan->numCols; i++)
+ elog(DEBUG1, "%d => %d %d %d %d", i,
+ brinsort_plan->sortColIdx[i],
+ brinsort_plan->sortOperators[i],
+ brinsort_plan->collations[i],
+ brinsort_plan->nullsFirst[i]);
+ }
+
+ copy_generic_path_info(&brinsort_plan->scan.plan, &best_path->ipath.path);
+
+ return brinsort_plan;
+}
+
/*
* create_bitmap_scan_plan
* Returns a bitmap scan plan for the base relation scanned by 'best_path'
@@ -5523,6 +5686,31 @@ make_indexscan(List *qptlist,
return node;
}
+static BrinSort *
+make_brinsort(List *qptlist,
+ List *qpqual,
+ Index scanrelid,
+ Oid indexid,
+ List *indexqual,
+ List *indexqualorig,
+ ScanDirection indexscandir)
+{
+ BrinSort *node = makeNode(BrinSort);
+ Plan *plan = &node->scan.plan;
+
+ plan->targetlist = qptlist;
+ plan->qual = qpqual;
+ plan->lefttree = NULL;
+ plan->righttree = NULL;
+ node->scan.scanrelid = scanrelid;
+ node->indexid = indexid;
+ node->indexqual = indexqual;
+ node->indexqualorig = indexqualorig;
+ node->indexorderdir = indexscandir;
+
+ return node;
+}
+
static IndexOnlyScan *
make_indexonlyscan(List *qptlist,
List *qpqual,
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 1cb0abdbc1f..2584a1f032d 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -609,6 +609,25 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
return set_indexonlyscan_references(root, splan, rtoffset);
}
break;
+ case T_BrinSort:
+ {
+ BrinSort *splan = (BrinSort *) plan;
+
+ splan->scan.scanrelid += rtoffset;
+ splan->scan.plan.targetlist =
+ fix_scan_list(root, splan->scan.plan.targetlist,
+ rtoffset, NUM_EXEC_TLIST(plan));
+ splan->scan.plan.qual =
+ fix_scan_list(root, splan->scan.plan.qual,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ splan->indexqual =
+ fix_scan_list(root, splan->indexqual,
+ rtoffset, 1);
+ splan->indexqualorig =
+ fix_scan_list(root, splan->indexqualorig,
+ rtoffset, NUM_EXEC_QUAL(plan));
+ }
+ break;
case T_BitmapIndexScan:
{
BitmapIndexScan *splan = (BitmapIndexScan *) plan;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 70f61ae7b1c..6471bbb5de8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1030,6 +1030,63 @@ create_index_path(PlannerInfo *root,
return pathnode;
}
+
+/*
+ * create_brinsort_path
+ * Creates a path node for sorted brin sort scan.
+ *
+ * 'index' is a usable index.
+ * 'indexclauses' is a list of IndexClause nodes representing clauses
+ * to be enforced as qual conditions in the scan.
+ * 'indexorderbys' is a list of bare expressions (no RestrictInfos)
+ * to be used as index ordering operators in the scan.
+ * 'indexorderbycols' is an integer list of index column numbers (zero based)
+ * the ordering operators can be used with.
+ * 'pathkeys' describes the ordering of the path.
+ * 'indexscandir' is ForwardScanDirection or BackwardScanDirection
+ * for an ordered index, or NoMovementScanDirection for
+ * an unordered index.
+ * 'indexonly' is true if an index-only scan is wanted.
+ * 'required_outer' is the set of outer relids for a parameterized path.
+ * 'loop_count' is the number of repetitions of the indexscan to factor into
+ * estimates of caching behavior.
+ * 'partial_path' is true if constructing a parallel index scan path.
+ *
+ * Returns the new path node.
+ */
+BrinSortPath *
+create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path)
+{
+ BrinSortPath *pathnode = makeNode(BrinSortPath);
+ RelOptInfo *rel = index->rel;
+
+ pathnode->ipath.path.pathtype = T_BrinSort;
+ pathnode->ipath.path.parent = rel;
+ pathnode->ipath.path.pathtarget = rel->reltarget;
+ pathnode->ipath.path.param_info = get_baserel_parampathinfo(root, rel,
+ required_outer);
+ pathnode->ipath.path.parallel_aware = false;
+ pathnode->ipath.path.parallel_safe = rel->consider_parallel;
+ pathnode->ipath.path.parallel_workers = 0;
+ pathnode->ipath.path.pathkeys = pathkeys;
+
+ pathnode->ipath.indexinfo = index;
+ pathnode->ipath.indexclauses = indexclauses;
+ pathnode->ipath.indexscandir = indexscandir;
+
+ cost_brinsort(pathnode, root, loop_count, partial_path);
+
+ return pathnode;
+}
+
/*
* create_bitmap_heap_path
* Creates a path node for a bitmap scan.
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 06dfeb6cd8b..a5ca3bd0cc4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -977,6 +977,16 @@ struct config_bool ConfigureNamesBool[] =
false,
NULL, NULL, NULL
},
+ {
+ {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enables the planner's use of BRIN sort plans."),
+ NULL,
+ GUC_EXPLAIN
+ },
+ &enable_brinsort,
+ false,
+ NULL, NULL, NULL
+ },
{
{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("Enables genetic query optimization."),
diff --git a/src/include/access/brin.h b/src/include/access/brin.h
index a7cccac9c90..be05586ec57 100644
--- a/src/include/access/brin.h
+++ b/src/include/access/brin.h
@@ -34,41 +34,6 @@ typedef struct BrinStatsData
BlockNumber revmapNumPages;
} BrinStatsData;
-/*
- * Info about ranges for BRIN Sort.
- */
-typedef struct BrinRange
-{
- BlockNumber blkno_start;
- BlockNumber blkno_end;
-
- Datum min_value;
- Datum max_value;
- bool has_nulls;
- bool all_nulls;
- bool not_summarized;
-
- /*
- * Index of the range when ordered by min_value (if there are multiple
- * ranges with the same min_value, it's the lowest one).
- */
- uint32 min_index;
-
- /*
- * Minimum min_index from all ranges with higher max_value (i.e. when
- * sorted by max_value). If there are multiple ranges with the same
- * max_value, it depends on the ordering (i.e. the ranges may get
- * different min_index_lowest, depending on the exact ordering).
- */
- uint32 min_index_lowest;
-} BrinRange;
-
-typedef struct BrinRanges
-{
- int nranges;
- BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
-} BrinRanges;
-
typedef struct BrinMinmaxStats
{
int32 vl_len_; /* varlena header (do not touch directly!) */
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index f4be357c176..06a36f769c5 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,6 +73,7 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
+#define BRIN_PROCNUM_RANGES 7 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_PROCNUM_STATISTICS 11 /* optional */
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 558df53206d..7a22eaef33c 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -806,6 +806,8 @@
amprocrighttype => 'bytea', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
+ amprocrighttype => 'bytea', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -839,6 +841,8 @@
amprocrighttype => 'char', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
+ amprocrighttype => 'char', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -870,6 +874,8 @@
amprocrighttype => 'name', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
+ amprocrighttype => 'name', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -901,6 +907,8 @@
amprocrighttype => 'int8', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
+ amprocrighttype => 'int8', amprocnum => '7', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -915,6 +923,8 @@
amprocrighttype => 'int2', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
+ amprocrighttype => 'int2', amprocnum => '7', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -929,6 +939,8 @@
amprocrighttype => 'int4', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
+ amprocrighttype => 'int4', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1048,6 +1060,8 @@
amprocrighttype => 'text', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
+ amprocrighttype => 'text', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1078,6 +1092,8 @@
amprocrighttype => 'oid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
+ amprocrighttype => 'oid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1128,6 +1144,8 @@
amprocrighttype => 'tid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
+ amprocrighttype => 'tid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1181,6 +1199,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
amprocrighttype => 'float4', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
+ amprocrighttype => 'float4', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '1',
@@ -1197,6 +1218,9 @@
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
amprocrighttype => 'float8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
+ amprocrighttype => 'float8', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi float
{ amprocfamily => 'brin/float_minmax_multi_ops', amproclefttype => 'float4',
@@ -1288,6 +1312,9 @@
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
+ amprocrighttype => 'macaddr', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr
{ amprocfamily => 'brin/macaddr_minmax_multi_ops', amproclefttype => 'macaddr',
@@ -1344,6 +1371,9 @@
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
amprocrighttype => 'macaddr8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
+ amprocrighttype => 'macaddr8', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi macaddr8
{ amprocfamily => 'brin/macaddr8_minmax_multi_ops',
@@ -1398,6 +1428,8 @@
amprocrighttype => 'inet', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
+ amprocrighttype => 'inet', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1471,6 +1503,9 @@
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
amprocrighttype => 'bpchar', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
+ amprocrighttype => 'bpchar', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# bloom character
{ amprocfamily => 'brin/bpchar_bloom_ops', amproclefttype => 'bpchar',
@@ -1504,6 +1539,8 @@
amprocrighttype => 'time', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
+ amprocrighttype => 'time', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1557,6 +1594,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
amprocrighttype => 'timestamp', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
+ amprocrighttype => 'timestamp', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '1',
@@ -1573,6 +1613,9 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
amprocrighttype => 'timestamptz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
+ amprocrighttype => 'timestamptz', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '1',
@@ -1587,6 +1630,8 @@
amprocrighttype => 'date', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
+ amprocrighttype => 'date', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1716,6 +1761,9 @@
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
+ amprocrighttype => 'interval', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi interval
{ amprocfamily => 'brin/interval_minmax_multi_ops',
@@ -1772,6 +1820,9 @@
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
amprocrighttype => 'timetz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
+ amprocrighttype => 'timetz', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi time with time zone
{ amprocfamily => 'brin/timetz_minmax_multi_ops', amproclefttype => 'timetz',
@@ -1824,6 +1875,8 @@
amprocrighttype => 'bit', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
+ amprocrighttype => 'bit', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1841,6 +1894,9 @@
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
amprocrighttype => 'varbit', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
+ amprocrighttype => 'varbit', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax numeric
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
@@ -1858,6 +1914,9 @@
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
amprocrighttype => 'numeric', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
+ amprocrighttype => 'numeric', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi numeric
{ amprocfamily => 'brin/numeric_minmax_multi_ops', amproclefttype => 'numeric',
@@ -1912,6 +1971,8 @@
amprocrighttype => 'uuid', amprocnum => '4', amproc => 'brin_minmax_union' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '11', amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
+ amprocrighttype => 'uuid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -1988,6 +2049,9 @@
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
amprocrighttype => 'pg_lsn', amprocnum => '11',
amproc => 'brin_minmax_stats' },
+{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
+ amprocrighttype => 'pg_lsn', amprocnum => '7',
+ amproc => 'brin_minmax_ranges' },
# minmax multi pg_lsn
{ amprocfamily => 'brin/pg_lsn_minmax_multi_ops', amproclefttype => 'pg_lsn',
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1dd9177b01c..18e0824a08e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8411,6 +8411,9 @@
proname => 'brin_minmax_stats', prorettype => 'bool',
proargtypes => 'internal internal int2 int2 internal int4',
prosrc => 'brin_minmax_stats' },
+{ oid => '9980', descr => 'BRIN minmax support',
+ proname => 'brin_minmax_ranges', prorettype => 'bool',
+ proargtypes => 'internal int2 bool', prosrc => 'brin_minmax_ranges' },
# BRIN minmax multi
{ oid => '4616', descr => 'BRIN multi minmax support',
diff --git a/src/include/executor/nodeBrinSort.h b/src/include/executor/nodeBrinSort.h
new file mode 100644
index 00000000000..2c860d926ea
--- /dev/null
+++ b/src/include/executor/nodeBrinSort.h
@@ -0,0 +1,47 @@
+/*-------------------------------------------------------------------------
+ *
+ * nodeBrinSort.h
+ *
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/executor/nodeBrinSort.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_H
+
+#include "access/genam.h"
+#include "access/parallel.h"
+#include "nodes/execnodes.h"
+
+extern BrinSortState *ExecInitBrinSort(BrinSort *node, EState *estate, int eflags);
+extern void ExecEndBrinSort(BrinSortState *node);
+extern void ExecBrinSortMarkPos(BrinSortState *node);
+extern void ExecBrinSortRestrPos(BrinSortState *node);
+extern void ExecReScanBrinSort(BrinSortState *node);
+extern void ExecBrinSortEstimate(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortReInitializeDSM(BrinSortState *node, ParallelContext *pcxt);
+extern void ExecBrinSortInitializeWorker(BrinSortState *node,
+ ParallelWorkerContext *pwcxt);
+
+/*
+ * These routines are exported to share code with nodeIndexonlyscan.c and
+ * nodeBitmapBrinSort.c
+ */
+extern void ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
+ List *quals, bool isorderby,
+ ScanKey *scanKeys, int *numScanKeys,
+ IndexRuntimeKeyInfo **runtimeKeys, int *numRuntimeKeys,
+ IndexArrayKeyInfo **arrayKeys, int *numArrayKeys);
+extern void ExecIndexEvalRuntimeKeys(ExprContext *econtext,
+ IndexRuntimeKeyInfo *runtimeKeys, int numRuntimeKeys);
+extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
+ IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
+
+#endif /* NODEBrinSort_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 01b1727fc09..381c2fcd3d6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1549,6 +1549,109 @@ typedef struct IndexScanState
Size iss_PscanLen;
} IndexScanState;
+typedef enum {
+ BRINSORT_START,
+ BRINSORT_LOAD_RANGE,
+ BRINSORT_PROCESS_RANGE,
+ BRINSORT_LOAD_NULLS,
+ BRINSORT_PROCESS_NULLS,
+ BRINSORT_FINISHED
+} BrinSortPhase;
+
+typedef struct BrinRangeScanDesc
+{
+ /* range info tuple descriptor */
+ TupleDesc tdesc;
+
+ /* ranges, sorted by minval, blkno_start */
+ Tuplesortstate *ranges;
+
+ /* distinct minval (sorted) */
+ Tuplestorestate *minvals;
+
+ /* slot for accessing the tuplesort/tuplestore */
+ TupleTableSlot *slot;
+
+} BrinRangeScanDesc;
+
+/*
+ * Info about ranges for BRIN Sort.
+ */
+typedef struct BrinRange
+{
+ BlockNumber blkno_start;
+ BlockNumber blkno_end;
+
+ Datum min_value;
+ Datum max_value;
+ bool has_nulls;
+ bool all_nulls;
+ bool not_summarized;
+
+ /*
+ * Index of the range when ordered by min_value (if there are multiple
+ * ranges with the same min_value, it's the lowest one).
+ */
+ uint32 min_index;
+
+ /*
+ * Minimum min_index from all ranges with higher max_value (i.e. when
+ * sorted by max_value). If there are multiple ranges with the same
+ * max_value, it depends on the ordering (i.e. the ranges may get
+ * different min_index_lowest, depending on the exact ordering).
+ */
+ uint32 min_index_lowest;
+} BrinRange;
+
+typedef struct BrinRanges
+{
+ int nranges;
+ BrinRange ranges[FLEXIBLE_ARRAY_MEMBER];
+} BrinRanges;
+
+typedef struct BrinSortState
+{
+ ScanState ss; /* its first field is NodeTag */
+ ExprState *indexqualorig;
+ List *indexorderbyorig;
+ struct ScanKeyData *iss_ScanKeys;
+ int iss_NumScanKeys;
+ struct ScanKeyData *iss_OrderByKeys;
+ int iss_NumOrderByKeys;
+ IndexRuntimeKeyInfo *iss_RuntimeKeys;
+ int iss_NumRuntimeKeys;
+ bool iss_RuntimeKeysReady;
+ ExprContext *iss_RuntimeContext;
+ Relation iss_RelationDesc;
+ struct IndexScanDescData *iss_ScanDesc;
+
+ /* These are needed for re-checking ORDER BY expr ordering */
+ pairingheap *iss_ReorderQueue;
+ bool iss_ReachedEnd;
+ Datum *iss_OrderByValues;
+ bool *iss_OrderByNulls;
+ SortSupport iss_SortSupport;
+ bool *iss_OrderByTypByVals;
+ int16 *iss_OrderByTypLens;
+ Size iss_PscanLen;
+
+ /* */
+ BrinRangeScanDesc *bs_scan;
+ BrinRange *bs_range;
+ ExprState *bs_qual;
+ Datum bs_watermark;
+ bool bs_watermark_set;
+ BrinSortPhase bs_phase;
+ SortSupportData bs_sortsupport;
+
+ /*
+ * We need two tuplesort instances - one for current range, one for
+ * spill-over tuples from the overlapping ranges
+ */
+ void *bs_tuplesortstate;
+ Tuplestorestate *bs_tuplestore;
+} BrinSortState;
+
/* ----------------
* IndexOnlyScanState information
*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6bda383bead..e79c904a8fc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1596,6 +1596,17 @@ typedef struct IndexPath
Selectivity indexselectivity;
} IndexPath;
+/*
+ * read sorted data from brin index
+ *
+ * We use IndexPath, because that's what amcostestimate is expecting, but
+ * we typedef it as a separate struct.
+ */
+typedef struct BrinSortPath
+{
+ IndexPath ipath;
+} BrinSortPath;
+
/*
* Each IndexClause references a RestrictInfo node from the query's WHERE
* or JOIN conditions, and shows how that restriction can be applied to
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 21e642a64c4..c4ef5362acc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -495,6 +495,32 @@ typedef struct IndexOnlyScan
ScanDirection indexorderdir; /* forward or backward or don't care */
} IndexOnlyScan;
+
+typedef struct BrinSort
+{
+ Scan scan;
+ Oid indexid; /* OID of index to scan */
+ List *indexqual; /* list of index quals (usually OpExprs) */
+ List *indexqualorig; /* the same in original form */
+ ScanDirection indexorderdir; /* forward or backward or don't care */
+
+ /* number of sort-key columns */
+ int numCols;
+
+ /* their indexes in the target list */
+ AttrNumber *sortColIdx pg_node_attr(array_size(numCols));
+
+ /* OIDs of operators to sort them by */
+ Oid *sortOperators pg_node_attr(array_size(numCols));
+
+ /* OIDs of collations */
+ Oid *collations pg_node_attr(array_size(numCols));
+
+ /* NULLS FIRST/LAST directions */
+ bool *nullsFirst pg_node_attr(array_size(numCols));
+
+} BrinSort;
+
/* ----------------
* bitmap index scan node
*
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 204e94b6d10..b77440728d1 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -69,6 +69,7 @@ extern PGDLLIMPORT bool enable_parallel_append;
extern PGDLLIMPORT bool enable_parallel_hash;
extern PGDLLIMPORT bool enable_partition_pruning;
extern PGDLLIMPORT bool enable_async_append;
+extern PGDLLIMPORT bool enable_brinsort;
extern PGDLLIMPORT int constraint_exclusion;
extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
@@ -79,6 +80,8 @@ extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info);
extern void cost_index(IndexPath *path, PlannerInfo *root,
double loop_count, bool partial_path);
+extern void cost_brinsort(BrinSortPath *path, PlannerInfo *root,
+ double loop_count, bool partial_path);
extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
ParamPathInfo *param_info,
Path *bitmapqual, double loop_count);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 050f00e79a4..11caad3ec51 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,6 +49,15 @@ extern IndexPath *create_index_path(PlannerInfo *root,
Relids required_outer,
double loop_count,
bool partial_path);
+extern BrinSortPath *create_brinsort_path(PlannerInfo *root,
+ IndexOptInfo *index,
+ List *indexclauses,
+ List *pathkeys,
+ ScanDirection indexscandir,
+ bool indexonly,
+ Relids required_outer,
+ double loop_count,
+ bool partial_path);
extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
RelOptInfo *rel,
Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 41f765d3422..6aa50257730 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -213,6 +213,9 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
ScanDirection scandir);
+extern List *build_index_pathkeys_brin(PlannerInfo *root, IndexOptInfo *index,
+ TargetEntry *tle, int idx,
+ bool reverse_sort, bool nulls_first);
extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
ScanDirection scandir, bool *partialkeys);
extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
--
2.25.1
0004-f-brinsort.patchtext/x-diff; charset=us-asciiDownload
From 65cf8ac1bb314e9f787f1e0fbf939e3cce561fe1 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sat, 15 Oct 2022 10:51:34 -0500
Subject: [PATCH 4/4] f!brinsort
//-os-only: linux-meson
---
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeBrinSort.c | 11 ++--
src/backend/utils/misc/guc_tables.c | 2 +-
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/brin_internal.h | 2 +-
src/include/catalog/pg_amproc.dat | 52 +++++++++----------
src/include/catalog/pg_proc.dat | 2 +-
src/include/executor/nodeBrinSort.h | 6 +--
src/test/regress/expected/sysviews.out | 3 +-
9 files changed, 41 insertions(+), 39 deletions(-)
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 518674cfa28..c1fc50120d1 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -24,6 +24,7 @@ backend_sources += files(
'nodeBitmapHeapscan.c',
'nodeBitmapIndexscan.c',
'nodeBitmapOr.c',
+ 'nodeBrinSort.c',
'nodeCtescan.c',
'nodeCustom.c',
'nodeForeignscan.c',
diff --git a/src/backend/executor/nodeBrinSort.c b/src/backend/executor/nodeBrinSort.c
index ca72c1ed22d..fcb0de71b9b 100644
--- a/src/backend/executor/nodeBrinSort.c
+++ b/src/backend/executor/nodeBrinSort.c
@@ -459,7 +459,7 @@ brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_proces
scan = node->ss.ss_currentScanDesc;
/*
- * Read tuples, evaluate the filer (so that we don't keep tuples only to
+ * Read tuples, evaluate the filter (so that we don't keep tuples only to
* discard them later), and decide if it goes into the current range
* (tuplesort) or overflow (tuplestore).
*/
@@ -501,7 +501,7 @@ brinsort_load_tuples(BrinSortState *node, bool check_watermark, bool null_proces
*
* XXX However, maybe we could also leverage other bitmap indexes,
* particularly for BRIN indexes because that makes it simpler to
- * eliminage the ranges incrementally - we know which ranges to
+ * eliminate the ranges incrementally - we know which ranges to
* load from the index, while for other indexes (e.g. btree) we
* have to read the whole index and build a bitmap in order to have
* a bitmap for any range. Although, if the condition is very
@@ -896,9 +896,9 @@ IndexNext(BrinSortState *node)
tuplesort_get_stats(node->bs_tuplesortstate, &stats);
- elog(DEBUG1, "method: %s space: %ld kB (%s)",
+ elog(DEBUG1, "method: %s space: %lld kB (%s)",
tuplesort_method_name(stats.sortMethod),
- stats.spaceUsed,
+ (long long)stats.spaceUsed,
tuplesort_space_type_name(stats.spaceType));
}
#endif
@@ -1015,7 +1015,6 @@ IndexNext(BrinSortState *node)
/* read tuples from the tuplesort range, and output them */
if (node->bs_tuplestore != NULL)
{
-
while (tuplestore_gettupleslot(node->bs_tuplestore, true, true, slot))
return slot;
@@ -1287,7 +1286,7 @@ ExecInitBrinSortRanges(BrinSort *node, BrinSortState *planstate)
* Should not get here without a proc, thanks to the check before
* building the BrinSort path.
*/
- Assert(rangeproc != NULL);
+ Assert(OidIsValid(rangeproc->fn_oid));
memset(&planstate->bs_sortsupport, 0, sizeof(SortSupportData));
PrepareSortSupportFromOrderingOp(node->sortOperators[0], &planstate->bs_sortsupport);
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index a5ca3bd0cc4..27fb720d842 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -984,7 +984,7 @@ struct config_bool ConfigureNamesBool[] =
GUC_EXPLAIN
},
&enable_brinsort,
- false,
+ true,
NULL, NULL, NULL
},
{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8c5d442ff45..3f44d1229f4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
#enable_async_append = on
#enable_bitmapscan = on
+#enable_brinsort = off
#enable_gathermerge = on
#enable_hashagg = on
#enable_hashjoin = on
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 06a36f769c5..355dddcc225 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -73,10 +73,10 @@ typedef struct BrinDesc
#define BRIN_PROCNUM_UNION 4
#define BRIN_MANDATORY_NPROCS 4
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
-#define BRIN_PROCNUM_RANGES 7 /* optional */
/* procedure numbers up to 10 are reserved for BRIN future expansion */
#define BRIN_FIRST_OPTIONAL_PROCNUM 11
#define BRIN_PROCNUM_STATISTICS 11 /* optional */
+#define BRIN_PROCNUM_RANGES 12 /* optional */
#define BRIN_LAST_OPTIONAL_PROCNUM 15
#undef BRIN_DEBUG
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 7a22eaef33c..0d192ce40ee 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -807,7 +807,7 @@
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
amprocrighttype => 'bytea', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/bytea_minmax_ops', amproclefttype => 'bytea',
- amprocrighttype => 'bytea', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'bytea', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# bloom bytea
{ amprocfamily => 'brin/bytea_bloom_ops', amproclefttype => 'bytea',
@@ -842,7 +842,7 @@
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
amprocrighttype => 'char', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/char_minmax_ops', amproclefttype => 'char',
- amprocrighttype => 'char', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'char', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# bloom "char"
{ amprocfamily => 'brin/char_bloom_ops', amproclefttype => 'char',
@@ -875,7 +875,7 @@
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
amprocrighttype => 'name', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/name_minmax_ops', amproclefttype => 'name',
- amprocrighttype => 'name', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'name', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# bloom name
{ amprocfamily => 'brin/name_bloom_ops', amproclefttype => 'name',
@@ -908,7 +908,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
amprocrighttype => 'int8', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int8',
- amprocrighttype => 'int8', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'int8', amprocnum => '12', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '1',
@@ -924,7 +924,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
amprocrighttype => 'int2', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int2',
- amprocrighttype => 'int2', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'int2', amprocnum => '12', amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '1',
@@ -940,7 +940,7 @@
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
amprocrighttype => 'int4', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/integer_minmax_ops', amproclefttype => 'int4',
- amprocrighttype => 'int4', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'int4', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi integer: int2, int4, int8
{ amprocfamily => 'brin/integer_minmax_multi_ops', amproclefttype => 'int2',
@@ -1061,7 +1061,7 @@
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
amprocrighttype => 'text', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/text_minmax_ops', amproclefttype => 'text',
- amprocrighttype => 'text', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'text', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# bloom text
{ amprocfamily => 'brin/text_bloom_ops', amproclefttype => 'text',
@@ -1093,7 +1093,7 @@
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
amprocrighttype => 'oid', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/oid_minmax_ops', amproclefttype => 'oid',
- amprocrighttype => 'oid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'oid', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi oid
{ amprocfamily => 'brin/oid_minmax_multi_ops', amproclefttype => 'oid',
@@ -1145,7 +1145,7 @@
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
amprocrighttype => 'tid', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/tid_minmax_ops', amproclefttype => 'tid',
- amprocrighttype => 'tid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'tid', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# bloom tid
{ amprocfamily => 'brin/tid_bloom_ops', amproclefttype => 'tid',
@@ -1200,7 +1200,7 @@
amprocrighttype => 'float4', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float4',
- amprocrighttype => 'float4', amprocnum => '7',
+ amprocrighttype => 'float4', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
@@ -1219,7 +1219,7 @@
amprocrighttype => 'float8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/float_minmax_ops', amproclefttype => 'float8',
- amprocrighttype => 'float8', amprocnum => '7',
+ amprocrighttype => 'float8', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi float
@@ -1313,7 +1313,7 @@
amprocrighttype => 'macaddr', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/macaddr_minmax_ops', amproclefttype => 'macaddr',
- amprocrighttype => 'macaddr', amprocnum => '7',
+ amprocrighttype => 'macaddr', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi macaddr
@@ -1372,7 +1372,7 @@
amprocrighttype => 'macaddr8', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/macaddr8_minmax_ops', amproclefttype => 'macaddr8',
- amprocrighttype => 'macaddr8', amprocnum => '7',
+ amprocrighttype => 'macaddr8', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi macaddr8
@@ -1429,7 +1429,7 @@
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
amprocrighttype => 'inet', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/network_minmax_ops', amproclefttype => 'inet',
- amprocrighttype => 'inet', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'inet', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi inet
{ amprocfamily => 'brin/network_minmax_multi_ops', amproclefttype => 'inet',
@@ -1504,7 +1504,7 @@
amprocrighttype => 'bpchar', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/bpchar_minmax_ops', amproclefttype => 'bpchar',
- amprocrighttype => 'bpchar', amprocnum => '7',
+ amprocrighttype => 'bpchar', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# bloom character
@@ -1540,7 +1540,7 @@
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
amprocrighttype => 'time', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/time_minmax_ops', amproclefttype => 'time',
- amprocrighttype => 'time', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'time', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi time without time zone
{ amprocfamily => 'brin/time_minmax_multi_ops', amproclefttype => 'time',
@@ -1595,7 +1595,7 @@
amprocrighttype => 'timestamp', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamp',
- amprocrighttype => 'timestamp', amprocnum => '7',
+ amprocrighttype => 'timestamp', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
@@ -1614,7 +1614,7 @@
amprocrighttype => 'timestamptz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'timestamptz',
- amprocrighttype => 'timestamptz', amprocnum => '7',
+ amprocrighttype => 'timestamptz', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
@@ -1631,7 +1631,7 @@
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
amprocrighttype => 'date', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/datetime_minmax_ops', amproclefttype => 'date',
- amprocrighttype => 'date', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'date', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi datetime (date, timestamp, timestamptz)
{ amprocfamily => 'brin/datetime_minmax_multi_ops',
@@ -1762,7 +1762,7 @@
amprocrighttype => 'interval', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/interval_minmax_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '7',
+ amprocrighttype => 'interval', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi interval
@@ -1821,7 +1821,7 @@
amprocrighttype => 'timetz', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/timetz_minmax_ops', amproclefttype => 'timetz',
- amprocrighttype => 'timetz', amprocnum => '7',
+ amprocrighttype => 'timetz', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi time with time zone
@@ -1876,7 +1876,7 @@
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
amprocrighttype => 'bit', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/bit_minmax_ops', amproclefttype => 'bit',
- amprocrighttype => 'bit', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'bit', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax bit varying
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
@@ -1895,7 +1895,7 @@
amprocrighttype => 'varbit', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/varbit_minmax_ops', amproclefttype => 'varbit',
- amprocrighttype => 'varbit', amprocnum => '7',
+ amprocrighttype => 'varbit', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax numeric
@@ -1915,7 +1915,7 @@
amprocrighttype => 'numeric', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/numeric_minmax_ops', amproclefttype => 'numeric',
- amprocrighttype => 'numeric', amprocnum => '7',
+ amprocrighttype => 'numeric', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi numeric
@@ -1972,7 +1972,7 @@
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
amprocrighttype => 'uuid', amprocnum => '11', amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/uuid_minmax_ops', amproclefttype => 'uuid',
- amprocrighttype => 'uuid', amprocnum => '7', amproc => 'brin_minmax_ranges' },
+ amprocrighttype => 'uuid', amprocnum => '12', amproc => 'brin_minmax_ranges' },
# minmax multi uuid
{ amprocfamily => 'brin/uuid_minmax_multi_ops', amproclefttype => 'uuid',
@@ -2050,7 +2050,7 @@
amprocrighttype => 'pg_lsn', amprocnum => '11',
amproc => 'brin_minmax_stats' },
{ amprocfamily => 'brin/pg_lsn_minmax_ops', amproclefttype => 'pg_lsn',
- amprocrighttype => 'pg_lsn', amprocnum => '7',
+ amprocrighttype => 'pg_lsn', amprocnum => '12',
amproc => 'brin_minmax_ranges' },
# minmax multi pg_lsn
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18e0824a08e..2bd034e7616 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5252,7 +5252,7 @@
proname => 'pg_stat_get_numscans', provolatile => 's', proparallel => 'r',
prorettype => 'int8', proargtypes => 'oid',
prosrc => 'pg_stat_get_numscans' },
-{ oid => '9976', descr => 'statistics: time of the last scan for table/index',
+{ oid => '8912', descr => 'statistics: time of the last scan for table/index',
proname => 'pg_stat_get_lastscan', provolatile => 's', proparallel => 'r',
prorettype => 'timestamptz', proargtypes => 'oid',
prosrc => 'pg_stat_get_lastscan' },
diff --git a/src/include/executor/nodeBrinSort.h b/src/include/executor/nodeBrinSort.h
index 2c860d926ea..3cac599d811 100644
--- a/src/include/executor/nodeBrinSort.h
+++ b/src/include/executor/nodeBrinSort.h
@@ -11,8 +11,8 @@
*
*-------------------------------------------------------------------------
*/
-#ifndef NODEBrinSort_H
-#define NODEBrinSort_H
+#ifndef NODEBRIN_SORT_H
+#define NODEBRIN_SORT_H
#include "access/genam.h"
#include "access/parallel.h"
@@ -44,4 +44,4 @@ extern bool ExecIndexEvalArrayKeys(ExprContext *econtext,
IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
extern bool ExecIndexAdvanceArrayKeys(IndexArrayKeyInfo *arrayKeys, int numArrayKeys);
-#endif /* NODEBrinSort_H */
+#endif /* NODEBRIN_SORT_H */
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index b19dae255e9..7b697941cad 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -113,6 +113,7 @@ select name, setting from pg_settings where name like 'enable%';
--------------------------------+---------
enable_async_append | on
enable_bitmapscan | on
+ enable_brinsort | on
enable_gathermerge | on
enable_hashagg | on
enable_hashjoin | on
@@ -132,7 +133,7 @@ select name, setting from pg_settings where name like 'enable%';
enable_seqscan | on
enable_sort | on
enable_tidscan | on
-(21 rows)
+(22 rows)
-- Test that the pg_timezone_names and pg_timezone_abbrevs views are
-- more-or-less working. We can't test their contents in any great detail
--
2.25.1
On 10/24/22 06:32, Justin Pryzby wrote:
On Sat, Oct 15, 2022 at 02:33:50PM +0200, Tomas Vondra wrote:
Of course, if there are e.g. BTREE indexes this is going to be slower,
but people are unlikely to have both index types on the same column.On Sun, Oct 16, 2022 at 05:48:31PM +0200, Tomas Vondra wrote:
I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexesNote that we (at my work) use unique, btree indexes on multiple columns
for INSERT ON CONFLICT into the most-recent tables: UNIQUE(a,b,c,...),
plus a separate set of indexes on all tables, used for searching:
BRIN(a) and BTREE(b). I'd hope that the costing is accurate enough to
prefer the btree index for searching the most-recent table, if that's
what's faster (for example, if columns b and c are specified).
Well, the costing is very crude at the moment - at the moment it's
pretty much just a copy of the existing BRIN costing. And the cost is
likely going to increase, because brinsort needs to do regular BRIN
bitmap scan (more or less) and then also a sort (which is an extra cost,
of course). So if it works now, I don't see why would brinsort break it.
Moreover, if you don't have ORDER BY in the query, I don't see why would
we create a brinsort at all.
But if you could test this once the costing gets improved, that'd be
very valuable.
+ /* There must not be any TID scan in progress yet. */ + Assert(node->ss.ss_currentScanDesc == NULL); + + /* Initialize the TID range scan, for the provided block range. */ + if (node->ss.ss_currentScanDesc == NULL) + {Why is this conditional on the condition that was just Assert()ed ?
Yeah, that's a mistake, due to how the code evolved.
+void +cost_brinsort(BrinSortPath *path, PlannerInfo *root, double loop_count, + bool partial_path)It's be nice to refactor existing code to avoid this part being so
duplicitive.+ * In some situations (particularly with OR'd index conditions) we may + * have scan_clauses that are not equal to, but are logically implied by, + * the index quals; so we also try a predicate_implied_by() check to seeIsn't that somewhat expensive ?
If that's known, then it'd be good to say that in the documentation.
Some of this is probably a residue from create_indexscan_path and may
not be needed for this new node.
+ { + {"enable_brinsort", PGC_USERSET, QUERY_TUNING_METHOD, + gettext_noop("Enables the planner's use of BRIN sort plans."), + NULL, + GUC_EXPLAIN + }, + &enable_brinsort, + false,I think new GUCs should be enabled during patch development.
Maybe in a separate 0002 patch "for CI only not for commit".
That way "make check" at least has a chance to hit that new code paths.Also, note that indxpath.c had the var initialized to true.
Good point.
+ attno = (i + 1); + nranges = (nblocks / pagesPerRange); + node->bs_phase = (nullsFirst) ? BRINSORT_LOAD_NULLS : BRINSORT_LOAD_RANGE;I'm curious why you have parenthesis these places ?
Not sure, it seemed more readable when writing the code I guess.
+#ifndef NODEBrinSort_H
+#define NODEBrinSort_HNODEBRIN_SORT would be more consistent with NODEINCREMENTALSORT.
But I'd prefer NODE_* - otherwise it looks like NO DEBRIN.
Yeah, stupid search/replace on the indescan code, which was used as a
starting point.
This needed a bunch of work needed to pass any of the regression tests -
even with the feature set to off.. meson.build needs the same change as the corresponding ./Makefile.
. guc missing from postgresql.conf.sample
. brin_validate.c is missing support for the opr function.
I gather you're planning on changing this part (?) but this allows to
pass tests for now.
. mingw is warning about OidIsValid(pointer) in nodeBrinSort.c.
https://cirrus-ci.com/task/5771227447951360?logs=mingw_cross_warning#L969
. Uninitialized catalog attribute.
. Some typos in your other patches: "heuristics heuristics". ste.
lest (least).
Thanks, I'll get this fixed. I've posted the patch as a PoC to showcase
it and gather some feedback, I should have mentioned it's incomplete in
these ways.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Fwiw tuplesort does do something like what you want for the top-k
case. At least it used to last I looked -- not sure if it went out
with the tapesort ...
For top-k it inserts new tuples into the heap data structure and then
pops the top element out of the hash. That keeps a fixed number of
elements in the heap. It's always inserting and removing at the same
time. I don't think it would be very hard to add a tuplesort interface
to access that behaviour.
For something like BRIN you would sort the ranges by minvalue then
insert all the tuples for each range. Before inserting tuples for a
new range you would first pop out all the tuples that are < the
minvalue for the new range.
I'm not sure how you handle degenerate BRIN indexes that behave
terribly. Like, if many BRIN ranges covered the entire key range.
Perhaps there would be a clever way to spill the overflow and switch
to quicksort for the spilled tuples without wasting lots of work
already done and without being too inefficient.
On 11/16/22 22:52, Greg Stark wrote:
Fwiw tuplesort does do something like what you want for the top-k
case. At least it used to last I looked -- not sure if it went out
with the tapesort ...For top-k it inserts new tuples into the heap data structure and then
pops the top element out of the hash. That keeps a fixed number of
elements in the heap. It's always inserting and removing at the same
time. I don't think it would be very hard to add a tuplesort interface
to access that behaviour.
Bounded sorts are still there, implemented using a heap (which is what
you're talking about, I think). I actually looked at it some time ago,
and it didn't look like a particularly good match for the general case
(without explicit LIMIT). Bounded sorts require specifying number of
tuples, and then discard the remaining tuples. But you don't know how
many tuples you'll actually find until the next minval - you have to
keep them all.
Maybe we could feed the tuples into a (sorted) heap incrementally, and
consume tuples until the next minval value. I'm not against exploring
that idea, but it certainly requires more work than just slapping some
interface to existing code.
For something like BRIN you would sort the ranges by minvalue then
insert all the tuples for each range. Before inserting tuples for a
new range you would first pop out all the tuples that are < the
minvalue for the new range.
Well, yeah. That's pretty much exactly what the last version of this
patch (from October 23) does.
I'm not sure how you handle degenerate BRIN indexes that behave
terribly. Like, if many BRIN ranges covered the entire key range.
Perhaps there would be a clever way to spill the overflow and switch
to quicksort for the spilled tuples without wasting lots of work
already done and without being too inefficient.
In two ways:
1) Don't have such BRIN index - if it has many degraded ranges, it's
bound to perform poorly even for WHERE conditions. We've lived with this
until now, I don't think this makes the issue any worse.
2) Improving statistics for BRIN indexes - until now the BRIN costing is
very crude, we have almost no information about how wide the ranges are,
how much they overlap, etc. The 0001 part (discussed in a thread [1]https://commitfest.postgresql.org/40/3952/)
aims to provide much better statistics. Yes, the costing still doesn't
use that information very much.
regards
[1]: https://commitfest.postgresql.org/40/3952/
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2022-11-17 00:52:35 +0100, Tomas Vondra wrote:
Well, yeah. That's pretty much exactly what the last version of this
patch (from October 23) does.
That version unfortunately doesn't build successfully:
https://cirrus-ci.com/task/5108789846736896
[03:02:48.641] Duplicate OIDs detected:
[03:02:48.641] 9979
[03:02:48.641] 9980
[03:02:48.641] found 2 duplicate OID(s) in catalog data
Greetings,
Andres Freund
Hi,
Here's an updated version of this patch series. There's still plenty of
stuff to improve, but it fixes a number of issues I mentioned earlier.
The two most important changes are:
1) handling of projections
Until now queries with projection might have failed, due to not using
the right slot, bogus var references and so on. The code was somewhat
confused because the new node is somewhere in between a scan node and a
sort (or more precisely, it combines both).
I believe this version handles all of this correctly - the code that
initializes the slots/projection info etc. needs serious cleanup, but
should be correct.
2) handling of expressions
The other improvement is handling of expressions - if you have a BRIN
index on an expression, this should now work too. This also includes
correct handling of collations (which the previous patches ignored).
Similarly to the projections, I believe the code is correct but needs
cleanup. In particular, I haven't paid close attention to memory
management, so there might be memory leaks when evaluating expressions.
The last two parts of the patch series (0009 and 0010) are about
testing. 0009 adds a regular regression test with various combinations
(projections, expressions, single- vs. multi-column indexes, ...).
0010 introduces a python script that randomly generates data sets,
indexes and queries. I use it to both test random combinations and to
evaluate performance. I don't expect it to be committed etc. - it's
included only to keep it versioned with the rest of the patch.
I did some basic benchmarking using the 0010 part, to evaluate the how
this works for various cases. The script varies a number of parameters:
- number of rows
- table fill factor
- randomness (how much ranges overlapp)
- pages per range
- limit / offset for queries
- ...
The script forces both a "seqscan" and "brinsort" plan, and collects
timing info.
The results are encouraging, I think. Attached are two charts, plotting
speedup vs. fraction of tuples the query has to sort.
speedup = (seqscan timing / brinsort timing)
fraction = (limit + offset) / (table rows)
A query with "limit 1 offset 0" has fraction ~0.0, query that scans
everything (perhaps because it has no LIMIT/OFFSET) has ~1.0.
For speedup, 1.0 means "no change" while values above 1.0 means the
query gets faster. Both plots have log-scale y-axis.
brinsort-all-data.gif shows results for all queries. There's significant
speedup for small values of fraction (i.e. queries with limit, requiring
few rows). This is expected, as this is pretty much the primary use case
for the patch.
The other thing is that the benefits quickly diminish - for fractions
close to 0.0 the potential benefits are huge, but once you cross ~10% of
the table, it's within 10x, for ~25% less than 5x etc.
OTOH there are also a fair number of queries that got slower - those are
the data points below 1.0. I've looked into many of them, and there are
a couple reasons why that can happen:
1) random data set - When the ranges are very wide, BRIN Sort has to
read most of the data, and it ends up sorting almost as many rows as the
sequential scan. But it's more expensive, especially when combined with
the following points.
Note: I don't think is an issue in practice, because BRIN indexes
would suck quite badly on such data, so no one is going to create
such indexes in the first place.
2) tiny ranges - By default ranges are 1MB, but it's possible to make
them much smaller. But BRIN Sort has to read/sort all ranges, and that
gets more expensive with the number of ranges.
Note: I'm not sure there's a way around this, although Matthias
had some interesting ideas about how to keep the ranges sorted.
But ultimately, I think this is fine, as long as it's costed
correctly. For fractions close to 0.0 this is still going to be
a huge win.
3) non-adaptive (and low) watermark_step - The number of sorts makes a
huge difference - in an extreme case we could add the ranges one by one,
with a sort after each. For small limit/offset that works, but for more
rows it's quite pointless.
Note: The adaptive step (adjusted during execution) works great, and
the script sets explicit values mostly to trigger more corner cases.
Also, I wonder if we should force higher values as we progress
through the table - we still don't want to exceed work_mem, but the
larger fraction we scan the more we should prefer larger "batches".
The second "filter" chart (brinsort-filtered-data.gif) shows results
filtered to only show runs with:
- pages_per_range >= 32
- randomness <= 5% (i.e. each range covers about 5% of domain)
- adaptive step (= -1)
And IMO this looks much better - there are almost no slower queries,
except for a bunch of queries that scan all the data.
So, what are the next steps for this patch:
1) cleanup of the existing code (mentioned above)
2) improvement of the costing - This is probably the critical part,
because we need a costing that allows us to identify the queries that
are likely to be faster/slower. I believe this is doable - either now or
using the new opclass-specific stats proposed in a separate patch (and
kept in part 0001 for completeness).
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
brinsort-filtered-data.gifimage/gif; name=brinsort-filtered-data.gifDownload
GIF89aS��� ###+++333<<<47<4;F5<I5?P6E]7Jh9Ou9S~DDDLLLTTT\\\dddkkkuuu|||:W�;`�=e�=k�?s�?v�@x�@y�A~�A��R��\��W��B��F��L��Q��U��[��_��c��j��k��t��p��d��i��`��|��s��|��m��t��{��~��w����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������� !� , S� � m H����*\�����#J�H����3j������ C�I����(S�\�����0c��I����8s�������@�
J����H�*]�����P�J�J����X�j������`��K����h��]�����p���K����x���������L�����+^������#K�L�����3k�������C�M�����S�^������c��M�����s���������N������+_�������K�N������k����������O�������_���������O������������� (��h��&���6���F(��Vh��f���v��� �(��$�h��(����,����0�(��4�h��8����<����@)��Di��H&���L6���PF)��TVi��Xf���\v���`�)��d�i��h����l����p�)��t�i��x����|�����*���j���&����6����F*���Vj���f����v�����*����j���������������*����j�������������+���k���&����6����F+���Vk���f����v�����+����k��������������+����k������������ ,���������B��;FIG,�����#�LD��A���a���2+��+�L�)���>\`��8���{(81%
��8m��9�P��
���P[���<���'���BX�q"��Q�,D�JU��7�!��v]����BN��7�@�����E�m��G;R���o���{}�I-��nl[�6���)w�P� �BysNw����ntwC���qb��7��:Px�`��}y����9�G_�yB���'4�����]O����G������vP���3���{���9��<����T�4L#��?�����5{�g��pzF3�D�=���H���@ N���p�� ��
E�������i}�[�*Q t0x3�D� ����y`�Z�Z��!��@����W���)
�'�����'4Z";�]�{� b�b���!����V���C��8,�d�7#n*�E(�K(��*���)D����;1BM���E�`4hO~d� w�D����Z�cD�����"@X!���ue#(�v����h�Sb��;L��J����C� �"{8Z�
b�R6r��4#Ef�����f_��$�x:���R�%D�y�_VD~�$7�@!0��d�"-R�Pd���,�H�b��N��Cj��37��y��(Ne�z�(*{hP���"���CJ13 ��?��0���b ]������a �4C
�^^�{�E1
� ��!,�('� @� 7�HL�p��a��'� ��f4�g�YQ�GPk$�!}VAZ���2���0�p�����C�y3��� D�8RN[�zYgE�IM�pB�s�?�.���=m�����m��Z+�F�A0�A{3b����KR
!�����8������`b������ �^,��Y��yE��I�J���!�lf7?I��m�l(A�j�s!z�h'�J j@��k���"�O�nD^q��a��`���87\W!G��4�G|����b������ ��k�vk��Q �%�k�Ka�w!���bK�W�� s=h������W �����7��r�����J1��u-�G�C^Z��@��/#�v��z4����pR���!H�[h,Q�
��-AL��
0��}.��^q�N����(W�J���������P�z��H�/���`�H,A�������H��Q�i�Dk�P�����������������'C�x3���f����iT��yuL0Cs�e���@��:�������!�A�g�%:"��41�l�B�h�%H�'���!u��
:����x@�f����^���� /��q��
\e��EBG� �3'd�B\AJ1�~G[ ���Z���!A����m|s�WV0��e+���"�G>f
7du�x���V�����=H�SWl[��y�.�w���m�����<����H�����-N�#l��BV���V�)���n� F����B��3���8��@8�F��� ����y��)#d��#������ ����`�L�t�fX3}�����5�+2u���l���vB@{x���h��"�����s��o��z��@�V��]��t���d�+_��s�K��H��e�G�����������x�e��z@��[��uu�K�lg$��6�EJ����@XG�#��g�xx��>��b����h@O�}��c9����y �Z�[� K�g��Ed'5��cu�~7����`�fE`&c��|��'::����v�Wy47gqs&Fy��uj�#or�3���eo!�kQ�p��F�'Ts��H
�j��L<