Minmax indexes
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes. Some
people said they would like to use some other name for this feature, but
I have yet to hear usable ideas, so for now I will keep calling them
this way. I'm open to proposals, but if you pick something that cannot
be abbreviated "mm" I might have you prepare a rebased version which
renames the files and structs.
The implementation here has been simplified from what I originally
proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
I noticed that there's no need to involve aggregate functions at all; we
can just use inequality operators. So the pg_amproc entries are gone;
only the pg_amop entries are necessary.
I've somewhat punted on the question of doing resummarization separately
from vacuuming. Right now, resummarization (as well as other necessary
index cleanup) takes place in amvacuumcleanup. This is not optimal; I
have stated elsewhere that I'd like to create separate maintenance
actions that can be carried out by autovacuum. That would be useful
both for Minmax indexes and GIN indexes (pending insertion list); maybe
others. That's not part of this patch, however.
The design of this stuff is in the file "minmax-proposal" at the top of
the tree. That file is up to date, though it still contains some open
questions that were present in the original proposal. (I have not fixed
some bogosities pointed out by Noah, for instance. I will do that
shortly.) In a final version, that file would be applied as
src/backend/access/minmax/README, most likely.
One area on which I needed to modify core code is IndexBuildHeapScan. I
needed a version that was able to scan only a certain range of pages,
not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
added a quick "heap_scansetlimits" function. I haven't tested that this
works outside of the HeapRangeScan thingy, so it's probably completely
bogus; I'm open to suggestions if people think this should be
implemented differently. In any case, keeping that implementation
together with vanilla IndexBuildHeapScan makes a lot of sense.
One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range; 2. vacuum the
affected index page if it's full, so we can maintain the index always up
to date without causing unduly bloat), but I haven't implemented
anything yet.
The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.
There are opclasses for int4, numeric and text. The latter doesn't work
at all, because collation info is not passed down at all. I will have
to figure that out (even if I find unlikely that minmax indexes have any
usefulness on top of text columns). I admit that numeric hasn't been
tested, and it's quite likely that they won't work; mainly because of
lack of some datumCopy() calls, about which the code contains some
/* XXX */ lines. I think this should be relatively straightforward.
Ideally, the final version of this patch would contain opclasses for all
supported datatypes (i.e. the same that have got btree opclasses).
I have messed up the opclass information, as evidenced by failures in
opr_sanity regression test. I will research that later.
There's working contrib/pageinspect support; pg_xlogdump (and wal_debug)
seems to work sanely too.
This patch compiles cleanly under -Werror.
The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n� 318633
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-1.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,217 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "funcapi.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+
+ typedef struct mm_page_state
+ {
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ int raw_page_size;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small (%d bytes)", raw_page_size)));
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = VARDATA(raw_page);
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
*** a/contrib/pageinspect/pageinspect--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.1.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,118 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,300 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of the min() and max()
+ values in consecutive groups of heap pages (page ranges). These values can be
+ used by constraint exclusion to avoid scanning such pages, depending on query
+ quals.
+
+ The main drawback of this is having to update the stored min/max values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have this feature. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs"
+ http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB seems to have it
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported; since an index is concerned with minimum and
+ maximum values of the involved columns across all the pages in the table, it
+ doesn't make sense to exclude values. Another way to see "partial" indexes
+ here would be those that only considered some pages in the table instead of all
+ of them; but this would be difficult to implement and manage and, most likely,
+ pointless.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine would return a TIDBitmap comprising all
+ the pages in those page groups that match the query qualifications; the recheck
+ node prunes tuples that are not visible per snapshot and those that are not
+ visible per query quals.
+
+ For each supported datatype, we need an opclass with the following catalog
+ entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+
+ These operators are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious, at best.)
+
+ There can be gaps where some pages have no covering index entry. In particular,
+ the last few pages of the table would commonly not be summarized.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have a
+ separate fork called the range reverse map. This fork stores one TID per
+ range, which is the address of the index tuple summarizing that range. Since
+ these map entries are fixed size, it is possible to compute the address of the
+ range map entry for any given heap page.
+
+ When a new heap tuple is inserted in a summarized page range, it is possible to
+ compare the existing index tuple with the new heap tuple. If the heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values. To do this, a new index tuple is inserted, and the reverse range
+ map is updated to point to it. The old index tuple is left in place, for later
+ garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is not summarized.
+
+ A minmax index is updated by creating a new summary tuple whenever an
+ insertion outside the min-max interval occurs in the pages within the range.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order.
+ Query quals are matched to each index tuple; if they match, each page within
+ the page range is returned as part of the output TID bitmap. If there's no
+ match, they are skipped. Reverse range map entries returning invalid index
+ TIDs, that is unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we reuse the VISIBILITYMAP_FORKNUM, since a VM
+ does not make sense for a minmax index anyway (XXX -- really??)
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ minimum and maximum values of each indexed column and nulls bitmap are
+ collected and stored in the index. The possibly-incomplete range at the end
+ of the table is not included.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the min() and max() for each indexed column, then insert the new
+ index entry at the end of the index. The main interesting questions are:
+
+ a) when to do it
+ The perfect time to do it is as soon as a complete page range of the
+ configured range size has been filled.
+
+ b) who does it (what process)
+ It doesn't seem a good idea to have a client-connected process do it;
+ it would incur unwanted latency. Three other options are (i) to spawn a
+ specialized process to do it, which perhaps can be signalled by a
+ client-connected process that executes a scan and notices the need to run
+ summarization; or (ii) to let autovacuum do it, as a separate new
+ maintenance task. This seems simple enough to bolt on top of already
+ existing autovacuum infrastructure. The timing constraints of autovacuum
+ might be undesirable, though. (iii) wait for user command.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the min()/max() values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the min()/max() ranges are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 268,273 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 268,275 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 293,298 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 295,308 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 634,640 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 644,651 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 644,650 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 655,662 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 895,901 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 907,913 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 905,911 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 917,923 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1523 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * do we need to reserve special space on pages?
+ * * support collatable datatypes
+ * * on heap insert, we always create a new index entry. Need to mark
+ * range as unsummarized at some point, to avoid index bloat?
+ * * index truncation on vacuum?
+ * * datumCopy() is needed in several places?
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * Within that struct, each column's contruction info is represented by a
+ * MMPerColBuildInfo struct. The running state is all kept in a
+ * DeformedMMTuple.
+ */
+ typedef struct MMPerColBuildInfo
+ {
+ AttrNumber heapAttno;
+ int typLen;
+ bool typByVal;
+ FmgrInfo lt;
+ FmgrInfo gt;
+ } MMPerColBuildInfo;
+
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber currRangeStart;
+ BlockNumber nextRangeAt;
+ mmRevmapAccess *rmAccess;
+ TupleDesc indexDesc;
+ TupleDesc diskDesc;
+ DeformedMMTuple *dtuple;
+ MMPerColBuildInfo perColState[FLEXIBLE_ARRAY_MEMBER];
+ } MMBuildState;
+
+ static void mmbuildCallback(Relation index,
+ HeapTuple htup, Datum *values, bool *isnull,
+ bool tupleIsAlive, void *state);
+ static void get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo);
+ static inline bool invoke_mm_operator(FmgrInfo *operator, Oid collation,
+ Datum left, Datum right);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static Buffer mm_getnewbuffer(Relation irel);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+
+
+ #define MINMAX_PAGES_PER_RANGE 2
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to create a new index tuple and
+ * point the revmap to it.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ mmRevmapAccess *rmAccess;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ TupleDesc tupdesc;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ FmgrInfo *lt;
+ FmgrInfo *gt;
+ bool need_insert;
+
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ tupdesc = RelationGetDescr(idxRel);
+ indexInfo = BuildIndexInfo(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+ gt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+
+ /* grab the operators we will need: < and > for each indexed column */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ }
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ mmtup = (MMTuple *) PageGetItem(page, PageGetItemId(page, ioff));
+
+ dtup = minmax_deform_tuple(tupdesc, mmtup);
+
+ /* compare the key values of the new tuple to the stored index values */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ /*
+ * If the new tuple contains a null in this attr, but the range index
+ * tuple doesn't allow for nulls, we need a new summary tuple
+ */
+ if (nulls[keyno])
+ {
+ if (!dtup->values[keyno].hasnulls)
+ {
+ need_insert = true;
+ }
+ else
+ continue;
+ }
+
+ /*
+ * If the new key value is not within the min/max interval for this
+ * range, we need a new summary tuple
+ */
+ if (invoke_mm_operator(<[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].min))
+ {
+ dtup->values[keyno].min = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ if (invoke_mm_operator(>[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].max))
+ {
+ dtup->values[keyno].max = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ TupleDesc diskDesc;
+ Size tupsz;
+ MMTuple *tup;
+
+ diskDesc = minmax_get_descr(tupdesc);
+ tup = minmax_form_tuple(tupdesc, diskDesc, dtup, &tupsz);
+
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ PG_RETURN_POINTER(scan);
+ }
+
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the min/max values in them are compared to the
+ * scan keys. We return into the TID bitmap all the pages in ranges
+ * corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ Oid heapOid;
+ Relation heapRel;
+ mmRevmapAccess *rmAccess;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ TupleDesc tupdesc;
+ AttrNumber keyno;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ FmgrInfo *lt;
+ FmgrInfo *lteq;
+ FmgrInfo *gteq;
+ FmgrInfo *gt;
+
+ pgstat_count_index_scan(idxRel);
+
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ tupdesc = RelationGetDescr(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ lteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+
+ /*
+ * lookup the operators needed to determine range containment of each key
+ * value.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ AttrNumber keyattno;
+ Oid opfam;
+ Oid keytypid;
+ Oid idxtypid;
+
+ keyattno = scan->keyData[keyno].sk_attno;
+ opfam = get_opclass_family(indclass->values[keyattno - 1]);
+ keytypid = scan->keyData[keyno].sk_subtype;
+ idxtypid = tupdesc->attrs[keyattno - 1]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessEqualStrategyNumber,
+ <eq[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterEqualStrategyNumber,
+ >eq[keyno]);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ }
+
+ /*
+ * To keep the buffer locked for a short time, we grab and
+ * immediately deform the index tuple to operate on. As soon as
+ * we have acquired the lock on the index buffer, we can release
+ * the tuple lock the revmap acquired for us. So vacuum would be
+ * able to remove this index row as soon as we release the buffer
+ * lock, if it has become stale.
+ */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *)
+ PageGetItem(page, PageGetItemId(page, idxoffno));
+ /* XXX probably need copies */
+ dtup = minmax_deform_tuple(tupdesc, tup);
+
+ /* done with the index page */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Compare scan keys with min/max values stored in range. If scan
+ * keys are matched, the page range must be added to the bitmap.
+ */
+ for (keyno = 0, addrange = true;
+ keyno < scan->numberOfKeys;
+ keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+
+ /*
+ * The analysis we need to make to decide whether to include a
+ * page range in the output result is: is it possible for a
+ * tuple contained within the min/max interval specified by
+ * this index tuple to match what's specified by the scan key?
+ * For example, for a query qual such as "WHERE col < 10" we
+ * need to include a range whose minimum value is less than
+ * 10.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole.
+ */
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ addrange =
+ invoke_mm_operator(<[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTLessEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want
+ * to return the current page range if the minimum
+ * value in the range <= scan key, and the maximum
+ * value >= scan key.
+ */
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ if (!addrange)
+ break;
+ /* max() >= scankey */
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterStrategyNumber:
+ addrange =
+ invoke_mm_operator(>[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ }
+
+ /*
+ * If the current scan key doesn't match the range values,
+ * don't look at further ones.
+ */
+ if (!addrange)
+ break;
+ }
+
+ /* XXX anything to free here? */
+ }
+
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + MINMAX_PAGES_PER_RANGE - 1;
+ pageno++)
+ tbm_add_page(tbm, pageno);
+ }
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ pfree(lt);
+ pfree(lteq);
+ pfree(gt);
+ pfree(gteq);
+
+ PG_RETURN_INT64(MaxHeapTuplesPerPage);
+ }
+
+
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ {
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+ }
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+
+ /* anything to do here? */
+ (void) scan; /* silence compiler */
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Reset the per-column build state in an MMBuildState.
+ */
+ static void
+ clear_mm_percol_buildstate(MMBuildState *mmstate)
+ {
+ int i;
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ mmstate->dtuple->values[i].allnulls = true;
+ mmstate->dtuple->values[i].hasnulls = false;
+ mmstate->dtuple->values[i].min = (Datum) 0;
+ mmstate->dtuple->values[i].max = (Datum) 0;
+ }
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; they
+ * are present in the build state struct but not inserted into the index.
+ * Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock == mmstate->nextRangeAt)
+ {
+ MMTuple *tup;
+ Size size;
+
+ #if 0
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ elog(DEBUG2, "completed a range for column %d, range: %u .. %u",
+ i,
+ DatumGetUInt32(mmstate->dtuple->values[i].min),
+ DatumGetUInt32(mmstate->dtuple->values[i].max));
+ }
+ #endif
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ /* and set state to correspond to the new current range */
+ mmstate->currRangeStart = mmstate->nextRangeAt;
+ mmstate->nextRangeAt = mmstate->currRangeStart + MINMAX_PAGES_PER_RANGE;
+
+ /* initialize aggregate state for the new range */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ if (!mmstate->dtuple->values[i].allnulls &&
+ !mmstate->perColState[i].typByVal)
+ {
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].max));
+ }
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ AttrNumber heapAttno = mmstate->perColState[i].heapAttno;
+
+ /*
+ * If the value in the current heap tuple is null, there's not much to
+ * do other than keep track that we saw it.
+ */
+ if (isnull[heapAttno - 1])
+ {
+ mmstate->dtuple->values[i].hasnulls = true;
+ continue;
+ }
+
+ /*
+ * If this is the first tuple in the range containing a not-null value
+ * for this column, initialize our state.
+ */
+ if (mmstate->dtuple->values[i].allnulls)
+ {
+ mmstate->dtuple->values[i].allnulls = false;
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ continue;
+ }
+
+ /*
+ * Otherwise, dtuple state was already initialized, and the current
+ * tuple is not null: therefore we need to compare it to the current
+ * state and possibly update the min/max boundaries.
+ */
+ if (invoke_mm_operator(&mmstate->perColState[i].lt, InvalidOid,
+ values[heapAttno - 1],
+ mmstate->dtuple->values[i].min))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+
+ if (invoke_mm_operator(&mmstate->perColState[i].gt, InvalidOid,
+ values[heapAttno - 1],
+ mmstate->dtuple->values[i].max))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+ }
+ }
+
+ static MMBuildState *
+ initialize_mm_buildstate(Relation heapRel, Relation idxRel,
+ mmRevmapAccess *rmAccess, IndexInfo *indexInfo)
+ {
+ MMBuildState *mmstate;
+ TupleDesc heapDesc = RelationGetDescr(heapRel);
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ int i;
+
+ mmstate = palloc(offsetof(MMBuildState, perColState) +
+ sizeof(MMPerColBuildInfo) * indexInfo->ii_NumIndexAttrs);
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->currRangeStart = 0;
+ mmstate->nextRangeAt = MINMAX_PAGES_PER_RANGE;
+ mmstate->rmAccess = rmAccess;
+ mmstate->indexDesc = RelationGetDescr(idxRel);
+ mmstate->diskDesc = minmax_get_descr(mmstate->indexDesc);
+
+ mmstate->dtuple = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * indexInfo->ii_NumIndexAttrs);
+ /* other stuff in dtuple is initialized below */
+
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ int heapAttno;
+ Form_pg_attribute attr;
+ Oid opfam = get_opclass_family(indclass->values[i]);
+ Oid idxtypid = mmstate->indexDesc->attrs[i]->atttypid;
+
+ heapAttno = indexInfo->ii_KeyAttrNumbers[i];
+ if (heapAttno == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create minmax indexes on expressions")));
+
+ attr = heapDesc->attrs[heapAttno - 1];
+ mmstate->perColState[i].heapAttno = heapAttno;
+ mmstate->perColState[i].typByVal = attr->attbyval;
+ mmstate->perColState[i].typLen = attr->attlen;
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ &(mmstate->perColState[i].lt));
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ &(mmstate->perColState[i].gt));
+
+ /* initialize per-column state */
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+
+ return mmstate;
+ }
+
+ void
+ mm_init_metapage(Buffer meta)
+ {
+ MinmaxMetaPageData *metadata;
+ Page page = BufferGetPage(meta);
+
+ PageInit(page, BLCKSZ, 0);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->minmaxVersion = MINMAX_CURRENT_VERSION;
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ START_CRIT_SECTION();
+ meta = mm_getnewbuffer(index);
+ mm_init_metapage(meta);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &(index->rd_node);
+ rdata.len = sizeof(RelFileNode);
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /* set up our "reverse map" fork */
+ mmRevmapCreate(index);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = mmRevmapAccessInit(index, MINMAX_PAGES_PER_RANGE);
+ mmstate = initialize_mm_buildstate(heap, index, rmAccess, indexInfo);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in order
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* XXX process the final batch, if needed */
+
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_POINTER(NULL);
+ }
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return an array of block numbers for which the revmap returns
+ * InvalidTid; this is used in a later stage to execute re-summarization.
+ * (The block numbers correspond to the start heap page numbers with which each
+ * unsummarized range starts.) Space for the array is palloc'ed, and must be
+ * freed by caller.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ MemoryContext hashcxt;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ hashcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "mm remove deletable hash",
+ ALLOCSET_SMALL_MINSIZE,
+ ALLOCSET_SMALL_INITSIZE,
+ ALLOCSET_SMALL_MAXSIZE);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = hashcxt;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &tid,
+ HASH_ENTER,
+ &found);
+ Assert(!found);
+ hitem->referenced = false;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0, numitems = 0;
+ heapBlk < heapNumBlocks;
+ heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. However, if the heap page
+ * range is complete but not summarized, store its initial page
+ * number in the unsummarized array, for later summarization.
+ */
+ if (heapBlk + MINMAX_PAGES_PER_RANGE < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ if (!found)
+ elog(ERROR, "reverse map references nonexistant index tuple %u/%u",
+ ItemPointerGetBlockNumber(&itupptr),
+ ItemPointerGetOffsetNumber(&itupptr));
+ hitem->referenced = true;
+ numitems++;
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples. Allocate this in the hash context, so that it goes
+ * away with it.
+ */
+ deletable = MemoryContextAlloc(hashcxt, sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ start = 0;
+ for (i = 0; i < numdeletable; i++)
+ {
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_BULKREMOVE;
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ RecordPageWithFreeSpace(idxRel, blk, PageGetFreeSpace(page));
+
+ start = i + 1;
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+ }
+ }
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(heapRel, idxRel, rmAccess, indexInfo);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+ MMTuple *tup;
+ Size size;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+
+ mmstate->currRangeStart = blk;
+ mmstate->nextRangeAt = blk + MINMAX_PAGES_PER_RANGE;
+
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, MINMAX_PAGES_PER_RANGE,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ * Note mmbuildCallback didn't have the chance to actually insert
+ * anything into the index, because the heapscan should have ended
+ * just as it reached the final tuple in the range.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * During amvacuumcleanup of a MinMax index, we do three main things:
+ *
+ * 1) remove revmap entries which are no longer interesting (heap has been
+ * truncated).
+ *
+ * 2) remove index tuples that are no longer referenced from the revmap.
+ *
+ * 3) summarize ranges that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+
+ rmAccess = mmRevmapAccessInit(info->index, MINMAX_PAGES_PER_RANGE);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First: truncate the revmap to the range that covers pages actually in
+ * the heap. We must do this while holding the relation extension lock,
+ * or we risk someone else extending the relation in the meantime.
+ */
+ LockRelationForExtension(heapRel, AccessShareLock);
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ mmRevmapTruncate(rmAccess, heapNumBlocks);
+ UnlockRelationForExtension(heapRel, AccessShareLock);
+
+ /*
+ * Second: scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers
+ * of ranges that are not summarized.
+ */
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* Finally, summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT64(0);
+ }
+
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT64(0);
+ }
+
+ /*
+ * Fill the given finfo to enable calls to the operator specified by the given
+ * parameters.
+ */
+ static void
+ get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo)
+ {
+ Oid oprid;
+ HeapTuple oper;
+
+ oprid = get_opfamily_member(opfam, idxtypid, keytypid, strategy);
+ if (!OidIsValid(oprid))
+ elog(ERROR, "missing operator %d(%u,%u) in opfamily %u",
+ strategy, idxtypid, keytypid, opfam);
+
+ oper = SearchSysCache1(OPEROID, oprid);
+ if (!HeapTupleIsValid(oper))
+ elog(ERROR, "cache lookup failed for operator %u", oprid);
+
+ fmgr_info(((Form_pg_operator) GETSTRUCT(oper))->oprcode, finfo);
+ ReleaseSysCache(oper);
+ }
+
+ /*
+ * Invoke the given operator, and return the result as a C boolean.
+ */
+ static inline bool
+ invoke_mm_operator(FmgrInfo *operator, Oid collation, Datum left, Datum right)
+ {
+ Datum result;
+
+ result = FunctionCall2Coll(operator, collation, left, right);
+
+ return DatumGetBool(result);
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+
+ if (PageGetFreeSpace(page) < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel))));
+
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ blk = BufferGetBlockNumber(*buffer);
+
+ MarkBufferDirty(*buffer);
+
+ START_CRIT_SECTION();
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return a exclusively-locked buffer resulting from extending the relation.
+ */
+ static Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has some, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, and a
+ * new buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. The function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ bool extended = false;
+
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ Page page;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer which has enough free space to satisfy the requirement.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly.
+ */
+ freespace = PageGetFreeSpace(page);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ buf = ReadBuffer(irel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ break;
+
+ /* Not enough space: register reality and start over */
+ /* XXX register and unlock, or unlock and register?? */
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ {
+ MarkBufferDirty(*buffer);
+ UnlockReleaseBuffer(*buffer);
+ }
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ return extended;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,340 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+
+
+ #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+ #define IDXITEMS_PER_PAGE (MAPSIZE / SizeOfIptrData)
+
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / IDXITEMS_PER_PAGE)
+
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE)
+
+ static void mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno);
+
+ /* typedef appears in minmax_revmap.h */
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer currBuf;
+ BlockNumber physPagesInRevmap;
+ };
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
+
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->physPagesInRevmap =
+ RelationGetNumberOfBlocksInFork(idxrel, MM_REVMAP_FORKNUM);
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * in the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages-per-range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ ItemPointerData *iptr;
+
+ iptr = (ItemPointerData *) PageGetContents(page);
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /*
+ * If the revmap is out of space, extend it first.
+ */
+ if (mapBlk > rmAccess->physPagesInRevmap - 1)
+ {
+ mmRevmapExtend(rmAccess, mapBlk);
+ extend = true;
+ }
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ rmAccess->currBuf = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+ if (extend)
+ info |= XLOG_MINMAX_INIT_PAGE;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxRevmapSet;
+ rdata.buffer = rmAccess->currBuf;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, &rdata);
+
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with LockTuple.
+ * It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ if (mapBlk > rmAccess->physPagesInRevmap)
+ {
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ rmAccess->currBuf =
+ ReadBufferExtended(rmAccess->idxrel, MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ iptr = (ItemPointerData *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Create a single-page reverse range map fork for a new minmax index
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ bool needLock;
+ Buffer buf;
+ Page page;
+
+ needLock = !RELATION_IS_LOCAL(idxrel);
+
+ /*
+ * XXX it's unclear that we need this lock, considering that the relation
+ * is likely being created ...
+ */
+ if (needLock)
+ LockRelationForExtension(idxrel, ExclusiveLock);
+
+ START_CRIT_SECTION();
+ RelationOpenSmgr(idxrel);
+ smgrcreate(idxrel->rd_smgr, MM_REVMAP_FORKNUM, false);
+ buf = ReadBufferExtended(idxrel, MM_REVMAP_FORKNUM, P_NEW, RBM_NORMAL,
+ NULL);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ END_CRIT_SECTION();
+
+ if (needLock)
+ UnlockRelationForExtension(idxrel, ExclusiveLock);
+ }
+
+ /*
+ * Extend the reverse range map to cover the given block number.
+ *
+ * NB -- caller is responsible for ensuring this action is properly WAL-logged.
+ */
+ static void
+ mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
+ {
+ char page[BLCKSZ];
+
+ MemSet(page, 0, sizeof(page));
+ PageInit(page, BLCKSZ, 0);
+
+ LockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ /*
+ * first, refresh our idea of the current size; it might well have grown
+ * up to what we need since we last checked.
+ */
+ rmAccess->physPagesInRevmap =
+ RelationGetNumberOfBlocksInFork(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM);
+
+ /*
+ * Now extend it one page at a time. This might seem a bit inefficient,
+ * but normally we'd be extending for a single page anyway.
+ */
+ while (blkno > rmAccess->physPagesInRevmap - 1)
+ {
+ smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
+ rmAccess->physPagesInRevmap, page, false);
+ rmAccess->physPagesInRevmap++;
+ }
+
+ Assert(rmAccess->physPagesInRevmap ==
+ RelationGetNumberOfBlocksInFork(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM));
+
+ UnlockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+ }
+
+ /*
+ * Truncate a revmap to the size needed for a table of the given number of
+ * blocks. This includes removing pages beyond the last one needed, and also
+ * zeroing out the excess entries in the last page.
+ *
+ * The caller should hold a lock to avoid the table from growing in
+ * the meantime.
+ */
+ void
+ mmRevmapTruncate(mmRevmapAccess *rmAccess, BlockNumber heapNumBlocks)
+ {
+ BlockNumber rmBlks;
+ char *data;
+ Page page;
+ Buffer buffer;
+
+ /* Remove blocks at the end */
+ rmBlks = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapNumBlocks);
+
+ RelationOpenSmgr(rmAccess->idxrel);
+ smgrtruncate(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM, rmBlks + 1);
+
+ /* zero out the remaining items in the last page */
+ buffer = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, rmBlks,
+ RBM_NORMAL, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ page = PageGetContents(BufferGetPage(buffer));
+ data = page + sizeof(ItemPointerData) *
+ HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapNumBlocks + 1);
+
+ memset(data, 0, page + MAPSIZE - data);
+
+ UnlockReleaseBuffer(buffer);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,364 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * The intended interface is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation by
+ * using functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the tuple descriptor exactly: for each attribute in the
+ * descriptor, the index tuple carries two values, one for the minimum value in
+ * that column and one for the maximum.
+ *
+ * Also, for each column there are two null bits: one (hasnulls) stores whether
+ * any tuple within the page range has that column set to null; the other
+ * (allnulls) stores whether the column values are all null. If allnulls is
+ * true, then the tuple data area does not contain min/max values for that
+ * column at all; whereas it does if the hasnulls is set. Note we always store
+ * a double-length null bitmask; for typical indexes of four columns or less,
+ * they take a single byte anyway. It doesn't seem worth trying to optimize
+ * this further.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Generate an internal-style tuple descriptor to pass to minmax_form_tuple.
+ * These have no use outside this module.
+ *
+ * The argument is a minmax index' regular tuple descriptor.
+ */
+ TupleDesc
+ minmax_get_descr(TupleDesc tupdesc)
+ {
+ TupleDesc diskDesc;
+ int i,
+ j;
+
+ diskDesc = CreateTemplateTupleDesc(tupdesc->natts * 2, false);
+
+ for (i = 0, j = 1; i < tupdesc->natts; i++)
+ {
+ /* min */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ /* max */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ return diskDesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ *
+ * The first tuple descriptor passed corresponds to the catalogued index info,
+ * that is, it is the index's descriptor; the second descriptor must be
+ * obtained by calling minmax_get_descr() on that descriptor.
+ *
+ * (The reason for this slightly grotty arrangement is that we use heap tuple
+ * functions to implement packing of a tuple into the on-disk format.)
+ */
+ MMTuple *
+ minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
+ Size *size)
+ {
+ Datum values[diskDsc->natts];
+ bool nulls[diskDsc->natts];
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ uint16 phony_infomask;
+ bits8 phony_nullbitmap[BITMAPLEN(diskDsc->natts)];
+ Size len,
+ hoff,
+ data_len;
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ MemSet(nulls, 0, sizeof(nulls));
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ AttrNumber idxattno = keyno * 2;
+
+ if (tuple->values[keyno].allnulls)
+ {
+ nulls[idxattno] = true;
+ nulls[idxattno + 1] = true;
+ anynulls = true;
+ continue;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ anynulls = true;
+
+ values[idxattno] = tuple->values[keyno].min;
+ values[idxattno + 1] = tuple->values[keyno].max;
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(idxDsc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(diskDsc, values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(diskDsc,
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need
+ * to apply datumCopy inside the loop. We probably also need a
+ * minmax_free_dtuple() function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum values[tupdesc->natts * 2];
+ bool allnulls[tupdesc->natts];
+ bool hasnulls[tupdesc->natts];
+ char *tp;
+ bits8 *nullbits = NULL;
+ int keyno;
+
+ dtup = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ mm_deconstruct_tuple(tp, nullbits,
+ MMTupleHasNulls(tuple),
+ tupdesc->natts, tupdesc->attrs, values,
+ allnulls, hasnulls);
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ if (allnulls[keyno])
+ {
+ dtup->values[keyno].allnulls = true;
+ continue;
+ }
+
+ /* XXX optional datumCopy() */
+ dtup->values[keyno].min = values[keyno * 2];
+ dtup->values[keyno].max = values[keyno * 2 + 1];
+ dtup->values[keyno].hasnulls = hasnulls[keyno];
+ dtup->values[keyno].allnulls = false;
+ }
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * natts number of array members in att
+ * att the tuple's TupleDesc Form_pg_attribute array
+ * values output values, size 2 * natts (alternates min and max)
+ * allnulls output "allnulls", size natts
+ * hasnulls output "hasnulls", size natts
+ */
+ static inline void
+ mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ values[attnum] = (Datum) 0;
+ allnulls[attnum] = true;
+ hasnulls[attnum] = true; /* XXX ? */
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. So the tuple data does have data for this
+ * column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] = nulls && att_isnull(natts + attnum, hasnulls);
+ }
+
+ /*
+ * The we iterate to natts * 2 to obtain each attribute's min and max
+ * values. Note that since we reuse attribute entries (first for the
+ * minimum value of the corresponding column, then for max), we cannot
+ * cache offsets here.
+ */
+ for (attnum = 0; attnum < natts * 2; attnum++)
+ {
+ int true_attnum = attnum / 2;
+ Form_pg_attribute thisatt = att[true_attnum];
+
+ if (allnulls[true_attnum])
+ continue;
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[attnum] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,213 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_init_metapage(buf);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its revmap fork */
+ buf = XLogReadBufferExtended(xlrec->node, MM_REVMAP_FORKNUM, 0, RBM_ZERO);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ PageInit(page, BufferGetPageSize(buffer), 0); /* XXX size correct?? */
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, true);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBufferExtended(xlrec->node,
+ MM_REVMAP_FORKNUM, xlrec->mapBlock,
+ init ? RBM_ZERO : RBM_NORMAL);
+ Assert(BufferIsValid(buffer));
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buffer);
+ if (init)
+ PageInit(page, BufferGetPageSize(buffer), 0);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,74 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ uint8 info = xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
+
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2116,2121 **** IndexBuildHeapScan(Relation heapRelation,
--- 2116,2142 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * range is scanned. Scan to end-of-rel can be signalled by passing
+ * InvalidBlockNumber as end block number.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2186,2191 **** IndexBuildHeapScan(Relation heapRelation,
--- 2207,2215 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 899,904 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 899,1073 ----
pfree(itemidbase);
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline,
+ nstorage;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the item pointer array and build a list of just the ones we are
+ * going to keep. Notice we do not modify the page just yet, since we are
+ * still validity-checking.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ nstorage = 0;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ nstorage++;
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (nstorage == 0)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ char pageCopy[BLCKSZ];
+ itemIdSort itemidbase,
+ itemidptr;
+ int lastused;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * First scan the page taking note of each item that we need to
+ * preserve. This includes both live items (those that contain data)
+ * and interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable.
+ */
+ itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nline);
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /*
+ * Defragment the data areas of each tuple. Note that since offset
+ * numbers must remain unchanged in these pages, we can't do a qsort()
+ * of the itemIdSort elements here; and because the elements are not
+ * sorted by offset, we can't use memmove() to defragment the occupied
+ * data space. So we first create a temporary copy of the original
+ * data page, from which we memcpy() each item's data onto the final
+ * page.
+ */
+ memcpy(pageCopy, page, BLCKSZ);
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ continue;
+ }
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ upper -= itemidptr->alignedlen;
+ memcpy((char *) page + upper,
+ pageCopy + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+
+ lastused = i + 1;
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + lastused * sizeof(ItemIdData);
+
+ pfree(itemidbase);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,35 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,39 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ int32 minmaxMagic;
+ int32 minmaxVersion;
+ } MinmaxMetaPageData;
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ #define MM_REVMAP_FORKNUM VISIBILITYMAP_FORKNUM /* reuse the VM forknum */
+
+
+ extern void mm_init_metapage(Buffer meta);
+ extern void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno);
+
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,34 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
+ BlockNumber pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,79 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/tupdesc.h"
+
+
+ /*
+ * This struct is used to represent the indexed values for one column, within
+ * one page range.
+ */
+ typedef struct MMValues
+ {
+ Datum min;
+ Datum max;
+ bool hasnulls;
+ bool allnulls;
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and
+ * maximum values for all indexed columns, within one page range.
+ * The number of elements in the values array is determined by the accompanying
+ * tuple descriptor.
+ */
+ typedef struct DeformedMMTuple
+ {
+ bool nvalues; /* XXX unused */
+ MMValues values[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for natts*2 null bits; min and max Datum values for each column follow
+ * that.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple offsetof(MMTuple, mt_info) + sizeof(uint8)
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern TupleDesc minmax_get_descr(TupleDesc tupdesc);
+ extern MMTuple *minmax_form_tuple(TupleDesc idxDesc, TupleDesc diskDesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ extern DeformedMMTuple *minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,93 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ RelFileNode node;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, node) + sizeof(RelFileNode)
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, target) + SizeOfMinmaxTid)
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+
+ extern void minmax_desc(StringInfo buf, uint8 xl_info, char *rec);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup, NULL)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3847 ( minmax 5 0 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3847
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 781,784 **** DATA(insert ( 3474 3831 3831 8 s 3892 4000 0 ));
--- 781,811 ----
DATA(insert ( 3474 3831 2283 16 s 3889 4000 0 ));
DATA(insert ( 3474 3831 3831 18 s 3882 4000 0 ));
+ /*
+ * MinMax int4_ops
+ */
+ DATA(insert ( 3177 23 23 1 s 97 403 0 ));
+ DATA(insert ( 3177 23 23 2 s 523 403 0 ));
+ DATA(insert ( 3177 23 23 3 s 96 403 0 ));
+ DATA(insert ( 3177 23 23 4 s 525 403 0 ));
+ DATA(insert ( 3177 23 23 5 s 521 403 0 ));
+
+ /*
+ * MinMax numeric_ops
+ */
+ DATA(insert ( 3192 1700 1700 1 s 1754 403 0 ));
+ DATA(insert ( 3192 1700 1700 2 s 1755 403 0 ));
+ DATA(insert ( 3192 1700 1700 3 s 1752 403 0 ));
+ DATA(insert ( 3192 1700 1700 4 s 1757 403 0 ));
+ DATA(insert ( 3192 1700 1700 5 s 1756 403 0 ));
+
+ /*
+ * MinMax text_ops
+ */
+ DATA(insert ( 3193 25 25 1 s 664 403 0 ));
+ DATA(insert ( 3193 25 25 2 s 665 403 0 ));
+ DATA(insert ( 3193 25 25 3 s 98 403 0 ));
+ DATA(insert ( 3193 25 25 4 s 667 403 0 ));
+ DATA(insert ( 3193 25 25 5 s 666 403 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 379,382 **** DATA(insert ( 3474 3831 3831 3 3471 ));
--- 379,388 ----
DATA(insert ( 3474 3831 3831 4 3472 ));
DATA(insert ( 3474 3831 3831 5 3473 ));
+ /* MinMax */
+ DATA(insert ( 3177 23 23 1 2132 ));
+ DATA(insert ( 3177 23 23 2 2116 ));
+ DATA(insert ( 3192 1700 1700 1 2146 ));
+ DATA(insert ( 3192 1700 1700 2 2130 ));
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 227,231 **** DATA(insert ( 4000 range_ops PGNSP PGUID 3474 3831 t 0 ));
--- 227,234 ----
DATA(insert ( 4000 quad_point_ops PGNSP PGUID 4015 600 t 0 ));
DATA(insert ( 4000 kd_point_ops PGNSP PGUID 4016 600 f 0 ));
DATA(insert ( 4000 text_ops PGNSP PGUID 4017 25 t 0 ));
+ DATA(insert ( 3847 int4_ops PGNSP PGUID 3177 23 t 0 ));
+ DATA(insert ( 3847 numeric_ops PGNSP PGUID 3192 1700 t 0 ));
+ DATA(insert ( 3847 text_ops PGNSP PGUID 3193 25 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 147,151 **** DATA(insert OID = 4015 ( 4000 quad_point_ops PGNSP PGUID ));
--- 147,154 ----
DATA(insert OID = 4016 ( 4000 kd_point_ops PGNSP PGUID ));
DATA(insert OID = 4017 ( 4000 text_ops PGNSP PGUID ));
#define TEXT_SPGIST_FAM_OID 4017
+ DATA(insert OID = 3177 ( 3847 int4_ops PGNSP PGUID ));
+ DATA(insert OID = 3192 ( 3847 numeric_ops PGNSP PGUID ));
+ DATA(insert OID = 3193 ( 3847 text_ops PGNSP PGUID ));
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 561,566 **** DESCR("btree(internal)");
--- 561,594 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3178 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3179 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3180 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3181 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3182 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3183 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3184 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3185 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3186 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3187 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3188 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3190 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3191 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,409 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
On Sat, 2013-09-14 at 21:14 -0300, Alvaro Herrera wrote:
Here's a reviewable version of what I've dubbed Minmax indexes.
Please fix duplicate OID 3177.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes. Some
people said they would like to use some other name for this feature, but
I have yet to hear usable ideas, so for now I will keep calling them
this way. I'm open to proposals, but if you pick something that cannot
be abbreviated "mm" I might have you prepare a rebased version which
renames the files and structs.The implementation here has been simplified from what I originally
proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
I noticed that there's no need to involve aggregate functions at all; we
can just use inequality operators. So the pg_amproc entries are gone;
only the pg_amop entries are necessary.I've somewhat punted on the question of doing resummarization separately
from vacuuming. Right now, resummarization (as well as other necessary
index cleanup) takes place in amvacuumcleanup. This is not optimal; I
have stated elsewhere that I'd like to create separate maintenance
actions that can be carried out by autovacuum. That would be useful
both for Minmax indexes and GIN indexes (pending insertion list); maybe
others. That's not part of this patch, however.The design of this stuff is in the file "minmax-proposal" at the top of
the tree. That file is up to date, though it still contains some open
questions that were present in the original proposal. (I have not fixed
some bogosities pointed out by Noah, for instance. I will do that
shortly.) In a final version, that file would be applied as
src/backend/access/minmax/README, most likely.One area on which I needed to modify core code is IndexBuildHeapScan. I
needed a version that was able to scan only a certain range of pages,
not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
added a quick "heap_scansetlimits" function. I haven't tested that this
works outside of the HeapRangeScan thingy, so it's probably completely
bogus; I'm open to suggestions if people think this should be
implemented differently. In any case, keeping that implementation
together with vanilla IndexBuildHeapScan makes a lot of sense.One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range; 2. vacuum the
affected index page if it's full, so we can maintain the index always up
to date without causing unduly bloat), but I haven't implemented
anything yet.The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.There are opclasses for int4, numeric and text. The latter doesn't work
at all, because collation info is not passed down at all. I will have
to figure that out (even if I find unlikely that minmax indexes have any
usefulness on top of text columns). I admit that numeric hasn't been
tested, and it's quite likely that they won't work; mainly because of
lack of some datumCopy() calls, about which the code contains some
/* XXX */ lines. I think this should be relatively straightforward.
Ideally, the final version of this patch would contain opclasses for all
supported datatypes (i.e. the same that have got btree opclasses).I have messed up the opclass information, as evidenced by failures in
opr_sanity regression test. I will research that later.There's working contrib/pageinspect support; pg_xlogdump (and wal_debug)
seems to work sanely too.
This patch compiles cleanly under -Werror.The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 318633
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0
--
Thom
On 15.09.2013 03:14, Alvaro Herrera wrote:
+ Partial indexes are not supported; since an index is concerned with minimum and + maximum values of the involved columns across all the pages in the table, it + doesn't make sense to exclude values. Another way to see "partial" indexes + here would be those that only considered some pages in the table instead of all + of them; but this would be difficult to implement and manage and, most likely, + pointless.
Something like this seems completely sensible to me:
create index i_accounts on accounts using minmax (ts) where valid = true;
The situation where that would be useful is if 'valid' accounts are
fairly well clustered, but invalid ones are scattered all over the
table. The minimum and maximum stoed in the index would only concern
valid accounts.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:
Something like this seems completely sensible to me:
create index i_accounts on accounts using minmax (ts) where valid = true;
The situation where that would be useful is if 'valid' accounts are
fairly well clustered, but invalid ones are scattered all over the
table. The minimum and maximum stoed in the index would only concern
valid accounts.
Here's one that occurs to me:
CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE;
Note that this would be a frequently moving target and over years of billing,
the subset would be quite small compared to the full system (imagine, say, 50k
rows out of 20M).
Best Wises,
Chris Travers
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Best Wishes,
Chris Travers
http://www.2ndquadrant.com
PostgreSQL Services, Training, and Support
On 2013-09-16 11:19:19 +0100, Chris Travers wrote:
On 16 September 2013 at 11:03 Heikki Linnakangas <hlinnakangas@vmware.com>
wrote:Something like this seems completely sensible to me:
create index i_accounts on accounts using minmax (ts) where valid = true;
The situation where that would be useful is if 'valid' accounts are
fairly well clustered, but invalid ones are scattered all over the
table. The minimum and maximum stoed in the index would only concern
valid accounts.
Yes, I wondered the same myself.
Here's one that occurs to me:
CREATE INDEX i_billing_id_mm ON billing(id) WHERE paid_in_full IS NOT TRUE;
Note that this would be a frequently moving target and over years of billing,
the subset would be quite small compared to the full system (imagine, say, 50k
rows out of 20M).
In that case you'd just use a normal btree index, no?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0
fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)
Another thing is this messages i got when compiling:
"""
mmxlog.c: In function ‘minmax_xlog_revmap_set’:
mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable]
bufpage.c: In function ‘PageIndexDeleteNoCompact’:
bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
"""
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157
Attachments:
On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)Another thing is this messages i got when compiling:
"""
mmxlog.c: In function ‘minmax_xlog_revmap_set’:
mmxlog.c:161:14: warning: unused variable ‘blkno’ [-Wunused-variable]
bufpage.c: In function ‘PageIndexDeleteNoCompact’:
bufpage.c:1066:18: warning: ‘lastused’ may be used uninitialized in
this function [-Wmaybe-uninitialized]
"""
I'm able to run ANALYSE manually without it dying:
pgbench=# analyse pgbench_accounts;
ANALYZE
pgbench=# analyse pgbench_accounts;
ANALYZE
pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
PANIC: invalid xlog record length 0
--
Thom
On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com> wrote:
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:Hi,
Here's a reviewable version of what I've dubbed Minmax indexes.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
(aid);
PANIC: invalid xlog record length 0fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)I'm able to run ANALYSE manually without it dying:
try inserting some data before the ANALYZE, that will force a
resumarization which is mentioned in the stack trace of the failure
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote:
On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com>
wrote:
On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:Hi,
Here's a reviewable version of what I've dubbed Minmax indexes.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
(aid);
PANIC: invalid xlog record length 0fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)I'm able to run ANALYSE manually without it dying:
try inserting some data before the ANALYZE, that will force a
resumarization which is mentioned in the stack trace of the failure
I've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and
in both cases there's no error. But then trying to create the index again
results in my original error.
--
Thom
On Tue, Sep 17, 2013 at 8:43 AM, Thom Brown <thom@linux.com> wrote:
On 17 September 2013 14:37, Jaime Casanova <jaime@2ndquadrant.com> wrote:
On Tue, Sep 17, 2013 at 3:30 AM, Thom Brown <thom@linux.com> wrote:
On 17 September 2013 07:20, Jaime Casanova <jaime@2ndquadrant.com>
wrote:On Mon, Sep 16, 2013 at 3:47 AM, Thom Brown <thom@linux.com> wrote:
On 15 September 2013 01:14, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:Hi,
Here's a reviewable version of what I've dubbed Minmax indexes.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax
(aid);
PANIC: invalid xlog record length 0fwiw, this seems to be triggered by ANALYZE.
At least i can trigger it by executing ANALYZE on the table (attached
is a stacktrace of a backend exhibiting the failure)I'm able to run ANALYSE manually without it dying:
try inserting some data before the ANALYZE, that will force a
resumarization which is mentioned in the stack trace of the failureI've tried inserting 1 row then ANALYSE and 10,000 rows then ANALYSE, and in
both cases there's no error. But then trying to create the index again
results in my original error.
Ok
So, please confirm if this is the pattern you are following:
CREATE TABLE t1(i int);
INSERT INTO t1 SELECT generate_series(1, 10000);
CREATE INDEX idx1 ON t1 USING minmax (i);
if that, then the attached stack trace (index_failure_thom.txt) should
correspond to the failure you are looking.
My test was slightly different:
CREATE TABLE t1(i int);
CREATE INDEX idx1 ON t1 USING minmax (i);
INSERT INTO t1 SELECT generate_series(1, 10000);
ANALYZE t1;
and the failure happened in a different time, in resumarization
(attached index_failure_jcm.txt)
but in the end, both failures seems to happen for the same reason: a
record of length 0... at XLogInsert time
#4 XLogInsert at xlog.c:966
#5 mmSetHeapBlockItemptr at mmrevmap.c:169
#6 mm_doinsert at minmax.c:1410
actually, if you create a temp table both tests works fine
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157
Thom Brown wrote:
Thanks for testing.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0
Silly mistake I had already made in another patch. Here's an
incremental patch which fixes this bug. Apply this on top of previous
minmax-1.patch.
I also renumbered the duplicate OID pointed out by Peter, and fixed the
two compiler warnings reported by Jaime.
Note you'll need to re-initdb in order to get the right catalog entries.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-2-incr.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/minmax/mmrevmap.c b/src/backend/access/minmax/mmrevmap.c
index 3e19f90..76cddde 100644
--- a/src/backend/access/minmax/mmrevmap.c
+++ b/src/backend/access/minmax/mmrevmap.c
@@ -147,12 +147,10 @@ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
{
xl_minmax_rm_set xlrec;
XLogRecPtr recptr;
- XLogRecData rdata;
+ XLogRecData rdata[2];
uint8 info;
info = XLOG_MINMAX_REVMAP_SET;
- if (extend)
- info |= XLOG_MINMAX_INIT_PAGE;
xlrec.node = rmAccess->idxrel->rd_node;
xlrec.mapBlock = mapBlk;
@@ -160,13 +158,26 @@ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
xlrec.heapBlock = heapBlk;
ItemPointerSet(&(xlrec.newval), blkno, offno);
- rdata.data = (char *) &xlrec;
- rdata.len = SizeOfMinmaxRevmapSet;
- rdata.buffer = rmAccess->currBuf;
- rdata.buffer_std = false;
- rdata.next = NULL;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
- recptr = XLogInsert(RM_MINMAX_ID, info, &rdata);
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
}
diff --git a/src/backend/access/minmax/mmxlog.c b/src/backend/access/minmax/mmxlog.c
index ee095a2..758fc5f 100644
--- a/src/backend/access/minmax/mmxlog.c
+++ b/src/backend/access/minmax/mmxlog.c
@@ -158,7 +158,6 @@ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
{
xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
bool init;
- BlockNumber blkno;
Buffer buffer;
Page page;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 1e7cbac..d1a3ca7 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1040,6 +1040,7 @@ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
* page.
*/
memcpy(pageCopy, page, BLCKSZ);
+ lastused = FirstOffsetNumber;
upper = pd_special;
PageClearHasFreeLinePointers(page);
for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index 8109949..192e295 100644
--- a/src/include/catalog/pg_amop.h
+++ b/src/include/catalog/pg_amop.h
@@ -784,28 +784,28 @@ DATA(insert ( 3474 3831 3831 18 s 3882 4000 0 ));
/*
* MinMax int4_ops
*/
-DATA(insert ( 3177 23 23 1 s 97 403 0 ));
-DATA(insert ( 3177 23 23 2 s 523 403 0 ));
-DATA(insert ( 3177 23 23 3 s 96 403 0 ));
-DATA(insert ( 3177 23 23 4 s 525 403 0 ));
-DATA(insert ( 3177 23 23 5 s 521 403 0 ));
+DATA(insert ( 3192 23 23 1 s 97 3847 0 ));
+DATA(insert ( 3192 23 23 2 s 523 3847 0 ));
+DATA(insert ( 3192 23 23 3 s 96 3847 0 ));
+DATA(insert ( 3192 23 23 4 s 525 3847 0 ));
+DATA(insert ( 3192 23 23 5 s 521 3847 0 ));
/*
* MinMax numeric_ops
*/
-DATA(insert ( 3192 1700 1700 1 s 1754 403 0 ));
-DATA(insert ( 3192 1700 1700 2 s 1755 403 0 ));
-DATA(insert ( 3192 1700 1700 3 s 1752 403 0 ));
-DATA(insert ( 3192 1700 1700 4 s 1757 403 0 ));
-DATA(insert ( 3192 1700 1700 5 s 1756 403 0 ));
+DATA(insert ( 3193 1700 1700 1 s 1754 3847 0 ));
+DATA(insert ( 3193 1700 1700 2 s 1755 3847 0 ));
+DATA(insert ( 3193 1700 1700 3 s 1752 3847 0 ));
+DATA(insert ( 3193 1700 1700 4 s 1757 3847 0 ));
+DATA(insert ( 3193 1700 1700 5 s 1756 3847 0 ));
/*
* MinMax text_ops
*/
-DATA(insert ( 3193 25 25 1 s 664 403 0 ));
-DATA(insert ( 3193 25 25 2 s 665 403 0 ));
-DATA(insert ( 3193 25 25 3 s 98 403 0 ));
-DATA(insert ( 3193 25 25 4 s 667 403 0 ));
-DATA(insert ( 3193 25 25 5 s 666 403 0 ));
+DATA(insert ( 3194 25 25 1 s 664 3847 0 ));
+DATA(insert ( 3194 25 25 2 s 665 3847 0 ));
+DATA(insert ( 3194 25 25 3 s 98 3847 0 ));
+DATA(insert ( 3194 25 25 4 s 667 3847 0 ));
+DATA(insert ( 3194 25 25 5 s 666 3847 0 ));
#endif /* PG_AMOP_H */
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 53ecb58..7155cb2 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -379,10 +379,4 @@ DATA(insert ( 3474 3831 3831 3 3471 ));
DATA(insert ( 3474 3831 3831 4 3472 ));
DATA(insert ( 3474 3831 3831 5 3473 ));
-/* MinMax */
-DATA(insert ( 3177 23 23 1 2132 ));
-DATA(insert ( 3177 23 23 2 2116 ));
-DATA(insert ( 3192 1700 1700 1 2146 ));
-DATA(insert ( 3192 1700 1700 2 2130 ));
-
#endif /* PG_AMPROC_H */
diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index da3337d..3a434de 100644
--- a/src/include/catalog/pg_opclass.h
+++ b/src/include/catalog/pg_opclass.h
@@ -227,8 +227,8 @@ DATA(insert ( 4000 range_ops PGNSP PGUID 3474 3831 t 0 ));
DATA(insert ( 4000 quad_point_ops PGNSP PGUID 4015 600 t 0 ));
DATA(insert ( 4000 kd_point_ops PGNSP PGUID 4016 600 f 0 ));
DATA(insert ( 4000 text_ops PGNSP PGUID 4017 25 t 0 ));
-DATA(insert ( 3847 int4_ops PGNSP PGUID 3177 23 t 0 ));
-DATA(insert ( 3847 numeric_ops PGNSP PGUID 3192 1700 t 0 ));
-DATA(insert ( 3847 text_ops PGNSP PGUID 3193 25 t 0 ));
+DATA(insert ( 3847 int4_ops PGNSP PGUID 3192 23 t 0 ));
+DATA(insert ( 3847 numeric_ops PGNSP PGUID 3193 1700 t 0 ));
+DATA(insert ( 3847 text_ops PGNSP PGUID 3194 25 t 0 ));
#endif /* PG_OPCLASS_H */
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index a9ac8d7..4fd761a 100644
--- a/src/include/catalog/pg_opfamily.h
+++ b/src/include/catalog/pg_opfamily.h
@@ -147,8 +147,8 @@ DATA(insert OID = 4015 ( 4000 quad_point_ops PGNSP PGUID ));
DATA(insert OID = 4016 ( 4000 kd_point_ops PGNSP PGUID ));
DATA(insert OID = 4017 ( 4000 text_ops PGNSP PGUID ));
#define TEXT_SPGIST_FAM_OID 4017
-DATA(insert OID = 3177 ( 3847 int4_ops PGNSP PGUID ));
-DATA(insert OID = 3192 ( 3847 numeric_ops PGNSP PGUID ));
-DATA(insert OID = 3193 ( 3847 text_ops PGNSP PGUID ));
+DATA(insert OID = 3192 ( 3847 int4_ops PGNSP PGUID ));
+DATA(insert OID = 3193 ( 3847 numeric_ops PGNSP PGUID ));
+DATA(insert OID = 3194 ( 3847 text_ops PGNSP PGUID ));
#endif /* PG_OPFAMILY_H */
On 17 September 2013 22:03, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thom Brown wrote:
Thanks for testing.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0Silly mistake I had already made in another patch. Here's an
incremental patch which fixes this bug. Apply this on top of previous
minmax-1.patch.
Thanks.
Hit another issue with exactly the same procedure:
pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
ERROR: lock 176475 is not held
--
Thom
On Tue, September 17, 2013 23:03, Alvaro Herrera wrote:
[minmax-1.patch. + minmax-2-incr.patch. (and initdb)]
The patches apply and compile OK.
I've not yet really tested; I just wanted to mention that make check gives the following differences:
*** /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/expected/opr_sanity.out 2013-09-17 23:18:31.427356703
+0200
--- /home/aardvark/pg_stuff/pg_sandbox/pgsql.minmax/src/test/regress/results/opr_sanity.out 2013-09-17 23:20:48.208150824
+0200
***************
*** 1076,1081 ****
--- 1076,1086 ----
2742 | 2 | @@@
2742 | 3 | <@
2742 | 4 | =
+ 3847 | 1 | <
+ 3847 | 2 | <=
+ 3847 | 3 | =
+ 3847 | 4 | >=
+ 3847 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1098,1104 ****
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (62 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1103,1109 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (67 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1272,1280 ****
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport OR amprocfamily IS NULL;
! amname | opcname | count
! --------+---------+-------
! (0 rows)
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
--- 1277,1288 ----
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport OR amprocfamily IS NULL;
! amname | opcname | count
! --------+-------------+-------
! minmax | int4_ops | 1
! minmax | text_ops | 1
! minmax | numeric_ops | 1
! (3 rows)
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
======================================================================
Erik Rijkers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thom Brown wrote:
Hit another issue with exactly the same procedure:
pgbench=# create index minmaxtest on pgbench_accounts using minmax (aid);
ERROR: lock 176475 is not held
That's what I get for restructuring the way buffers are acquired to use
the FSM, and then neglecting to test creation on decently-sized indexes.
Fix attached.
I just realized that xlog replay is also broken.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-3-incr.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/minmax/minmax.c b/src/backend/access/minmax/minmax.c
index 3b41100..47cb05e 100644
--- a/src/backend/access/minmax/minmax.c
+++ b/src/backend/access/minmax/minmax.c
@@ -1510,10 +1510,8 @@ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
}
if (!BufferIsInvalid(*buffer))
- {
- MarkBufferDirty(*buffer);
- UnlockReleaseBuffer(*buffer);
- }
+ ReleaseBuffer(*buffer);
+
*buffer = buf;
}
else
Erik Rijkers wrote:
On Tue, September 17, 2013 23:03, Alvaro Herrera wrote:
[minmax-1.patch. + minmax-2-incr.patch. (and initdb)]
The patches apply and compile OK.
I've not yet really tested; I just wanted to mention that make check gives the following differences:
Oops, I forgot to update the expected file. I had to comment on this
when submitting minmax-2-incr.patch and forgot. First, those extra five
operators are supposed to be there; expected file needs an update. As
for this:
--- 1277,1288 ---- WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin' GROUP BY amname, amsupport, opcname, amprocfamily HAVING count(*) != amsupport OR amprocfamily IS NULL; ! amname | opcname | count ! --------+-------------+------- ! minmax | int4_ops | 1 ! minmax | text_ops | 1 ! minmax | numeric_ops | 1 ! (3 rows)
I think the problem is that the query is wrong. This is the complete query:
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport OR amprocfamily IS NULL;
I should be, instead, this:
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
LEFT JOIN pg_amproc p ON amprocfamily = opcfamily AND
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
HAVING count(*) != amsupport AND (amprocfamily IS NOT NULL);
This query is supposed to check that there are no opclasses with
mismatching number of support procedures; but if the left join returns a
null-extended row for pg_amproc, that means there is no support proc,
yet count(*) will return 1. So count(*) will not match amsupport, and
the row is supposed to be excluded by the amprocfamily IS NULL clause in
HAVING.
Both queries return empty in HEAD, but only the second one correctly
returns empty with the patch applied.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 17, 2013 at 4:03 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Thom Brown wrote:
Thanks for testing.
Thanks for the patch, but I seem to have immediately hit a snag:
pgbench=# CREATE INDEX minmaxtest ON pgbench_accounts USING minmax (aid);
PANIC: invalid xlog record length 0Silly mistake I had already made in another patch. Here's an
incremental patch which fixes this bug. Apply this on top of previous
minmax-1.patch.I also renumbered the duplicate OID pointed out by Peter, and fixed the
two compiler warnings reported by Jaime.Note you'll need to re-initdb in order to get the right catalog entries.
Hi,
Found another problem with the this steps:
create table t1 (i int);
create index idx_t1_i on t1 using minmax(i);
insert into t1 select generate_series(1, 2000000);
ERROR: could not read block 1 in file "base/12645/16397_vm": read
only 0 of 8192 bytes
STATEMENT: insert into t1 select generate_series(1, 2000000);
ERROR: could not read block 1 in file "base/12645/16397_vm": read
only 0 of 8192 bytes
After that, i keep receiving these messages (when autovacuum tries to
vacuum this table):
ERROR: could not truncate file "base/12645/16397_vm" to 2 blocks:
it's only 1 blocks now
CONTEXT: automatic vacuum of table "postgres.public.t1"
ERROR: could not truncate file "base/12645/16397_vm" to 2 blocks:
it's only 1 blocks now
CONTEXT: automatic vacuum of table "postgres.public.t1"
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Jaime Casanova wrote:
Found another problem with the this steps:
create table t1 (i int);
create index idx_t1_i on t1 using minmax(i);
insert into t1 select generate_series(1, 2000000);
ERROR: could not read block 1 in file "base/12645/16397_vm": read
only 0 of 8192 bytes
Thanks. This was a trivial off-by-one bug; fixed in the attached patch.
While studying it, I noticed that I was also failing to notice extension
of the fork by another process. I have tried to fix that also in the
current patch, but I'm afraid that a fully robust solution for this will
involve having a cached fork size in the index's relcache entry -- just
like we have smgr_vm_nblocks. In fact, since the revmap fork is
currently reusing the VM forknum, I might even be able to use the same
variable to keep track of the fork size. But I don't really like this
bit of reusing the VM forknum for revmap, so I've refrained from
extending that assumption into further code for the time being.
There was also a bug that we would try to initialize a revmap page twice
during recovery, if two backends thought they needed to extend it; that
would cause the data written by the first extender to be lost.
This patch applies on top of the two previous incremental patches. I
will send a full patch later, including all those fixes and the fix for
the opr_sanity regression test.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-4-incr.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/access/minmax/mmrevmap.c
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 30,36 ****
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE)
! static void mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno);
/* typedef appears in minmax_revmap.h */
struct mmRevmapAccess
--- 30,36 ----
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE)
! static bool mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno);
/* typedef appears in minmax_revmap.h */
struct mmRevmapAccess
***************
*** 52,62 **** mmRevmapAccessInit(Relation idxrel, BlockNumber pagesPerRange)
{
mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
rmAccess->idxrel = idxrel;
rmAccess->pagesPerRange = pagesPerRange;
rmAccess->currBuf = InvalidBuffer;
rmAccess->physPagesInRevmap =
! RelationGetNumberOfBlocksInFork(idxrel, MM_REVMAP_FORKNUM);
return rmAccess;
}
--- 52,64 ----
{
mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
+ RelationOpenSmgr(idxrel);
+
rmAccess->idxrel = idxrel;
rmAccess->pagesPerRange = pagesPerRange;
rmAccess->currBuf = InvalidBuffer;
rmAccess->physPagesInRevmap =
! smgrnblocks(idxrel->rd_smgr, MM_REVMAP_FORKNUM);
return rmAccess;
}
***************
*** 111,121 **** mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
/*
* If the revmap is out of space, extend it first.
*/
! if (mapBlk > rmAccess->physPagesInRevmap - 1)
! {
! mmRevmapExtend(rmAccess, mapBlk);
! extend = true;
! }
/*
* Obtain the buffer from which we need to read. If we already have the
--- 113,120 ----
/*
* If the revmap is out of space, extend it first.
*/
! if (mapBlk >= rmAccess->physPagesInRevmap)
! extend = mmRevmapExtend(rmAccess, mapBlk);
/*
* Obtain the buffer from which we need to read. If we already have the
***************
*** 202,211 **** mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
! if (mapBlk > rmAccess->physPagesInRevmap)
{
! ItemPointerSetInvalid(out);
! return;
}
if (rmAccess->currBuf == InvalidBuffer ||
--- 201,229 ----
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
! /*
! * If we are asked for a block of the map which is beyond what we know
! * about it, try to see if our fork has grown since we last checked its
! * size; a concurrent inserter could have extended it.
! */
! if (mapBlk >= rmAccess->physPagesInRevmap)
{
! RelationOpenSmgr(rmAccess->idxrel);
! LockRelationForExtension(rmAccess->idxrel, ShareLock);
! rmAccess->physPagesInRevmap =
! smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
!
! if (mapBlk >= rmAccess->physPagesInRevmap)
! {
! /* definitely not in range */
!
! UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
! ItemPointerSetInvalid(out);
! return;
! }
!
! /* the block exists now, proceed */
! UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
}
if (rmAccess->currBuf == InvalidBuffer ||
***************
*** 273,286 **** mmRevmapCreate(Relation idxrel)
}
/*
! * Extend the reverse range map to cover the given block number.
*
* NB -- caller is responsible for ensuring this action is properly WAL-logged.
*/
! static void
mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
{
char page[BLCKSZ];
MemSet(page, 0, sizeof(page));
PageInit(page, BLCKSZ, 0);
--- 291,307 ----
}
/*
! * Extend the reverse range map to cover the given block number. Return false
! * if the map already covered the requested range (no extension actually done),
! * true otherwise.
*
* NB -- caller is responsible for ensuring this action is properly WAL-logged.
*/
! static bool
mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
{
char page[BLCKSZ];
+ bool extended = false;
MemSet(page, 0, sizeof(page));
PageInit(page, BLCKSZ, 0);
***************
*** 291,316 **** mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
* first, refresh our idea of the current size; it might well have grown
* up to what we need since we last checked.
*/
rmAccess->physPagesInRevmap =
! RelationGetNumberOfBlocksInFork(rmAccess->idxrel,
! MM_REVMAP_FORKNUM);
/*
* Now extend it one page at a time. This might seem a bit inefficient,
* but normally we'd be extending for a single page anyway.
*/
! while (blkno > rmAccess->physPagesInRevmap - 1)
{
smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
rmAccess->physPagesInRevmap, page, false);
rmAccess->physPagesInRevmap++;
}
Assert(rmAccess->physPagesInRevmap ==
! RelationGetNumberOfBlocksInFork(rmAccess->idxrel,
! MM_REVMAP_FORKNUM));
UnlockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
}
/*
--- 312,339 ----
* first, refresh our idea of the current size; it might well have grown
* up to what we need since we last checked.
*/
+ RelationOpenSmgr(rmAccess->idxrel);
rmAccess->physPagesInRevmap =
! smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
/*
* Now extend it one page at a time. This might seem a bit inefficient,
* but normally we'd be extending for a single page anyway.
*/
! while (blkno >= rmAccess->physPagesInRevmap)
{
+ extended = true;
smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
rmAccess->physPagesInRevmap, page, false);
rmAccess->physPagesInRevmap++;
}
Assert(rmAccess->physPagesInRevmap ==
! smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM));
UnlockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ return extended;
}
/*
On Wed, September 25, 2013 00:14, Alvaro Herrera wrote:
[minmax-4-incr.patch]
After a --data-checksums initdb (successful), the following error came up:
after the statement: create index t_minmax_idx on t using minmax (r);
WARNING: page verification failed, calculated checksum 25951 but expected 0
ERROR: invalid page in block 1 of relation base/21324/26267_vm
it happens reliably. every time I run the program.
Below is the whole program that I used.
Thanks,
Erik Rijkers
#!/bin/sh
t=t
if [[ 1 -eq 1 ]]; then
echo "
drop table if exists $t ;
create table $t
as
select i, cast( random() * 10^9 as integer ) as r
from generate_series(1, 1000000) as f(i) ;
analyze $t;
table $t limit 5;
select count(*) from $t;
explain analyze select min(r), max(r) from $t;
select min(r), max(r) from $t;
create index ${t}_minmax_idx on $t using minmax (r);
analyze $t;
explain analyze select min(r), max(r) from $t;
select min(r), max(r) from $t;
" | psql
fi
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Hi,
Here's a reviewable version of what I've dubbed Minmax indexes. Some
people said they would like to use some other name for this feature, but
I have yet to hear usable ideas, so for now I will keep calling them
this way. I'm open to proposals, but if you pick something that cannot
be abbreviated "mm" I might have you prepare a rebased version which
renames the files and structs.The implementation here has been simplified from what I originally
proposed at 20130614222805.GZ5491@eldon.alvh.no-ip.org -- in particular,
I noticed that there's no need to involve aggregate functions at all; we
can just use inequality operators. So the pg_amproc entries are gone;
only the pg_amop entries are necessary.I've somewhat punted on the question of doing resummarization separately
from vacuuming. Right now, resummarization (as well as other necessary
index cleanup) takes place in amvacuumcleanup. This is not optimal; I
have stated elsewhere that I'd like to create separate maintenance
actions that can be carried out by autovacuum. That would be useful
both for Minmax indexes and GIN indexes (pending insertion list); maybe
others. That's not part of this patch, however.The design of this stuff is in the file "minmax-proposal" at the top of
the tree. That file is up to date, though it still contains some open
questions that were present in the original proposal. (I have not fixed
some bogosities pointed out by Noah, for instance. I will do that
shortly.) In a final version, that file would be applied as
src/backend/access/minmax/README, most likely.One area on which I needed to modify core code is IndexBuildHeapScan. I
needed a version that was able to scan only a certain range of pages,
not the entire table, so I introduced a new IndexBuildHeapRangeScan, and
added a quick "heap_scansetlimits" function. I haven't tested that this
works outside of the HeapRangeScan thingy, so it's probably completely
bogus; I'm open to suggestions if people think this should be
implemented differently. In any case, keeping that implementation
together with vanilla IndexBuildHeapScan makes a lot of sense.One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range;
Why only at 3rd time?
Doesn't it need to be precise, like if someone inserts a row having
value greater than max value of corresponding index tuple,
then that index tuple's corresponding max value needs to be updated
and I think its updated with the help of validity map.
For example:
considering we need to store below info for each index tuple:
In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column:
* min() value across all tuples in the range
* max() value across all tuples in the range
Assume first and last block for index tuple is same (assume block
no. 'x') and min value is 5 and max is 10.
Now user insert/update value in block 'x' such that max value of
index col. is 11, if we don't update corresponding
index tuple or at least invalidate it, won't it lead to wrong results?
2. vacuum the
affected index page if it's full, so we can maintain the index always up
to date without causing unduly bloat), but I haven't implemented
anything yet.The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.
I think for first version, you might want to keep things simple, but
there should be some way for optimizer to select this index.
So rather than choose if it is present, we can make optimizer choose
when some-one says set enable_minmax index to true.
How about keeping this up-to-date during foreground operations.
Vacuum/Maintainer task maintaining things usually have problems of
bloat and
then we need optimize/workaround issues.
Lot of people have raised this or similar point previously and what
I read you are of opinion that it seems to be slow.
I really don't think that it can be so slow that adding so much
handling to get it up-to-date by some maintainer task is useful.
Currently there are
systems like Oracle where index clean-up is mainly done during
foreground operation, so this alone cannot be reason for slowness.
Comparing the logic with IOS is also not completely right as for
IOS, we need to know each tuple's visibility, which is not the case
here.
Now it can so happen that min and max values are sometimes not right
because later the operation is rolled back, but I think such cases
will
be less and we can find some way to handle such cases may be
maintainer task only, but the handling will be quite simpler.
On Windows, patch gives below compilation errors:
src\backend\access\minmax\mmtuple.c(96): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(96): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(96): error C2133: 'values' : unknown size
src\backend\access\minmax\mmtuple.c(97): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(97): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(97): error C2133: 'nulls' : unknown size
src\backend\access\minmax\mmtuple.c(102): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(102): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(102): error C2133:
'phony_nullbitmap' : unknown size
src\backend\access\minmax\mmtuple.c(110): warning C4034: sizeof returns 0
src\backend\access\minmax\mmtuple.c(246): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(246): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(246): error C2133: 'values' : unknown size
src\backend\access\minmax\mmtuple.c(247): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(247): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(247): error C2133: 'allnulls' :
unknown size
src\backend\access\minmax\mmtuple.c(248): error C2057: expected
constant expression
src\backend\access\minmax\mmtuple.c(248): error C2466: cannot
allocate an array of constant size 0
src\backend\access\minmax\mmtuple.c(248): error C2133: 'hasnulls' :
unknown size
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Amit Kapila escribi�:
On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
One thing still to tackle is when to mark ranges as unsummarized. Right
now, any new tuple on a page range would cause a new index entry to be
created and a new revmap update. This would cause huge index bloat if,
say, a page is emptied and vacuumed and filled with new tuples with
increasing values outside the original range; each new tuple would
create a new index tuple. I have two ideas about this (1. mark range as
unsummarized if 3rd time we touch the same page range;Why only at 3rd time?
Doesn't it need to be precise, like if someone inserts a row having
value greater than max value of corresponding index tuple,
then that index tuple's corresponding max value needs to be updated
and I think its updated with the help of validity map.
Of course. Note I no longer have the concept of a validity map; I have
switched things to use a "reverse range map", or revmap for short. The
revmap is responsible for mapping each page number to an individual
index TID. If the TID stored in the revmap is InvalidTid, that means
the range is not summarized. Summarized ranges are always considered as
"match query quals", and thus all tuples in them are returned in the
bitmap for heap recheck.
The way it works currently, is that any tuple insert (that's outside the
bounds of the current index tuple) causes a new index tuple to be
created, and the revmap is updated to point to the new index tuple. The
old index tuple is orphaned and will be deleted at next vacuum. This
works fine. However the problem is excess orphaned tuples; I don't want
a long series of updates to create many orphaned dead tuples. Instead I
would like the system to, at some point, stop creating new index tuples
and instead set the revmap to InvalidTid. That would stop the index
bloat.
For example:
considering we need to store below info for each index tuple:
In each index tuple (corresponding to one page range), we store:
- first block this tuple applies to
- last block this tuple applies to
- for each indexed column:
* min() value across all tuples in the range
* max() value across all tuples in the rangeAssume first and last block for index tuple is same (assume block
no. 'x') and min value is 5 and max is 10.
Now user insert/update value in block 'x' such that max value of
index col. is 11, if we don't update corresponding
index tuple or at least invalidate it, won't it lead to wrong results?
Sure, that would result in wrong results. Fortunately that's not how I
am suggesting to do it.
I note you're reading an old version of the design. I realize now that
this is my mistake because instead of posting the new design in the
cover letter for the patch, I only put it in the "minmax-proposal" file.
Please give that file a read to see how the design differs from the
design I originally posted in the old thread.
The "amcostestimate" routine is completely bogus; right now it returns
constant 0, meaning the index is always chosen if it exists.I think for first version, you might want to keep things simple, but
there should be some way for optimizer to select this index.
So rather than choose if it is present, we can make optimizer choose
when some-one says set enable_minmax index to true.
Well, enable_bitmapscan already disables minmax indexes, just like it
disables other indexes.
How about keeping this up-to-date during foreground operations.
Vacuum/Maintainer task maintaining things usually have problems of
bloat and
then we need optimize/workaround issues.
Lot of people have raised this or similar point previously and what
I read you are of opinion that it seems to be slow.
Well, the current code does keep the index up to date -- I did choose to
implement what people suggested :-)
Now it can so happen that min and max values are sometimes not right
because later the operation is rolled back, but I think such cases
will
be less and we can find some way to handle such cases may be
maintainer task only, but the handling will be quite simpler.
Agreed.
On Windows, patch gives below compilation errors:
src\backend\access\minmax\mmtuple.c(96): error C2057: expected
constant expression
I have fixed all these compile errors (fix attached). Thanks for
reporting them. I'll post a new version shortly.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-5a-incr.patchtext/x-diff; charset=us-asciiDownload
*** a/src/backend/access/minmax/mmtuple.c
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 93,117 **** MMTuple *
minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
Size *size)
{
! Datum values[diskDsc->natts];
! bool nulls[diskDsc->natts];
bool anynulls = false;
MMTuple *rettuple;
int keyno;
uint16 phony_infomask;
! bits8 phony_nullbitmap[BITMAPLEN(diskDsc->natts)];
Size len,
hoff,
data_len;
/*
* Set up the values/nulls arrays for heap_fill_tuple
*/
- MemSet(nulls, 0, sizeof(nulls));
for (keyno = 0; keyno < idxDsc->natts; keyno++)
{
! AttrNumber idxattno = keyno * 2;
if (tuple->values[keyno].allnulls)
{
nulls[idxattno] = true;
--- 93,126 ----
minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
Size *size)
{
! Datum *values;
! bool *nulls;
bool anynulls = false;
MMTuple *rettuple;
int keyno;
uint16 phony_infomask;
! bits8 *phony_nullbitmap;
Size len,
hoff,
data_len;
+ Assert(diskDsc->natts > 0);
+
+ values = palloc(sizeof(Datum) * diskDsc->natts);
+ nulls = palloc0(sizeof(bool) * diskDsc->natts);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(diskDsc->natts));
+
/*
* Set up the values/nulls arrays for heap_fill_tuple
*/
for (keyno = 0; keyno < idxDsc->natts; keyno++)
{
! int idxattno = keyno * 2;
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; set the nullable bits for both min and max attrs.
+ */
if (tuple->values[keyno].allnulls)
{
nulls[idxattno] = true;
***************
*** 168,173 **** minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
--- 177,187 ----
&phony_infomask,
phony_nullbitmap);
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
/*
* Now fill in the real null bitmasks. allnulls first.
*/
***************
*** 243,251 **** DeformedMMTuple *
minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
{
DeformedMMTuple *dtup;
! Datum values[tupdesc->natts * 2];
! bool allnulls[tupdesc->natts];
! bool hasnulls[tupdesc->natts];
char *tp;
bits8 *nullbits = NULL;
int keyno;
--- 257,265 ----
minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
{
DeformedMMTuple *dtup;
! Datum *values;
! bool *allnulls;
! bool *hasnulls;
char *tp;
bits8 *nullbits = NULL;
int keyno;
***************
*** 253,258 **** minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
--- 267,276 ----
dtup = palloc(offsetof(DeformedMMTuple, values) +
sizeof(MMValues) * tupdesc->natts);
+ values = palloc(sizeof(Datum) * tupdesc->natts * 2);
+ allnulls = palloc(sizeof(bool) * tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * tupdesc->natts);
+
tp = (char *) tuple + MMTupleDataOffset(tuple);
if (MMTupleHasNulls(tuple))
***************
*** 277,282 **** minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
--- 295,304 ----
dtup->values[keyno].allnulls = false;
}
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
return dtup;
}
***************
*** 293,298 **** minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
--- 315,322 ----
* values output values, size 2 * natts (alternates min and max)
* allnulls output "allnulls", size natts
* hasnulls output "hasnulls", size natts
+ *
+ * Output arrays are allocated by caller.
*/
static inline void
mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
Erik Rijkers wrote:
After a --data-checksums initdb (successful), the following error came up:
after the statement: create index t_minmax_idx on t using minmax (r);
WARNING: page verification failed, calculated checksum 25951 but expected 0
ERROR: invalid page in block 1 of relation base/21324/26267_vmit happens reliably. every time I run the program.
Thanks for the report. That's fixed with the attached.
Below is the whole program that I used.
Hmm, this test program shows that you're trying to use the index to
optimize min() and max() queries, but that's not what these indexes do.
You will need to use operators > >= = <= < (or BETWEEN, which is the
same thing) to see your index in action.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-5b-incr.patchtext/x-diff; charset=us-asciiDownload
commit 762ebb8f6ecfb36b9976bb9aa87d0a7f8fb601b4
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed Sep 25 17:03:28 2013 -0300
Set checksum to new revmap pages
per Erik Rijkers
diff --git a/src/backend/access/minmax/mmrevmap.c b/src/backend/access/minmax/mmrevmap.c
index 43bff95..5ff5ca2 100644
--- a/src/backend/access/minmax/mmrevmap.c
+++ b/src/backend/access/minmax/mmrevmap.c
@@ -323,6 +323,7 @@ mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
while (blkno >= rmAccess->physPagesInRevmap)
{
extended = true;
+ PageSetChecksumInplace(page, blkno);
smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
rmAccess->physPagesInRevmap, page, false);
rmAccess->physPagesInRevmap++;
Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-5.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,217 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "funcapi.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+
+ typedef struct mm_page_state
+ {
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ int raw_page_size;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small (%d bytes)", raw_page_size)));
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = VARDATA(raw_page);
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
*** a/contrib/pageinspect/pageinspect--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.1.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,118 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,300 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of the min() and max()
+ values in consecutive groups of heap pages (page ranges). These values can be
+ used by constraint exclusion to avoid scanning such pages, depending on query
+ quals.
+
+ The main drawback of this is having to update the stored min/max values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have this feature. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs"
+ http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB seems to have it
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported; since an index is concerned with minimum and
+ maximum values of the involved columns across all the pages in the table, it
+ doesn't make sense to exclude values. Another way to see "partial" indexes
+ here would be those that only considered some pages in the table instead of all
+ of them; but this would be difficult to implement and manage and, most likely,
+ pointless.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine would return a TIDBitmap comprising all
+ the pages in those page groups that match the query qualifications; the recheck
+ node prunes tuples that are not visible per snapshot and those that are not
+ visible per query quals.
+
+ For each supported datatype, we need an opclass with the following catalog
+ entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+
+ These operators are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious, at best.)
+
+ There can be gaps where some pages have no covering index entry. In particular,
+ the last few pages of the table would commonly not be summarized.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have a
+ separate fork called the range reverse map. This fork stores one TID per
+ range, which is the address of the index tuple summarizing that range. Since
+ these map entries are fixed size, it is possible to compute the address of the
+ range map entry for any given heap page.
+
+ When a new heap tuple is inserted in a summarized page range, it is possible to
+ compare the existing index tuple with the new heap tuple. If the heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values. To do this, a new index tuple is inserted, and the reverse range
+ map is updated to point to it. The old index tuple is left in place, for later
+ garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is not summarized.
+
+ A minmax index is updated by creating a new summary tuple whenever an
+ insertion outside the min-max interval occurs in the pages within the range.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order.
+ Query quals are matched to each index tuple; if they match, each page within
+ the page range is returned as part of the output TID bitmap. If there's no
+ match, they are skipped. Reverse range map entries returning invalid index
+ TIDs, that is unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we reuse the VISIBILITYMAP_FORKNUM, since a VM
+ does not make sense for a minmax index anyway (XXX -- really??)
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ minimum and maximum values of each indexed column and nulls bitmap are
+ collected and stored in the index. The possibly-incomplete range at the end
+ of the table is not included.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the min() and max() for each indexed column, then insert the new
+ index entry at the end of the index. The main interesting questions are:
+
+ a) when to do it
+ The perfect time to do it is as soon as a complete page range of the
+ configured range size has been filled.
+
+ b) who does it (what process)
+ It doesn't seem a good idea to have a client-connected process do it;
+ it would incur unwanted latency. Three other options are (i) to spawn a
+ specialized process to do it, which perhaps can be signalled by a
+ client-connected process that executes a scan and notices the need to run
+ summarization; or (ii) to let autovacuum do it, as a separate new
+ maintenance task. This seems simple enough to bolt on top of already
+ existing autovacuum infrastructure. The timing constraints of autovacuum
+ might be undesirable, though. (iii) wait for user command.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the min()/max() values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the min()/max() ranges are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 268,273 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 268,275 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 293,298 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 295,308 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 634,640 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 644,651 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 644,650 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 655,662 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 895,901 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 907,913 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 905,911 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 917,923 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1521 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * do we need to reserve special space on pages?
+ * * support collatable datatypes
+ * * on heap insert, we always create a new index entry. Need to mark
+ * range as unsummarized at some point, to avoid index bloat?
+ * * index truncation on vacuum?
+ * * datumCopy() is needed in several places?
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * Within that struct, each column's contruction info is represented by a
+ * MMPerColBuildInfo struct. The running state is all kept in a
+ * DeformedMMTuple.
+ */
+ typedef struct MMPerColBuildInfo
+ {
+ AttrNumber heapAttno;
+ int typLen;
+ bool typByVal;
+ FmgrInfo lt;
+ FmgrInfo gt;
+ } MMPerColBuildInfo;
+
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber currRangeStart;
+ BlockNumber nextRangeAt;
+ mmRevmapAccess *rmAccess;
+ TupleDesc indexDesc;
+ TupleDesc diskDesc;
+ DeformedMMTuple *dtuple;
+ MMPerColBuildInfo perColState[FLEXIBLE_ARRAY_MEMBER];
+ } MMBuildState;
+
+ static void mmbuildCallback(Relation index,
+ HeapTuple htup, Datum *values, bool *isnull,
+ bool tupleIsAlive, void *state);
+ static void get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo);
+ static inline bool invoke_mm_operator(FmgrInfo *operator, Oid collation,
+ Datum left, Datum right);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static Buffer mm_getnewbuffer(Relation irel);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+
+
+ #define MINMAX_PAGES_PER_RANGE 2
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to create a new index tuple and
+ * point the revmap to it.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ mmRevmapAccess *rmAccess;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ TupleDesc tupdesc;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ FmgrInfo *lt;
+ FmgrInfo *gt;
+ bool need_insert;
+
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ tupdesc = RelationGetDescr(idxRel);
+ indexInfo = BuildIndexInfo(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+ gt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+
+ /* grab the operators we will need: < and > for each indexed column */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ }
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ mmtup = (MMTuple *) PageGetItem(page, PageGetItemId(page, ioff));
+
+ dtup = minmax_deform_tuple(tupdesc, mmtup);
+
+ /* compare the key values of the new tuple to the stored index values */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ /*
+ * If the new tuple contains a null in this attr, but the range index
+ * tuple doesn't allow for nulls, we need a new summary tuple
+ */
+ if (nulls[keyno])
+ {
+ if (!dtup->values[keyno].hasnulls)
+ {
+ need_insert = true;
+ }
+ else
+ continue;
+ }
+
+ /*
+ * If the new key value is not within the min/max interval for this
+ * range, we need a new summary tuple
+ */
+ if (invoke_mm_operator(<[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].min))
+ {
+ dtup->values[keyno].min = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ if (invoke_mm_operator(>[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].max))
+ {
+ dtup->values[keyno].max = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ TupleDesc diskDesc;
+ Size tupsz;
+ MMTuple *tup;
+
+ diskDesc = minmax_get_descr(tupdesc);
+ tup = minmax_form_tuple(tupdesc, diskDesc, dtup, &tupsz);
+
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ PG_RETURN_POINTER(scan);
+ }
+
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the min/max values in them are compared to the
+ * scan keys. We return into the TID bitmap all the pages in ranges
+ * corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ Oid heapOid;
+ Relation heapRel;
+ mmRevmapAccess *rmAccess;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ TupleDesc tupdesc;
+ AttrNumber keyno;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ FmgrInfo *lt;
+ FmgrInfo *lteq;
+ FmgrInfo *gteq;
+ FmgrInfo *gt;
+
+ pgstat_count_index_scan(idxRel);
+
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ tupdesc = RelationGetDescr(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ lteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+
+ /*
+ * lookup the operators needed to determine range containment of each key
+ * value.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ AttrNumber keyattno;
+ Oid opfam;
+ Oid keytypid;
+ Oid idxtypid;
+
+ keyattno = scan->keyData[keyno].sk_attno;
+ opfam = get_opclass_family(indclass->values[keyattno - 1]);
+ keytypid = scan->keyData[keyno].sk_subtype;
+ idxtypid = tupdesc->attrs[keyattno - 1]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessEqualStrategyNumber,
+ <eq[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterEqualStrategyNumber,
+ >eq[keyno]);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ }
+
+ /*
+ * To keep the buffer locked for a short time, we grab and
+ * immediately deform the index tuple to operate on. As soon as
+ * we have acquired the lock on the index buffer, we can release
+ * the tuple lock the revmap acquired for us. So vacuum would be
+ * able to remove this index row as soon as we release the buffer
+ * lock, if it has become stale.
+ */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *)
+ PageGetItem(page, PageGetItemId(page, idxoffno));
+ /* XXX probably need copies */
+ dtup = minmax_deform_tuple(tupdesc, tup);
+
+ /* done with the index page */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Compare scan keys with min/max values stored in range. If scan
+ * keys are matched, the page range must be added to the bitmap.
+ */
+ for (keyno = 0, addrange = true;
+ keyno < scan->numberOfKeys;
+ keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+
+ /*
+ * The analysis we need to make to decide whether to include a
+ * page range in the output result is: is it possible for a
+ * tuple contained within the min/max interval specified by
+ * this index tuple to match what's specified by the scan key?
+ * For example, for a query qual such as "WHERE col < 10" we
+ * need to include a range whose minimum value is less than
+ * 10.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole.
+ */
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ addrange =
+ invoke_mm_operator(<[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTLessEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want
+ * to return the current page range if the minimum
+ * value in the range <= scan key, and the maximum
+ * value >= scan key.
+ */
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ if (!addrange)
+ break;
+ /* max() >= scankey */
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterStrategyNumber:
+ addrange =
+ invoke_mm_operator(>[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ }
+
+ /*
+ * If the current scan key doesn't match the range values,
+ * don't look at further ones.
+ */
+ if (!addrange)
+ break;
+ }
+
+ /* XXX anything to free here? */
+ }
+
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + MINMAX_PAGES_PER_RANGE - 1;
+ pageno++)
+ tbm_add_page(tbm, pageno);
+ }
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ pfree(lt);
+ pfree(lteq);
+ pfree(gt);
+ pfree(gteq);
+
+ PG_RETURN_INT64(MaxHeapTuplesPerPage);
+ }
+
+
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ {
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+ }
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+
+ /* anything to do here? */
+ (void) scan; /* silence compiler */
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Reset the per-column build state in an MMBuildState.
+ */
+ static void
+ clear_mm_percol_buildstate(MMBuildState *mmstate)
+ {
+ int i;
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ mmstate->dtuple->values[i].allnulls = true;
+ mmstate->dtuple->values[i].hasnulls = false;
+ mmstate->dtuple->values[i].min = (Datum) 0;
+ mmstate->dtuple->values[i].max = (Datum) 0;
+ }
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; they
+ * are present in the build state struct but not inserted into the index.
+ * Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock == mmstate->nextRangeAt)
+ {
+ MMTuple *tup;
+ Size size;
+
+ #if 0
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ elog(DEBUG2, "completed a range for column %d, range: %u .. %u",
+ i,
+ DatumGetUInt32(mmstate->dtuple->values[i].min),
+ DatumGetUInt32(mmstate->dtuple->values[i].max));
+ }
+ #endif
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ /* and set state to correspond to the new current range */
+ mmstate->currRangeStart = mmstate->nextRangeAt;
+ mmstate->nextRangeAt = mmstate->currRangeStart + MINMAX_PAGES_PER_RANGE;
+
+ /* initialize aggregate state for the new range */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ if (!mmstate->dtuple->values[i].allnulls &&
+ !mmstate->perColState[i].typByVal)
+ {
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].max));
+ }
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ AttrNumber heapAttno = mmstate->perColState[i].heapAttno;
+
+ /*
+ * If the value in the current heap tuple is null, there's not much to
+ * do other than keep track that we saw it.
+ */
+ if (isnull[heapAttno - 1])
+ {
+ mmstate->dtuple->values[i].hasnulls = true;
+ continue;
+ }
+
+ /*
+ * If this is the first tuple in the range containing a not-null value
+ * for this column, initialize our state.
+ */
+ if (mmstate->dtuple->values[i].allnulls)
+ {
+ mmstate->dtuple->values[i].allnulls = false;
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ continue;
+ }
+
+ /*
+ * Otherwise, dtuple state was already initialized, and the current
+ * tuple is not null: therefore we need to compare it to the current
+ * state and possibly update the min/max boundaries.
+ */
+ if (invoke_mm_operator(&mmstate->perColState[i].lt, InvalidOid,
+ values[heapAttno - 1],
+ mmstate->dtuple->values[i].min))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+
+ if (invoke_mm_operator(&mmstate->perColState[i].gt, InvalidOid,
+ values[heapAttno - 1],
+ mmstate->dtuple->values[i].max))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[heapAttno - 1],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+ }
+ }
+
+ static MMBuildState *
+ initialize_mm_buildstate(Relation heapRel, Relation idxRel,
+ mmRevmapAccess *rmAccess, IndexInfo *indexInfo)
+ {
+ MMBuildState *mmstate;
+ TupleDesc heapDesc = RelationGetDescr(heapRel);
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ int i;
+
+ mmstate = palloc(offsetof(MMBuildState, perColState) +
+ sizeof(MMPerColBuildInfo) * indexInfo->ii_NumIndexAttrs);
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->currRangeStart = 0;
+ mmstate->nextRangeAt = MINMAX_PAGES_PER_RANGE;
+ mmstate->rmAccess = rmAccess;
+ mmstate->indexDesc = RelationGetDescr(idxRel);
+ mmstate->diskDesc = minmax_get_descr(mmstate->indexDesc);
+
+ mmstate->dtuple = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * indexInfo->ii_NumIndexAttrs);
+ /* other stuff in dtuple is initialized below */
+
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ int heapAttno;
+ Form_pg_attribute attr;
+ Oid opfam = get_opclass_family(indclass->values[i]);
+ Oid idxtypid = mmstate->indexDesc->attrs[i]->atttypid;
+
+ heapAttno = indexInfo->ii_KeyAttrNumbers[i];
+ if (heapAttno == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create minmax indexes on expressions")));
+
+ attr = heapDesc->attrs[heapAttno - 1];
+ mmstate->perColState[i].heapAttno = heapAttno;
+ mmstate->perColState[i].typByVal = attr->attbyval;
+ mmstate->perColState[i].typLen = attr->attlen;
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ &(mmstate->perColState[i].lt));
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ &(mmstate->perColState[i].gt));
+
+ /* initialize per-column state */
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+
+ return mmstate;
+ }
+
+ void
+ mm_init_metapage(Buffer meta)
+ {
+ MinmaxMetaPageData *metadata;
+ Page page = BufferGetPage(meta);
+
+ PageInit(page, BLCKSZ, 0);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->minmaxVersion = MINMAX_CURRENT_VERSION;
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ START_CRIT_SECTION();
+ meta = mm_getnewbuffer(index);
+ mm_init_metapage(meta);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &(index->rd_node);
+ rdata.len = sizeof(RelFileNode);
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /* set up our "reverse map" fork */
+ mmRevmapCreate(index);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = mmRevmapAccessInit(index, MINMAX_PAGES_PER_RANGE);
+ mmstate = initialize_mm_buildstate(heap, index, rmAccess, indexInfo);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in order
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* XXX process the final batch, if needed */
+
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_POINTER(NULL);
+ }
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return an array of block numbers for which the revmap returns
+ * InvalidTid; this is used in a later stage to execute re-summarization.
+ * (The block numbers correspond to the start heap page numbers with which each
+ * unsummarized range starts.) Space for the array is palloc'ed, and must be
+ * freed by caller.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ MemoryContext hashcxt;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ hashcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "mm remove deletable hash",
+ ALLOCSET_SMALL_MINSIZE,
+ ALLOCSET_SMALL_INITSIZE,
+ ALLOCSET_SMALL_MAXSIZE);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = hashcxt;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &tid,
+ HASH_ENTER,
+ &found);
+ Assert(!found);
+ hitem->referenced = false;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0, numitems = 0;
+ heapBlk < heapNumBlocks;
+ heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. However, if the heap page
+ * range is complete but not summarized, store its initial page
+ * number in the unsummarized array, for later summarization.
+ */
+ if (heapBlk + MINMAX_PAGES_PER_RANGE < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ if (!found)
+ elog(ERROR, "reverse map references nonexistant index tuple %u/%u",
+ ItemPointerGetBlockNumber(&itupptr),
+ ItemPointerGetOffsetNumber(&itupptr));
+ hitem->referenced = true;
+ numitems++;
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples. Allocate this in the hash context, so that it goes
+ * away with it.
+ */
+ deletable = MemoryContextAlloc(hashcxt, sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ start = 0;
+ for (i = 0; i < numdeletable; i++)
+ {
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_BULKREMOVE;
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ RecordPageWithFreeSpace(idxRel, blk, PageGetFreeSpace(page));
+
+ start = i + 1;
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+ }
+ }
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(heapRel, idxRel, rmAccess, indexInfo);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+ MMTuple *tup;
+ Size size;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+
+ mmstate->currRangeStart = blk;
+ mmstate->nextRangeAt = blk + MINMAX_PAGES_PER_RANGE;
+
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, MINMAX_PAGES_PER_RANGE,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ * Note mmbuildCallback didn't have the chance to actually insert
+ * anything into the index, because the heapscan should have ended
+ * just as it reached the final tuple in the range.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * During amvacuumcleanup of a MinMax index, we do three main things:
+ *
+ * 1) remove revmap entries which are no longer interesting (heap has been
+ * truncated).
+ *
+ * 2) remove index tuples that are no longer referenced from the revmap.
+ *
+ * 3) summarize ranges that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+
+ rmAccess = mmRevmapAccessInit(info->index, MINMAX_PAGES_PER_RANGE);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First: truncate the revmap to the range that covers pages actually in
+ * the heap. We must do this while holding the relation extension lock,
+ * or we risk someone else extending the relation in the meantime.
+ */
+ LockRelationForExtension(heapRel, AccessShareLock);
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ mmRevmapTruncate(rmAccess, heapNumBlocks);
+ UnlockRelationForExtension(heapRel, AccessShareLock);
+
+ /*
+ * Second: scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers
+ * of ranges that are not summarized.
+ */
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* Finally, summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT64(0);
+ }
+
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT64(0);
+ }
+
+ /*
+ * Fill the given finfo to enable calls to the operator specified by the given
+ * parameters.
+ */
+ static void
+ get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo)
+ {
+ Oid oprid;
+ HeapTuple oper;
+
+ oprid = get_opfamily_member(opfam, idxtypid, keytypid, strategy);
+ if (!OidIsValid(oprid))
+ elog(ERROR, "missing operator %d(%u,%u) in opfamily %u",
+ strategy, idxtypid, keytypid, opfam);
+
+ oper = SearchSysCache1(OPEROID, oprid);
+ if (!HeapTupleIsValid(oper))
+ elog(ERROR, "cache lookup failed for operator %u", oprid);
+
+ fmgr_info(((Form_pg_operator) GETSTRUCT(oper))->oprcode, finfo);
+ ReleaseSysCache(oper);
+ }
+
+ /*
+ * Invoke the given operator, and return the result as a C boolean.
+ */
+ static inline bool
+ invoke_mm_operator(FmgrInfo *operator, Oid collation, Datum left, Datum right)
+ {
+ Datum result;
+
+ result = FunctionCall2Coll(operator, collation, left, right);
+
+ return DatumGetBool(result);
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+
+ if (PageGetFreeSpace(page) < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel))));
+
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ blk = BufferGetBlockNumber(*buffer);
+
+ MarkBufferDirty(*buffer);
+
+ START_CRIT_SECTION();
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return a exclusively-locked buffer resulting from extending the relation.
+ */
+ static Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has some, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, and a
+ * new buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. The function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ bool extended = false;
+
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ Page page;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer which has enough free space to satisfy the requirement.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly.
+ */
+ freespace = PageGetFreeSpace(page);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ buf = ReadBuffer(irel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ break;
+
+ /* Not enough space: register reality and start over */
+ /* XXX register and unlock, or unlock and register?? */
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ return extended;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,375 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+
+
+ #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+ #define IDXITEMS_PER_PAGE (MAPSIZE / SizeOfIptrData)
+
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / IDXITEMS_PER_PAGE)
+
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE)
+
+ static bool mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno);
+
+ /* typedef appears in minmax_revmap.h */
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer currBuf;
+ BlockNumber physPagesInRevmap;
+ };
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
+
+ RelationOpenSmgr(idxrel);
+
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * in the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages-per-range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ ItemPointerData *iptr;
+
+ iptr = (ItemPointerData *) PageGetContents(page);
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /*
+ * If the revmap is out of space, extend it first.
+ */
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ extend = mmRevmapExtend(rmAccess, mapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ rmAccess->currBuf = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with LockTuple.
+ * It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /*
+ * If we are asked for a block of the map which is beyond what we know
+ * about it, try to see if our fork has grown since we last checked its
+ * size; a concurrent inserter could have extended it.
+ */
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ {
+ RelationOpenSmgr(rmAccess->idxrel);
+ LockRelationForExtension(rmAccess->idxrel, ShareLock);
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ {
+ /* definitely not in range */
+
+ UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ /* the block exists now, proceed */
+ UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ rmAccess->currBuf =
+ ReadBufferExtended(rmAccess->idxrel, MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ iptr = (ItemPointerData *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Create a single-page reverse range map fork for a new minmax index
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ bool needLock;
+ Buffer buf;
+ Page page;
+
+ needLock = !RELATION_IS_LOCAL(idxrel);
+
+ /*
+ * XXX it's unclear that we need this lock, considering that the relation
+ * is likely being created ...
+ */
+ if (needLock)
+ LockRelationForExtension(idxrel, ExclusiveLock);
+
+ START_CRIT_SECTION();
+ RelationOpenSmgr(idxrel);
+ smgrcreate(idxrel->rd_smgr, MM_REVMAP_FORKNUM, false);
+ buf = ReadBufferExtended(idxrel, MM_REVMAP_FORKNUM, P_NEW, RBM_NORMAL,
+ NULL);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ END_CRIT_SECTION();
+
+ if (needLock)
+ UnlockRelationForExtension(idxrel, ExclusiveLock);
+ }
+
+ /*
+ * Extend the reverse range map to cover the given block number. Return false
+ * if the map already covered the requested range (no extension actually done),
+ * true otherwise.
+ *
+ * NB -- caller is responsible for ensuring this action is properly WAL-logged.
+ */
+ static bool
+ mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
+ {
+ char page[BLCKSZ];
+ bool extended = false;
+
+ MemSet(page, 0, sizeof(page));
+ PageInit(page, BLCKSZ, 0);
+
+ LockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ /*
+ * first, refresh our idea of the current size; it might well have grown
+ * up to what we need since we last checked.
+ */
+ RelationOpenSmgr(rmAccess->idxrel);
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ /*
+ * Now extend it one page at a time. This might seem a bit inefficient,
+ * but normally we'd be extending for a single page anyway.
+ */
+ while (blkno >= rmAccess->physPagesInRevmap)
+ {
+ extended = true;
+ PageSetChecksumInplace(page, blkno);
+ smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
+ rmAccess->physPagesInRevmap, page, false);
+ rmAccess->physPagesInRevmap++;
+ }
+
+ Assert(rmAccess->physPagesInRevmap ==
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM));
+
+ UnlockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ return extended;
+ }
+
+ /*
+ * Truncate a revmap to the size needed for a table of the given number of
+ * blocks. This includes removing pages beyond the last one needed, and also
+ * zeroing out the excess entries in the last page.
+ *
+ * The caller should hold a lock to avoid the table from growing in
+ * the meantime.
+ */
+ void
+ mmRevmapTruncate(mmRevmapAccess *rmAccess, BlockNumber heapNumBlocks)
+ {
+ BlockNumber rmBlks;
+ char *data;
+ Page page;
+ Buffer buffer;
+
+ /* Remove blocks at the end */
+ rmBlks = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapNumBlocks);
+
+ RelationOpenSmgr(rmAccess->idxrel);
+ smgrtruncate(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM, rmBlks + 1);
+
+ /* zero out the remaining items in the last page */
+ buffer = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, rmBlks,
+ RBM_NORMAL, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ page = PageGetContents(BufferGetPage(buffer));
+ data = page + sizeof(ItemPointerData) *
+ HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapNumBlocks + 1);
+
+ memset(data, 0, page + MAPSIZE - data);
+
+ UnlockReleaseBuffer(buffer);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,388 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * The intended interface is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation by
+ * using functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the tuple descriptor exactly: for each attribute in the
+ * descriptor, the index tuple carries two values, one for the minimum value in
+ * that column and one for the maximum.
+ *
+ * Also, for each column there are two null bits: one (hasnulls) stores whether
+ * any tuple within the page range has that column set to null; the other
+ * (allnulls) stores whether the column values are all null. If allnulls is
+ * true, then the tuple data area does not contain min/max values for that
+ * column at all; whereas it does if the hasnulls is set. Note we always store
+ * a double-length null bitmask; for typical indexes of four columns or less,
+ * they take a single byte anyway. It doesn't seem worth trying to optimize
+ * this further.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Generate an internal-style tuple descriptor to pass to minmax_form_tuple.
+ * These have no use outside this module.
+ *
+ * The argument is a minmax index' regular tuple descriptor.
+ */
+ TupleDesc
+ minmax_get_descr(TupleDesc tupdesc)
+ {
+ TupleDesc diskDesc;
+ int i,
+ j;
+
+ diskDesc = CreateTemplateTupleDesc(tupdesc->natts * 2, false);
+
+ for (i = 0, j = 1; i < tupdesc->natts; i++)
+ {
+ /* min */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ /* max */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ return diskDesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ *
+ * The first tuple descriptor passed corresponds to the catalogued index info,
+ * that is, it is the index's descriptor; the second descriptor must be
+ * obtained by calling minmax_get_descr() on that descriptor.
+ *
+ * (The reason for this slightly grotty arrangement is that we use heap tuple
+ * functions to implement packing of a tuple into the on-disk format.)
+ */
+ MMTuple *
+ minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
+ Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(diskDsc->natts > 0);
+
+ values = palloc(sizeof(Datum) * diskDsc->natts);
+ nulls = palloc0(sizeof(bool) * diskDsc->natts);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(diskDsc->natts));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ int idxattno = keyno * 2;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; set the nullable bits for both min and max attrs.
+ */
+ if (tuple->values[keyno].allnulls)
+ {
+ nulls[idxattno] = true;
+ nulls[idxattno + 1] = true;
+ anynulls = true;
+ continue;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ anynulls = true;
+
+ values[idxattno] = tuple->values[keyno].min;
+ values[idxattno + 1] = tuple->values[keyno].max;
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(idxDsc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(diskDsc, values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(diskDsc,
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need
+ * to apply datumCopy inside the loop. We probably also need a
+ * minmax_free_dtuple() function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits = NULL;
+ int keyno;
+
+ dtup = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * tupdesc->natts);
+
+ values = palloc(sizeof(Datum) * tupdesc->natts * 2);
+ allnulls = palloc(sizeof(bool) * tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ mm_deconstruct_tuple(tp, nullbits,
+ MMTupleHasNulls(tuple),
+ tupdesc->natts, tupdesc->attrs, values,
+ allnulls, hasnulls);
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ if (allnulls[keyno])
+ {
+ dtup->values[keyno].allnulls = true;
+ continue;
+ }
+
+ /* XXX optional datumCopy() */
+ dtup->values[keyno].min = values[keyno * 2];
+ dtup->values[keyno].max = values[keyno * 2 + 1];
+ dtup->values[keyno].hasnulls = hasnulls[keyno];
+ dtup->values[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * natts number of array members in att
+ * att the tuple's TupleDesc Form_pg_attribute array
+ * values output values, size 2 * natts (alternates min and max)
+ * allnulls output "allnulls", size natts
+ * hasnulls output "hasnulls", size natts
+ *
+ * Output arrays are allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ values[attnum] = (Datum) 0;
+ allnulls[attnum] = true;
+ hasnulls[attnum] = true; /* XXX ? */
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. So the tuple data does have data for this
+ * column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] = nulls && att_isnull(natts + attnum, hasnulls);
+ }
+
+ /*
+ * The we iterate to natts * 2 to obtain each attribute's min and max
+ * values. Note that since we reuse attribute entries (first for the
+ * minimum value of the corresponding column, then for max), we cannot
+ * cache offsets here.
+ */
+ for (attnum = 0; attnum < natts * 2; attnum++)
+ {
+ int true_attnum = attnum / 2;
+ Form_pg_attribute thisatt = att[true_attnum];
+
+ if (allnulls[true_attnum])
+ continue;
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[attnum] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,212 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_init_metapage(buf);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its revmap fork */
+ buf = XLogReadBufferExtended(xlrec->node, MM_REVMAP_FORKNUM, 0, RBM_ZERO);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ PageInit(page, BufferGetPageSize(buffer), 0); /* XXX size correct?? */
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBufferExtended(xlrec->node,
+ MM_REVMAP_FORKNUM, xlrec->mapBlock,
+ init ? RBM_ZERO : RBM_NORMAL);
+ Assert(BufferIsValid(buffer));
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buffer);
+ if (init)
+ PageInit(page, BufferGetPageSize(buffer), 0);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,74 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ uint8 info = xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
+
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2116,2121 **** IndexBuildHeapScan(Relation heapRelation,
--- 2116,2142 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * range is scanned. Scan to end-of-rel can be signalled by passing
+ * InvalidBlockNumber as end block number.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2186,2191 **** IndexBuildHeapScan(Relation heapRelation,
--- 2207,2215 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 899,904 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 899,1074 ----
pfree(itemidbase);
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline,
+ nstorage;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the item pointer array and build a list of just the ones we are
+ * going to keep. Notice we do not modify the page just yet, since we are
+ * still validity-checking.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ nstorage = 0;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ nstorage++;
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (nstorage == 0)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ char pageCopy[BLCKSZ];
+ itemIdSort itemidbase,
+ itemidptr;
+ int lastused;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * First scan the page taking note of each item that we need to
+ * preserve. This includes both live items (those that contain data)
+ * and interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable.
+ */
+ itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nline);
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /*
+ * Defragment the data areas of each tuple. Note that since offset
+ * numbers must remain unchanged in these pages, we can't do a qsort()
+ * of the itemIdSort elements here; and because the elements are not
+ * sorted by offset, we can't use memmove() to defragment the occupied
+ * data space. So we first create a temporary copy of the original
+ * data page, from which we memcpy() each item's data onto the final
+ * page.
+ */
+ memcpy(pageCopy, page, BLCKSZ);
+ lastused = FirstOffsetNumber;
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ continue;
+ }
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ upper -= itemidptr->alignedlen;
+ memcpy((char *) page + upper,
+ pageCopy + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+
+ lastused = i + 1;
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + lastused * sizeof(ItemIdData);
+
+ pfree(itemidbase);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,35 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,39 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ int32 minmaxMagic;
+ int32 minmaxVersion;
+ } MinmaxMetaPageData;
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ #define MM_REVMAP_FORKNUM VISIBILITYMAP_FORKNUM /* reuse the VM forknum */
+
+
+ extern void mm_init_metapage(Buffer meta);
+ extern void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno);
+
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,34 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
+ BlockNumber pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,79 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/tupdesc.h"
+
+
+ /*
+ * This struct is used to represent the indexed values for one column, within
+ * one page range.
+ */
+ typedef struct MMValues
+ {
+ Datum min;
+ Datum max;
+ bool hasnulls;
+ bool allnulls;
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and
+ * maximum values for all indexed columns, within one page range.
+ * The number of elements in the values array is determined by the accompanying
+ * tuple descriptor.
+ */
+ typedef struct DeformedMMTuple
+ {
+ bool nvalues; /* XXX unused */
+ MMValues values[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for natts*2 null bits; min and max Datum values for each column follow
+ * that.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple offsetof(MMTuple, mt_info) + sizeof(uint8)
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern TupleDesc minmax_get_descr(TupleDesc tupdesc);
+ extern MMTuple *minmax_form_tuple(TupleDesc idxDesc, TupleDesc diskDesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ extern DeformedMMTuple *minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,93 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ RelFileNode node;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, node) + sizeof(RelFileNode)
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, target) + SizeOfMinmaxTid)
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+
+ extern void minmax_desc(StringInfo buf, uint8 xl_info, char *rec);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup, NULL)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3847 ( minmax 5 0 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3847
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 781,784 **** DATA(insert ( 3474 3831 3831 8 s 3892 4000 0 ));
--- 781,811 ----
DATA(insert ( 3474 3831 2283 16 s 3889 4000 0 ));
DATA(insert ( 3474 3831 3831 18 s 3882 4000 0 ));
+ /*
+ * MinMax int4_ops
+ */
+ DATA(insert ( 3192 23 23 1 s 97 3847 0 ));
+ DATA(insert ( 3192 23 23 2 s 523 3847 0 ));
+ DATA(insert ( 3192 23 23 3 s 96 3847 0 ));
+ DATA(insert ( 3192 23 23 4 s 525 3847 0 ));
+ DATA(insert ( 3192 23 23 5 s 521 3847 0 ));
+
+ /*
+ * MinMax numeric_ops
+ */
+ DATA(insert ( 3193 1700 1700 1 s 1754 3847 0 ));
+ DATA(insert ( 3193 1700 1700 2 s 1755 3847 0 ));
+ DATA(insert ( 3193 1700 1700 3 s 1752 3847 0 ));
+ DATA(insert ( 3193 1700 1700 4 s 1757 3847 0 ));
+ DATA(insert ( 3193 1700 1700 5 s 1756 3847 0 ));
+
+ /*
+ * MinMax text_ops
+ */
+ DATA(insert ( 3194 25 25 1 s 664 3847 0 ));
+ DATA(insert ( 3194 25 25 2 s 665 3847 0 ));
+ DATA(insert ( 3194 25 25 3 s 98 3847 0 ));
+ DATA(insert ( 3194 25 25 4 s 667 3847 0 ));
+ DATA(insert ( 3194 25 25 5 s 666 3847 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 227,231 **** DATA(insert ( 4000 range_ops PGNSP PGUID 3474 3831 t 0 ));
--- 227,234 ----
DATA(insert ( 4000 quad_point_ops PGNSP PGUID 4015 600 t 0 ));
DATA(insert ( 4000 kd_point_ops PGNSP PGUID 4016 600 f 0 ));
DATA(insert ( 4000 text_ops PGNSP PGUID 4017 25 t 0 ));
+ DATA(insert ( 3847 int4_ops PGNSP PGUID 3192 23 t 0 ));
+ DATA(insert ( 3847 numeric_ops PGNSP PGUID 3193 1700 t 0 ));
+ DATA(insert ( 3847 text_ops PGNSP PGUID 3194 25 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 147,151 **** DATA(insert OID = 4015 ( 4000 quad_point_ops PGNSP PGUID ));
--- 147,154 ----
DATA(insert OID = 4016 ( 4000 kd_point_ops PGNSP PGUID ));
DATA(insert OID = 4017 ( 4000 text_ops PGNSP PGUID ));
#define TEXT_SPGIST_FAM_OID 4017
+ DATA(insert OID = 3192 ( 3847 int4_ops PGNSP PGUID ));
+ DATA(insert OID = 3193 ( 3847 numeric_ops PGNSP PGUID ));
+ DATA(insert OID = 3194 ( 3847 text_ops PGNSP PGUID ));
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 561,566 **** DESCR("btree(internal)");
--- 561,594 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3178 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3179 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3180 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3181 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3182 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3183 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3184 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3185 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3186 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3187 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3188 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3190 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3191 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,409 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1076,1081 **** ORDER BY 1, 2, 3;
--- 1076,1086 ----
2742 | 2 | @@@
2742 | 3 | <@
2742 | 4 | =
+ 3847 | 1 | <
+ 3847 | 2 | <=
+ 3847 | 3 | =
+ 3847 | 4 | >=
+ 3847 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1098,1104 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (62 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1103,1109 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (67 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1271,1277 **** FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport OR amprocfamily IS NULL;
amname | opcname | count
--------+---------+-------
(0 rows)
--- 1276,1282 ----
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport AND amprocfamily IS NOT NULL;
amname | opcname | count
--------+---------+-------
(0 rows)
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 978,984 **** FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport OR amprocfamily IS NULL;
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
--- 978,984 ----
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport AND amprocfamily IS NOT NULL;
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
The attached program returns different results for different values of enable_bitmapscan (consistently).
( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row
table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. )
From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields only
the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached
program).
Thanks,
Erikjan Rijkers
Attachments:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
The attached program returns different results for different values of enable_bitmapscan (consistently).
( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row
table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. )From eyeballing the results it looks like the minmax result (i.e. the result set with enable_bitmapscan = 1) yields only
the last part because the only 'last' rows seem to be present (see the values in column i in table tmm in the attached
program).
Looking back at that, I realize I should have added a bit more detail on that test.sh program and its output (attached on
previous mail).
test.sh creates a table tmm and a minmax index on that table:
testdb=# \d tmm
Table "public.tmm"
Column | Type | Modifiers
--------+---------+-----------
i | integer |
r | integer |
Indexes:
"tmm_minmax_idx" minmax (r)
The following shows the problem: the same search with minax index on versus off gives different result sets:
testdb=# set enable_bitmapscan=0; select count(*) from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.473 ms
count
-------
1261
(1 row)
Time: 7.764 ms
testdb=# set enable_bitmapscan=1; select count(*) from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.471 ms
count
-------
3
(1 row)
Time: 1.014 ms
testdb=# set enable_bitmapscan =1; select * from tmm where r between symmetric 19494484 and 145288238;
SET
Time: 0.615 ms
i | r
------+-----------
9945 | 45405603
9951 | 102552485
9966 | 63763962
(3 rows)
Time: 0.984 ms
testdb=# set enable_bitmapscan=0; select * from ( select * from tmm where r between symmetric 19494484 and 145288238 order
by i desc limit 10) f order by i ;
SET
Time: 0.470 ms
i | r
------+-----------
9852 | 114996906
9858 | 69907169
9875 | 43341583
9894 | 127862657
9895 | 44740033
9911 | 51797553
9916 | 58538774
9945 | 45405603
9951 | 102552485
9966 | 63763962
(10 rows)
Time: 8.704 ms
testdb=#
If enable_bitmapscan=1 (i.e. using the minmax index), then only some values are retrieved (in this case 3 rows). It turns
out those are always the last N rows of the full resultset (i.e. with enable_bitmapscan=0).
Erikjan Rijkers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.
I'm not very happy with the use of a separate relation fork for
storing this data. Using an existing fork number rather than creating
a new one avoids some of them (like, the fact that we loop over all
known fork numbers in various places, and adding another one will add
latency in all of those places, particularly when there is a system
call in the loop) but not all of them (like, what happens if the index
is unlogged? we have provisions to reset the main fork but any others
are just removed; is that OK?), and it also creates some new ones
(like, files having misleading names).
More generally, I fear we really opened a bag of worms with this
relation fork stuff. Every time I turn around I run into a problem
that could be solved by adding another relation fork. I'm not
terribly sure that it was a good idea to go that way to begin with,
because we've got customers who are unhappy about 3 files/heap due to
inode consumption and slow directory lookups. I think we would have
been smarter to devise a strategy for storing the fsm and vm pages
within the main fork in some fashion, and I tend to think that's the
right solution here as well. Of course, it may be hopeless to put the
worms back in the can at this point, and surely these indexes will be
lightly used compared to heaps, so it's not incrementally exacerbating
the problems all that much. But I still feel uneasy about widening
use of that mechanism.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas escribi�:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.
I understand this opinion, as I considered it myself while developing
it. Also, GIN already does things this way. Perhaps I should just bite
the bullet and do this.
Using an existing fork number rather than creating
a new one avoids some of them (like, the fact that we loop over all
known fork numbers in various places, and adding another one will add
latency in all of those places, particularly when there is a system
call in the loop) but not all of them (like, what happens if the index
is unlogged? we have provisions to reset the main fork but any others
are just removed; is that OK?), and it also creates some new ones
(like, files having misleading names).
All good points.
Index scans will normally access the revmap in sequential fashion; it
would be enough to chain revmap pages, keeping a single block number in
the metapage pointing to the first one, and subsequent ones are accessed
from a "next" block number in each page. However, heap insertion might
need to access a random revmap page, and this would be too slow. I
think it would be enough to keep an array of block numbers in the
index's metapage; the metapage would be share locked on every scan and
insert, but that's not a big deal because exclusive lock would only be
needed on the metapage to extend the revmap, which would be a very
infrequent operation.
As this will require some rework to this code, I think it's fair to mark
this as returned with feedback for the time being. I will return with
an updated version soon, fixing the relation fork issue as well as the
locking and visibility bugs reported by Erik.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Erik Rijkers wrote:
I have the impression it's not quite working correctly.
The attached program returns different results for different values of enable_bitmapscan (consistently).
Clearly there's some bug somewhere. I'll investigate it more.
( Btw, I had to make the max_locks_per_transaction higher for even not-so-large tables -- is that expected? For a 100M row
table, max_locks_per_transaction=1024 was not enough; I set it to 2048. Might be worth some documentation, eventually. )
Not documentation -- that would also be a bug which needs to be fixed.
Thanks for testing.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/26/13 12:00 PM, Robert Haas wrote:
More generally, I fear we really opened a bag of worms with this
relation fork stuff. Every time I turn around I run into a problem
that could be solved by adding another relation fork. I'm not
terribly sure that it was a good idea to go that way to begin with,
because we've got customers who are unhappy about 3 files/heap due to
inode consumption and slow directory lookups. I think we would have
been smarter to devise a strategy for storing the fsm and vm pages
within the main fork in some fashion, and I tend to think that's the
right solution here as well. Of course, it may be hopeless to put the
worms back in the can at this point, and surely these indexes will be
lightly used compared to heaps, so it's not incrementally exacerbating
the problems all that much. But I still feel uneasy about widening
use of that mechanism.
Why would we add additional code complexity when forks do the trick? That seems like a step backwards, not forward.
If the only complaint about forks is directory traversal why wouldn't we go with the well established practice of using multiple directories instead of glomming everything into one place?
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote:
Why would we add additional code complexity when forks do the trick? That
seems like a step backwards, not forward.
Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.
It's tempting to think that we can use the fork mechanism every time
we have multiple logical "streams" of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file. However, the reality is that the fork mechanism isn't
up to the job. I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism. But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.
If the only complaint about forks is directory traversal why wouldn't we go
with the well established practice of using multiple directories instead of
glomming everything into one place?
That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Amit Kapila escribió:
On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
On Windows, patch gives below compilation errors:
src\backend\access\minmax\mmtuple.c(96): error C2057: expected
constant expressionI have fixed all these compile errors (fix attached). Thanks for
reporting them. I'll post a new version shortly.
Thanks for fixing it. In last few days I had spent some time
reading about minmax or equivalent indexes in other databases (Netezza
and Oracle) and going through some parts of your proposal. Its a bit
bigger patch and needs much more time, but I would like to share my
findings/thoughts I had developed till now.
Firstly about interface and use case, as far as I could understand
other databases provide this index automatically rather than having a
separate Create Index command which may be because such an index can
be mainly useful when the data is ordered or if it's distributed in
such a way that it's quite useful for repeatedly executing queries.
You have proposed it as a command which means user needs to take care
of it which I find is okay for first version, later may be we can also
have some optimisations so that it can get created automatically.
For the page range, If I read correctly, currently you have used hash
define, do you want to expose it to user in some way like GUC or
maintain it internally and assign the right value based on performance
of different queries?
Operations on this index seems to be very fast, like Oracle has this
as an in-memory structure and I read in Netezza that write operations
doesn't carry any significant overhead for zone maps as compare to
other indexes, so shouldn't we consider it to be without WAL logged?
OTOH I think because these structures get automatically created in
those databases, so it might be okay but if we provide it as a
command, then user might be bothered if he didn't find it
automatically on server restart.
Few Questions and observations:
1.
+ When a new heap tuple is inserted in a summarized page range, it is
possible to
+ compare the existing index tuple with the new heap tuple. If the
heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for
any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values. To do this, a new index tuple is inserted, and the
reverse range
+ map is updated to point to it. The old index tuple is left in
place, for later
+ garbage collection.
Is there a reason why we can't directly update the value rather then
new insert in index, as I understand for other indexes like btree
we do this because we might need to rollback, but here even if after
updating the min or max value, rollback happens, it will not cause
any harm (tuple loss).
2.
+ If the reverse range map points to an invalid TID, the corresponding
page range
+ is not summarized.
3.
It might be better if you can mention when range map will point to an
invalid TID, it's not explained in your proposal, but you have used it
in you proposal to explain some other things.
4.
Range reverse map is a good terminology, but isn't Range translation
map better. I don't mind either way, it's just a thought came to my
mind while understanding concept of Range Reverse map.
5.
/*
* As above, except that instead of scanning the complete heap, only the given
* range is scanned. Scan to end-of-rel can be signalled by passing
* InvalidBlockNumber as end block number.
*/
double
IndexBuildHeapRangeScan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool allow_sync,
BlockNumber start_blockno,
BlockNumber numblocks,
IndexBuildCallback callback,
void *callback_state)
In comments you have used end block number, which parameter does it
refer to? I could see only start_blockno and numb locks?
6.
currently you are passing 0 as start block and InvalidBlockNumber as
number of blocks, what's the logic for it?
return IndexBuildHeapRangeScan(heapRelation, indexRelation,
indexInfo, allow_sync,
0, InvalidBlockNumber,
callback, callback_state);
7.
In mmbuildCallback, it only add's tuple to minmax index, if it
satisfies page range, else this can lead to waste of big scan incase
page range is large (1280 pages as you mentiones in one of your
mails). Why can't we include it end of scan?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 27, 2013 at 11:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Sep 26, 2013 at 1:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Amit Kapila escribió:
On Sun, Sep 15, 2013 at 5:44 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:On Windows, patch gives below compilation errors:
src\backend\access\minmax\mmtuple.c(96): error C2057: expected
constant expressionI have fixed all these compile errors (fix attached). Thanks for
reporting them. I'll post a new version shortly.Thanks for fixing it. In last few days I had spent some time
reading about minmax or equivalent indexes in other databases (Netezza
and Oracle) and going through some parts of your proposal. Its a bit
bigger patch and needs much more time, but I would like to share my
findings/thoughts I had developed till now.Firstly about interface and use case, as far as I could understand
other databases provide this index automatically rather than having a
separate Create Index command which may be because such an index can
be mainly useful when the data is ordered or if it's distributed in
such a way that it's quite useful for repeatedly executing queries.
You have proposed it as a command which means user needs to take care
of it which I find is okay for first version, later may be we can also
have some optimisations so that it can get created automatically.
For the page range, If I read correctly, currently you have used hash
define, do you want to expose it to user in some way like GUC or
maintain it internally and assign the right value based on performance
of different queries?Operations on this index seems to be very fast, like Oracle has this
as an in-memory structure and I read in Netezza that write operations
doesn't carry any significant overhead for zone maps as compare to
other indexes, so shouldn't we consider it to be without WAL logged?
OTOH I think because these structures get automatically created in
those databases, so it might be okay but if we provide it as a
command, then user might be bothered if he didn't find it
automatically on server restart.Few Questions and observations: 1. + When a new heap tuple is inserted in a summarized page range, it is possible to + compare the existing index tuple with the new heap tuple. If the heap tuple is + outside the minimum/maximum boundaries given by the index tuple for any indexed + column (or if the new heap tuple contains null values but the index tuple + indicate there are no nulls), it is necessary to create a new index tuple with + the new values. To do this, a new index tuple is inserted, and the reverse range + map is updated to point to it. The old index tuple is left in place, for later + garbage collection.Is there a reason why we can't directly update the value rather then
new insert in index, as I understand for other indexes like btree
we do this because we might need to rollback, but here even if after
updating the min or max value, rollback happens, it will not cause
any harm (tuple loss).2. + If the reverse range map points to an invalid TID, the corresponding page range + is not summarized.3.
It might be better if you can mention when range map will point to an
invalid TID, it's not explained in your proposal, but you have used it
in you proposal to explain some other things.4.
Range reverse map is a good terminology, but isn't Range translation
map better. I don't mind either way, it's just a thought came to my
mind while understanding concept of Range Reverse map.5.
/*
* As above, except that instead of scanning the complete heap, only the given
* range is scanned. Scan to end-of-rel can be signalled by passing
* InvalidBlockNumber as end block number.
*/
double
IndexBuildHeapRangeScan(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool allow_sync,
BlockNumber start_blockno,
BlockNumber numblocks,
IndexBuildCallback callback,
void *callback_state)In comments you have used end block number, which parameter does it
refer to? I could see only start_blockno and numb locks?6.
currently you are passing 0 as start block and InvalidBlockNumber as
number of blocks, what's the logic for it?
return IndexBuildHeapRangeScan(heapRelation, indexRelation,
indexInfo, allow_sync,
0, InvalidBlockNumber,
callback, callback_state);
I got it, I think here it means scan all the pages.
7.
In mmbuildCallback, it only add's tuple to minmax index, if it
satisfies page range, else this can lead to waste of big scan incase
page range is large (1280 pages as you mentiones in one of your
mails). Why can't we include it end of scan?With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/26/13 2:46 PM, Robert Haas wrote:
On Thu, Sep 26, 2013 at 2:58 PM, Jim Nasby <jim@nasby.net> wrote:
Why would we add additional code complexity when forks do the trick? That
seems like a step backwards, not forward.Well, they sorta do the trick, but see e.g. commit
ece01aae479227d9836294b287d872c5a6146a11. I doubt that's the only
code that's poorly-optimized for multiple forks; IOW, every time
someone adds a new fork, there's a system-wide cost to that, even if
that fork is only used in a tiny percentage of the relations that
exist in the system.
Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements that need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving them vs what we gain by using them.
Of course there's always some added cost so we shouldn't just blindly use them all over the place without considering the fork cost either...
It's tempting to think that we can use the fork mechanism every time
we have multiple logical "streams" of blocks within a relation and
don't want to figure out a way to multiplex them onto the same
physical file. However, the reality is that the fork mechanism isn't
up to the job. I certainly don't want to imply that we shouldn't have
gone in that direction - both the fsm and the vm are huge steps
forward, and we wouldn't have gotten them in 8.4 without that
mechanism. But they haven't been entirely without their own pain,
too, and that pain is going to grow the more we push in the direction
of relying on forks.
Agreed.
Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed).
If the only complaint about forks is directory traversal why wouldn't we go
with the well established practice of using multiple directories instead of
glomming everything into one place?That's not the only complaint about forks - but I support what you're
proposing there anyhow, because it will be helpful to users with lots
of relations regardless of what we do or do not decide to do about
forks.
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby <jim@nasby.net> wrote:
Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements that need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving them vs what we gain by using them.
We think this gives short change to the decision to introduce forks.
If you go back to the discussion at the time it was a topic of debate
and the argument which won the day is that interleaving different
streams of data in one storage system is exactly what the file system
is designed to do and we would just be reinventing the wheel if we
tried to do it ourselves. I think that makes a lot of sense for things
like the fsm or vm which grow indefinitely and are maintained by a
different piece of code from the main heap.
The tradeoff might be somewhat different for the pieces of a data
structure like a bitmap index or gin index where the code responsible
for maintaining it is all the same.
Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed).
The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.
But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as "deprecated" and move tuples out of them until we can just remove
that whole file.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9/27/13 1:43 PM, Greg Stark wrote:
Honestly, I think we actually need more obfuscation between what happens on the filesystem and the rest of postgres... we're starting to look at areas where that would help. For example, the recent idea of being able to truncate individual relation files and not being limited to only truncating the end of the relation. My concern in that case is that 1GB is a pretty arbitrary size that we happened to pick, so if we're going to go for more efficiency in storage we probably shouldn't just blindly stick with 1G (though of course initial implementation might do that to reduce complexity, but we better still consider where we're headed).
The ultimate goal here would be to get the filesystem to issue a TRIM
call so an SSD storage system can reuse the underlying blocks.
Truncating 1GB files might be a convenient way to do it, especially if
we have some new kind of vacuum full that can pack tuples within each
1GB file.But there may be easier ways to achieve the same thing. If we can
notify the filesystem that we're not using some of the blocks in the
middle of the file we might be able to just leave things where they
are and have holes in the files. Or we might be better off not
depending on truncate and just look for ways to mark entire 1GB files
as "deprecated" and move tuples out of them until we can just remove
that whole file.
Yeah, there's a ton of different things we might do. And dealing with free space is just one example... things like the VM give us the ability to detect areas of the heap that have gone "dormant"; imagine if we could seamlessly move that data to it's own storage, possibly compressing it at the same time. (Yes, I realize there's partitioning and tablespaces and compressing filesystems, but those are a lot more work and will never be as efficient as what the database itself can do).
Anyway, I think we're all on the same page. We should stop hijacking Alvaro's thread... ;)
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 27.09.2013 21:43, Greg Stark wrote:
On Fri, Sep 27, 2013 at 7:22 PM, Jim Nasby<jim@nasby.net> wrote:
Yeah, we obviously kept things simpler when adding forks in order to get the feature out the door. There's improvements that need to be made. But IMHO that's not reason to automatically avoid forks; we need to consider the cost of improving them vs what we gain by using them.
We think this gives short change to the decision to introduce forks.
If you go back to the discussion at the time it was a topic of debate
and the argument which won the day is that interleaving different
streams of data in one storage system is exactly what the file system
is designed to do and we would just be reinventing the wheel if we
tried to do it ourselves. I think that makes a lot of sense for things
like the fsm or vm which grow indefinitely and are maintained by a
different piece of code from the main heap.The tradeoff might be somewhat different for the pieces of a data
structure like a bitmap index or gin index where the code responsible
for maintaining it is all the same.
There are quite a dfew cases where we have several "streams" of data,
all related to a single relation. We've solved them all in slightly
different ways:
1. TOAST. A separate heap relation with accompanying b-tree index is
created.
2. GIN. GIN contains a b-tree, and data pages (and somer other kinds of
pages too IIRC). It would be natural to use the regular B-tree code for
the B-tree, but instead it contains a completely separate
implementation. All the different kinds of streams are stored in the
main fork.
3. Free space map. Stored as a separate fork.
4. Visibility map. Stored as a separate fork.
And upcoming:
5. Minmax indexes, with the linearly-addressed range reverse map and
variable lenghth index tuples.
6. Bitmap indexes. Like in GIN, there's a B-tree and the data pages
containing the bitmaps.
A nice property of the VM and FSM forks currently is that they are just
auxiliary information to speed things up. You can safely remove them
(when the server is shut down), and the system will recreate them on
next vacuum. It's not carved in stone that it has to be that way for all
extra forks, but it is today and I like it.
I feel we need a new kind of a relation fork, something more
heavy-weight than the current forks, but not as heavy-weight as the way
TOAST does it. It would be nice if GIN and bitmap indexes could use the
regular nbtree code. Or any other index type - imagine a bitmap index
using a SP-GiST index instead of a B-tree! You could create a bitmap
index for 2d points, and use it to speed up operations like overlap for
example.
The nbtree code expects the data to be in the main fork and uses the FSM
fork too. Maybe it could be abstracted, so that the regular b-tree could
be used as part of another index type. Same with other indexams.
Perhaps relation forks need to be made more flexible, allowing access
methods to define what forks exists. IOW, let's not avoid using relation
forks, let's make them better instead.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
What would it take to abstract the minmax indexes to allow maintaing a
bounding box for points, instead of a plain min/max? Or for ranges. In
other words, why is this restricted to b-tree operators?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
What would it take to abstract the minmax indexes to allow maintaing
a bounding box for points, instead of a plain min/max? Or for
ranges. In other words, why is this restricted to b-tree operators?
If I had to guess, I'd guess, "first cut."
I take it this also occurred to you and that you believe that this
approach makes the more general case or at least further out than it
would need to be. Am I close?
Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics
Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
David Fetter wrote:
On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
What would it take to abstract the minmax indexes to allow maintaing
a bounding box for points, instead of a plain min/max? Or for
ranges. In other words, why is this restricted to b-tree operators?If I had to guess, I'd guess, "first cut."
Yeah, there were a few other simplifications in the design too, though I
admit allowing for multidimensional dataypes hadn't occured to me
(though I will guess Simon did think about it and just didn't tell me to
avoid me going overboard with stuff that would make the first version
take forever).
I think we'd better add version numbers and stuff to the metapage to
allow for extensions and proper upgradability.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 30.09.2013 19:49, Alvaro Herrera wrote:
David Fetter wrote:
On Mon, Sep 30, 2013 at 02:17:39PM +0300, Heikki Linnakangas wrote:
What would it take to abstract the minmax indexes to allow maintaing
a bounding box for points, instead of a plain min/max? Or for
ranges. In other words, why is this restricted to b-tree operators?If I had to guess, I'd guess, "first cut."
Yeah, there were a few other simplifications in the design too, though I
admit allowing for multidimensional dataypes hadn't occured to me
You can almost create a bounding box opclass in the current
implementation, by mapping < operator to "contains" and > to "not
contains". But there's no support for creating a new, larger, bounding
box on insert. It will just replace the max with the new value if it's
"greater than", when it should create a whole new value to store in the
index that covers both the old and the new values. (or less than? I'm
not sure which way those operators would work..)
When you think of the general case, it's weird that the current
implementation requires storing both the min and the max. For a bounding
box, you store the bounding box that covers all heap tuples in the
range. If that corresponds to "max", what does "min" mean?
In fact, even with regular b-tree operators, over integers for example,
you don't necessarily want to store both min and max. If you only ever
perform queries like "WHERE col > ?", there's no need to track the min
value. So to make this really general, you should be able to create an
index on only the minimum or maximum. Or if you want both, you can store
them as separate index columns. Something like:
CREATE INDEX minindex ON table (col ASC); -- For min
CREATE INDEX minindex ON table (col DESC); -- For max
CREATE INDEX minindex ON table (col ASC, col DESC); -- For both
That said, in practice most people probably want to store both min and
max. Maybe it's a bit too finicky if we require people to write "col
ASC, col DESC" to get that. Some kind of a shorthand probably makes sense.
(though I will guess Simon did think about it and just didn't tell me to
avoid me going overboard with stuff that would make the first version
take forever).
Heh, and I ruined that great plan :-).
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Sep 30, 2013 at 1:20 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
You can almost create a bounding box opclass in the current implementation,
by mapping < operator to "contains" and > to "not contains". But there's no
support for creating a new, larger, bounding box on insert. It will just
replace the max with the new value if it's "greater than", when it should
create a whole new value to store in the index that covers both the old and
the new values. (or less than? I'm not sure which way those operators would
work..)
This sounds an awful lot like GiST's "union" operation. Actually,
following the GiST model of having "union" and "consistent" operations
might be a smart way to go. Then the exact index semantics could be
decided by the opclass. This might not even be that much extra code;
the existing consistent and union functions for GiST are pretty short.
That way, it'd be easy to add new opclasses with somewhat different
behavior; the common thread would be that every opclass of this new AM
works by summarizing a physical page range into a single indexed
value. You might call the AM something like "summary" or "sparse" and
then have "minmax_ops" for your first opclass.
In fact, even with regular b-tree operators, over integers for example, you
don't necessarily want to store both min and max. If you only ever perform
queries like "WHERE col > ?", there's no need to track the min value. So to
make this really general, you should be able to create an index on only the
minimum or maximum. Or if you want both, you can store them as separate
index columns. Something like:CREATE INDEX minindex ON table (col ASC); -- For min
CREATE INDEX minindex ON table (col DESC); -- For max
CREATE INDEX minindex ON table (col ASC, col DESC); -- For both
This doesn't seem very general, since you're relying on the fact that
ASC and DESC map to < and >. It's not clear what you'd write here if
you wanted to optimize #$ and @!. But something based on opclasses
will work, since each opclass can support an arbitrary set of
operators.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Mart�n Marqu�s. It's also been rebased to apply
cleanly on top of today's master branch.
I have also added a selectivity function, but I'm not positive that it's
very useful yet.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-7.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.1.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,217 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "funcapi.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+
+ typedef struct mm_page_state
+ {
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ int raw_page_size;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small (%d bytes)", raw_page_size)));
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = VARDATA(raw_page);
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
*** a/contrib/pageinspect/pageinspect--1.1.sql
--- b/contrib/pageinspect/pageinspect--1.1.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,118 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,300 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of the min() and max()
+ values in consecutive groups of heap pages (page ranges). These values can be
+ used by constraint exclusion to avoid scanning such pages, depending on query
+ quals.
+
+ The main drawback of this is having to update the stored min/max values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have this feature. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs"
+ http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB seems to have it
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported; since an index is concerned with minimum and
+ maximum values of the involved columns across all the pages in the table, it
+ doesn't make sense to exclude values. Another way to see "partial" indexes
+ here would be those that only considered some pages in the table instead of all
+ of them; but this would be difficult to implement and manage and, most likely,
+ pointless.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine would return a TIDBitmap comprising all
+ the pages in those page groups that match the query qualifications; the recheck
+ node prunes tuples that are not visible per snapshot and those that are not
+ visible per query quals.
+
+ For each supported datatype, we need an opclass with the following catalog
+ entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+
+ These operators are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious, at best.)
+
+ There can be gaps where some pages have no covering index entry. In particular,
+ the last few pages of the table would commonly not be summarized.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have a
+ separate fork called the range reverse map. This fork stores one TID per
+ range, which is the address of the index tuple summarizing that range. Since
+ these map entries are fixed size, it is possible to compute the address of the
+ range map entry for any given heap page.
+
+ When a new heap tuple is inserted in a summarized page range, it is possible to
+ compare the existing index tuple with the new heap tuple. If the heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values. To do this, a new index tuple is inserted, and the reverse range
+ map is updated to point to it. The old index tuple is left in place, for later
+ garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is not summarized.
+
+ A minmax index is updated by creating a new summary tuple whenever an
+ insertion outside the min-max interval occurs in the pages within the range.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order.
+ Query quals are matched to each index tuple; if they match, each page within
+ the page range is returned as part of the output TID bitmap. If there's no
+ match, they are skipped. Reverse range map entries returning invalid index
+ TIDs, that is unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we reuse the VISIBILITYMAP_FORKNUM, since a VM
+ does not make sense for a minmax index anyway (XXX -- really??)
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ minimum and maximum values of each indexed column and nulls bitmap are
+ collected and stored in the index. The possibly-incomplete range at the end
+ of the table is not included.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the min() and max() for each indexed column, then insert the new
+ index entry at the end of the index. The main interesting questions are:
+
+ a) when to do it
+ The perfect time to do it is as soon as a complete page range of the
+ configured range size has been filled.
+
+ b) who does it (what process)
+ It doesn't seem a good idea to have a client-connected process do it;
+ it would incur unwanted latency. Three other options are (i) to spawn a
+ specialized process to do it, which perhaps can be signalled by a
+ client-connected process that executes a scan and notices the need to run
+ summarization; or (ii) to let autovacuum do it, as a separate new
+ maintenance task. This seems simple enough to bolt on top of already
+ existing autovacuum infrastructure. The timing constraints of autovacuum
+ might be undesirable, though. (iii) wait for user command.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the min()/max() values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the min()/max() ranges are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 268,273 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 268,275 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 293,298 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 295,308 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 634,640 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 644,651 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 644,650 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 655,662 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 ||
! (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 895,901 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 907,913 ----
*/
if (backward)
{
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 905,911 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 917,923 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = --scan->rs_numblocks <= 0 || page == scan->rs_startblock;
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1525 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * do we need to reserve special space on pages?
+ * * support collatable datatypes
+ * * on heap insert, we always create a new index entry. Need to mark
+ * range as unsummarized at some point, to avoid index bloat?
+ * * index truncation on vacuum?
+ * * datumCopy() is needed in several places?
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * Within that struct, each column's contruction info is represented by a
+ * MMPerColBuildInfo struct. The running state is all kept in a
+ * DeformedMMTuple.
+ */
+ typedef struct MMPerColBuildInfo
+ {
+ int typLen;
+ bool typByVal;
+ FmgrInfo lt;
+ FmgrInfo gt;
+ } MMPerColBuildInfo;
+
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber currRangeStart;
+ BlockNumber nextRangeAt;
+ mmRevmapAccess *rmAccess;
+ TupleDesc indexDesc;
+ TupleDesc diskDesc;
+ DeformedMMTuple *dtuple;
+ MMPerColBuildInfo perColState[FLEXIBLE_ARRAY_MEMBER];
+ } MMBuildState;
+
+ static void mmbuildCallback(Relation index,
+ HeapTuple htup, Datum *values, bool *isnull,
+ bool tupleIsAlive, void *state);
+ static void get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo);
+ static inline bool invoke_mm_operator(FmgrInfo *operator, Oid collation,
+ Datum left, Datum right);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static Buffer mm_getnewbuffer(Relation irel);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+
+
+ #define MINMAX_PAGES_PER_RANGE 2
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to create a new index tuple and
+ * point the revmap to it.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ mmRevmapAccess *rmAccess;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ TupleDesc tupdesc;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ FmgrInfo *lt;
+ FmgrInfo *gt;
+ bool need_insert;
+
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ tupdesc = RelationGetDescr(idxRel);
+ indexInfo = BuildIndexInfo(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+ gt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+
+ /* grab the operators we will need: < and > for each indexed column */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ }
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ mmtup = (MMTuple *) PageGetItem(page, PageGetItemId(page, ioff));
+
+ dtup = minmax_deform_tuple(tupdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values.
+ * Note that we need to keep checking each column even after noticing that a
+ * new tuple is necessary, because as a side effect this loop will update
+ * the dtup with the values to insert in the new tuple.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ /*
+ * If the new tuple contains a null in this attr, but the range index
+ * tuple doesn't allow for nulls, we need a new summary tuple.
+ */
+ if (nulls[keyno])
+ {
+ if (!dtup->values[keyno].hasnulls)
+ {
+ need_insert = true;
+ dtup->values[keyno].hasnulls = true;
+ }
+ else
+ continue;
+ }
+
+ /*
+ * If the new key value is not within the min/max interval for this
+ * range, we need a new summary tuple.
+ */
+ if (invoke_mm_operator(<[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].min))
+ {
+ dtup->values[keyno].min = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ if (invoke_mm_operator(>[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].max))
+ {
+ dtup->values[keyno].max = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ TupleDesc diskDesc;
+ Size tupsz;
+ MMTuple *tup;
+
+ diskDesc = minmax_get_descr(tupdesc);
+ tup = minmax_form_tuple(tupdesc, diskDesc, dtup, &tupsz);
+
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ PG_RETURN_POINTER(scan);
+ }
+
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the min/max values in them are compared to the
+ * scan keys. We return into the TID bitmap all the pages in ranges
+ * corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ Oid heapOid;
+ Relation heapRel;
+ mmRevmapAccess *rmAccess;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ TupleDesc tupdesc;
+ AttrNumber keyno;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ FmgrInfo *lt;
+ FmgrInfo *lteq;
+ FmgrInfo *gteq;
+ FmgrInfo *gt;
+
+ pgstat_count_index_scan(idxRel);
+
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ tupdesc = RelationGetDescr(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ lteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+
+ /*
+ * lookup the operators needed to determine range containment of each key
+ * value.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ AttrNumber keyattno;
+ Oid opfam;
+ Oid keytypid;
+ Oid idxtypid;
+
+ keyattno = scan->keyData[keyno].sk_attno;
+ opfam = get_opclass_family(indclass->values[keyattno - 1]);
+ keytypid = scan->keyData[keyno].sk_subtype;
+ idxtypid = tupdesc->attrs[keyattno - 1]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessEqualStrategyNumber,
+ <eq[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterEqualStrategyNumber,
+ >eq[keyno]);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ }
+
+ /*
+ * To keep the buffer locked for a short time, we grab and
+ * immediately deform the index tuple to operate on. As soon as
+ * we have acquired the lock on the index buffer, we can release
+ * the tuple lock the revmap acquired for us. So vacuum would be
+ * able to remove this index row as soon as we release the buffer
+ * lock, if it has become stale.
+ */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *)
+ PageGetItem(page, PageGetItemId(page, idxoffno));
+ /* XXX probably need copies */
+ dtup = minmax_deform_tuple(tupdesc, tup);
+
+ /* done with the index page */
+ LockBuffer(currIdxBuf, BUFFER_LOCK_UNLOCK);
+
+ /*
+ * Compare scan keys with min/max values stored in range. If scan
+ * keys are matched, the page range must be added to the bitmap.
+ */
+ for (keyno = 0, addrange = true;
+ keyno < scan->numberOfKeys;
+ keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+
+ /*
+ * The analysis we need to make to decide whether to include a
+ * page range in the output result is: is it possible for a
+ * tuple contained within the min/max interval specified by
+ * this index tuple to match what's specified by the scan key?
+ * For example, for a query qual such as "WHERE col < 10" we
+ * need to include a range whose minimum value is less than
+ * 10.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole.
+ */
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ addrange =
+ invoke_mm_operator(<[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTLessEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want
+ * to return the current page range if the minimum
+ * value in the range <= scan key, and the maximum
+ * value >= scan key.
+ */
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ if (!addrange)
+ break;
+ /* max() >= scankey */
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterStrategyNumber:
+ addrange =
+ invoke_mm_operator(>[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ default:
+ /* can't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ addrange = false;
+ break;
+ }
+
+ /*
+ * If the current scan key doesn't match the range values,
+ * don't look at further ones.
+ */
+ if (!addrange)
+ break;
+ }
+
+ /* XXX anything to free here? */
+ pfree(dtup);
+ }
+
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + MINMAX_PAGES_PER_RANGE - 1;
+ pageno++)
+ tbm_add_page(tbm, pageno);
+ }
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ if (currIdxBuf != InvalidBuffer)
+ ReleaseBuffer(currIdxBuf);
+
+ pfree(lt);
+ pfree(lteq);
+ pfree(gt);
+ pfree(gteq);
+
+ PG_RETURN_INT64(MaxHeapTuplesPerPage);
+ }
+
+
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ {
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+ }
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+
+ /* anything to do here? */
+ (void) scan; /* silence compiler */
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Reset the per-column build state in an MMBuildState.
+ */
+ static void
+ clear_mm_percol_buildstate(MMBuildState *mmstate)
+ {
+ int i;
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ mmstate->dtuple->values[i].allnulls = true;
+ mmstate->dtuple->values[i].hasnulls = false;
+ mmstate->dtuple->values[i].min = (Datum) 0;
+ mmstate->dtuple->values[i].max = (Datum) 0;
+ }
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; they
+ * are present in the build state struct but not inserted into the index.
+ * Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock == mmstate->nextRangeAt)
+ {
+ MMTuple *tup;
+ Size size;
+
+ #if 0
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ elog(DEBUG2, "completed a range for column %d, range: %u .. %u",
+ i,
+ DatumGetUInt32(mmstate->dtuple->values[i].min),
+ DatumGetUInt32(mmstate->dtuple->values[i].max));
+ }
+ #endif
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ /* and set state to correspond to the new current range */
+ mmstate->currRangeStart = mmstate->nextRangeAt;
+ mmstate->nextRangeAt = mmstate->currRangeStart + MINMAX_PAGES_PER_RANGE;
+
+ /* initialize aggregate state for the new range */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ if (!mmstate->dtuple->values[i].allnulls &&
+ !mmstate->perColState[i].typByVal)
+ {
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].max));
+ }
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ /*
+ * If the value in the current heap tuple is null, there's not much to
+ * do other than keep track that we saw it.
+ */
+ if (isnull[i])
+ {
+ mmstate->dtuple->values[i].hasnulls = true;
+ continue;
+ }
+
+ /*
+ * If this is the first tuple in the range containing a not-null value
+ * for this column, initialize our state.
+ */
+ if (mmstate->dtuple->values[i].allnulls)
+ {
+ mmstate->dtuple->values[i].allnulls = false;
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ continue;
+ }
+
+ /*
+ * Otherwise, dtuple state was already initialized, and the current
+ * tuple is not null: therefore we need to compare it to the current
+ * state and possibly update the min/max boundaries.
+ */
+ if (invoke_mm_operator(&mmstate->perColState[i].lt, InvalidOid,
+ values[i],
+ mmstate->dtuple->values[i].min))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+
+ if (invoke_mm_operator(&mmstate->perColState[i].gt, InvalidOid,
+ values[i],
+ mmstate->dtuple->values[i].max))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+ }
+ }
+
+ static MMBuildState *
+ initialize_mm_buildstate(Relation heapRel, Relation idxRel,
+ mmRevmapAccess *rmAccess, IndexInfo *indexInfo)
+ {
+ MMBuildState *mmstate;
+ TupleDesc heapDesc = RelationGetDescr(heapRel);
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ int i;
+
+ mmstate = palloc(offsetof(MMBuildState, perColState) +
+ sizeof(MMPerColBuildInfo) * indexInfo->ii_NumIndexAttrs);
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->currRangeStart = 0;
+ mmstate->nextRangeAt = MINMAX_PAGES_PER_RANGE;
+ mmstate->rmAccess = rmAccess;
+ mmstate->indexDesc = RelationGetDescr(idxRel);
+ mmstate->diskDesc = minmax_get_descr(mmstate->indexDesc);
+
+ mmstate->dtuple = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * indexInfo->ii_NumIndexAttrs);
+ /* other stuff in dtuple is initialized below */
+
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ int heapAttno;
+ Form_pg_attribute attr;
+ Oid opfam = get_opclass_family(indclass->values[i]);
+ Oid idxtypid = mmstate->indexDesc->attrs[i]->atttypid;
+
+ heapAttno = indexInfo->ii_KeyAttrNumbers[i];
+ if (heapAttno == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create minmax indexes on expressions")));
+
+ attr = heapDesc->attrs[heapAttno - 1];
+ mmstate->perColState[i].typByVal = attr->attbyval;
+ mmstate->perColState[i].typLen = attr->attlen;
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ &(mmstate->perColState[i].lt));
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ &(mmstate->perColState[i].gt));
+
+ /* initialize per-column state */
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+
+ return mmstate;
+ }
+
+ void
+ mm_init_metapage(Buffer meta)
+ {
+ MinmaxMetaPageData *metadata;
+ Page page = BufferGetPage(meta);
+
+ PageInit(page, BLCKSZ, 0);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->minmaxVersion = MINMAX_CURRENT_VERSION;
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ START_CRIT_SECTION();
+ meta = mm_getnewbuffer(index);
+ mm_init_metapage(meta);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &(index->rd_node);
+ rdata.len = sizeof(RelFileNode);
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /* set up our "reverse map" fork */
+ mmRevmapCreate(index);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = mmRevmapAccessInit(index, MINMAX_PAGES_PER_RANGE);
+ mmstate = initialize_mm_buildstate(heap, index, rmAccess, indexInfo);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in order
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* XXX process the final batch, if needed */
+
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_POINTER(NULL);
+ }
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return an array of block numbers for which the revmap returns
+ * InvalidTid; this is used in a later stage to execute re-summarization.
+ * (The block numbers correspond to the start heap page numbers with which each
+ * unsummarized range starts.) Space for the array is palloc'ed, and must be
+ * freed by caller.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ MemoryContext hashcxt;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ hashcxt = AllocSetContextCreate(CurrentMemoryContext,
+ "mm remove deletable hash",
+ ALLOCSET_SMALL_MINSIZE,
+ ALLOCSET_SMALL_INITSIZE,
+ ALLOCSET_SMALL_MAXSIZE);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = hashcxt;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &tid,
+ HASH_ENTER,
+ &found);
+ Assert(!found);
+ hitem->referenced = false;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, MINMAX_PAGES_PER_RANGE);
+ for (heapBlk = 0, numitems = 0;
+ heapBlk < heapNumBlocks;
+ heapBlk += MINMAX_PAGES_PER_RANGE)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. However, if the heap page
+ * range is complete but not summarized, store its initial page
+ * number in the unsummarized array, for later summarization.
+ */
+ if (heapBlk + MINMAX_PAGES_PER_RANGE < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+ else
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ if (!found)
+ elog(ERROR, "reverse map references nonexistant index tuple %u/%u",
+ ItemPointerGetBlockNumber(&itupptr),
+ ItemPointerGetOffsetNumber(&itupptr));
+ hitem->referenced = true;
+ numitems++;
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples. Allocate this in the hash context, so that it goes
+ * away with it.
+ */
+ deletable = MemoryContextAlloc(hashcxt, sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ start = 0;
+ for (i = 0; i < numdeletable; i++)
+ {
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_BULKREMOVE;
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ RecordPageWithFreeSpace(idxRel, blk, PageGetFreeSpace(page));
+
+ start = i + 1;
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+ }
+ }
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(heapRel, idxRel, rmAccess, indexInfo);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+ MMTuple *tup;
+ Size size;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+
+ mmstate->currRangeStart = blk;
+ mmstate->nextRangeAt = blk + MINMAX_PAGES_PER_RANGE;
+
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, MINMAX_PAGES_PER_RANGE,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ * Note mmbuildCallback didn't have the chance to actually insert
+ * anything into the index, because the heapscan should have ended
+ * just as it reached the final tuple in the range.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * During amvacuumcleanup of a MinMax index, we do three main things:
+ *
+ * 1) remove revmap entries which are no longer interesting (heap has been
+ * truncated).
+ *
+ * 2) remove index tuples that are no longer referenced from the revmap.
+ *
+ * 3) summarize ranges that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+
+ rmAccess = mmRevmapAccessInit(info->index, MINMAX_PAGES_PER_RANGE);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First: truncate the revmap to the range that covers pages actually in
+ * the heap. We must do this while holding the relation extension lock,
+ * or we risk someone else extending the relation in the meantime.
+ */
+ LockRelationForExtension(heapRel, AccessShareLock);
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ mmRevmapTruncate(rmAccess, heapNumBlocks);
+ UnlockRelationForExtension(heapRel, AccessShareLock);
+
+ /*
+ * Second: scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers
+ * of ranges that are not summarized.
+ */
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* Finally, summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ PG_RETURN_INT64(0);
+ }
+
+ /*
+ * Fill the given finfo to enable calls to the operator specified by the given
+ * parameters.
+ */
+ static void
+ get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo)
+ {
+ Oid oprid;
+ HeapTuple oper;
+
+ oprid = get_opfamily_member(opfam, idxtypid, keytypid, strategy);
+ if (!OidIsValid(oprid))
+ elog(ERROR, "missing operator %d(%u,%u) in opfamily %u",
+ strategy, idxtypid, keytypid, opfam);
+
+ oper = SearchSysCache1(OPEROID, oprid);
+ if (!HeapTupleIsValid(oper))
+ elog(ERROR, "cache lookup failed for operator %u", oprid);
+
+ fmgr_info(((Form_pg_operator) GETSTRUCT(oper))->oprcode, finfo);
+ ReleaseSysCache(oper);
+ }
+
+ /*
+ * Invoke the given operator, and return the result as a C boolean.
+ */
+ static inline bool
+ invoke_mm_operator(FmgrInfo *operator, Oid collation, Datum left, Datum right)
+ {
+ Datum result;
+
+ result = FunctionCall2Coll(operator, collation, left, right);
+
+ return DatumGetBool(result);
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+
+ if (PageGetFreeSpace(page) < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel))));
+
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ blk = BufferGetBlockNumber(*buffer);
+
+ MarkBufferDirty(*buffer);
+
+ START_CRIT_SECTION();
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return a exclusively-locked buffer resulting from extending the relation.
+ */
+ static Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has some, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, and a
+ * new buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. The function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ bool extended = false;
+
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ Page page;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer which has enough free space to satisfy the requirement.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly.
+ */
+ freespace = PageGetFreeSpace(page);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ buf = ReadBuffer(irel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ break;
+
+ /* Not enough space: register reality and start over */
+ /* XXX register and unlock, or unlock and register?? */
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ return extended;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,375 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+
+
+ #define MAPSIZE (BLCKSZ - MAXALIGN(SizeOfPageHeaderData))
+ #define IDXITEMS_PER_PAGE (MAPSIZE / SizeOfIptrData)
+
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / IDXITEMS_PER_PAGE)
+
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % IDXITEMS_PER_PAGE)
+
+ static bool mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno);
+
+ /* typedef appears in minmax_revmap.h */
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer currBuf;
+ BlockNumber physPagesInRevmap;
+ };
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
+
+ RelationOpenSmgr(idxrel);
+
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * in the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages-per-range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ ItemPointerData *iptr;
+
+ iptr = (ItemPointerData *) PageGetContents(page);
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /*
+ * If the revmap is out of space, extend it first.
+ */
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ extend = mmRevmapExtend(rmAccess, mapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ rmAccess->currBuf = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with LockTuple.
+ * It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /*
+ * If we are asked for a block of the map which is beyond what we know
+ * about it, try to see if our fork has grown since we last checked its
+ * size; a concurrent inserter could have extended it.
+ */
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ {
+ RelationOpenSmgr(rmAccess->idxrel);
+ LockRelationForExtension(rmAccess->idxrel, ShareLock);
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ if (mapBlk >= rmAccess->physPagesInRevmap)
+ {
+ /* definitely not in range */
+
+ UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ /* the block exists now, proceed */
+ UnlockRelationForExtension(rmAccess->idxrel, ShareLock);
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ rmAccess->currBuf =
+ ReadBufferExtended(rmAccess->idxrel, MM_REVMAP_FORKNUM, mapBlk,
+ RBM_NORMAL, NULL);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ iptr = (ItemPointerData *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Create a single-page reverse range map fork for a new minmax index
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ bool needLock;
+ Buffer buf;
+ Page page;
+
+ needLock = !RELATION_IS_LOCAL(idxrel);
+
+ /*
+ * XXX it's unclear that we need this lock, considering that the relation
+ * is likely being created ...
+ */
+ if (needLock)
+ LockRelationForExtension(idxrel, ExclusiveLock);
+
+ START_CRIT_SECTION();
+ RelationOpenSmgr(idxrel);
+ smgrcreate(idxrel->rd_smgr, MM_REVMAP_FORKNUM, false);
+ buf = ReadBufferExtended(idxrel, MM_REVMAP_FORKNUM, P_NEW, RBM_NORMAL,
+ NULL);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ page = BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ END_CRIT_SECTION();
+
+ if (needLock)
+ UnlockRelationForExtension(idxrel, ExclusiveLock);
+ }
+
+ /*
+ * Extend the reverse range map to cover the given block number. Return false
+ * if the map already covered the requested range (no extension actually done),
+ * true otherwise.
+ *
+ * NB -- caller is responsible for ensuring this action is properly WAL-logged.
+ */
+ static bool
+ mmRevmapExtend(mmRevmapAccess *rmAccess, BlockNumber blkno)
+ {
+ char page[BLCKSZ];
+ bool extended = false;
+
+ MemSet(page, 0, sizeof(page));
+ PageInit(page, BLCKSZ, 0);
+
+ LockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ /*
+ * first, refresh our idea of the current size; it might well have grown
+ * up to what we need since we last checked.
+ */
+ RelationOpenSmgr(rmAccess->idxrel);
+ rmAccess->physPagesInRevmap =
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM);
+
+ /*
+ * Now extend it one page at a time. This might seem a bit inefficient,
+ * but normally we'd be extending for a single page anyway.
+ */
+ while (blkno >= rmAccess->physPagesInRevmap)
+ {
+ extended = true;
+ PageSetChecksumInplace(page, blkno);
+ smgrextend(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM,
+ rmAccess->physPagesInRevmap, page, false);
+ rmAccess->physPagesInRevmap++;
+ }
+
+ Assert(rmAccess->physPagesInRevmap ==
+ smgrnblocks(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM));
+
+ UnlockRelationForExtension(rmAccess->idxrel, ExclusiveLock);
+
+ return extended;
+ }
+
+ /*
+ * Truncate a revmap to the size needed for a table of the given number of
+ * blocks. This includes removing pages beyond the last one needed, and also
+ * zeroing out the excess entries in the last page.
+ *
+ * The caller should hold a lock to avoid the table from growing in
+ * the meantime.
+ */
+ void
+ mmRevmapTruncate(mmRevmapAccess *rmAccess, BlockNumber heapNumBlocks)
+ {
+ BlockNumber rmBlks;
+ char *data;
+ Page page;
+ Buffer buffer;
+
+ /* Remove blocks at the end */
+ rmBlks = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapNumBlocks);
+
+ RelationOpenSmgr(rmAccess->idxrel);
+ smgrtruncate(rmAccess->idxrel->rd_smgr, MM_REVMAP_FORKNUM, rmBlks + 1);
+
+ /* zero out the remaining items in the last page */
+ buffer = ReadBufferExtended(rmAccess->idxrel,
+ MM_REVMAP_FORKNUM, rmBlks,
+ RBM_NORMAL, NULL);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ page = PageGetContents(BufferGetPage(buffer));
+ data = page + sizeof(ItemPointerData) *
+ HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapNumBlocks + 1);
+
+ memset(data, 0, page + MAPSIZE - data);
+
+ UnlockReleaseBuffer(buffer);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,388 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * The intended interface is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation by
+ * using functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the tuple descriptor exactly: for each attribute in the
+ * descriptor, the index tuple carries two values, one for the minimum value in
+ * that column and one for the maximum.
+ *
+ * Also, for each column there are two null bits: one (hasnulls) stores whether
+ * any tuple within the page range has that column set to null; the other
+ * (allnulls) stores whether the column values are all null. If allnulls is
+ * true, then the tuple data area does not contain min/max values for that
+ * column at all; whereas it does if the hasnulls is set. Note we always store
+ * a double-length null bitmask; for typical indexes of four columns or less,
+ * they take a single byte anyway. It doesn't seem worth trying to optimize
+ * this further.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Generate an internal-style tuple descriptor to pass to minmax_form_tuple.
+ * These have no use outside this module.
+ *
+ * The argument is a minmax index' regular tuple descriptor.
+ */
+ TupleDesc
+ minmax_get_descr(TupleDesc tupdesc)
+ {
+ TupleDesc diskDesc;
+ int i,
+ j;
+
+ diskDesc = CreateTemplateTupleDesc(tupdesc->natts * 2, false);
+
+ for (i = 0, j = 1; i < tupdesc->natts; i++)
+ {
+ /* min */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ /* max */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ return diskDesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ *
+ * The first tuple descriptor passed corresponds to the catalogued index info,
+ * that is, it is the index's descriptor; the second descriptor must be
+ * obtained by calling minmax_get_descr() on that descriptor.
+ *
+ * (The reason for this slightly grotty arrangement is that we use heap tuple
+ * functions to implement packing of a tuple into the on-disk format.)
+ */
+ MMTuple *
+ minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
+ Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(diskDsc->natts > 0);
+
+ values = palloc(sizeof(Datum) * diskDsc->natts);
+ nulls = palloc0(sizeof(bool) * diskDsc->natts);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(diskDsc->natts));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ int idxattno = keyno * 2;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; set the nullable bits for both min and max attrs.
+ */
+ if (tuple->values[keyno].allnulls)
+ {
+ nulls[idxattno] = true;
+ nulls[idxattno + 1] = true;
+ anynulls = true;
+ continue;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ anynulls = true;
+
+ values[idxattno] = tuple->values[keyno].min;
+ values[idxattno + 1] = tuple->values[keyno].max;
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(idxDsc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(diskDsc, values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(diskDsc,
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need
+ * to apply datumCopy inside the loop. We probably also need a
+ * minmax_free_dtuple() function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits = NULL;
+ int keyno;
+
+ dtup = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * tupdesc->natts);
+
+ values = palloc(sizeof(Datum) * tupdesc->natts * 2);
+ allnulls = palloc(sizeof(bool) * tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ mm_deconstruct_tuple(tp, nullbits,
+ MMTupleHasNulls(tuple),
+ tupdesc->natts, tupdesc->attrs, values,
+ allnulls, hasnulls);
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ if (allnulls[keyno])
+ {
+ dtup->values[keyno].allnulls = true;
+ continue;
+ }
+
+ /* XXX optional datumCopy() */
+ dtup->values[keyno].min = values[keyno * 2];
+ dtup->values[keyno].max = values[keyno * 2 + 1];
+ dtup->values[keyno].hasnulls = hasnulls[keyno];
+ dtup->values[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * natts number of array members in att
+ * att the tuple's TupleDesc Form_pg_attribute array
+ * values output values, size 2 * natts (alternates min and max)
+ * allnulls output "allnulls", size natts
+ * hasnulls output "hasnulls", size natts
+ *
+ * Output arrays are allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ values[attnum] = (Datum) 0;
+ allnulls[attnum] = true;
+ hasnulls[attnum] = true; /* XXX ? */
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. So the tuple data does have data for this
+ * column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] = nulls && att_isnull(natts + attnum, hasnulls);
+ }
+
+ /*
+ * The we iterate to natts * 2 to obtain each attribute's min and max
+ * values. Note that since we reuse attribute entries (first for the
+ * minimum value of the corresponding column, then for max), we cannot
+ * cache offsets here.
+ */
+ for (attnum = 0; attnum < natts * 2; attnum++)
+ {
+ int true_attnum = attnum / 2;
+ Form_pg_attribute thisatt = att[true_attnum];
+
+ if (allnulls[true_attnum])
+ continue;
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[attnum] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,212 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_init_metapage(buf);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its revmap fork */
+ buf = XLogReadBufferExtended(xlrec->node, MM_REVMAP_FORKNUM, 0, RBM_ZERO);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ PageInit(page, BLCKSZ, 0);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ PageInit(page, BufferGetPageSize(buffer), 0); /* XXX size correct?? */
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBufferExtended(xlrec->node,
+ MM_REVMAP_FORKNUM, xlrec->mapBlock,
+ init ? RBM_ZERO : RBM_NORMAL);
+ Assert(BufferIsValid(buffer));
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buffer);
+ if (init)
+ PageInit(page, BufferGetPageSize(buffer), 0);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,74 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ uint8 info = xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
+
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2097,2102 **** IndexBuildHeapScan(Relation heapRelation,
--- 2097,2123 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * range is scanned. Scan to end-of-rel can be signalled by passing
+ * InvalidBlockNumber as end block number.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2167,2172 **** IndexBuildHeapScan(Relation heapRelation,
--- 2188,2196 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 899,904 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 899,1074 ----
pfree(itemidbase);
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline,
+ nstorage;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the item pointer array and build a list of just the ones we are
+ * going to keep. Notice we do not modify the page just yet, since we are
+ * still validity-checking.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ nstorage = 0;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ nstorage++;
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (nstorage == 0)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ char pageCopy[BLCKSZ];
+ itemIdSort itemidbase,
+ itemidptr;
+ int lastused;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * First scan the page taking note of each item that we need to
+ * preserve. This includes both live items (those that contain data)
+ * and interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable.
+ */
+ itemidbase = (itemIdSort) palloc(sizeof(itemIdSortData) * nline);
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /*
+ * Defragment the data areas of each tuple. Note that since offset
+ * numbers must remain unchanged in these pages, we can't do a qsort()
+ * of the itemIdSort elements here; and because the elements are not
+ * sorted by offset, we can't use memmove() to defragment the occupied
+ * data space. So we first create a temporary copy of the original
+ * data page, from which we memcpy() each item's data onto the final
+ * page.
+ */
+ memcpy(pageCopy, page, BLCKSZ);
+ lastused = FirstOffsetNumber;
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ continue;
+ }
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ upper -= itemidptr->alignedlen;
+ memcpy((char *) page + upper,
+ pageCopy + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+
+ lastused = i + 1;
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + lastused * sizeof(ItemIdData);
+
+ pfree(itemidbase);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7314,7316 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7314,7341 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
+
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,35 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,39 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ int32 minmaxMagic;
+ int32 minmaxVersion;
+ } MinmaxMetaPageData;
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ #define MM_REVMAP_FORKNUM VISIBILITYMAP_FORKNUM /* reuse the VM forknum */
+
+
+ extern void mm_init_metapage(Buffer meta);
+ extern void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno);
+
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,34 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
+ BlockNumber pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,79 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/tupdesc.h"
+
+
+ /*
+ * This struct is used to represent the indexed values for one column, within
+ * one page range.
+ */
+ typedef struct MMValues
+ {
+ Datum min;
+ Datum max;
+ bool hasnulls;
+ bool allnulls;
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and
+ * maximum values for all indexed columns, within one page range.
+ * The number of elements in the values array is determined by the accompanying
+ * tuple descriptor.
+ */
+ typedef struct DeformedMMTuple
+ {
+ bool nvalues; /* XXX unused */
+ MMValues values[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for natts*2 null bits; min and max Datum values for each column follow
+ * that.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple offsetof(MMTuple, mt_info) + sizeof(uint8)
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern TupleDesc minmax_get_descr(TupleDesc tupdesc);
+ extern MMTuple *minmax_form_tuple(TupleDesc idxDesc, TupleDesc diskDesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ extern DeformedMMTuple *minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,93 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ RelFileNode node;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, node) + sizeof(RelFileNode)
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, target) + SizeOfMinmaxTid)
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+
+ extern void minmax_desc(StringInfo buf, uint8 xl_info, char *rec);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup, NULL)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3847 ( minmax 5 0 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3847
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 791,794 **** DATA(insert ( 3474 3831 3831 8 s 3892 4000 0 ));
--- 791,866 ----
DATA(insert ( 3474 3831 2283 16 s 3889 4000 0 ));
DATA(insert ( 3474 3831 3831 18 s 3882 4000 0 ));
+ /*
+ * MinMax int4_ops
+ */
+ DATA(insert ( 3969 23 23 1 s 97 3847 0 ));
+ DATA(insert ( 3969 23 23 2 s 523 3847 0 ));
+ DATA(insert ( 3969 23 23 3 s 96 3847 0 ));
+ DATA(insert ( 3969 23 23 4 s 525 3847 0 ));
+ DATA(insert ( 3969 23 23 5 s 521 3847 0 ));
+
+ /*
+ * MinMax numeric_ops
+ */
+ DATA(insert ( 3970 1700 1700 1 s 1754 3847 0 ));
+ DATA(insert ( 3970 1700 1700 2 s 1755 3847 0 ));
+ DATA(insert ( 3970 1700 1700 3 s 1752 3847 0 ));
+ DATA(insert ( 3970 1700 1700 4 s 1757 3847 0 ));
+ DATA(insert ( 3970 1700 1700 5 s 1756 3847 0 ));
+
+ /*
+ * MinMax text_ops
+ */
+ DATA(insert ( 3971 25 25 1 s 664 3847 0 ));
+ DATA(insert ( 3971 25 25 2 s 665 3847 0 ));
+ DATA(insert ( 3971 25 25 3 s 98 3847 0 ));
+ DATA(insert ( 3971 25 25 4 s 667 3847 0 ));
+ DATA(insert ( 3971 25 25 5 s 666 3847 0 ));
+
+ /*
+ * MinMax time_ops
+ */
+ DATA(insert ( 3972 1083 1083 1 s 1110 3847 0 ));
+ DATA(insert ( 3972 1083 1083 2 s 1111 3847 0 ));
+ DATA(insert ( 3972 1083 1083 3 s 1108 3847 0 ));
+ DATA(insert ( 3972 1083 1083 4 s 1113 3847 0 ));
+ DATA(insert ( 3972 1083 1083 5 s 1112 3847 0 ));
+
+ /*
+ * MinMax timetz_ops
+ */
+ DATA(insert ( 3973 1266 1266 1 s 1552 3847 0 ));
+ DATA(insert ( 3973 1266 1266 2 s 1553 3847 0 ));
+ DATA(insert ( 3973 1266 1266 3 s 1550 3847 0 ));
+ DATA(insert ( 3973 1266 1266 4 s 1555 3847 0 ));
+ DATA(insert ( 3973 1266 1266 5 s 1554 3847 0 ));
+
+ /*
+ * MinMax timestamp_ops
+ */
+ DATA(insert ( 3974 1114 1114 1 s 2062 3847 0 ));
+ DATA(insert ( 3974 1114 1114 2 s 2063 3847 0 ));
+ DATA(insert ( 3974 1114 1114 3 s 2060 3847 0 ));
+ DATA(insert ( 3974 1114 1114 4 s 2065 3847 0 ));
+ DATA(insert ( 3974 1114 1114 5 s 2064 3847 0 ));
+
+ /*
+ * MinMax timestamptz_ops
+ */
+ DATA(insert ( 3975 1184 1184 1 s 1322 3847 0 ));
+ DATA(insert ( 3975 1184 1184 2 s 1323 3847 0 ));
+ DATA(insert ( 3975 1184 1184 3 s 1320 3847 0 ));
+ DATA(insert ( 3975 1184 1184 4 s 1325 3847 0 ));
+ DATA(insert ( 3975 1184 1184 5 s 1324 3847 0 ));
+
+ /*
+ * MinMax date_ops
+ */
+ DATA(insert ( 3976 1082 1082 1 s 1095 3847 0 ));
+ DATA(insert ( 3976 1082 1082 2 s 1096 3847 0 ));
+ DATA(insert ( 3976 1082 1082 3 s 1093 3847 0 ));
+ DATA(insert ( 3976 1082 1082 4 s 1098 3847 0 ));
+ DATA(insert ( 3976 1082 1082 5 s 1097 3847 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 228,232 **** DATA(insert ( 4000 range_ops PGNSP PGUID 3474 3831 t 0 ));
--- 228,240 ----
DATA(insert ( 4000 quad_point_ops PGNSP PGUID 4015 600 t 0 ));
DATA(insert ( 4000 kd_point_ops PGNSP PGUID 4016 600 f 0 ));
DATA(insert ( 4000 text_ops PGNSP PGUID 4017 25 t 0 ));
+ DATA(insert ( 3847 int4_ops PGNSP PGUID 3969 23 t 0 ));
+ DATA(insert ( 3847 numeric_ops PGNSP PGUID 3970 1700 t 0 ));
+ DATA(insert ( 3847 text_ops PGNSP PGUID 3971 25 t 0 ));
+ DATA(insert ( 3847 time_ops PGNSP PGUID 3972 1083 t 0 ));
+ DATA(insert ( 3847 timetz_ops PGNSP PGUID 3973 1266 t 0 ));
+ DATA(insert ( 3847 timestamp_ops PGNSP PGUID 3974 1114 t 0 ));
+ DATA(insert ( 3847 timestamptz_ops PGNSP PGUID 3975 1184 t 0 ));
+ DATA(insert ( 3847 date_ops PGNSP PGUID 3976 1082 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 148,152 **** DATA(insert OID = 4015 ( 4000 quad_point_ops PGNSP PGUID ));
--- 148,160 ----
DATA(insert OID = 4016 ( 4000 kd_point_ops PGNSP PGUID ));
DATA(insert OID = 4017 ( 4000 text_ops PGNSP PGUID ));
#define TEXT_SPGIST_FAM_OID 4017
+ DATA(insert OID = 3969 ( 3847 int4_ops PGNSP PGUID ));
+ DATA(insert OID = 3970 ( 3847 numeric_ops PGNSP PGUID ));
+ DATA(insert OID = 3971 ( 3847 text_ops PGNSP PGUID ));
+ DATA(insert OID = 3972 ( 3847 time_ops PGNSP PGUID ));
+ DATA(insert OID = 3973 ( 3847 timetz_ops PGNSP PGUID ));
+ DATA(insert OID = 3974 ( 3847 timestamp_ops PGNSP PGUID ));
+ DATA(insert OID = 3975 ( 3847 timestamptz_ops PGNSP PGUID ));
+ DATA(insert OID = 3976 ( 3847 date_ops PGNSP PGUID ));
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 561,566 **** DESCR("btree(internal)");
--- 561,594 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3178 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3179 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3180 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3195 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3196 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3197 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3198 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3199 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3200 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3201 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3202 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3203 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3204 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,409 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1081,1086 **** ORDER BY 1, 2, 3;
--- 1081,1091 ----
2742 | 2 | @@@
2742 | 3 | <@
2742 | 4 | =
+ 3847 | 1 | <
+ 3847 | 2 | <=
+ 3847 | 3 | =
+ 3847 | 4 | >=
+ 3847 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1276,1282 **** FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport OR amprocfamily IS NULL;
amname | opcname | count
--------+---------+-------
(0 rows)
--- 1281,1287 ----
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport AND amprocfamily IS NOT NULL;
amname | opcname | count
--------+---------+-------
(0 rows)
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 978,984 **** FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport OR amprocfamily IS NULL;
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
--- 978,984 ----
amproclefttype = amprocrighttype AND amproclefttype = opcintype
WHERE am.amname <> 'btree' AND am.amname <> 'gist' AND am.amname <> 'gin'
GROUP BY amname, amsupport, opcname, amprocfamily
! HAVING count(*) != amsupport AND amprocfamily IS NOT NULL;
SELECT amname, opcname, count(*)
FROM pg_am am JOIN pg_opclass op ON opcmethod = am.oid
Robert Haas escribi�:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.
I have been playing with having the revmap in the main fork of the index
rather than a separate one. On the surface many things stay just what
they are; I only had to add a layer beneath the revmap that maps its
logical block numbers to physical block numbers. The problem with this
is that it needs more disk access, because revmap block numbers cannot be
hardcoded.
After doing some quick math, what I ended up with was to keep an array
of BlockNumbers in the metapage. Each element in this array points to
array pages; each array page is, in turn, filled with more BlockNumbers,
which this time correspond to the logical revmap pages we used to have
in the revmap fork. (I initially feared that this design would not
allow me to address enough revmap pages for the largest of tables; but
fortunately this is sufficient unless you configure very small pages,
say BLCKSZ 2kB, use small page ranges, and use small datatypes, say
"char". I have no problem with saying that that scenario is not
supported if you want to have minmax indexes on 32 TB tables. I mean,
who uses BLCKSZ smaller than 8kB anyway?).
The advantage of this design is that in order to find any particular
logical revmap page, you always have to do a constant number of page
accesses. You read the metapage, then read the array page, then read
the revmap page; done. Another idea I considered was chaining revmap
pages (so each would have a pointer-to-next), or chaining array pages;
but this would have meant that to locate an individual page to the end
of the revmap, you might need to do many accesses. Not good.
As an optimization for relatively small indexes, we hardcode the page
number for the first revmap page: it's always the page right after the
metapage (so BlockNumber 1). A revmap page can store, with the default
page size, about 1350 item pointers; so with an index built for page
ranges of 1000 pages per range, you can point to enough index entries
for a ~10 GB table without having the need to examine the first array
page. This seems pretty acceptable; people with larger tables can
likely spare one extra page accessed every now and then.
(For comparison, each regular minmax page can store about 500 index
tuples, if it's built for a single 4-byte column; this means that the 10
GB table requires a 5-page index.)
This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera escribi�:
I have been playing with having the revmap in the main fork of the index
rather than a separate one.
...
This is not complete yet; although I have a proof-of-concept working, I
still need to write XLog support code and update the pageinspect code to
match.
Just to be clear: the v7 published elsewhere in this thread does not
contain this revmap-in-main-fork code.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Mart�n Marqu�s. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.[minmax-7.patch]
The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that
improves with minmax index use. (it gets used sometimes but speedup is negligable).
That probably means I'm doing something wrong; could you (or anyone) give some hints about use-case would be expected?
(Or is it just the unfinished selectivity function?)
Thanks,
Erikjan Rijkers
Attachments:
On Mon, November 11, 2013 09:53, Erik Rijkers wrote:
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Mart�n Marqu�s. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.[minmax-7.patch]
The earlier errors are indeed fixed; now, I've been trying with the attached test case but I'm unable to find a query that
improves with minmax index use. (it gets used sometimes but speedup is negligable).
Another issue (I think):
Attached is a program (and output as a .txt file) that gives the following (repeatable) error:
$ ./casanova_test.sh
\timing on
drop table if exists t1;
Time: 333.159 ms
create table t1 (i int);
Time: 155.827 ms
create index t1_i_idx on t1 using minmax(i);
Time: 204.031 ms
insert into t1 select generate_series(1, 25000000);
Time: 126312.302 ms
analyze t1;
ERROR: could not truncate file base/21324/26339_vm to 41 blocks: it's only 1 blocks now
Time: 472.504 ms
[...]
Thanks,
Erik Rijkers
On Mon, Nov 11, 2013 at 12:53 AM, Erik Rijkers <er@xs4all.nl> wrote:
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.[minmax-7.patch]
The earlier errors are indeed fixed; now, I've been trying with the
attached test case but I'm unable to find a query that
improves with minmax index use. (it gets used sometimes but speedup is
negligable).
Your data set seems to be completely random. I believe that minmax indices
would only be expected to be useful when the data is clustered. Perhaps
you could try it on a table where it is populated something like
i+random()/10*max_i.
Cheers,
Jeff
On Mon, November 11, 2013 09:53, Erik Rijkers wrote:
On Fri, November 8, 2013 21:11, Alvaro Herrera wrote:
Here's a version 7 of the patch, which fixes these bugs and adds
[minmax-7.patch]
[...]
some hints about use-case would be expected?
I've been messing with minmax indexes some more so here are some results of that.
Perhaps someone finds these timings useful.
Centos 5.7, 32 GB memory, 2 quadcores.
'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--enable-cassert'
'--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace'
Detail is in the attached files; the below is a grep through these.
-- rowcount (size_string): 10_000
368,640 | size table
245,760 | size btree index
16,384 | size minmax index
Total runtime: 0.167 ms <-- btree (4x) ( last 2x disabled index-only )
Total runtime: 0.046 ms
Total runtime: 0.046 ms
Total runtime: 0.049 ms
Total runtime: 0.102 ms <-- minmax (4x)
Total runtime: 0.047 ms
Total runtime: 0.047 ms
Total runtime: 0.047 ms
Total runtime: 1.066 ms <-- seqscan
-- rowcount (size_string): 100_000
3,629,056 | size table
2,260,992 | size btree index
16,384 | size minmax index
Total runtime: 0.090 ms <-- btree (4x) ( last 2x disabled index-only )
Total runtime: 0.046 ms
Total runtime: 0.426 ms
Total runtime: 0.287 ms
Total runtime: 0.391 ms <-- minmax (4x)
Total runtime: 0.285 ms
Total runtime: 0.285 ms
Total runtime: 0.291 ms
Total runtime: 14.065 ms <-- seqscan
-- rowcount (size_string): 1_000_000
36,249,600 | size table
22,487,040 | size btree index
57,344 | size minmax index
Total runtime: 0.077 ms <-- btree (4x) ( last 2x disabled index-only )
Total runtime: 0.048 ms
Total runtime: 0.044 ms
Total runtime: 0.038 ms
Total runtime: 2.284 ms <-- minmax (4x)
Total runtime: 1.812 ms
Total runtime: 1.813 ms
Total runtime: 1.809 ms
Total runtime: 142.958 ms <-- seqscan
-- rowcount (size_string): 100_000_000
3,624,779,776 | size table
2,246,197,248 | size btree index
4,456,448 | size minmax index
Total runtime: 0.091 ms <-- btree (4x) ( last 2x disabled index-only )
Total runtime: 0.047 ms
Total runtime: 0.046 ms
Total runtime: 0.038 ms
Total runtime: 181.874 ms <-- minmax (4x)
Total runtime: 175.084 ms
Total runtime: 175.104 ms
Total runtime: 174.349 ms
Total runtime: 14833.994 ms <-- seqscan
-- rowcount (size_string): 1_000_000_000
36,247,789,568 | size table
22,461,628,416 | size btree index
44,433,408 | size minmax index
Total runtime: 14.735 ms <-- btree (4x) ( last 2x disabled index-only )
Total runtime: 0.046 ms
Total runtime: 0.044 ms
Total runtime: 0.041 ms
Total runtime: 1790.591 ms <-- minmax (4x)
Total runtime: 1750.129 ms
Total runtime: 1747.987 ms
Total runtime: 1748.476 ms
Total runtime: 169770.455 ms <-- seqscan
The messy "program" is attached too (although it still has Jaime's name, the mess is mine).
hth,
Erik Rijkers
PS.
The bug I reported earlier is (of course) still there; but I noticed that it only occurs on larger table sizes (e.g. +1M
rows).
On 2013-11-15 17:11:46 +0100, Erik Rijkers wrote:
I've been messing with minmax indexes some more so here are some results of that.
Perhaps someone finds these timings useful.
Centos 5.7, 32 GB memory, 2 quadcores.
'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--enable-cassert'
'--enable-debug' '--with-perl' '--with-openssl' '--with-libxml' '--enable-dtrace'
Just some general advice: doing timings with --enale-cassert isn't that
meaningful - it often can distort results significantly.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Erik Rijkers <er@xs4all.nl> wrote:
Perhaps someone finds these timings useful.
'--enable-cassert'
Assertions can really distort the timings, and not always equally
for all code paths. Any chance of re-running those tests without
that?
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, November 15, 2013 17:33, Kevin Grittner wrote:
Erik Rijkers <er@xs4all.nl> wrote:
Perhaps someone finds these timings useful.
'--enable-cassert'
Assertions can really distort the timings, and not always equally
for all code paths.� Any chance of re-running those tests without
that?
Fair enough. It seems it doesn't make all that much difference for this case, here are the results:
'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444' '--enable-depend' '--with-perl'
'--with-openssl' '--with-libxml'
-- rowcount (size_string): 10_000
368640 | size table | 360 kB
245760 | size btree index | 240 kB
16384 | size minmax index | 16 kB
Total runtime: 0.121 ms
Total runtime: 0.041 ms
Total runtime: 0.039 ms
Total runtime: 0.040 ms
Total runtime: 0.043 ms
Total runtime: 0.041 ms
Total runtime: 0.040 ms
Total runtime: 0.040 ms
Total runtime: 0.948 ms
-- rowcount (size_string): 100_000
3629056 | size table | 3544 kB
2260992 | size btree index | 2208 kB
16384 | size minmax index | 16 kB
Total runtime: 0.082 ms
Total runtime: 0.039 ms
Total runtime: 0.396 ms
Total runtime: 0.252 ms
Total runtime: 0.339 ms
Total runtime: 0.245 ms
Total runtime: 0.240 ms
Total runtime: 0.241 ms
Total runtime: 13.268 ms
-- rowcount (size_string): 1_000_000
36249600 | size table | 35 MB
22487040 | size btree index | 21 MB
57344 | size minmax index | 56 kB
Total runtime: 0.096 ms
Total runtime: 0.039 ms
Total runtime: 0.039 ms
Total runtime: 0.034 ms
Total runtime: 1.975 ms
Total runtime: 1.527 ms
Total runtime: 1.523 ms
Total runtime: 1.519 ms
Total runtime: 145.125 ms
-- rowcount (size_string): 100_000_000
3624779776 | size table | 3457 MB
2246197248 | size btree index | 2142 MB
4456448 | size minmax index | 4352 kB
Total runtime: 0.074 ms
Total runtime: 0.039 ms
Total runtime: 0.040 ms
Total runtime: 0.033 ms
Total runtime: 150.450 ms
Total runtime: 147.039 ms
Total runtime: 145.410 ms
Total runtime: 145.142 ms
Total runtime: 15068.171 ms
-- rowcount (size_string): 1_000_000_000
36247789568 | size table | 34 GB
22461628416 | size btree index | 21 GB
44433408 | size minmax index | 42 MB
Total runtime: 15.454 ms <-- 4x btree
Total runtime: 0.040 ms
Total runtime: 0.040 ms
Total runtime: 0.034 ms
Total runtime: 1502.353 ms <-- 4x minmax
Total runtime: 1482.322 ms
Total runtime: 1489.522 ms
Total runtime: 1481.424 ms
Total runtime: 162213.392 ms <-- seqscan
I'd say minmax indexes give spectacular gains for very small indexsize.
Erik Rijkers
Attachments:
On Fri, Nov 8, 2013 at 12:11 PM, Alvaro Herrera <alvherre@2ndquadrant.com>wrote:
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.
I tested it with attached script, but broke out of the "for" loop after 5
iterations (when it had 300,000,005 rows inserted)
Then I did an analyze, and got an error message below:
jjanes=# analyze;
ERROR: could not truncate file "base/16384/16388_vm" to 488 blocks: it's
only 82 blocks now
16388 is the index's relfilenode.
Here is the backtrace upon entry to the truncate that is going to fail:
#0 mdtruncate (reln=0x23c91b0, forknum=VISIBILITYMAP_FORKNUM, nblocks=488)
at md.c:858
#1 0x000000000048eb4a in mmRevmapTruncate (rmAccess=0x26ad878,
heapNumBlocks=1327434) at mmrevmap.c:360
#2 0x000000000048d37a in mmvacuumcleanup (fcinfo=<value optimized out>) at
minmax.c:1264
#3 0x000000000072dcef in FunctionCall2Coll (flinfo=<value optimized out>,
collation=<value optimized out>, arg1=<value optimized out>,
arg2=<value optimized out>) at fmgr.c:1323
#4 0x000000000048c1e5 in index_vacuum_cleanup (info=<value optimized out>,
stats=0x0) at indexam.c:715
#5 0x000000000052a7ce in do_analyze_rel (onerel=0x7f59798589e8,
vacstmt=0x23b0bd8, acquirefunc=0x5298d0 <acquire_sample_rows>,
relpages=1327434,
inh=0 '\000', elevel=13) at analyze.c:634
#6 0x000000000052b320 in analyze_rel (relid=<value optimized out>,
vacstmt=0x23b0bd8, bstrategy=<value optimized out>) at analyze.c:267
#7 0x000000000057cba7 in vacuum (vacstmt=0x23b0bd8, relid=<value optimized
out>, do_toast=1 '\001', bstrategy=<value optimized out>,
for_wraparound=0 '\000', isTopLevel=<value optimized out>) at
vacuum.c:249
#8 0x0000000000663177 in standard_ProcessUtility (parsetree=0x23b0bd8,
queryString=<value optimized out>, context=<value optimized out>,
params=0x0,
dest=<value optimized out>, completionTag=<value optimized out>) at
utility.c:682
#9 0x00007f598290b791 in pgss_ProcessUtility (parsetree=0x23b0bd8,
queryString=0x23b0220 "analyze \n;", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0,
dest=0x23b0f18, completionTag=0x7fffd3442f30 "") at
pg_stat_statements.c:825
#10 0x000000000065fcf7 in PortalRunUtility (portal=0x24195e0,
utilityStmt=0x23b0bd8, isTopLevel=1 '\001', dest=0x23b0f18,
completionTag=0x7fffd3442f30 "")
at pquery.c:1187
#11 0x0000000000660c6d in PortalRunMulti (portal=0x24195e0, isTopLevel=1
'\001', dest=0x23b0f18, altdest=0x23b0f18, completionTag=0x7fffd3442f30 "")
at pquery.c:1318
#12 0x0000000000661323 in PortalRun (portal=0x24195e0,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x23b0f18,
altdest=0x23b0f18,
completionTag=0x7fffd3442f30 "") at pquery.c:816
#13 0x000000000065dbb4 in exec_simple_query (query_string=0x23b0220
"analyze \n;") at postgres.c:1048
#14 0x000000000065f259 in PostgresMain (argc=<value optimized out>,
argv=<value optimized out>, dbname=0x2347be8 "jjanes", username=<value
optimized out>)
at postgres.c:3992
#15 0x000000000061b7d0 in BackendRun (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:4085
#16 BackendStartup (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:3774
#17 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at
postmaster.c:1585
#18 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
at postmaster.c:1240
#19 0x00000000005b5e90 in main (argc=3, argv=0x2346cd0) at main.c:196
Cheers,
Jeff
Attachments:
On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.
This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thom Brown wrote:
On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Mart�n Marqu�s. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?
It's still a planned feature, but I didn't have time to continue work
for 2014-01.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thom Brown wrote:
On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?It's still a planned feature, but I didn't have time to continue work
for 2014-01.
Alles klar.
Thanks
--
Thom
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 24, 2014 at 2:54 PM, Thom Brown <thom@linux.com> wrote:
On 24 January 2014 17:53, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Thom Brown wrote:
On 8 November 2013 20:11, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Erik Rijkers wrote:
On Thu, September 26, 2013 00:34, Erik Rijkers wrote:
On Wed, September 25, 2013 22:34, Alvaro Herrera wrote:
[minmax-5.patch]
I have the impression it's not quite working correctly.
Here's a version 7 of the patch, which fixes these bugs and adds
opclasses for a bunch more types (timestamp, timestamptz, date, time,
timetz), courtesy of Martín Marqués. It's also been rebased to apply
cleanly on top of today's master branch.I have also added a selectivity function, but I'm not positive that it's
very useful yet.This patch doesn't appear to have been submitted to any Commitfest.
Is this still a feature undergoing research then?It's still a planned feature, but I didn't have time to continue work
for 2014-01.
What's the status?
I believe I have more than a use for minmax indexes, and wouldn't mind
lending a hand if it's within my grasp.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jan 24, 2014 at 12:58 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
What's the status?
I believe I have more than a use for minmax indexes, and wouldn't mind
lending a hand if it's within my grasp.
I'm also interested in looking at this. Mostly because I have ideas
for other "summary" functions that would be interesting and could use
the same infrastructure otherwise.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.
Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.
I added a few pageinspect functions that let one explore the data in the
index. With this you can start by reading the metapage, and from there
obtain the block numbers for the revmap array pages; and explore revmap
array pages to read regular revmap pages, which contain the TIDs to
index entries. All these pageinspect functions don't currently have any
documentation, but it's as easy as
with idxname as (select 'ti'::text as idxname)
select *
from idxname,
generate_series(0, pg_relation_size(idxname) / 8192 - 1) i,
minmax_page_type(get_raw_page(idxname, i::int));
select * -- data in metapage
from
minmax_metapage_info(get_raw_page('ti', 0));
select * -- data in revmap array pages
from minmax_revmap_array_data(get_raw_page('ti', 6));
select logblk, unnest(pages) -- data in regular revmap pages
from minmax_revmap_data(get_raw_page('ti', 15));
select * -- data in regular index pages
from minmax_page_items(get_raw_page('ti', 2), 'ti'::regclass);
Note that in this last case you need to give it the OID of the index as
the second parameter, so that it can construct a tupledesc for decoding
the min/max data.
I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple. This saves a lot of index bloat. This required a new entry
point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change. This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.
I have also added a reloption for the size of each page range, so you
can do
create index ti on t using minmax (a) with (pages_per_range = 2);
The default is 128 pages per range, and I have an arbitrary maximum of
131072 (default size of a 1GB segment). There doesn't seem to be much
point in having larger page ranges; intuitively I think page ranges
should be more or less the size of kernel readahead, but I haven't
tested this.
I didn't want to rebase past 0ef0b6784 in a hurry. I only know this
applies cleanly on top of fe7337f2dc, so please use that if you want to
play with it. I will post a rebased version shortly.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-8.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,418 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_type(PG_FUNCTION_ARGS);
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+ Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_type);
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+ PG_FUNCTION_INFO_V1(minmax_metapage_info);
+ PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+ PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+ typedef struct mm_page_state
+ {
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+
+ static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ minmax_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
+
+ Datum
+ minmax_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+ Datum
+ minmax_revmap_array_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+ }
+
+ /*
+ * Return the TID array stored in a minmax revmap page
+ */
+ Datum
+ minmax_revmap_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,148 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_type()
+ --
+ CREATE FUNCTION minmax_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'minmax_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_metapage_info()
+ --
+ CREATE FUNCTION minmax_metapage_info(IN page bytea,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_array_data()
+ CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_data()
+ CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,300 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of the min() and max()
+ values in consecutive groups of heap pages (page ranges). These values can be
+ used by constraint exclusion to avoid scanning such pages, depending on query
+ quals.
+
+ The main drawback of this is having to update the stored min/max values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have this feature. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs"
+ http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB seems to have it
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported; since an index is concerned with minimum and
+ maximum values of the involved columns across all the pages in the table, it
+ doesn't make sense to exclude values. Another way to see "partial" indexes
+ here would be those that only considered some pages in the table instead of all
+ of them; but this would be difficult to implement and manage and, most likely,
+ pointless.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine would return a TIDBitmap comprising all
+ the pages in those page groups that match the query qualifications; the recheck
+ node prunes tuples that are not visible per snapshot and those that are not
+ visible per query quals.
+
+ For each supported datatype, we need an opclass with the following catalog
+ entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+
+ These operators are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious, at best.)
+
+ There can be gaps where some pages have no covering index entry. In particular,
+ the last few pages of the table would commonly not be summarized.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have a
+ separate fork called the range reverse map. This fork stores one TID per
+ range, which is the address of the index tuple summarizing that range. Since
+ these map entries are fixed size, it is possible to compute the address of the
+ range map entry for any given heap page.
+
+ When a new heap tuple is inserted in a summarized page range, it is possible to
+ compare the existing index tuple with the new heap tuple. If the heap tuple is
+ outside the minimum/maximum boundaries given by the index tuple for any indexed
+ column (or if the new heap tuple contains null values but the index tuple
+ indicate there are no nulls), it is necessary to create a new index tuple with
+ the new values. To do this, a new index tuple is inserted, and the reverse range
+ map is updated to point to it. The old index tuple is left in place, for later
+ garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is not summarized.
+
+ A minmax index is updated by creating a new summary tuple whenever an
+ insertion outside the min-max interval occurs in the pages within the range.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order.
+ Query quals are matched to each index tuple; if they match, each page within
+ the page range is returned as part of the output TID bitmap. If there's no
+ match, they are skipped. Reverse range map entries returning invalid index
+ TIDs, that is unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we reuse the VISIBILITYMAP_FORKNUM, since a VM
+ does not make sense for a minmax index anyway (XXX -- really??)
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ minimum and maximum values of each indexed column and nulls bitmap are
+ collected and stored in the index. The possibly-incomplete range at the end
+ of the table is not included.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the min() and max() for each indexed column, then insert the new
+ index entry at the end of the index. The main interesting questions are:
+
+ a) when to do it
+ The perfect time to do it is as soon as a complete page range of the
+ configured range size has been filled.
+
+ b) who does it (what process)
+ It doesn't seem a good idea to have a client-connected process do it;
+ it would incur unwanted latency. Three other options are (i) to spawn a
+ specialized process to do it, which perhaps can be signalled by a
+ client-connected process that executes a scan and notices the need to run
+ summarization; or (ii) to let autovacuum do it, as a separate new
+ maintenance task. This seems simple enough to bolt on top of already
+ existing autovacuum infrastructure. The timing constraints of autovacuum
+ might be undesirable, though. (iii) wait for user command.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the min()/max() values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the min()/max() ranges are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 271,276 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 271,278 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 296,301 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 298,311 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 636,642 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 646,653 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 646,652 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 657,664 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 897,903 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 909,916 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 907,913 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 920,927 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1680 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * support collatable datatypes
+ * * ScalarArrayOpExpr
+ * * Make use of the stored NULL bits
+ * * fill in the XLog routines
+ * * we can support unlogged indexes now
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * Within that struct, each column's contruction info is represented by a
+ * MMPerColBuildInfo struct. The running state is all kept in a
+ * DeformedMMTuple.
+ */
+ typedef struct MMPerColBuildInfo
+ {
+ int typLen;
+ bool typByVal;
+ FmgrInfo lt;
+ FmgrInfo gt;
+ } MMPerColBuildInfo;
+
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber currRangeStart;
+ BlockNumber nextRangeAt;
+ mmRevmapAccess *rmAccess;
+ TupleDesc indexDesc;
+ TupleDesc diskDesc;
+ DeformedMMTuple *dtuple;
+ MMPerColBuildInfo perColState[FLEXIBLE_ARRAY_MEMBER];
+ } MMBuildState;
+
+ static void mmbuildCallback(Relation index,
+ HeapTuple htup, Datum *values, bool *isnull,
+ bool tupleIsAlive, void *state);
+ static void get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo);
+ static inline bool invoke_mm_operator(FmgrInfo *operator, Oid collation,
+ Datum left, Datum right);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+
+
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to update the index (either by
+ * a new index tuple and repointing the revmap, or by overwriting the existing
+ * index tuple).
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ mmRevmapAccess *rmAccess;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ TupleDesc tupdesc;
+ ItemId origlp;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ FmgrInfo *lt;
+ FmgrInfo *gt;
+ bool need_insert = false;
+
+ rmAccess = mmRevmapAccessInit(idxRel);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ tupdesc = RelationGetDescr(idxRel);
+ indexInfo = BuildIndexInfo(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+ gt = palloc(sizeof(FmgrInfo) * indexInfo->ii_NumIndexAttrs);
+
+ /* grab the operators we will need: < and > for each indexed column */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ }
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ Assert(iblk != InvalidBlockNumber);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ origlp = PageGetItemId(page, ioff);
+ mmtup = (MMTuple *) PageGetItem(page, origlp);
+
+ dtup = minmax_deform_tuple(tupdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values.
+ * Note that we need to keep checking each column even after noticing that a
+ * new tuple is necessary, because as a side effect this loop will update
+ * the dtup with the values to insert in the new tuple.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ /*
+ * If the new tuple contains a null in this attr, but the range index
+ * tuple doesn't allow for nulls, we need a new summary tuple.
+ */
+ if (nulls[keyno])
+ {
+ if (!dtup->values[keyno].hasnulls)
+ {
+ need_insert = true;
+ dtup->values[keyno].hasnulls = true;
+ }
+ else
+ continue;
+ }
+
+ /*
+ * If the new key value is not within the min/max interval for this
+ * range, we need a new summary tuple.
+ */
+ if (invoke_mm_operator(<[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].min))
+ {
+ dtup->values[keyno].min = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ if (invoke_mm_operator(>[keyno], InvalidOid, values[keyno],
+ dtup->values[keyno].max))
+ {
+ dtup->values[keyno].max = values[keyno]; /* XXX datumCopy? */
+ need_insert = true;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ TupleDesc diskDesc;
+ Size tupsz;
+ MMTuple *tup;
+
+ diskDesc = minmax_get_descr(tupdesc);
+ tup = minmax_form_tuple(tupdesc, diskDesc, dtup, &tupsz);
+
+ /*
+ * If the size of the original tuple is greater or equal to the new
+ * index tuple, we can overwrite. This saves regular page bloat, and
+ * also saves revmap traffic. This might leave some unused space
+ * before the start of the next tuple, but we don't worry about that
+ * here.
+ *
+ * We avoid doing this when the itempointer of the index tuple would
+ * change, because that would require an update to the revmap while
+ * holding exclusive lock on this page, which would reduce concurrency.
+ *
+ * Note that we continue to acccess 'origlp' here, even though there
+ * was an interval during which the page wasn't locked. Since we hold
+ * pin on the page, this is okay -- the buffer cannot go away from
+ * under us, and also tuples cannot be shuffled around.
+ */
+ if (tupsz >= ItemIdGetLength(origlp))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageOverwriteItemData(BufferGetPage(buf),
+ ioff,
+ (Item) tup, tupsz);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxRel->rd_node;
+ xlrec.target.tid = idxtid;
+ xlrec.overwrite = true;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = tupsz;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ /*
+ * The new tuple is larger than the original one, so we must insert
+ * a new one the slow way.
+ */
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+
+ #ifdef NOT_YET
+ /*
+ * Possible optimization: if we can grab an exclusive lock on the
+ * buffer containing the old tuple right away, we can also seize
+ * the opportunity to prune the old tuple and avoid some bloat.
+ * This is not necessary for correctness.
+ */
+ if (ConditionalLockBuffer(buf))
+ {
+ /* prune the old tuple */
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ #endif
+ }
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ PG_RETURN_POINTER(scan);
+ }
+
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the min/max values in them are compared to the
+ * scan keys. We return into the TID bitmap all the pages in ranges
+ * corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ Oid heapOid;
+ Relation heapRel;
+ mmRevmapAccess *rmAccess;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ TupleDesc tupdesc;
+ AttrNumber keyno;
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ FmgrInfo *lt;
+ FmgrInfo *lteq;
+ FmgrInfo *gteq;
+ FmgrInfo *gt;
+
+ pgstat_count_index_scan(idxRel);
+
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ tupdesc = RelationGetDescr(idxRel);
+
+ lt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ lteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gteq = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+ gt = palloc(sizeof(FmgrInfo) * scan->numberOfKeys);
+
+ /*
+ * lookup the operators needed to determine range containment of each key
+ * value.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ AttrNumber keyattno;
+ Oid opfam;
+ Oid keytypid;
+ Oid idxtypid;
+
+ keyattno = scan->keyData[keyno].sk_attno;
+ opfam = get_opclass_family(indclass->values[keyattno - 1]);
+ keytypid = scan->keyData[keyno].sk_subtype;
+ idxtypid = tupdesc->attrs[keyattno - 1]->atttypid;
+
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessStrategyNumber,
+ <[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTLessEqualStrategyNumber,
+ <eq[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterStrategyNumber,
+ >[keyno]);
+ get_mm_operator(opfam, idxtypid, keytypid, BTGreaterEqualStrategyNumber,
+ >eq[keyno]);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ pagesPerRange = MinmaxGetPagesPerRange(idxRel);
+ rmAccess = mmRevmapAccessInit(idxRel);
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ Assert(idxblkno != InvalidBlockNumber);
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+ }
+
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *)
+ PageGetItem(page, PageGetItemId(page, idxoffno));
+ /* XXX probably need copies */
+ dtup = minmax_deform_tuple(tupdesc, tup);
+
+ /*
+ * Compare scan keys with min/max values stored in range. If scan
+ * keys are matched, the page range must be added to the bitmap.
+ */
+ for (keyno = 0, addrange = true;
+ keyno < scan->numberOfKeys;
+ keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+
+ /*
+ * The analysis we need to make to decide whether to include a
+ * page range in the output result is: is it possible for a
+ * tuple contained within the min/max interval specified by
+ * this index tuple to match what's specified by the scan key?
+ * For example, for a query qual such as "WHERE col < 10" we
+ * need to include a range whose minimum value is less than
+ * 10.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole.
+ */
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ addrange =
+ invoke_mm_operator(<[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTLessEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want
+ * to return the current page range if the minimum
+ * value in the range <= scan key, and the maximum
+ * value >= scan key.
+ */
+ addrange =
+ invoke_mm_operator(<eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].min,
+ key->sk_argument);
+ if (!addrange)
+ break;
+ /* max() >= scankey */
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ addrange =
+ invoke_mm_operator(>eq[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ case BTGreaterStrategyNumber:
+ addrange =
+ invoke_mm_operator(>[keyno], InvalidOid,
+ dtup->values[keyattno - 1].max,
+ key->sk_argument);
+ break;
+ default:
+ /* can't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ addrange = false;
+ break;
+ }
+
+ /*
+ * If the current scan key doesn't match the range values,
+ * don't look at further ones.
+ */
+ if (!addrange)
+ break;
+ }
+
+ /* XXX anything to free here? */
+ pfree(dtup);
+ }
+
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + pagesPerRange - 1;
+ pageno++)
+ tbm_add_page(tbm, pageno);
+ }
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ pfree(lt);
+ pfree(lteq);
+ pfree(gt);
+ pfree(gteq);
+
+ PG_RETURN_INT64(MaxHeapTuplesPerPage);
+ }
+
+
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ {
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+ }
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+
+ /* anything to do here? */
+ (void) scan; /* silence compiler */
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Reset the per-column build state in an MMBuildState.
+ */
+ static void
+ clear_mm_percol_buildstate(MMBuildState *mmstate)
+ {
+ int i;
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ mmstate->dtuple->values[i].allnulls = true;
+ mmstate->dtuple->values[i].hasnulls = false;
+ mmstate->dtuple->values[i].min = (Datum) 0;
+ mmstate->dtuple->values[i].max = (Datum) 0;
+ }
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock == mmstate->nextRangeAt)
+ {
+ MMTuple *tup;
+ Size size;
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->nextRangeAt);
+ #if 0
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ elog(DEBUG2, "completed a range for column %d, range: %u .. %u",
+ i,
+ DatumGetUInt32(mmstate->dtuple->values[i].min),
+ DatumGetUInt32(mmstate->dtuple->values[i].max));
+ }
+ #endif
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ /* and set state to correspond to the new current range */
+ mmstate->currRangeStart = mmstate->nextRangeAt;
+ mmstate->nextRangeAt = mmstate->currRangeStart + MinmaxGetPagesPerRange(index);
+
+ /* initialize aggregate state for the new range */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ if (!mmstate->dtuple->values[i].allnulls &&
+ !mmstate->perColState[i].typByVal)
+ {
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].max));
+ }
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ /*
+ * If the value in the current heap tuple is null, there's not much to
+ * do other than keep track that we saw it.
+ */
+ if (isnull[i])
+ {
+ mmstate->dtuple->values[i].hasnulls = true;
+ continue;
+ }
+
+ /*
+ * If this is the first tuple in the range containing a not-null value
+ * for this column, initialize our state.
+ */
+ if (mmstate->dtuple->values[i].allnulls)
+ {
+ mmstate->dtuple->values[i].allnulls = false;
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ continue;
+ }
+
+ /*
+ * Otherwise, dtuple state was already initialized, and the current
+ * tuple is not null: therefore we need to compare it to the current
+ * state and possibly update the min/max boundaries.
+ */
+ if (invoke_mm_operator(&mmstate->perColState[i].lt, InvalidOid,
+ values[i],
+ mmstate->dtuple->values[i].min))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].min =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+
+ if (invoke_mm_operator(&mmstate->perColState[i].gt, InvalidOid,
+ values[i],
+ mmstate->dtuple->values[i].max))
+ {
+ if (!mmstate->perColState[i].typByVal)
+ pfree(DatumGetPointer(mmstate->dtuple->values[i].min));
+ mmstate->dtuple->values[i].max =
+ datumCopy(values[i],
+ mmstate->perColState[i].typByVal,
+ mmstate->perColState[i].typLen);
+ }
+ }
+ }
+
+ /*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+ static MMBuildState *
+ initialize_mm_buildstate(Relation heapRel, Relation idxRel,
+ mmRevmapAccess *rmAccess, IndexInfo *indexInfo)
+ {
+ MMBuildState *mmstate;
+ TupleDesc heapDesc = RelationGetDescr(heapRel);
+ Datum indclassDatum;
+ bool isnull;
+ oidvector *indclass;
+ int i;
+
+ mmstate = palloc(offsetof(MMBuildState, perColState) +
+ sizeof(MMPerColBuildInfo) * indexInfo->ii_NumIndexAttrs);
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->currRangeStart = 0;
+ mmstate->nextRangeAt = MinmaxGetPagesPerRange(idxRel);
+ mmstate->rmAccess = rmAccess;
+ mmstate->indexDesc = RelationGetDescr(idxRel);
+ mmstate->diskDesc = minmax_get_descr(mmstate->indexDesc);
+
+ mmstate->dtuple = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * indexInfo->ii_NumIndexAttrs);
+ /* other stuff in dtuple is initialized below */
+
+ indclassDatum = SysCacheGetAttr(INDEXRELID, idxRel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ int heapAttno;
+ Form_pg_attribute attr;
+ Oid opfam = get_opclass_family(indclass->values[i]);
+ Oid idxtypid = mmstate->indexDesc->attrs[i]->atttypid;
+
+ heapAttno = indexInfo->ii_KeyAttrNumbers[i];
+ if (heapAttno == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create minmax indexes on expressions")));
+
+ attr = heapDesc->attrs[heapAttno - 1];
+ mmstate->perColState[i].typByVal = attr->attbyval;
+ mmstate->perColState[i].typLen = attr->attlen;
+ get_mm_operator(opfam, idxtypid, idxtypid, BTLessStrategyNumber,
+ &(mmstate->perColState[i].lt));
+ get_mm_operator(opfam, idxtypid, idxtypid, BTGreaterStrategyNumber,
+ &(mmstate->perColState[i].gt));
+
+ /* initialize per-column state */
+ }
+
+ clear_mm_percol_buildstate(mmstate);
+
+ return mmstate;
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ mm_page_init(Page page, uint16 type)
+ {
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+ /*
+ * Initialize a new minmax index' metapage.
+ */
+ void
+ mm_metapage_init(Page page)
+ {
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxVersion = MINMAX_CURRENT_VERSION;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = mm_getnewbuffer(index);
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta));
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &(index->rd_node);
+ rdata.len = sizeof(RelFileNode);
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /* set up our "reverse map" */
+ mmRevmapCreate(index);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = mmRevmapAccessInit(index);
+ mmstate = initialize_mm_buildstate(heap, index, rmAccess, indexInfo);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in order
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* XXX process the final batch, if needed */
+
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return in nonsummed the array (and in numnonsummed its size) of
+ * block numbers for which the revmap returns InvalidTid; this is used in a
+ * later stage to execute re-summarization. (Each block number returned
+ * corresponds to the heap page number with which each unsummarized range
+ * starts.) Space for the array is palloc'ed, and must be freed by caller.
+ *
+ * idxRel is the index relation; heapNumBlocks is the size of the heap
+ * relation; strategy is appropriate for bulk scanning.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = CurrentMemoryContext;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ page = BufferGetPage(buf);
+
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *)
+ hash_search(tuples, &tid, HASH_ENTER, &found);
+ Assert(!found);
+ hitem->referenced = false;
+ numitems++;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ pagesPerRange = MinmaxGetPagesPerRange(idxRel);
+ rmAccess = mmRevmapAccessInit(idxRel);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. Before doing so, if the
+ * heap page range is complete but not summarized, store its
+ * initial page number in the unsummarized array, for later
+ * summarization.
+ */
+ if (heapBlk + pagesPerRange < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+ else
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ if (!found)
+ elog(ERROR, "reverse map references nonexistant index tuple %u/%u",
+ ItemPointerGetBlockNumber(&itupptr),
+ ItemPointerGetOffsetNumber(&itupptr));
+ hitem->referenced = true;
+
+ /* discount items set as referenced */
+ numitems--;
+ }
+ Assert(numitems >= 0);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples.
+ */
+ deletable = palloc(sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ start = 0;
+ for (i = 0; i < numdeletable; i++)
+ {
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+ int freespace;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ /*
+ * Now defragment the page.
+ */
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_BULKREMOVE,
+ rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* next iteration starts where this one ended */
+ start = i + 1;
+
+ /* remember free space while we have the buffer locked */
+ freespace = PageGetFreeSpace(page);
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+
+ RecordPageWithFreeSpace(idxRel, blk, freespace);
+ }
+ }
+
+ pfree(deletable);
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+ BlockNumber pagesPerRange;
+
+ indexInfo = BuildIndexInfo(idxRel);
+ pagesPerRange = MinmaxGetPagesPerRange(idxRel);
+
+ mmstate = initialize_mm_buildstate(heapRel, idxRel, rmAccess, indexInfo);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+ MMTuple *tup;
+ Size size;
+
+ mmstate->currRangeStart = blk;
+ mmstate->nextRangeAt = blk + pagesPerRange;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ * Note mmbuildCallback didn't have the chance to actually insert
+ * anything into the index, because the heapscan should have ended
+ * just as it reached the final tuple in the range.
+ */
+ tup = minmax_form_tuple(mmstate->indexDesc, mmstate->diskDesc,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ clear_mm_percol_buildstate(mmstate);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * ambulkdelete
+ * Since there are no per-heap-tuple index tuples here, there's not a lot
+ * we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. We'd need to expand on-disk mmtuples with an extra flag for that,
+ * though.
+ */
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a minmax index: 1) remove index
+ * tuples that are no longer referenced from the revmap. 2) summarize ranges
+ * that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ rmAccess = mmRevmapAccessInit(info->index);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers of
+ * ranges that are not summarized.
+ */
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* and summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Fill the given finfo to enable calls to the operator specified by the given
+ * parameters.
+ */
+ static void
+ get_mm_operator(Oid opfam, Oid idxtypid, Oid keytypid,
+ StrategyNumber strategy, FmgrInfo *finfo)
+ {
+ Oid oprid;
+ HeapTuple oper;
+
+ oprid = get_opfamily_member(opfam, idxtypid, keytypid, strategy);
+ if (!OidIsValid(oprid))
+ elog(ERROR, "missing operator %d(%u,%u) in opfamily %u",
+ strategy, idxtypid, keytypid, opfam);
+
+ oper = SearchSysCache1(OPEROID, oprid);
+ if (!HeapTupleIsValid(oper))
+ elog(ERROR, "cache lookup failed for operator %u", oprid);
+
+ fmgr_info(((Form_pg_operator) GETSTRUCT(oper))->oprcode, finfo);
+ ReleaseSysCache(oper);
+ }
+
+ /*
+ * Invoke the given operator, and return the result as a C boolean.
+ */
+ static inline bool
+ invoke_mm_operator(FmgrInfo *operator, Oid collation, Datum left, Datum right)
+ {
+ Datum result;
+
+ result = FunctionCall2Coll(operator, collation, left, right);
+
+ return DatumGetBool(result);
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+
+ if (PageGetFreeSpace(page) < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel))));
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ MarkBufferDirty(*buffer);
+
+ blk = BufferGetBlockNumber(*buffer);
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapblkno);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ xlrec.overwrite = false;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+ Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /* FIXME need to request a MaxFSMRequestSize page from the FSM here */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has enough, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, a new
+ * buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. This function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ bool extended = false;
+
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ Page page;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer which has enough free space to satisfy the requirement.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly; but first make sure we record the page as
+ * free, for next time.
+ */
+ freespace = PageGetFreeSpace(page);
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ Assert(blk != InvalidBlockNumber);
+ buf = ReadBuffer(irel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ break;
+
+ /* Not really enough space: register reality and start over */
+ UnlockReleaseBuffer(buf);
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * XXX we could perhaps avoid this if we used RelationSetTargetBlock ...
+ */
+ if (extended)
+ FreeSpaceMapVacuum(irel);
+
+ return extended;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,679 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/heapam_xlog.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+
+
+
+ /*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+ /*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+ #define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+ #define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+ };
+ /* typedef appears in minmax_revmap.h */
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel)
+ {
+ mmRevmapAccess *rmAccess = palloc(sizeof(mmRevmapAccess));
+
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = MinmaxGetPagesPerRange(idxrel);
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * In the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages per range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct revmap page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ BlockNumber
+ initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+ {
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+ }
+
+ /*
+ * Read the metapage, lock it as specified by called, and update the given
+ * rmAccess with the metapage data. Return value is the locked buffer, which
+ * must be unlocked and released by caller.
+ */
+ static Buffer
+ rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+ {
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ meta = ReadBuffer(rmAccess->idxrel, MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, lockmode);
+
+ metapage = BufferGetPage(meta);
+
+ #ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+ #endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ return meta;
+ }
+
+ /*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ void
+ initialize_rma_page(Buffer buf)
+ {
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+ }
+
+ /*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno
+ */
+ static void
+ update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+ {
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ Buffer meta;
+
+ meta = rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ UnlockReleaseBuffer(meta);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ Buffer meta;
+
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ meta = rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ START_CRIT_SECTION();
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, meta, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ UnlockReleaseBuffer(meta);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again. XXX but we didn't hold lock!
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ *
+ * FIXME -- what if we were asked not to extend?
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ START_CRIT_SECTION();
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Modify the revmap array page to point to the newly allocated revmap
+ * page.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ /*
+ * XXX this rma_nblocks assignment should probably be conditional on the
+ * current rma_blocks value.
+ */
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "setting %u/%u in logical page %lu (physical %u) for heap %u",
+ blkno, offno,
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with
+ * LockTuple. It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,388 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * The intended interface is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation by
+ * using functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the tuple descriptor exactly: for each attribute in the
+ * descriptor, the index tuple carries two values, one for the minimum value in
+ * that column and one for the maximum.
+ *
+ * Also, for each column there are two null bits: one (hasnulls) stores whether
+ * any tuple within the page range has that column set to null; the other
+ * (allnulls) stores whether the column values are all null. If allnulls is
+ * true, then the tuple data area does not contain min/max values for that
+ * column at all; whereas it does if the hasnulls is set. Note we always store
+ * a double-length null bitmask; for typical indexes of four columns or less,
+ * they take a single byte anyway. It doesn't seem worth trying to optimize
+ * this further.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Generate an internal-style tuple descriptor to pass to minmax_form_tuple.
+ * These have no use outside this module.
+ *
+ * The argument is a minmax index' regular tuple descriptor.
+ */
+ TupleDesc
+ minmax_get_descr(TupleDesc tupdesc)
+ {
+ TupleDesc diskDesc;
+ int i,
+ j;
+
+ diskDesc = CreateTemplateTupleDesc(tupdesc->natts * 2, false);
+
+ for (i = 0, j = 1; i < tupdesc->natts; i++)
+ {
+ /* min */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ /* max */
+ TupleDescInitEntry(diskDesc,
+ j++,
+ NULL,
+ tupdesc->attrs[i]->atttypid,
+ tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ return diskDesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ *
+ * The first tuple descriptor passed corresponds to the catalogued index info,
+ * that is, it is the index's descriptor; the second descriptor must be
+ * obtained by calling minmax_get_descr() on that descriptor.
+ *
+ * (The reason for this slightly grotty arrangement is that we use heap tuple
+ * functions to implement packing of a tuple into the on-disk format.)
+ */
+ MMTuple *
+ minmax_form_tuple(TupleDesc idxDsc, TupleDesc diskDsc, DeformedMMTuple *tuple,
+ Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(diskDsc->natts > 0);
+
+ values = palloc(sizeof(Datum) * diskDsc->natts);
+ nulls = palloc0(sizeof(bool) * diskDsc->natts);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(diskDsc->natts));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ int idxattno = keyno * 2;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; set the nullable bits for both min and max attrs.
+ */
+ if (tuple->values[keyno].allnulls)
+ {
+ nulls[idxattno] = true;
+ nulls[idxattno + 1] = true;
+ anynulls = true;
+ continue;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ anynulls = true;
+
+ values[idxattno] = tuple->values[keyno].min;
+ values[idxattno + 1] = tuple->values[keyno].max;
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(idxDsc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(diskDsc, values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(diskDsc,
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < idxDsc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->values[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need
+ * to apply datumCopy inside the loop. We probably also need a
+ * minmax_free_dtuple() function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits = NULL;
+ int keyno;
+
+ dtup = palloc(offsetof(DeformedMMTuple, values) +
+ sizeof(MMValues) * tupdesc->natts);
+
+ values = palloc(sizeof(Datum) * tupdesc->natts * 2);
+ allnulls = palloc(sizeof(bool) * tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ mm_deconstruct_tuple(tp, nullbits,
+ MMTupleHasNulls(tuple),
+ tupdesc->natts, tupdesc->attrs, values,
+ allnulls, hasnulls);
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ if (allnulls[keyno])
+ {
+ dtup->values[keyno].allnulls = true;
+ continue;
+ }
+
+ /* XXX optional datumCopy() */
+ dtup->values[keyno].min = values[keyno * 2];
+ dtup->values[keyno].max = values[keyno * 2 + 1];
+ dtup->values[keyno].hasnulls = hasnulls[keyno];
+ dtup->values[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * natts number of array members in att
+ * att the tuple's TupleDesc Form_pg_attribute array
+ * values output values, size 2 * natts (alternates min and max)
+ * allnulls output "allnulls", size natts
+ * hasnulls output "hasnulls", size natts
+ *
+ * Output arrays are allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(char *tp, bits8 *nullbits, bool nulls,
+ int natts, Form_pg_attribute *att,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ values[attnum] = (Datum) 0;
+ allnulls[attnum] = true;
+ hasnulls[attnum] = true; /* XXX ? */
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. So the tuple data does have data for this
+ * column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] = nulls && att_isnull(natts + attnum, hasnulls);
+ }
+
+ /*
+ * The we iterate to natts * 2 to obtain each attribute's min and max
+ * values. Note that since we reuse attribute entries (first for the
+ * minimum value of the corresponding column, then for max), we cannot
+ * cache offsets here.
+ */
+ for (attnum = 0; attnum < natts * 2; attnum++)
+ {
+ int true_attnum = attnum / 2;
+ Form_pg_attribute thisatt = att[true_attnum];
+
+ if (allnulls[true_attnum])
+ continue;
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[attnum] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,304 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ if (xlrec->overwrite)
+ PageOverwriteItemData(page, offnum, (Item) mmtuple, tuplen);
+ else
+ {
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+ }
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBuffer(xlrec->node, xlrec->mapBlock, init);
+ Assert(BufferIsValid(buffer));
+ page = BufferGetPage(buffer);
+ if (init)
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+ }
+
+ static void
+ minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,95 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, uint8 xl_info, char *rec)
+ {
+ uint8 info = xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else if (info == XLOG_MINMAX_METAPG_SET)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
+
+ appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_RMARRAY_SET)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
+
+ appendStringInfoString(buf, "revmap array: ");
+ appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->rmarray,
+ xlrec->blkidx, xlrec->newpg);
+ }
+
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
+
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2096,2101 **** IndexBuildHeapScan(Relation heapRelation,
--- 2096,2122 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2166,2171 **** IndexBuildHeapScan(Relation heapRelation,
--- 2187,2195 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 324,329 **** PageAddItem(Page page,
--- 324,364 ----
}
/*
+ * PageOverwriteItemData
+ * Overwrite the data for the item at the given offset.
+ *
+ * The new data must fit in the existing data space for the old tuple.
+ */
+ void
+ PageOverwriteItemData(Page page, OffsetNumber offset, Item item, Size size)
+ {
+ PageHeader phdr = (PageHeader) page;
+ ItemId itemId;
+
+ /*
+ * Be wary about corrupted page pointers
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(PANIC,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+ itemId = PageGetItemId(phdr, offset);
+ if (!ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId))
+ elog(ERROR, "existing item to overwrite is not used");
+
+ if (ItemIdGetLength(itemId) < size)
+ elog(ERROR, "existing item is not large enough to be overwritten");
+
+ memcpy((char *) page + ItemIdGetOffset(itemId), item, size);
+ ItemIdSetNormal(itemId, ItemIdGetOffset(itemId), size);
+ }
+
+ /*
* PageGetTempPage
* Get a temporary page in local memory for special processing.
* The returned page is not initialized at all; caller must do that.
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 434,441 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 932,1113 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7376 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
+
*** a/src/backend/utils/mmgr/mcxt.c
--- b/src/backend/utils/mmgr/mcxt.c
***************
*** 68,74 **** static void MemoryContextStatsInternal(MemoryContext context, int level);
*/
#define AssertNotInCriticalSection(context) \
Assert(CritSectionCount == 0 || (context) == ErrorContext || \
! AmCheckpointerProcess())
/*****************************************************************************
* EXPORTED ROUTINES *
--- 68,74 ----
*/
#define AssertNotInCriticalSection(context) \
Assert(CritSectionCount == 0 || (context) == ErrorContext || \
! AmCheckpointerProcess() || true)
/*****************************************************************************
* EXPORTED ROUTINES *
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for MinMax' reloptions
+ */
+ typedef struct MinmaxOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ int pagesPerRange;
+ } MinmaxOptions;
+
+ #define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+ #define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,37 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+
+ extern void mm_metapage_init(Page page);
+ extern Buffer mm_getnewbuffer(Relation irel);
+ extern void rm_page_set_iptr(Page page, int pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+ extern void initialize_rma_page(Buffer buf);
+
+ #define MINMAX_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(MINMAX_DEBUG) && defined(__GNUC__)
+ #define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define MINMAX_elog(a) void(0)
+ #endif
+
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_page.h
***************
*** 0 ****
--- 1,87 ----
+ /*
+ * prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef MINMAX_PAGE_H
+ #define MINMAX_PAGE_H
+
+
+ /* special space on all minmax pages stores a "type" identifier */
+ #define MINMAX_PAGETYPE_META 0xF091
+ #define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+ #define MINMAX_PAGETYPE_REVMAP 0xF093
+ #define MINMAX_PAGETYPE_REGULAR 0xF094
+
+ typedef struct MinmaxSpecialSpace
+ {
+ uint16 type;
+ } MinmaxSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ uint32 minmaxVersion;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+ } MinmaxMetaPageData;
+
+ /*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+ #define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+ #define MINMAX_CURRENT_VERSION 1
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ /* Definitions for regular revmap pages */
+ typedef struct RevmapContents
+ {
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ /* Definitions for array revmap pages */
+ typedef struct RevmapArrayContents
+ {
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+ } RevmapArrayContents;
+
+ #define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+ extern void mm_page_init(Page page, uint16 type);
+
+ #endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,34 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,79 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/tupdesc.h"
+
+
+ /*
+ * This struct is used to represent the indexed values for one column, within
+ * one page range.
+ */
+ typedef struct MMValues
+ {
+ Datum min;
+ Datum max;
+ bool hasnulls;
+ bool allnulls;
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and
+ * maximum values for all indexed columns, within one page range.
+ * The number of elements in the values array is determined by the accompanying
+ * tuple descriptor.
+ */
+ typedef struct DeformedMMTuple
+ {
+ bool nvalues; /* XXX unused */
+ MMValues values[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for natts*2 null bits; min and max Datum values for each column follow
+ * that.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern TupleDesc minmax_get_descr(TupleDesc tupdesc);
+ extern MMTuple *minmax_form_tuple(TupleDesc idxDesc, TupleDesc diskDesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ extern DeformedMMTuple *minmax_deform_tuple(TupleDesc tupdesc, MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,132 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+ #define XLOG_MINMAX_METAPG_SET 0x40
+ #define XLOG_MINMAX_RMARRAY_SET 0x50
+ #define XLOG_MINMAX_INIT_RMPG 0x60
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ RelFileNode node;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, node) + sizeof(RelFileNode)
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ bool overwrite;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, overwrite) + sizeof(bool))
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+ /* This is what we need to know about a "metapage set" operation */
+ typedef struct xl_minmax_metapg_set
+ {
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_metapg_set;
+
+ #define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know about a "revmap array set" operation */
+ typedef struct xl_minmax_rmarray_set
+ {
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_rmarray_set;
+
+ #define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know when we initialize a new revmap page */
+ typedef struct xl_minmax_init_rmpg
+ {
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+ } xl_minmax_init_rmpg;
+
+ #define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+ extern void minmax_desc(StringInfo buf, uint8 xl_info, char *rec);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( minmax 5 0 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * MinMax int4_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * MinMax numeric_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * MinMax text_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * MinMax time_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * MinMax timetz_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * MinMax timestamp_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * MinMax timestamptz_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * MinMax date_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * MinMax char_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 393,398 **** extern void PageInit(Page page, Size pageSize, Size specialSize);
--- 393,400 ----
extern bool PageIsVerified(Page page, BlockNumber blkno);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
+ extern void PageOverwriteItemData(Page page, OffsetNumber offset, Item item,
+ Size size);
extern Page PageGetTempPage(Page page);
extern Page PageGetTempPageCopy(Page page);
extern Page PageGetTempPageCopySpecial(Page page);
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 405,412 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1547,1552 **** ORDER BY 1, 2, 3;
--- 1547,1557 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1569,1575 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1574,1580 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1731,1736 **** WHERE NOT (
--- 1736,1742 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has no support functions
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
***************
*** 1756,1762 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1762,1769 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{NULL}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1154,1159 **** WHERE NOT (
--- 1154,1160 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has no support functions
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
***************
*** 1177,1183 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1178,1185 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{NULL}'
);
-- Unfortunately, we can't check the amproc link very well because the
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.
Cool.
Have you thought more about this comment from Heikki?
/messages/by-id/52495DD3.9010809@vmware.com
I'm concerned that we could end up with one index type of this general
nature for min/max type operations, and then another very similar
index type for geometric operators or text-search operators or what
have you. Considering the overhead in adding and maintaining an index
AM, I think we should try to be sure that we've done a reasonably
solid job making each one as general as we reasonably can.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.Cool.
Have you thought more about this comment from Heikki?
Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 3:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.
Isn't min/max just a 2d bounding box? If you do a bulk data load of
something like the census data then sure, every page will have data
points for some geometrically clustered set of data.
I had in mind to do a small bloom filter per block. In general any
kind of predicate like bounding box should work.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.Cool.
Have you thought more about this comment from Heikki?
Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.
Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are. As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people. What is your reason for thinking that it would be
any less useful in this context?
I do also think that a general demand for being generic ought to carry
some weight. We have gone to great lengths to make sure that our
indexing can handle more than just < and >, where a lot of other
products have not bothered. I think we have gotten a lot of mileage
out of that decision and feel that we shouldn't casually back away
from it. Obviously, we do already have some special-case
optimizations and will likely have more in the future, and there are
can certainly be valid reasons for taking that approach. But it needs
to be justified in some way; we shouldn't accept a less-generic
approach blindly, without questioning whether it's possible to do
better.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-06-17 11:48:10 -0400, Robert Haas wrote:
On Tue, Jun 17, 2014 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2014-06-17 10:26:11 -0400, Robert Haas wrote:
On Sat, Jun 14, 2014 at 10:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.Cool.
Have you thought more about this comment from Heikki?
Is there actually a significant usecase behind that wish or just a
general demand for being generic? To me it seems fairly unlikely you'd
end up with something useful by doing a minmax index over bounding
boxes.Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are. As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people. What is your reason for thinking that it would be
any less useful in this context?
For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.
Maybe I've a major intuition failure here though...
I do also think that a general demand for being generic ought to carry
some weight.
Agreed. It's always a balance act. But it's not like this doesn't use a
datatype abstraction concept...
We have gone to great lengths to make sure that our
indexing can handle more than just < and >, where a lot of other
products have not bothered. I think we have gotten a lot of mileage
out of that decision and feel that we shouldn't casually back away
from it.
I don't see this as a case of backing away from that though?
we shouldn't accept a less-generic
approach blindly, without questioning whether it's possible to do
better.
But the aim shouldn't be to add genericity that's not going to be used,
but to add it where it's somewhat likely to help...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are. As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people. What is your reason for thinking that it would be
any less useful in this context?For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.
Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country. It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.
But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this. We considered A, B, and C. We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B." Basically, I think Heikki asked a good
question - which was "could we abstract this more?" - and I can't
recall seeing a clear answer explaining why we could or couldn't and
what the trade-offs would be.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-06-17 12:14:00 -0400, Robert Haas wrote:
On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are. As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people. What is your reason for thinking that it would be
any less useful in this context?For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country. It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.
Well, it might be doable to correlate them along one axis, but along
both? That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.
I think hierarchical datastructures are so much better suited for this,
that there's little point trying to fit them into minmax. I can very
well imagine that there's benefit in a gist support for only storing one
pointer per block instead of one pointer per item or such. But it seems
like separate feature.
But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this. We considered A, B, and C. We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."
Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...
Basically, I think Heikki asked a good
question - which was "could we abstract this more?" - and I can't
recall seeing a clear answer explaining why we could or couldn't and
what the trade-offs would be.
'could we abstract more' imo is a pretty bad design guideline. It's 'is
there benefit in abstracting more'. Otherwise you end up with way to
complicated systems.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
But minmax is just a specific form of bloom filter.
This could certainly be generalized to a bloom filter index with some
set of bloom&hashing operators (minmax being just one).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/17/2014 09:14 AM, Robert Haas wrote:
Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country. It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.
I admin a production application which has exactly this. However, that
application doesn't have big enough data to benefit from minmax indexes;
it uses the basic spatial indexes.
So, my $0.02: bounding box minmax falls under the heading of "would be
nice to have, but not if it delays the feature".
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMa386f9b191a09d71e77e09b53478ba8b729697346f0be420945eff6fc6e7a8c2933b480b00106441a3b67064f8db41fb@asav-3.01.com
On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, it might be doable to correlate them along one axis, but along
both? That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.
I think there's a misunderstanding here, possibly mine. My
understanding is that a min/max index will always be exactly the same
size for a given size table. It stores the minimum and maximum value
of the key for each page. Then you can do a bitmap scan by comparing
the search key with each page's minimum and maximum to see if that
page needs to be included in the scan. The failure mode is not that
the index is large but that a page that has an outlier will be
included in every scan as a false positive incurring an extra iop.
I don't think it's implausible at all that Geometric data would work
well. If you load Geometric data it's very common to load data by
geographic area so that all objects in San Francisco in one part of
the data load, probably even by zip code or census block.
What operations would an opclass for min/max need? I think it would be
pretty similar to the operators that GiST needs (thankfully minus the
baroque page split function):
An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"
The nice thing is this would let us add things like range @> (contains
element) to the plain integer min/max case.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/17/2014 09:16 PM, Andres Freund wrote:
On 2014-06-17 12:14:00 -0400, Robert Haas wrote:
On Tue, Jun 17, 2014 at 12:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, I'm not the guy who does things with geometric data, but I don't
want to ignore the significant percentage of our users who are. As
you must surely know, the GIST implementations for geometric data
types store bounding boxes on internal pages, and that seems to be
useful to people. What is your reason for thinking that it would be
any less useful in this context?For me minmax indexes are helpful because they allow to generate *small*
'coarse' indexes over large volumes of data. From my pov that's possible
possible because they don't contain item pointers for every contained
row.
That'ill imo work well if there are consecutive rows in the table that
can be summarized into one min/max range. That's quite likely to happen
for common applications of number of scalar datatypes. But the
likelihood of placing sufficiently many rows with very similar bounding
boxes close together seems much less relevant in practice. And I think
that's generally likely for operations which can't be well represented
as btree opclasses - the substructure that implies inside a Datum will
make correlation between consecutive rows less likely.Well, I don't know: suppose you're loading geospatial data showing the
location of every building in some country. It might easily be the
case that the data is or can be loaded in an order that provides
pretty good spatial locality, leading to tight bounding boxes over
physically consecutive data ranges.Well, it might be doable to correlate them along one axis, but along
both? That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Sure, there are cases where it would be useless. But it's easy to
imagine scenarios where it would work well, where points are loaded in
clusters and points that are close to each other also end up physically
close to each other.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.
That's an inherent risk with minmax indexes: insert a few rows to the
"wrong" locations in the heap, and the selectivity of the index degrades
rapidly.
The main problem with using it for geometric types is that you can't
easily CLUSTER the table to make the minmax index effective again. But
there are ways around that.
But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this. We considered A, B, and C. We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...
Without some analysis (e.g implementing it and comparing), I don't buy
that it makes the implementation simpler to restrict it in this way.
Maybe it does, but often it's actually simpler to solve the general case.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:
On 06/17/2014 09:16 PM, Andres Freund wrote:
Well, it might be doable to correlate them along one axis, but along
both? That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.Sure, there are cases where it would be useless. But it's easy to imagine
scenarios where it would work well, where points are loaded in clusters and
points that are close to each other also end up physically close to each
other.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.That's an inherent risk with minmax indexes: insert a few rows to the
"wrong" locations in the heap, and the selectivity of the index degrades
rapidly.
Sure. But it's fairly normal to have natural clusteredness in many
columns (surrogate keys, dateseries type of data). Even if you insert
geometric types in a geographic clusters you'll have worse results
because some bounding boxes will be big and such.
And:
The main problem with using it for geometric types is that you can't easily
CLUSTER the table to make the minmax index effective again. But there are
ways around that.
Which are? Sure you can try stuff like recreating the table, sorting
rows with boundary boxes area above threshold first, and then go on to
sort by the lop left corner of the bounding box. But that'll be neither
builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
for types with a btree opclass which will yield the perfect order.
Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...Without some analysis (e.g implementing it and comparing), I don't buy that
it makes the implementation simpler to restrict it in this way. Maybe it
does, but often it's actually simpler to solve the general case.
So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2014-06-17 16:48:07 -0700, Greg Stark wrote:
On Tue, Jun 17, 2014 at 11:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Well, it might be doable to correlate them along one axis, but along
both? That's more complicated... And even alongside one axis you
already get into problems if your geometries are irregularly sized.
Asingle large polygon will completely destroy indexability for anything
stored physically close by because suddently the minmax range will be
huge... So you'll need to cleverly sort for that as well.I think there's a misunderstanding here, possibly mine. My
understanding is that a min/max index will always be exactly the same
size for a given size table. It stores the minimum and maximum value
of the key for each page. Then you can do a bitmap scan by comparing
the search key with each page's minimum and maximum to see if that
page needs to be included in the scan. The failure mode is not that
the index is large but that a page that has an outlier will be
included in every scan as a false positive incurring an extra iop.
I just rechecked, and no, it doesn't, by default, store a range for each
page. It's MINMAX_DEFAULT_PAGES_PER_RANGE=128 pages by
default... Haven't checked what's the lowest it can be se tto.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/18/2014 01:46 PM, Andres Freund wrote:
On 2014-06-18 12:18:26 +0300, Heikki Linnakangas wrote:
The main problem with using it for geometric types is that you can't easily
CLUSTER the table to make the minmax index effective again. But there are
ways around that.Which are? Sure you can try stuff like recreating the table, sorting
rows with boundary boxes area above threshold first, and then go on to
sort by the lop left corner of the bounding box.
Right, something like that. Or cluster using some other column that
correlates with the geometry, like a zip code.
But that'll be neither
builtin, nor convenient, nor perfect. In contrast to a normal CLUSTER
for types with a btree opclass which will yield the perfect order.
Sure.
BTW, CLUSTERing by a geometric type would be useful anyway, even without
minmax indexes.
Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...Without some analysis (e.g implementing it and comparing), I don't buy that
it makes the implementation simpler to restrict it in this way. Maybe it
does, but often it's actually simpler to solve the general case.So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?
Implementing something is a good way to demonstrate how it would look
like. But no, I don't insist on implementing every possible design
whenever a new feature is proposed.
I liked Greg's sketch of what the opclass support functions would be. It
doesn't seem significantly more complicated than what's in the patch now.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 2:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:
But I'm not trying to say that we absolutely have to support that kind
of thing; what I am trying to say is that there should be a README or
a mailing list post or some such that says: "We thought about how
generic to make this. We considered A, B, and C. We rejected C as
too narrow, and A because if we made it that general it would have
greatly enlarged the disk footprint for the following reasons.
Therefore we selected B."Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...
It really *hasn't* been discussed on-list. See these emails,
discussing the same ideas, from 8 months ago:
/messages/by-id/5249B2D3.6030606@vmware.com
/messages/by-id/CA+TgmoYSCbW-UC8LQV96sziGnXSuzAyQbfdQmK-FGu22HdKkaw@mail.gmail.com
Now, Alvaro did not respond to those emails, nor did anyone involved
in the development of the feature. There may be an argument that
implementing that would be too complicated, but Heikki said he didn't
think it would be, and nobody's made a concrete argument as to why
he's wrong (and Heikki knows a lot about indexing).
Basically, I think Heikki asked a good
question - which was "could we abstract this more?" - and I can't
recall seeing a clear answer explaining why we could or couldn't and
what the trade-offs would be.'could we abstract more' imo is a pretty bad design guideline. It's 'is
there benefit in abstracting more'. Otherwise you end up with way to
complicated systems.
On the flip side, if you don't abstract enough, you end up being able
to cover only a small set of the relevant use cases, or else you end
up with a bunch of poorly-coordinated tools to cover slightly
different use cases.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I liked Greg's sketch of what the opclass support functions would be. It
doesn't seem significantly more complicated than what's in the patch now.
Which was
On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"
Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.
So, you have:
An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type
If you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.
One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.
Adding also a "in-place updateable" bit to the "type" would perhaps
inflate the complexity of the patch due to the need to provide both
code paths?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/18/2014 12:46 PM, Andres Freund wrote:
Isn't 'simpler implementation' a valid reason that's already been
discussed onlist? Obviously simpler implementation doesn't trump
everything, but it's one valid reason...
Note that I have zap to do with the design of this feature. I work for
the same company as Alvaro, but that's pretty much it...Without some analysis (e.g implementing it and comparing), I don't buy that
it makes the implementation simpler to restrict it in this way. Maybe it
does, but often it's actually simpler to solve the general case.So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?
Well, there is the inventor's paradox to consider.
--
Vik
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/18/2014 06:09 PM, Claudio Freire wrote:
On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.
What do you mean by parameterized?
So, you have:
An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type
Yeah, something like that. I'm not sure I like the "compressed set" term
any more than bounding box, though. GiST seems to have avoided naming
the thing, and just talks about "index entries". But if we can come up
with a good name, that would be more clear.
One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.
I don't understand what you mean. It's a fundamental property of minmax
indexes that you can always replace the "min" or "max" or "compressing
set" or "bounding box" or whatever with another datum that represents
all the keys that the old one did, plus some.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Vik Fearing <vik.fearing@dalibo.com> writes:
On 06/18/2014 12:46 PM, Andres Freund wrote:
So to implement a feature one now has to implement the most generic
variant as a prototype first? Really?
Well, there is the inventor's paradox to consider.
I have not seen anyone demanding a different implementation in this
thread. What *has* been asked for, and not supplied, is a concrete
defense of the particular level of generality that's been selected
in this implementation. It's not at all clear to the rest of us
whether it was the right choice, and that is something that ought
to be asked now not later.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Implementing something is a good way to demonstrate how it would look like.
But no, I don't insist on implementing every possible design whenever a new
feature is proposed.I liked Greg's sketch of what the opclass support functions would be. It
doesn't seem significantly more complicated than what's in the patch now.
As a counter-point to my own point there will be nothing stopping us
in the future from generalizing things. Dealing with catalogs is
mostly book-keeping headaches and careful work. it's something that
might be well-suited for a GSOC or first patch from someone looking to
familiarize themselves with the system architecture. It's hard to
invent a whole new underlying infrastructure at the same time as
dealing with all that book-keeping and it's hard for someone
familiarizing themselves with the system to also have a great new
idea. Having tasks like this that are easy to explain and that mentor
understands well can be easier to manage than tasks where the newcomer
has some radical new idea.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
On 06/18/2014 06:09 PM, Claudio Freire wrote:
On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark <stark@mit.edu> wrote:
An aggregate to generate a min/max "bounding box" from several values
A function which takes an "bounding box" and a new value and returns
the new "bounding box"
A function which tests if a value is in a "bounding box"
A function which tests if a "bounding box" overlaps a "bounding box"Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.What do you mean by parameterized?
Bloom filters can be paired with number of hashes, number of bit
positions, and hash function, so it's not a simple bloom filter index,
but a bloom filter index with N SHA-1-based hashes spread on a
K-length bitmap.
So, you have:
An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal typeYeah, something like that. I'm not sure I like the "compressed set" term any
more than bounding box, though. GiST seems to have avoided naming the thing,
and just talks about "index entries". But if we can come up with a good
name, that would be more clear.
I don't want to use the term bloom filter since it's very specific of
a specific technique, but it's basically that - an approximate set
without false negatives. Ie: compressed set.
It's not a bounding box either when using bloom filters. So...
One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.I don't understand what you mean. It's a fundamental property of minmax
indexes that you can always replace the "min" or "max" or "compressing set"
or "bounding box" or whatever with another datum that represents all the
keys that the old one did, plus some.
Yes, and bloom filters happen to fall on that category too.
Never mind what I said. I was thinking of other potential and
imaginary implementation that supports removal or updates, that might
need care with transaction lifetimes, but that's easily fixed by
letting vacuum or some lazy process do the deleting just as it happens
with other indexes anyway.
So, I guess the interface must include also the invariant that
compressed sets only grow, never shrink unless from a rebuild or a
vacuum operation.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I'm sorry if I missed something, but ISTM this is beginning to look a
lot like GiST. This was pointed out by Robert Haas last year.
On Wed, Jun 18, 2014 at 12:09:42PM -0300, Claudio Freire wrote:
So, you have:
An aggregate to generate a "compressed set" from several values
Which GiST does by calling 'compress' on each value, and the 'unions' the
results together.
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
Again, 'compress' + 'union'
A function which tests if a value is in a "compressed set"
Which GiST does using 'compress' +'consistant'
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type
Which GiST calls 'consistant'
So I'm wondering why you can't just reuse the btree_gist functions we
already have in contrib. It seems to me that these MinMax indexes are
in fact a variation on GiST that indexes the pages of a table based
upon the 'union' of all the elements in a page. By reusing the GiST
operator class you get support for many datatypes for free.
If you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.
You could implement bloom filter in GiST too. It's been discussed
before but I can't find any implementation. Probably because the filter
needs to be parameterised and if you store the bloom filter for each
element it gets expensive very quickly. However, hooked into a minmax
structure which only indexes whole pages it could be quite efficient.
One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.
I think GiST has already solved this problem.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.
-- Arthur Schopenhauer
Some comments, aside from the design wrt. bounding boxes etc. :
On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Here's an updated version of this patch, with fixes to all the bugs
reported so far. Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.I'm not very happy with the use of a separate relation fork for
storing this data.Here's a new version of this patch. Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.
Thanks! Please update the README accordingly.
If I understand the code correctly, the revmap is a three-level deep
structure. The bottom level consists of "regular revmap pages", and each
regular revmap page is filled with ItemPointerDatas, which point to the
index tuples. The middle level consists of "array revmap pages", and
each array revmap page contains an array of BlockNumbers of the "regular
revmap" pages. The top level is an array of BlockNumbers of the array
revmap pages, and it is stored in the metapage.
With 8k block size, that's just enough to cover the full range of 2^32-1
blocks that you'll need if you set mm_pages_per_range=1. Each regular
revmap page can store about 8192/6 = 1365 item pointers, each array
revmap page can store about 8192/4 = 2048 block references, and the size
of the top array is 8192/4. That's just enough; to store the required
number of array pages in the top array, the array needs to be
(2^32/1365)/2048)=1536 elements large.
But with 4k or smaller blocks, it's not enough.
I wonder if it would be simpler to just always store the revmap pages in
the beginning of the index, before any other pages. Finding the revmap
page would then be just as easy as with a separate fork. When the
table/index is extended so that a new revmap page is needed, move the
existing page at that block out of the way. Locking needs some
consideration, but I think it would be feasible and simpler than you
have now.
I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple. This saves a lot of index bloat. This required a new entry
point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change. This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.
ISTM that when the old tuple cannot be updated in-place, the new index
tuple is inserted with mm_doinsert(), but the old tuple is never deleted.
- Heikki
--
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
Some comments, aside from the design wrt. bounding boxes etc. :
Thanks. I haven't commented on that sub-thread because I think it's
possible to come up with a reasonable design that solves the issue by
adding a couple of amprocs. I need to do some more thinking to ensure
it is really workable, and then I'll post my ideas.
On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
Robert Haas wrote:
If I understand the code correctly, the revmap is a three-level deep
structure. The bottom level consists of "regular revmap pages", and
each regular revmap page is filled with ItemPointerDatas, which
point to the index tuples. The middle level consists of "array
revmap pages", and each array revmap page contains an array of
BlockNumbers of the "regular revmap" pages. The top level is an
array of BlockNumbers of the array revmap pages, and it is stored in
the metapage.
Yep, that's correct. Essentially, we still have the revmap as a linear
space (containing TIDs); the other two levels on top of that are only
there to enable locating the physical page numbers for each revmap
logical page. I make one exception that the first logical revmap page
is always stored in page 1, to optimize the case of a smallish table
(~1360 page ranges; approximately 1.3 gigabytes of data at 128 pages per
range, or 170 megabytes at 16 pages per range.)
Each page has a page header (24 bytes) and special space (4 bytes), so
it has 8192-28=8164 bytes available for data, so 1360 item pointers.
With 8k block size, that's just enough to cover the full range of
2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
regular revmap page can store about 8192/6 = 1365 item pointers,
each array revmap page can store about 8192/4 = 2048 block
references, and the size of the top array is 8192/4. That's just
enough; to store the required number of array pages in the top
array, the array needs to be (2^32/1365)/2048)=1536 elements large.But with 4k or smaller blocks, it's not enough.
Yeah. As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance. I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.
Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower. It wouldn't make sense to build a system that has very large
tables and use a smaller page size.
So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.
ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.
I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.
Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.
I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple. This saves a lot of index bloat. This required a new entry
point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change. This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.
It's deleted by the next vacuum.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
Heikki Linnakangas wrote:
With 8k block size, that's just enough to cover the full range of
2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each
regular revmap page can store about 8192/6 = 1365 item pointers,
each array revmap page can store about 8192/4 = 2048 block
references, and the size of the top array is 8192/4. That's just
enough; to store the required number of array pages in the top
array, the array needs to be (2^32/1365)/2048)=1536 elements large.But with 4k or smaller blocks, it's not enough.
Yeah. As I said elsewhere, actual useful values are likely to be close
to the read-ahead setting of the underlying disk; by default that'd be
16 pages (128kB), but I think it's common advice to increase the kernel
setting to improve performance.
My gut feeling is that it might well be best to set pages_per_page=1.
Even if you do the same amount of I/O, thanks to kernel read-ahead, you
might still avoid processing a lot of tuples. But would need to see some
benchmarks to know..
I don't think we don't need to prevent
minmax indexes with pages_per_range=1, but I don't think we need to
ensure that that setting works with the largest tables, either, because
it doesn't make any sense to set it up like that.Also, while there are some recommendations to set up a system with
larger page sizes (32kB), I have never seen any recommendation to set it
lower. It wouldn't make sense to build a system that has very large
tables and use a smaller page size.So in other words, yes, you're correct that the mechanism doesn't work
in some cases (small page size and index configured for highest level of
detail), but the conditions are such that I don't think it matters.ISTM the thing to do here is to do the math at index creation time, and
if we find that we don't have enough space in the metapage for all array
revmap pointers we need, bail out and require the index to be created
with a larger pages_per_range setting.
Yeah, I agree that would be acceptable.
I feel that the below would nevertheless be simpler:
I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.
Hmm. Two alternative schemes come to mind:
1. Move each index tuple off the page individually, updating the revmap
while you do it, until the page is empty. Updating the revmap for a
single index tuple isn't difficult; you have to do it anyway when an
index tuple is replaced. (MMTuples don't contain a heap block number
ATM, but IMHO they should, see below)
2. Store the new block number of the page that you moved out of the way
in the revmap page, and leave the revmap pointers unchanged. The revmap
pointers can be updated later, lazily.
Both of those seem pretty straightforward.
I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple. This saves a lot of index bloat. This required a new entry
point in bufpage.c, PageOverwriteItemData(). bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change. This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.It's deleted by the next vacuum.
Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might be
acceptable - you ought to have plenty of RAM if you're pushing around
multi-terabyte tables - but it would nevertheless be nice to not have a
hard requirement for something as essential as vacuum.
In addition to the hash table, remove_deletable_tuples() pallocs an
array to hold an ItemPointer for every index tuple about to be removed.
A single palloc is limited to 1GB, so that will fail outright if there
are more than ~179 million dead index tuples. You're unlikely to hit
that in practice, but if you do, you'll never be able to vacuum the
index. So that's not very nice.
Wouldn't it be simpler to remove the old tuple atomically with inserting
the new tuple and updating the revmap? Or at least mark the old tuple as
deletable, so that vacuum can just delete it, without building the large
hash table to determine that it's deletable.
As it is, remove_deletable_tuples looks racy:
1. Vacuum begins, and remove_deletable_tuples performs the first pass
over the regular, non-revmap index pages, building the hash table of all
items in the index.
2. Another process inserts a new row to the heap, which causes a new
minmax tuple to be inserted and the revmap to be updated to point to the
new tuple.
3. Vacuum proceeds to scan the revmap. It will find the updated revmap
entry that points to the new index tuple. The new index tuples is not
found in the hash table, so it throws an error: "reverse map references
nonexistant (sic) index tuple".
I think to fix that you can just ignore tuples that are not found in the
hash table. (Although as I said above I think it would be simpler to not
leave behind any dead index tuples in the first place and get rid of the
vacuum scans altogether)
Regarding locking, I think it would be good to mention explicitly the
order that the pages must be locked if you need to lock multiple pages
at the same time, to avoid deadlock. Based on the Locking
considerations-section in the README, I believe the order is that you
always lock the regular index page first, and then the revmap page.
There's no mention of the order of locking two regular or two revmap
pages, but I guess you never do that ATM.
I'm quite surprised by the use of LockTuple on the index tuples. I think
the main reason for needing that is the fact that MMTuple doesn't store
the heap (range) block number that the tuple points to: LockTuple is
required to ensure that the tuple doesn't go away while a scan is
following a pointer from the revmap to it. If the MMTuple contained the
BlockNumber, a scan could check that and go back to the revmap if it
doesn't match. Alternatively, you could keep the revmap page locked when
you follow a pointer to the regular index page.
The lack of a block number on index tuples also makes my idea of moving
tuples out of the way of extending the revmap much more difficult;
there's no way to find the revmap entry pointing to an index tuple,
short of scanning the whole revmap. And also on general robustness
grounds, and for debugging purposes, it would be nice to have the block
number there.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jun 19, 2014 at 12:32 PM, Greg Stark <stark@mit.edu> wrote:
On Wed, Jun 18, 2014 at 4:51 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:Implementing something is a good way to demonstrate how it would look like.
But no, I don't insist on implementing every possible design whenever a new
feature is proposed.I liked Greg's sketch of what the opclass support functions would be. It
doesn't seem significantly more complicated than what's in the patch now.As a counter-point to my own point there will be nothing stopping us
in the future from generalizing things. Dealing with catalogs is
mostly book-keeping headaches and careful work. it's something that
might be well-suited for a GSOC or first patch from someone looking to
familiarize themselves with the system architecture. It's hard to
invent a whole new underlying infrastructure at the same time as
dealing with all that book-keeping and it's hard for someone
familiarizing themselves with the system to also have a great new
idea. Having tasks like this that are easy to explain and that mentor
understands well can be easier to manage than tasks where the newcomer
has some radical new idea.
Generalizing this in the future would be highly likely to change the
on-disk format for existing indexes, which would be a problem for
pg_upgrade. I think we will likely be stuck with whatever the initial
on-disk format looks like for a very long time, which is why I think
we need to try rather hard to get this right the first time.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Claudio Freire wrote:
An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal typeIf you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.
Here's a new version of this patch, which is more generic the original
versions, and similar to what you describe.
The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)
getOpers is used to obtain information about what is stored for that
data type; it says how many datum values are stored for a column of that
type (two for sortable: min and max), and how many operators it needs
setup. Then, the generic code fills in a MinmaxDesc(riptor) and creates
an initial DeformedMMTuple (which is a rather ugly name for a minmax
tuple held in memory). The maybeUpdateValues amproc can then be called
when there's a new heap tuple, which updates the DeformedMMTuple to
account for the new tuple (in essence, it's a union of the original
values and the new tuple). This can be done repeatedly (when a new
index is being created) or only once (when a new heap tuple is inserted
into an existing index). There is no need for an "aggregate".
This DeformedMMTuple can easily be turned into the on-disk
representation; there is no hardcoded assumption on the number of index
values stored per heap column, so it is possible to build an opclass
that stores a bounding box column for a geometry heap column, for
instance.
Then we have the "compare" amproc. This is used during index scans;
after extracting an index tuple, it is turned into DeformedMMTuple, and
the "compare" amproc for each column is called with the values of scan
keys. (Now that I think about this, it seems pretty much what
"consistent" is for GiST opclasses). A true return value indicates that
the scan key matches the page range boundaries and thus all pages in the
range are added to the output TID bitmap.
Of course, you can have multicolumn indexes, and (as would be expected)
each column can have totally different opclasses; so for instance you
could have an integer column and a geometric column in the same index,
and it should work fine. In a query that constrained both columns, only
those page ranges that satisfied the scan keys for both columns would be
returned.
I think this level of abstraction is good --- AFAICS it is easy to
implement opclasses for datatypes not suitable for btree opclasses such
as geometric ones, etc. This answers the concerns of those who wanted
to see this support datatypes that don't have the concept of min/max at
all. I'm not sure about bloom filters, as I've forgotten how those
work. Of course, the implementation of min/max is there: that logic has
been abstracted out into what I've called "sortable opfamilies" for now
(name suggestions welcome) --- it can be used for any datatype with a
btree opclass.
I think design-wise it ended up making a lot of sense, because all the
opclass-specific assumptions about usage of "min" and "max" values and
comparisons using the less-than etc operators are contained in the
mmsortable.c file, and the basic minmax.c file only needs to know to
call the right opclass-specific procedures. The basic code might need
some tweaks to ensure that we're not assuming anything about the
datatypes of the stored values (as opposed to the datatypes of the
indexed columns), but this is a matter of tweaking the MinmaxDesc and
the getOpers amprocs; it wouldn't require changing the on-disk
representation, and thus is upgrade-compatible.
There's a bit of boilerplate code in the amproc routines which would be
nice to be able to remove (mainly involving null values), but I think to
do that we would need to split those three amprocs into maybe four or
five, which is not as nice, so I'm not real sure about doing it.
All this being said, I'm sticking to the name "Minmax indexes". There
was a poll in pgsql-advocacy
/messages/by-id/53A0B4F8.8080803@agliodbs.com
about a new name, but there were no suggestions supported by more than
one person. If a brilliant new name comes up, I'm open to changing it.
Another thing I noticed is that version 8 of the patch blindly believed
the "pages_per_range" declared in catalogs. This meant that if somebody
did "alter index foo set pages_per_range=123" the index would
immediately break (i.e. return corrupted results when queried). I have
fixed this by storing the pages_per_range value used to construct the
index in the metapage. Now if you do the ALTER INDEX thing, the new
value is only used when the index is recreated by REINDEX.
There are still things to go over before this is committable,
particularly regarding vacuuming the index, but as far as index creation
and scanning it should be good to test. (Vacuuming should work just
fine most of the time also, but there are a few wrinkles pointed out by
Heikki.)
One thing I've disabled for now is the pageinspect code that displays
index items. I need to rewrite that.
Closing thought: thinking more about it, the bit about returning
function OIDs by getOpers when creating a MinmaxDesc seems unnecessary.
I think we could go by with just returning the number of values stored
in the column, and have the operators be part of an opaque struct that's
initialized and only touched by the opclass amprocs, not by the generic
code.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-9.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,420 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2013, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_type(PG_FUNCTION_ARGS);
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+ Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_type);
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+ PG_FUNCTION_INFO_V1(minmax_metapage_info);
+ PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+ PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+ typedef struct mm_page_state
+ {
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+
+ static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ minmax_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ #ifdef NOT_YET
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
+ #endif
+
+ Datum
+ minmax_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+ Datum
+ minmax_revmap_array_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+ }
+
+ /*
+ * Return the TID array stored in a minmax revmap page
+ */
+ Datum
+ minmax_revmap_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,150 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_type()
+ --
+ CREATE FUNCTION minmax_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'minmax_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_metapage_info()
+ --
+ CREATE FUNCTION minmax_metapage_info(IN page bytea,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_page_items()
+ --
+ /* needs more work
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+ */
+
+ --
+ -- minmax_revmap_array_data()
+ CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_data()
+ CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 271,276 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 271,278 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 296,301 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 298,311 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 636,642 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 646,653 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 646,652 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 657,664 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 897,903 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 909,916 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 907,913 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 920,927 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1553 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * support collatable datatypes
+ * * ScalarArrayOpExpr
+ * * Make use of the stored NULL bits
+ * * we can support unlogged indexes now
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ DeformedMMTuple *dtuple;
+ } MMBuildState;
+
+ /*
+ * Struct used as "opaque" during index scans
+ */
+ typedef struct MinmaxOpaque
+ {
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ } MinmaxOpaque;
+
+ static MinmaxDesc *minmax_build_mmdesc(Relation rel);
+ static void mmbuildCallback(Relation index,
+ HeapTuple htup, Datum *values, bool *isnull,
+ bool tupleIsAlive, void *state);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to update the index (either by
+ * a new index tuple and repointing the revmap, or by overwriting the existing
+ * index tuple).
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ ItemId origlp;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ bool need_insert = false;
+
+ rmAccess = mmRevmapAccessInit(idxRel, NULL);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ indexInfo = BuildIndexInfo(idxRel);
+ mmdesc = minmax_build_mmdesc(idxRel);
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ Assert(iblk != InvalidBlockNumber);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ origlp = PageGetItemId(page, ioff);
+ mmtup = (MMTuple *) PageGetItem(page, origlp);
+
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later, if necessary.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Datum result;
+ FmgrInfo *maybeUpdateFn;
+
+ /* FIXME must be cached somewhere */
+ maybeUpdateFn = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_MAYBEUPDATE);
+
+ result = FunctionCall5(maybeUpdateFn,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ if (DatumGetBool(result))
+ need_insert = true;
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ Size tupsz;
+ MMTuple *tup;
+
+ tup = minmax_form_tuple(mmdesc, dtup, &tupsz);
+
+ /*
+ * If the size of the original tuple is greater or equal to the new
+ * index tuple, we can overwrite. This saves regular page bloat, and
+ * also saves revmap traffic. This might leave some unused space
+ * before the start of the next tuple, but we don't worry about that
+ * here.
+ *
+ * We avoid doing this when the itempointer of the index tuple would
+ * change, because that would require an update to the revmap while
+ * holding exclusive lock on this page, which would reduce concurrency.
+ *
+ * Note that we continue to acccess 'origlp' here, even though there
+ * was an interval during which the page wasn't locked. Since we hold
+ * pin on the page, this is okay -- the buffer cannot go away from
+ * under us, and also tuples cannot be shuffled around.
+ */
+ if (tupsz >= ItemIdGetLength(origlp))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageOverwriteItemData(BufferGetPage(buf),
+ ioff,
+ (Item) tup, tupsz);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxRel->rd_node;
+ xlrec.target.tid = idxtid;
+ xlrec.overwrite = true;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = tupsz;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ /*
+ * The new tuple is larger than the original one, so we must insert
+ * a new one the slow way.
+ */
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+
+ #ifdef NOT_YET
+ /*
+ * Possible optimization: if we can grab an exclusive lock on the
+ * buffer containing the old tuple right away, we can also seize
+ * the opportunity to prune the old tuple and avoid some bloat.
+ * This is not necessary for correctness.
+ */
+ if (ConditionalLockBuffer(buf))
+ {
+ /* prune the old tuple */
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ #endif
+ }
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ /*
+ * ambeginscan implementation for minmax.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+ }
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the min/max values in them are compared to the
+ * scan keys. We return into the TID bitmap all the pages in ranges
+ * corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ MinmaxDesc *mmdesc = minmax_build_mmdesc(idxRel);
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *compareFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain compare functions for all indexed column. Maybe it'd be possible
+ * to do this lazily only the first time we see a scan key that involves
+ * each particular attribute.
+ */
+ compareFn = palloc(sizeof(FmgrInfo) * mmdesc->tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_COMPARE);
+ fmgr_info_copy(&compareFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(opaque->rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ Assert(idxblkno != InvalidBlockNumber);
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+ }
+
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *)
+ PageGetItem(page, PageGetItemId(page, idxoffno));
+ /* XXX probably need copies */
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with min/max values stored in range. If scan
+ * keys are matched, the page range must be added to the bitmap.
+ */
+ for (keyno = 0, addrange = true;
+ keyno < scan->numberOfKeys;
+ keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The analysis we need to make to decide whether to include a
+ * page range in the output result is: is it possible for a
+ * tuple contained within the min/max interval specified by
+ * this index tuple to match what's specified by the scan key?
+ * For example, for a query qual such as "WHERE col < 10" we
+ * need to include a range whose minimum value is less than
+ * 10.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole.
+ */
+ add = FunctionCall5Coll(&compareFn[keyattno - 1],
+ InvalidOid, /* FIXME collation */
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ Int16GetDatum(keyattno),
+ UInt16GetDatum(key->sk_strategy),
+ key->sk_argument);
+ addrange = DatumGetBool(add);
+
+ /*
+ * If the current scan key doesn't match the range values,
+ * don't look at further ones.
+ */
+ if (!addrange)
+ break;
+ }
+
+ /* XXX anything to free here? */
+ pfree(dtup);
+ }
+
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+ }
+
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh. We're careful to avoid being fooled by
+ * wraparound.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1) ||
+ thisblock < mmstate->currRangeStart)
+ {
+ MMTuple *tup;
+ Size size;
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+ #if 0
+ for (i = 0; i < mmstate->indexDesc->natts; i++)
+ {
+ elog(DEBUG2, "completed a range for column %d, range: %u .. %u",
+ i,
+ DatumGetUInt32(mmstate->dtuple->values[i].min),
+ DatumGetUInt32(mmstate->dtuple->values[i].max));
+ }
+ #endif
+
+ /*
+ * Create the index tuple and insert it.
+ */
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ /* and set state to correspond to the new current range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for the new range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ for (i = 0; i < mmstate->mmDesc->tupdesc->natts; i++)
+ {
+ FmgrInfo *maybeUpdateFn;
+
+ /* FIXME must be cached somewhere */
+ maybeUpdateFn = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_MAYBEUPDATE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5(maybeUpdateFn,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+ }
+
+ /*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+ static MMBuildState *
+ initialize_mm_buildstate(Relation heapRel, Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ IndexInfo *indexInfo)
+ {
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ return mmstate;
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ mm_page_init(Page page, uint16 type)
+ {
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+ /*
+ * Initialize a new minmax index' metapage.
+ */
+ void
+ mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+ {
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->pagesPerRange = pagesPerRange;
+ metadata->minmaxVersion = version;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = mm_getnewbuffer(index);
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /*
+ * Set up an empty revmap, and get access to it
+ */
+ mmRevmapCreate(index);
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ mmstate = initialize_mm_buildstate(heap, index, rmAccess, pagesPerRange,
+ indexInfo);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* FIXME process the final batch */
+
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ static MinmaxDesc *
+ minmax_build_mmdesc(Relation rel)
+ {
+ MinmaxOpers **opers;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalopers = 0;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+ int curroffset;
+ Datum indclassDatum;
+ oidvector *indclass;
+ bool isnull;
+
+ tupdesc = RelationGetDescr(rel);
+
+ indclassDatum = SysCacheGetAttr(INDEXRELID, rel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ opers = (MinmaxOpers **) palloc(sizeof(MinmaxOpers *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+ FmgrInfo getOpers;
+
+ fmgr_info_copy(&getOpers,
+ index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_GETOPERS),
+ CurrentMemoryContext);
+
+ opers[keyno] = (MinmaxOpers *)
+ DatumGetPointer(FunctionCall2(&getOpers,
+ ObjectIdGetDatum(opfam),
+ ObjectIdGetDatum(idxtypid)));
+ totalopers += opers[keyno]->nopers;
+ totalstored += opers[keyno]->nstored;
+ }
+
+ totalsize = offsetof(MinmaxDesc, perCol) +
+ sizeof(MinmaxDescPerCol) * tupdesc->natts +
+ sizeof(FmgrInfo) * totalopers +
+ sizeof(Oid) * totalopers;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->tupdesc = CreateTupleDescCopy(tupdesc);
+ mmdesc->disktdesc = NULL; /* generated lazily */
+ mmdesc->totalstored = totalstored;
+ mmdesc->operoids = (Oid *) ((char *) mmdesc +
+ offsetof(MinmaxDesc, perCol) +
+ sizeof(MinmaxDescPerCol) * tupdesc->natts);
+ mmdesc->opers = (FmgrInfo *) ((char *) mmdesc->operoids +
+ sizeof(Oid) * totalopers);
+
+ curroffset = 0;
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ int opno;
+
+ mmdesc->perCol[keyno].numopers = opers[keyno]->nopers;
+ mmdesc->perCol[keyno].numstored = opers[keyno]->nstored;
+
+ /* Copy the operator OIDs from what the opclass told us */
+ mmdesc->perCol[keyno].operoids = mmdesc->operoids + curroffset;
+ memcpy(mmdesc->perCol[keyno].operoids, opers[keyno]->opers,
+ sizeof(Oid) * opers[keyno]->nopers);
+
+ /*
+ * Fill in the operator's FmgrInfos. XXX is it possible to do this
+ * lazily to avoid initializing operators that go unused much of the
+ * time? We'd need a tweak to minmax_get_operfn plus a flag array
+ * to indicate which ones have been initialized ...
+ */
+ mmdesc->perCol[keyno].opers = mmdesc->opers + curroffset;
+ for (opno = 0; opno < mmdesc->perCol[keyno].numopers; opno++)
+ fmgr_info(mmdesc->perCol[keyno].operoids[opno],
+ &mmdesc->perCol[keyno].opers[opno]);
+
+ curroffset += opers[keyno]->nopers;
+ }
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ pfree(opers[keyno]);
+ pfree(opers);
+
+ return mmdesc;
+ }
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return in nonsummed the array (and in numnonsummed its size) of
+ * block numbers for which the revmap returns InvalidTid; this is used in a
+ * later stage to execute re-summarization. (Each block number returned
+ * corresponds to the heap page number with which each unsummarized range
+ * starts.) Space for the array is palloc'ed, and must be freed by caller.
+ *
+ * idxRel is the index relation; heapNumBlocks is the size of the heap
+ * relation; strategy is appropriate for bulk scanning.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = CurrentMemoryContext;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+ MinmaxSpecialSpace *special;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ page = BufferGetPage(buf);
+
+ /*
+ * Verify the type of the page we got; if it's not a regular page,
+ * ignore it.
+ */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != MINMAX_PAGETYPE_REGULAR)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+
+ /*
+ * Enter each live tuple into the hash table
+ */
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *)
+ hash_search(tuples, &tid, HASH_ENTER, &found);
+ Assert(!found);
+ hitem->referenced = false;
+ numitems++;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. Before doing so, if the
+ * heap page range is complete but not summarized, store its
+ * initial page number in the unsummarized array, for later
+ * summarization.
+ */
+ if (heapBlk + pagesPerRange < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+ else
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ /*
+ * If the item is not in the hash, it must have been inserted after the
+ * index was scanned, and therefore we should leave things well alone.
+ * (There might be a leftover entry, but it's okay because next vacuum
+ * will remove it.)
+ */
+ if (!found)
+ continue;
+
+ hitem->referenced = true;
+
+ /* discount items set as referenced */
+ numitems--;
+ }
+ Assert(numitems >= 0);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples.
+ */
+ deletable = palloc(sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ for (start = 0, i = 0; i < numdeletable; i++)
+ {
+ /*
+ * Are we at the end of the items that together belong in one
+ * particular page? If so, then it's deletion time.
+ */
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+ int freespace;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ /*
+ * Now defragment the page.
+ */
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_BULKREMOVE,
+ rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* next iteration starts where this one ended */
+ start = i + 1;
+
+ /* remember free space while we have the buffer locked */
+ freespace = PageGetFreeSpace(page);
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+
+ RecordPageWithFreeSpace(idxRel, blk, freespace);
+ }
+ }
+
+ pfree(deletable);
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(heapRel, idxRel, rmAccess, pagesPerRange, indexInfo);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+ MMTuple *tup;
+ Size size;
+
+ mmstate->currRangeStart = blk;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple containing min/max values, and insert it.
+ * Note mmbuildCallback didn't have the chance to actually insert
+ * anything into the index, because the heapscan should have ended
+ * just as it reached the final tuple in the range.
+ */
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * ambulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a minmax index: 1) remove index
+ * tuples that are no longer referenced from the revmap. 2) summarize ranges
+ * that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber pagesPerRange;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers of
+ * ranges that are not summarized.
+ */
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* and summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess, pagesPerRange,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+
+ if (PageGetFreeSpace(page) < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel))));
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ MarkBufferDirty(*buffer);
+
+ blk = BufferGetBlockNumber(*buffer);
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapblkno);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ xlrec.overwrite = false;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+ Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /* FIXME need to request a MaxFSMRequestSize page from the FSM here */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has enough, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, a new
+ * buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. This function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ bool extended = false;
+
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ Page page;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer which has enough free space to satisfy the requirement.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly; but first make sure we record the page as
+ * free, for next time.
+ */
+ freespace = PageGetFreeSpace(page);
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ Assert(blk != InvalidBlockNumber);
+ buf = ReadBuffer(irel, blk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ break;
+
+ /* Not really enough space: register reality and start over */
+ UnlockReleaseBuffer(buf);
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * XXX we could perhaps avoid this if we used RelationSetTargetBlock ...
+ */
+ if (extended)
+ FreeSpaceMapVacuum(irel);
+
+ return extended;
+ }
+
+ FmgrInfo *
+ minmax_get_operfn(MinmaxDesc *mmdesc, AttrNumber attno, uint16 operno)
+ {
+ return &(mmdesc->perCol[attno - 1].opers[operno]);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,683 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/heapam_xlog.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+
+
+
+ /*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+ /*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+ #define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+ #define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer metaBuf;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+ };
+ /* typedef appears in minmax_revmap.h */
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->metaBuf = meta;
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ if (pagesPerRange)
+ *pagesPerRange = metadata->pagesPerRange;
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->metaBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * In the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages per range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct revmap page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ BlockNumber
+ initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+ {
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+ }
+
+ /*
+ * Lock the metapage as specified by called, and update the given rmAccess with
+ * the metapage data. The metapage buffer is locked when this function
+ * returns; it's the caller's responsibility to unlock it.
+ */
+ static void
+ rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+ {
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ LockBuffer(rmAccess->metaBuf, lockmode);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+
+ #ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+ #endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+ }
+
+ /*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ void
+ initialize_rma_page(Buffer buf)
+ {
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+ }
+
+ /*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno.
+ */
+ static void
+ update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+ {
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ START_CRIT_SECTION();
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again. XXX but we didn't hold lock!
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ *
+ * FIXME -- what if we were asked not to extend?
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ START_CRIT_SECTION();
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Modify the revmap array page to point to the newly allocated revmap
+ * page.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ /*
+ * XXX this rma_nblocks assignment should probably be conditional on the
+ * current rma_blocks value.
+ */
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "setting %u/%u in logical page %lu (physical %u) for heap %u",
+ blkno, offno,
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with
+ * LockTuple. It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmsortable.c
***************
*** 0 ****
--- 1,225 ----
+ /*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/skey.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/syscache.h"
+
+ #define OPER_LT 0
+ #define OPER_LTEQ 1
+ #define OPER_EQ 2
+ #define OPER_GTEQ 3
+ #define OPER_GT 4
+
+
+ PG_FUNCTION_INFO_V1(mmSortableGetOpers);
+ PG_FUNCTION_INFO_V1(mmSortableMaybeUpdateValues);
+ PG_FUNCTION_INFO_V1(mmSortableCompare);
+
+ Datum mmSortableGetOpers(PG_FUNCTION_ARGS);
+ Datum mmSortableMaybeUpdateValues(PG_FUNCTION_ARGS);
+ Datum mmSortableCompare(PG_FUNCTION_ARGS);
+
+ /*
+ * Return the number and OIDs of (the functions that underlie) operators we
+ * need to build a minmax index, as a pointer to a newly palloc'ed MinmaxOpers.
+ */
+ Datum
+ mmSortableGetOpers(PG_FUNCTION_ARGS)
+ {
+ Oid opfamily = PG_GETARG_OID(0);
+ Oid optypid = PG_GETARG_OID(1);
+ MinmaxOpers *mmopers;
+
+ mmopers = palloc(offsetof(MinmaxOpers, opers) + sizeof(Oid) * 5);
+ mmopers->nopers = 5; /* <, <=, =, >=, > */
+ mmopers->nstored = 2; /* min, max */
+
+ mmopers->opers[OPER_LT] =
+ get_opcode(get_opfamily_member(opfamily, optypid, optypid,
+ BTLessStrategyNumber));
+ mmopers->opers[OPER_LTEQ] =
+ get_opcode(get_opfamily_member(opfamily, optypid, optypid,
+ BTLessEqualStrategyNumber));
+ mmopers->opers[OPER_EQ] =
+ get_opcode(get_opfamily_member(opfamily, optypid, optypid,
+ BTEqualStrategyNumber));
+ mmopers->opers[OPER_GTEQ] =
+ get_opcode(get_opfamily_member(opfamily, optypid, optypid,
+ BTGreaterEqualStrategyNumber));
+ mmopers->opers[OPER_GT] =
+ get_opcode(get_opfamily_member(opfamily, optypid, optypid,
+ BTGreaterStrategyNumber));
+
+ PG_RETURN_POINTER(mmopers);
+ }
+
+ /*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+ Datum
+ mmSortableMaybeUpdateValues(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+ Oid colloid = InvalidOid; /* FIXME -- figure out collation stuff */
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->perCol[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->perCol[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know is not
+ * null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->perCol[attno - 1].allnulls)
+ {
+ dtuple->perCol[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->tupdesc->attrs[attno - 1]->attlen);
+ dtuple->perCol[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->tupdesc->attrs[attno - 1]->attlen);
+ dtuple->perCol[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_LT);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->perCol[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->perCol[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_GT);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->perCol[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->perCol[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+ }
+
+ /*
+ * Given an index tuple corresponding to a certain page range, and a scan key
+ * (represented by its index attribute number, the value and an operator
+ * strategy number), return whether the scan key is consistent with the page
+ * range. Return true if so, false otherwise.
+ *
+ * XXX what do we need to do with NULL values here, if anything?
+ */
+ Datum
+ mmSortableCompare(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_INT16(2);
+ StrategyNumber strat = PG_GETARG_UINT16(3);
+ Datum value = PG_GETARG_DATUM(4);
+ Datum matches;
+ Oid colloid = InvalidOid; /* figure out collation stuff */
+ FmgrInfo *cmpFn;
+
+ switch (strat)
+ {
+ case BTLessStrategyNumber:
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_LT);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_LTEQ);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_LTEQ);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_GTEQ);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_GTEQ);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ cmpFn = minmax_get_operfn(mmdesc, attno, OPER_GT);
+ matches = FunctionCall2Coll(cmpFn, colloid,
+ dtup->perCol[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", strat);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,449 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * The intended interface is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation by
+ * using functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+ static TupleDesc
+ mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+ {
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->totalstored, false);
+
+ for (i = 0; i < mmdesc->tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->perCol[i].numstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->tupdesc->attrs[i]->atttypid,
+ mmdesc->tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ mmdesc->disktdesc = tupdesc;
+ }
+
+ return mmdesc->disktdesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+ MMTuple *
+ minmax_form_tuple(MinmaxDesc *mmdesc, DeformedMMTuple *tuple, Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (idxattno = 0, keyno = 0; keyno < mmdesc->tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->perCol[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->perCol[keyno].numstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->perCol[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->perCol[keyno].numstored;
+ datumno++)
+ /* XXX datumCopy ?? */
+ values[idxattno++] = tuple->perCol[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->tupdesc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->perCol[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->perCol[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ DeformedMMTuple *
+ minmax_new_dtuple(MinmaxDesc *mmdesc)
+ {
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->tupdesc->natts; i++)
+ {
+ dtup->perCol[i].allnulls = true;
+ dtup->perCol[i].hasnulls = false;
+ dtup->perCol[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->perCol[i].numstored;
+ }
+
+ return dtup;
+ }
+
+ void
+ minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+ {
+ int i;
+
+ for (i = 0; i < mmdesc->tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->perCol[i].allnulls = true;
+ dtuple->perCol[i].hasnulls = false;
+ memset(dtuple->perCol[i].values, 0,
+ sizeof(Datum) * mmdesc->perCol[i].numstored);
+ }
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need
+ * to apply datumCopy inside the loop. We probably also need a
+ * minmax_free_dtuple() function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->perCol[keyno].numstored;
+ continue;
+ }
+
+ dtup->perCol[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->perCol[keyno].numstored; i++)
+ dtup->perCol[keyno].values[i] = values[valueno++];
+
+ dtup->perCol[keyno].hasnulls = hasnulls[keyno];
+ dtup->perCol[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->totalstored
+ * allnulls output "allnulls", size mmdesc->tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < mmdesc->tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ allnulls[attnum] = true;
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && att_isnull(mmdesc->tupdesc->natts + attnum, hasnulls);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ for (stored = 0, attnum = 0; attnum < mmdesc->tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->perCol[attnum].numstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->perCol[attnum].numstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,304 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ if (xlrec->overwrite)
+ PageOverwriteItemData(page, offnum, (Item) mmtuple, tuplen);
+ else
+ {
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+ }
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBuffer(xlrec->node, xlrec->mapBlock, init);
+ Assert(BufferIsValid(buffer));
+ page = BufferGetPage(buffer);
+ if (init)
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+ }
+
+ static void
+ minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,95 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, XLogRecord *record)
+ {
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else if (info == XLOG_MINMAX_METAPG_SET)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
+
+ appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_RMARRAY_SET)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
+
+ appendStringInfoString(buf, "revmap array: ");
+ appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->rmarray,
+ xlrec->blkidx, xlrec->newpg);
+ }
+
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2096,2101 **** IndexBuildHeapScan(Relation heapRelation,
--- 2096,2122 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2166,2171 **** IndexBuildHeapScan(Relation heapRelation,
--- 2187,2195 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 324,329 **** PageAddItem(Page page,
--- 324,364 ----
}
/*
+ * PageOverwriteItemData
+ * Overwrite the data for the item at the given offset.
+ *
+ * The new data must fit in the existing data space for the old tuple.
+ */
+ void
+ PageOverwriteItemData(Page page, OffsetNumber offset, Item item, Size size)
+ {
+ PageHeader phdr = (PageHeader) page;
+ ItemId itemId;
+
+ /*
+ * Be wary about corrupted page pointers
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(PANIC,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+ itemId = PageGetItemId(phdr, offset);
+ if (!ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId))
+ elog(ERROR, "existing item to overwrite is not used");
+
+ if (ItemIdGetLength(itemId) < size)
+ elog(ERROR, "existing item is not large enough to be overwritten");
+
+ memcpy((char *) page + ItemIdGetOffset(itemId), item, size);
+ ItemIdSetNormal(itemId, ItemIdGetOffset(itemId), size);
+ }
+
+ /*
* PageGetTempPage
* Get a temporary page in local memory for special processing.
* The returned page is not initialized at all; caller must do that.
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 434,441 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 932,1113 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7375 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for MinMax' reloptions
+ */
+ typedef struct MinmaxOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+ } MinmaxOptions;
+
+ #define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+ #define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,104 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "fmgr.h"
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* returned by "GetOpers" amproc */
+ typedef struct MinmaxOpers
+ {
+ /* Number of columns stored in an index column of this opclass */
+ int nstored;
+
+ /* Operators that this opclass uses */
+ int nopers;
+ Oid opers[FLEXIBLE_ARRAY_MEMBER];
+ } MinmaxOpers;
+
+ /*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple. We store all the necessary FmgrInfo
+ * structs for all columns in a single array; for each column, the operators
+ * involved start at the "operoffset" item of that array.
+ *
+ * Note: we assume, for now, that the data stored for each column is the same
+ * datatype as the indexed heap column. This restriction can be lifted by
+ * having an Oid array pointer on the PerCol struct, where each member of the
+ * array indicates the typid of the stored data.
+ */
+ typedef struct MinmaxDescPerCol
+ {
+ uint16 numopers; /* number of operators we need */
+ uint16 numstored; /* number of stored columns */
+ #if 0
+ Oid typid; /* OID of indexed datatype */
+ uint32 typmod; /* typmod of indexed datatype */
+ #endif
+ Oid *operoids; /* array of operator OIDs (size numopers) */
+ FmgrInfo *opers; /* array of operators (same size) */
+ } MinmaxDescPerCol;
+
+ typedef struct MinmaxDesc
+ {
+ /* tuple descriptor of the index relation */
+ TupleDesc tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int totalstored;
+
+ /*
+ * The "getOpers" opclass-specific procedure returns an array of OIDs for
+ * the operators it uses internally. We copy the OIDs into our own array;
+ * we also have an array of FmgrInfos where we initialize the operators,
+ * so that the opclass can call them from there.
+ */
+ Oid *operoids;
+ FmgrInfo *opers;
+
+ /* per-column info */
+ MinmaxDescPerCol perCol[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
+ } MinmaxDesc;
+
+ extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+ extern Buffer mm_getnewbuffer(Relation irel);
+ extern void rm_page_set_iptr(Page page, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, BlockNumber blkno, OffsetNumber offno);
+ extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+ extern void initialize_rma_page(Buffer buf);
+
+ extern FmgrInfo *minmax_get_operfn(MinmaxDesc *mmdesc, AttrNumber attno,
+ uint16 operno);
+
+ /* Procedure strategy numbers */
+ #define MINMAX_PROCNUM_GETOPERS 1
+ #define MINMAX_PROCNUM_MAYBEUPDATE 2
+ #define MINMAX_PROCNUM_COMPARE 3
+
+ #define MINMAX_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(MINMAX_DEBUG) && defined(__GNUC__)
+ #define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define MINMAX_elog(a) void(0)
+ #endif
+
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_page.h
***************
*** 0 ****
--- 1,88 ----
+ /*
+ * prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef MINMAX_PAGE_H
+ #define MINMAX_PAGE_H
+
+
+ /* special space on all minmax pages stores a "type" identifier */
+ #define MINMAX_PAGETYPE_META 0xF091
+ #define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+ #define MINMAX_PAGETYPE_REVMAP 0xF093
+ #define MINMAX_PAGETYPE_REGULAR 0xF094
+
+ typedef struct MinmaxSpecialSpace
+ {
+ uint16 type;
+ } MinmaxSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+ } MinmaxMetaPageData;
+
+ /*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+ #define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+ #define MINMAX_CURRENT_VERSION 1
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ /* Definitions for regular revmap pages */
+ typedef struct RevmapContents
+ {
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ /* Definitions for array revmap pages */
+ typedef struct RevmapArrayContents
+ {
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+ } RevmapArrayContents;
+
+ #define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+ extern void mm_page_init(Page page, uint16 type);
+
+ #endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,34 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,84 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/minmax_internal.h"
+ #include "access/tupdesc.h"
+
+
+ /*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+ typedef struct MMValues
+ {
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+ typedef struct DeformedMMTuple
+ {
+ bool unused;
+ MMValues perCol[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each value stored); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+
+ extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+ extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+ extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,134 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+ #define XLOG_MINMAX_METAPG_SET 0x40
+ #define XLOG_MINMAX_RMARRAY_SET 0x50
+ #define XLOG_MINMAX_INIT_RMPG 0x60
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ bool overwrite;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, overwrite) + sizeof(bool))
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+ /* This is what we need to know about a "metapage set" operation */
+ typedef struct xl_minmax_metapg_set
+ {
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_metapg_set;
+
+ #define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know about a "revmap array set" operation */
+ typedef struct xl_minmax_rmarray_set
+ {
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_rmarray_set;
+
+ #define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know when we initialize a new revmap page */
+ typedef struct xl_minmax_init_rmpg
+ {
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+ } xl_minmax_init_rmpg;
+
+ #define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+ extern void minmax_desc(StringInfo buf, XLogRecord *record);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( minmax 5 3 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * MinMax int4_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * MinMax numeric_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * MinMax text_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * MinMax time_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * MinMax timetz_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * MinMax timestamp_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * MinMax timestamptz_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * MinMax date_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * MinMax char_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 431,434 **** DATA(insert ( 4017 25 25 3 4029 ));
--- 431,472 ----
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+ /* minmax */
+ DATA(insert ( 4054 23 23 1 3383 ));
+ DATA(insert ( 4054 23 23 2 3384 ));
+ DATA(insert ( 4054 23 23 3 3385 ));
+
+ DATA(insert ( 4055 1700 1700 1 3383 ));
+ DATA(insert ( 4055 1700 1700 2 3384 ));
+ DATA(insert ( 4055 1700 1700 3 3385 ));
+
+ DATA(insert ( 4056 25 25 1 3383 ));
+ DATA(insert ( 4056 25 25 2 3384 ));
+ DATA(insert ( 4056 25 25 3 3385 ));
+
+ DATA(insert ( 4057 1083 1083 1 3383 ));
+ DATA(insert ( 4057 1083 1083 2 3384 ));
+ DATA(insert ( 4057 1083 1083 3 3385 ));
+
+ DATA(insert ( 4058 1266 1266 1 3383 ));
+ DATA(insert ( 4058 1266 1266 2 3384 ));
+ DATA(insert ( 4058 1266 1266 3 3385 ));
+
+ DATA(insert ( 4059 1114 1114 1 3383 ));
+ DATA(insert ( 4059 1114 1114 2 3384 ));
+ DATA(insert ( 4059 1114 1114 3 3385 ));
+
+ DATA(insert ( 4060 1184 1184 1 3383 ));
+ DATA(insert ( 4060 1184 1184 2 3384 ));
+ DATA(insert ( 4060 1184 1184 3 3385 ));
+
+ DATA(insert ( 4061 1082 1082 1 3383 ));
+ DATA(insert ( 4061 1082 1082 2 3384 ));
+ DATA(insert ( 4061 1082 1082 3 3385 ));
+
+ DATA(insert ( 4062 18 18 1 3383 ));
+ DATA(insert ( 4062 18 18 2 3384 ));
+ DATA(insert ( 4062 18 18 3 3385 ));
+
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
***************
*** 4064,4069 **** DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
--- 4092,4105 ----
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+ /* Minmax */
+ DATA(insert OID = 3383 ( minmax_sortable_getopers PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableGetOpers _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3384 ( minmax_sortable_update_values PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableMaybeUpdateValues _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3385 ( minmax_sortable_compare PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableCompare _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 393,398 **** extern void PageInit(Page page, Size pageSize, Size specialSize);
--- 393,400 ----
extern bool PageIsVerified(Page page, BlockNumber blkno);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
+ extern void PageOverwriteItemData(Page page, OffsetNumber offset, Item item,
+ Size size);
extern Page PageGetTempPage(Page page);
extern Page PageGetTempPageCopy(Page page);
extern Page PageGetTempPageCopySpecial(Page page);
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 405,412 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1591,1596 **** ORDER BY 1, 2, 3;
--- 1591,1601 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1613,1619 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1618,1624 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1775,1785 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
--- 1780,1792 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has three support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
***************
*** 1800,1806 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1807,1814 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1178,1188 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
--- 1178,1190 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has three support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
***************
*** 1201,1207 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1203,1210 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3}'
);
-- Unfortunately, we can't check the amproc link very well because the
Heikki Linnakangas wrote:
On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
I feel that the below would nevertheless be simpler:
I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.Hmm. Two alternative schemes come to mind:
1. Move each index tuple off the page individually, updating the
revmap while you do it, until the page is empty. Updating the revmap
for a single index tuple isn't difficult; you have to do it anyway
when an index tuple is replaced. (MMTuples don't contain a heap
block number ATM, but IMHO they should, see below)2. Store the new block number of the page that you moved out of the
way in the revmap page, and leave the revmap pointers unchanged. The
revmap pointers can be updated later, lazily.Both of those seem pretty straightforward.
The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages. One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal. That
would cease to be the case if we have to do your proposed block moves.
ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.It's deleted by the next vacuum.
Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might
be acceptable - you ought to have plenty of RAM if you're pushing
around multi-terabyte tables - but it would nevertheless be nice to
not have a hard requirement for something as essential as vacuum.
I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually. One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it. That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set. No need for this hash
table stuff. I am still concerned with adding more overhead whenever a
page range is modified, so that insertions in the table continue to be
fast. If we're going to dirty the index every time, it might not be so
fast anymore. But then maybe I'm worrying about nothing; I will have to
measure how slower it is.
Wouldn't it be simpler to remove the old tuple atomically with
inserting the new tuple and updating the revmap? Or at least mark
the old tuple as deletable, so that vacuum can just delete it,
without building the large hash table to determine that it's
deletable.
Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer. Not good for
concurrent access).
I'm quite surprised by the use of LockTuple on the index tuples. I
think the main reason for needing that is the fact that MMTuple
doesn't store the heap (range) block number that the tuple points
to: LockTuple is required to ensure that the tuple doesn't go away
while a scan is following a pointer from the revmap to it. If the
MMTuple contained the BlockNumber, a scan could check that and go
back to the revmap if it doesn't match. Alternatively, you could
keep the revmap page locked when you follow a pointer to the regular
index page.
There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page. Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.
I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain. That data would mostly be
redundant, and would be necessary only for vacuuming.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)
I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update
the values, maybe not." ;-)
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM86f190a4cb004ab85060c480bc794410ff639b7d66277f45847975186b7839137dff5dfe3f4443e125aebfa3b4231991@asav-3.01.com
On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
All this being said, I'm sticking to the name "Minmax indexes". There
was a poll in pgsql-advocacy
/messages/by-id/53A0B4F8.8080803@agliodbs.com
about a new name, but there were no suggestions supported by more than
one person. If a brilliant new name comes up, I'm open to changing it.
How about "summarizing indexes"? That seems reasonably descriptive.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Josh Berkus wrote:
On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update
the values, maybe not." ;-)
:-) Well, that's exactly what happens. If we insert a new tuple into
the table, and the existing summarizing tuple (to use Peter's term)
already covers it, then we don't need to update the index tuple at all.
What this name doesn't say is what values are to be maybe-updated.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 10:16 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
there is no hardcoded assumption on the number of index
values stored per heap column, so it is possible to build an opclass
that stores a bounding box column for a geometry heap column, for
instance.
I think the more Postgresy thing to do is to store one datum per heap
column. It's up to the opclass to find or make a composite data type
that stores all the necessary state. So you could make a minmax_accum
data type like NumericAggState in numeric.c:numeric_accum() or the
array of floats in float8_accum. For a bounding box a 2d geometric
min/max index could use the "box" data type for example. The way
you've done it seems more convenient but there's something to be said
for using the same style for different areas. A single bounding box
accumulator function would probably suffice for both an aggregate and
index opclass for example.
But this sounds pretty great. I think it would let me do the bloom
filter index I had in mind fairly straightforwardly. The result would
be something very similar to a bitmap index. I'm not sure if there's a
generic term that includes bitmap indexes or other summary functions
like bounding boxes (which min/max is basically -- a 1D bounding box).
Thanks a lot for listening and being so open, I think what you
describe is a lot more flexible than what you had before and I can see
some pretty great things coming out of it (including min/max itself of
course).
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Another thing I noticed is that version 8 of the patch blindly believed
the "pages_per_range" declared in catalogs. This meant that if somebody
did "alter index foo set pages_per_range=123" the index would
immediately break (i.e. return corrupted results when queried). I have
fixed this by storing the pages_per_range value used to construct the
index in the metapage. Now if you do the ALTER INDEX thing, the new
value is only used when the index is recreated by REINDEX.
This seems a lot like parameterizing. So I guess the only thing left
is to issue a NOTICE when said alter takes place (I don't see that on
the patch, but maybe it's there?)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Claudio Freire wrote:
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Another thing I noticed is that version 8 of the patch blindly believed
the "pages_per_range" declared in catalogs. This meant that if somebody
did "alter index foo set pages_per_range=123" the index would
immediately break (i.e. return corrupted results when queried). I have
fixed this by storing the pages_per_range value used to construct the
index in the metapage. Now if you do the ALTER INDEX thing, the new
value is only used when the index is recreated by REINDEX.This seems a lot like parameterizing.
I don't understand what that means -- care to elaborate?
So I guess the only thing left is to issue a NOTICE when said alter
takes place (I don't see that on the patch, but maybe it's there?)
That's not in the patch. I don't think we have an appropriate place to
emit such a notice.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
So I guess the only thing left is to issue a NOTICE when said alter
takes place (I don't see that on the patch, but maybe it's there?)
That's not in the patch. I don't think we have an appropriate place to
emit such a notice.
What do you mean by "don't have an appropriate place"?
The suggestion is that when a user does:
ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
they should get a NOTICE:
"NOTICE: changes to pages per range will not take effect until the index
is REINDEXed"
otherwise, we're going to get a lot of "I Altered the pages per range,
but performance didn't change" emails.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMb67dd3c6aea40fed5b7bf503e63009b94e4b0d87adc026159470d6130768d6e066f2df1764a8f5c9cc92586922de5a47@asav-2.01.com
Josh Berkus wrote:
On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
So I guess the only thing left is to issue a NOTICE when said alter
takes place (I don't see that on the patch, but maybe it's there?)
That's not in the patch. I don't think we have an appropriate place to
emit such a notice.What do you mean by "don't have an appropriate place"?
What I think should happen is that if the value is changed, the index
sholud be rebuilt right there. But there is no way to have this occur
from the generic tablecmds.c code. Maybe we should extend the AM
interface so that they are notified of changes and can take action.
Inserting AM-specific code into tablecmds.c seems pretty wrong to me --
existing stuff for WITH CHECK OPTION views notwithstanding.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
So I guess the only thing left is to issue a NOTICE when said alter
takes place (I don't see that on the patch, but maybe it's there?)
That's not in the patch. I don't think we have an appropriate place to
emit such a notice.What do you mean by "don't have an appropriate place"?
The suggestion is that when a user does:
ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
they should get a NOTICE:
"NOTICE: changes to pages per range will not take effect until the index
is REINDEXed"otherwise, we're going to get a lot of "I Altered the pages per range,
but performance didn't change" emails.
How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?
we don't get a notice for these cases either
--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/10/2014 02:30 PM, Jaime Casanova wrote:
How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?we don't get a notice for these cases either
Good idea. We should also emit notices for those. Well, maybe not for
fillfactor.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM6c5efa1bf0a862891fd06134e5c51c49e1344015b6daff5b21265e78d3399b289b9074bdcc606a737ab403fd648f0aa1@asav-1.01.com
On Thu, Jul 10, 2014 at 2:30 PM, Jaime Casanova <jaime@2ndquadrant.com>
wrote:
On Thu, Jul 10, 2014 at 3:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 07/10/2014 12:20 PM, Alvaro Herrera wrote:
So I guess the only thing left is to issue a NOTICE when said alter
takes place (I don't see that on the patch, but maybe it's there?)
That's not in the patch. I don't think we have an appropriate place to
emit such a notice.What do you mean by "don't have an appropriate place"?
The suggestion is that when a user does:
ALTER INDEX foo_minmax SET PAGES_PER_RANGE=100
they should get a NOTICE:
"NOTICE: changes to pages per range will not take effect until the index
is REINDEXed"otherwise, we're going to get a lot of "I Altered the pages per range,
but performance didn't change" emails.How is this different from "ALTER TABLE foo SET (FILLFACTOR=80); " or
from "ALTER TABLE foo ALTER bar SET STORAGE EXTERNAL; " ?we don't get a notice for these cases either
I think those are different. They don't rewrite existing data in the
table, but they are applied to new (and updated) data. My understanding is
that changing PAGES_PER_RANGE will have no effect on future data until a
re-index is done, even if the entire table eventually turns over.
Cheers,
Jeff
On Thu, Jul 10, 2014 at 10:29 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
What I think should happen is that if the value is changed, the index
sholud be rebuilt right there.
I disagree. It would be a non-orthogonal interface if ALTER TABLE
sometimes causes the index to be rebuilt and sometimes just makes a
configuration change. I already see a lot of user confusion when some
ALTER TABLE commands rewrite the table and some are quick meta data
changes.
Especially in this case where the type of configuration being changed
is just an internal storage parameter and the user visible shape of
the index is unchanged it would be weird to rebuild the index.
IMHO the "right" thing to do is just to say this parameter is
read-only and have the AM throw an error when the user changes it. But
even that would require an AM callback for the AM to even know about
the change.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Claudio Freire wrote:
An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal typeIf you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.Here's a new version of this patch, which is more generic the original
versions, and similar to what you describe.
I've not read the discussion so far at all, but I found the problem
when I played with this patch. Sorry if this has already been discussed.
=# create table test as select num from generate_series(1,10) num;
SELECT 10
=# create index testidx on test using minmax (num);
CREATE INDEX
=# alter table test alter column num type text;
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 9 July 2014 23:54, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Jul 9, 2014 at 2:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
All this being said, I'm sticking to the name "Minmax indexes". There
was a poll in pgsql-advocacy
/messages/by-id/53A0B4F8.8080803@agliodbs.com
about a new name, but there were no suggestions supported by more than
one person. If a brilliant new name comes up, I'm open to changing it.How about "summarizing indexes"? That seems reasonably descriptive.
-1 for another name change. That boat sailed some months back.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 10 July 2014 00:13, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Josh Berkus wrote:
On 07/09/2014 02:16 PM, Alvaro Herrera wrote:
The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)I kind of like "maybeUpdateValues". Very ... NoSQL-ish. "Maybe update
the values, maybe not." ;-):-) Well, that's exactly what happens. If we insert a new tuple into
the table, and the existing summarizing tuple (to use Peter's term)
already covers it, then we don't need to update the index tuple at all.
What this name doesn't say is what values are to be maybe-updated.
There are lots of functions that maybe-do-things, that's just modular
programming. Not sure we need to prefix things with maybe to explain
that, otherwise we'd have maybeXXX everywhere.
More descriptive name would be MaintainIndexBounds() or similar.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao wrote:
On Thu, Jul 10, 2014 at 6:16 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Here's a new version of this patch, which is more generic the original
versions, and similar to what you describe.I've not read the discussion so far at all, but I found the problem
when I played with this patch. Sorry if this has already been discussed.=# create table test as select num from generate_series(1,10) num;
SELECT 10
=# create index testidx on test using minmax (num);
CREATE INDEX
=# alter table test alter column num type text;
ERROR: could not determine which collation to use for string comparison
HINT: Use the COLLATE clause to set the collation explicitly.
Ah, yes, I need to pass down collation OIDs to comparison functions.
That's marked as XXX in various places in the code. Sorry I forgot to
mention that.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Claudio Freire wrote:
On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Another thing I noticed is that version 8 of the patch blindly believed
the "pages_per_range" declared in catalogs. This meant that if somebody
did "alter index foo set pages_per_range=123" the index would
immediately break (i.e. return corrupted results when queried). I have
fixed this by storing the pages_per_range value used to construct the
index in the metapage. Now if you do the ALTER INDEX thing, the new
value is only used when the index is recreated by REINDEX.This seems a lot like parameterizing.
I don't understand what that means -- care to elaborate?
We've been talking about bloom filters, and how their shape differs
according to the parameters of the bloom filter (number of hashes,
hash type, etc).
But after seeing this case of pages_per_range, I noticed it's an
effective-enough mechanism. Like:
CREATE INDEX ix_blah ON some_table USING bloom (somecol)
WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64);
Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.
We don't actually have any of these mechanisms. They wouldn't be bad
things to have but I don't think we should gate adding new types of
indexes on adding them. In particular, the index could just hard code
a value for these parameters and having them be parameterized is
clearly better even if that doesn't produce all the warnings or
rebuild things automatically or whatever.
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark <stark@mit.edu> wrote:
On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.We don't actually have any of these mechanisms. They wouldn't be bad
things to have but I don't think we should gate adding new types of
indexes on adding them. In particular, the index could just hard code
a value for these parameters and having them be parameterized is
clearly better even if that doesn't produce all the warnings or
rebuild things automatically or whatever.
No, I agree, it's just a nice to have.
But at least the docs should mention it.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jul 9, 2014 at 5:16 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
The way it works now, each opclass needs to have three support
procedures; I've called them getOpers, maybeUpdateValues, and compare.
(I realize these names are pretty bad, and will be changing them.)
getOpers is used to obtain information about what is stored for that
data type; it says how many datum values are stored for a column of that
type (two for sortable: min and max), and how many operators it needs
setup. Then, the generic code fills in a MinmaxDesc(riptor) and creates
an initial DeformedMMTuple (which is a rather ugly name for a minmax
tuple held in memory). The maybeUpdateValues amproc can then be called
when there's a new heap tuple, which updates the DeformedMMTuple to
account for the new tuple (in essence, it's a union of the original
values and the new tuple). This can be done repeatedly (when a new
index is being created) or only once (when a new heap tuple is inserted
into an existing index). There is no need for an "aggregate".This DeformedMMTuple can easily be turned into the on-disk
representation; there is no hardcoded assumption on the number of index
values stored per heap column, so it is possible to build an opclass
that stores a bounding box column for a geometry heap column, for
instance.Then we have the "compare" amproc. This is used during index scans;
after extracting an index tuple, it is turned into DeformedMMTuple, and
the "compare" amproc for each column is called with the values of scan
keys. (Now that I think about this, it seems pretty much what
"consistent" is for GiST opclasses). A true return value indicates that
the scan key matches the page range boundaries and thus all pages in the
range are added to the output TID bitmap.
This sounds really great. I agree that it needs some renaming. I
think renaming what you are calling "compare" to "consistent" would be
an excellent idea, to match GiST. "maybeUpdateValues" sounds like it
does the equivalent of GIST's "compress" on the new value followed by
a "union" with the existing summary item. I don't think it's
necessary to separate those out, though. You could perhaps call it
something like "add_item".
Also, FWIW, I liked Peter's idea of calling these "summarizing
indexes" or perhaps "summary" would be a bit shorter and mean the same
thing. "minmax" wouldn't be the end of the world, but since you've
gone to the trouble of making this more generic I think giving it a
more generic name would be a very good idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/10/2014 12:41 AM, Alvaro Herrera wrote:
Heikki Linnakangas wrote:
On 06/23/2014 08:07 PM, Alvaro Herrera wrote:
I feel that the below would nevertheless be simpler:
I wonder if it would be simpler to just always store the revmap
pages in the beginning of the index, before any other pages. Finding
the revmap page would then be just as easy as with a separate fork.
When the table/index is extended so that a new revmap page is
needed, move the existing page at that block out of the way. Locking
needs some consideration, but I think it would be feasible and
simpler than you have now.Moving index items around is not easy, because you'd have to adjust the
revmap to rewrite the item pointers.Hmm. Two alternative schemes come to mind:
1. Move each index tuple off the page individually, updating the
revmap while you do it, until the page is empty. Updating the revmap
for a single index tuple isn't difficult; you have to do it anyway
when an index tuple is replaced. (MMTuples don't contain a heap
block number ATM, but IMHO they should, see below)2. Store the new block number of the page that you moved out of the
way in the revmap page, and leave the revmap pointers unchanged. The
revmap pointers can be updated later, lazily.Both of those seem pretty straightforward.
The trouble I have with moving blocks around to make space, is that it
would cause the index to have periodic hiccups to make room for the new
revmap pages. One nice property that these indexes are supposed to have
is that the effect into insertion times should be pretty minimal. That
would cease to be the case if we have to do your proposed block moves.
Approach 2 above is fairly quick, quick enough that no-one would notice
the "hiccup". Moving the tuples individually (approach 1) would be slower.
ISTM that when the old tuple cannot be updated in-place, the new
index tuple is inserted with mm_doinsert(), but the old tuple is
never deleted.It's deleted by the next vacuum.
Ah I see. Vacuum reads the whole index, and builds an in-memory hash
table that contains an ItemPointerData for every tuple in the index.
Doesn't that require a lot of memory, for a large index? That might
be acceptable - you ought to have plenty of RAM if you're pushing
around multi-terabyte tables - but it would nevertheless be nice to
not have a hard requirement for something as essential as vacuum.I guess if you're expecting that pages_per_range=1 is a common case,
yeah it might become an issue eventually.
Not sure, but I find it easier to think of the patch that way. In any
case, it would be nice to avoid the problem, even if it's not common.
One idea I just had is to
have a bit for each index tuple, which is set whenever the revmap no
longer points to it. That way, vacuuming is much easier: just scan the
index and delete all tuples having that bit set.
The bit needs to be set atomically with the insertion of the new tuple,
so why not just remove the old tuple right away?
Wouldn't it be simpler to remove the old tuple atomically with
inserting the new tuple and updating the revmap? Or at least mark
the old tuple as deletable, so that vacuum can just delete it,
without building the large hash table to determine that it's
deletable.Yes, it might be simpler, but it'd require dirtying more pages on
insertions (and holding more page-level locks, for longer. Not good for
concurrent access).
I wouldn't worry much about the performance and concurrency of this
operation. Remember that the majority of updates are expected to not
have to update the index, otherwise the minmax index will degenerate
quickly and performance will suck anyway. And even when updating the
index is needed, in most cases the new tuple fits on the same page,
after removing the old one. So the case where you have to insert a new
index tuple, remove old one (or mark it dead), and update the revmap to
point to the new tuple, is rare.
I'm quite surprised by the use of LockTuple on the index tuples. I
think the main reason for needing that is the fact that MMTuple
doesn't store the heap (range) block number that the tuple points
to: LockTuple is required to ensure that the tuple doesn't go away
while a scan is following a pointer from the revmap to it. If the
MMTuple contained the BlockNumber, a scan could check that and go
back to the revmap if it doesn't match. Alternatively, you could
keep the revmap page locked when you follow a pointer to the regular
index page.There's the intention that these accesses be kept as concurrent as
possible; this is why we don't want to block the whole page. Locking
individual TIDs is fine in this case (which is not in SELECT FOR UPDATE)
because we can only lock a single tuple in any one index scan, so
there's no unbounded growth of the lock table.I prefer not to have BlockNumbers in index tuples, because that would
make them larger for not much gain. That data would mostly be
redundant, and would be necessary only for vacuuming.
Don't underestimate the value of easier debugging. I wouldn't worry much
about shaving four bytes from the tuple, these indexes are tiny in any
case. Keep it simple at first, and optimize later if necessary.
In fact, I'd suggest just using normal IndexTuple instead of the custom
MMTuple struct, store the block number in t_tid and leave offset number
field of that unused. That wastes 2 more bytes per tuple, but that's
insignificant too. I feel that it probably would be worth it just to
keep thing simple, and you'd e.g. be able to use index_deform_tuple() as is.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thanks for all the feedback on version 9. Here's version 13. (The
intermediate versions are just tags in my private tree which I created
each time I rebased. Please bear with me here.)
I have chosen to keep the name "minmax", even if the opclasses now let
one implement completely different things on top of it such as geometry
bounding boxes and bloom filters (aka bitmap indexes). I don't see a
need for a rename: essentially, in PR we can just say "we have these
neat minmax indexes that other databases also have, but instead of just
being used for integer data, they can also be used for geometry, GIS and
bitmap indexes, so as always we're more powerful than everyone else when
implementing new database features".
This new version includes some changes per feedback. Most notoriously,
the opclass definition is different now: instead of relying on the
"sortable" opclass implementation extracting the oprcode for each
operator strategy (i.e. the functions that underlie < <= >= >), I chose
to have catalog entries in pg_amproc for the underlying support
functions. The new definition makes a lot of sense to me now, after
thinking long about this stuff and carefully reading the
"Catalog Entries for Indexes" chapter in docs.
The way it works now is that there are five pg_amop entries in an
opclass, just like previously (corresponding to the underlying < <= = >= >
operators). This lets the optimizer choose the index when a query uses
those operators. There are also seven pg_amproc entries. The first
three are identical to all minmax opclasses: "opcinfo" (version 9 called
it "getopers"), "consistent" (v9 name "compare") and "add_value" (v9
name "maybeUpdateValues", not a loved name evidently). A minmax opclass
on top of a sortable datatype has four additional support functions: one
for each function underlying the < <= >= > operators. Other opclasses
would define their own support functions here, which would correspond to
functions used to implement the "consistent" and "compare" functions
internally.
I don't claim this is 100% correct, but in particular I think it's now
possible to implement cross-datatype comparisons, so that a minmax index
defined on an int8 column works when the query uses an int4 operator,
for example. (The current patch doesn't actually add such catalog
entries, though. I think some minor code changes are required for this
to actually work. However with the previous opclass definition it would
have been outright impossible.)
I fixed the bug reported by Masao-kun that collatable datatypes weren't
cleanly supported. Collation OIDs are passed down now, although I don't
claim that it is bulletproof. This could use some more testing.
I haven't yet updated the revmap definition per Heikki's review. I am
not sure I want to do that right away. I think we could live with what
we have now, and see about changing this later on in the 9.5 cycle if we
think a different definition is better. I think what we have is pretty
solid even if there are some theoretical holes.
As a very quick test, I created a 10 million tuples table with an int4
column on my laptop. The table is ~346 MB. Creating a btree index on
it takes 8 seconds. A minmax index takes 1.6 seconds. The btree index
is 214 MB. The minmax index, with pages_per_range=1 is 1 MB. With
pages_per_range=16 (default) it is 48kB.
Very unscientific results follow. This is the btree doing an index-only
scan:
alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Only Scan using bti2 on t (cost=0.43..1692.75 rows=54416 width=4) (actual time=0.106..23.329 rows=54518 loops=1)
Index Cond: ((a > 991243) AND (a < 1045762))
Heap Fetches: 0
Buffers: shared hit=1 read=152
Planning time: 0.695 ms
Execution time: 31.565 ms
(6 filas)
Duraci�n: 33,662 ms
Turn off index-only scan, do a regular index scan:
alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using bti2 on t (cost=0.43..1932.75 rows=54416 width=4) (actual time=0.066..31.027 rows=54518 loops=1)
Index Cond: ((a > 991243) AND (a < 1045762))
Buffers: shared hit=394
Planning time: 0.250 ms
Execution time: 39.218 ms
(5 filas)
Duraci�n: 40,385 ms
Use the 16-pages-per-range minmax index:
alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=16.60..47402.01 rows=54416 width=4) (actual time=4.266..43.948 rows=54518 loops=1)
Recheck Cond: ((a > 991243) AND (a < 1045762))
Rows Removed by Index Recheck: 32266
Heap Blocks: lossy=384
Buffers: shared hit=244 read=142
-> Bitmap Index Scan on ti2 (cost=0.00..3.00 rows=54416 width=0) (actual time=1.061..1.061 rows=3840 loops=1)
Index Cond: ((a > 991243) AND (a < 1045762))
Buffers: shared hit=2
Planning time: 0.215 ms
Execution time: 51.820 ms
(10 filas)
This is the 1-page-per-range minmax index:
alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=157.60..47543.01 rows=54416 width=4) (actual time=82.479..98.642 rows=54518 loops=1)
Recheck Cond: ((a > 991243) AND (a < 1045762))
Rows Removed by Index Recheck: 174
Heap Blocks: lossy=242
Buffers: shared hit=385
-> Bitmap Index Scan on ti (cost=0.00..144.00 rows=54416 width=0) (actual time=82.448..82.448 rows=2420 loops=1)
Index Cond: ((a > 991243) AND (a < 1045762))
Buffers: shared hit=143
Planning time: 0.280 ms
Execution time: 103.542 ms
(10 filas)
Duraci�n: 104,952 ms
This is a seqscan. Notice the high number of buffer accesses:
alvherre=# explain (analyze, buffers) select * from t where a > 991243 and a < 1045762;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on t (cost=0.00..194248.00 rows=54416 width=4) (actual time=161.338..1201.535 rows=54518 loops=1)
Filter: ((a > 991243) AND (a < 1045762))
Rows Removed by Filter: 9945482
Buffers: shared hit=10672 read=33576
Planning time: 0.189 ms
Execution time: 1204.501 ms
(6 filas)
Duraci�n: 1205,304 ms
Of course, this isn't nearly a worst-case scenario for minmax, as the
data is perfectly correlated. The pages_per_range=16 index benefits
particularly from that.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-13.patchtext/x-diff; charset=us-asciiDownload
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index f10229d..45b5b6c 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -1,7 +1,7 @@
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
-OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
+OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
diff --git a/contrib/pageinspect/mmfuncs.c b/contrib/pageinspect/mmfuncs.c
new file mode 100644
index 0000000..b2584d3
--- /dev/null
+++ b/contrib/pageinspect/mmfuncs.c
@@ -0,0 +1,421 @@
+/*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "catalog/index.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "miscadmin.h"
+
+Datum minmax_page_type(PG_FUNCTION_ARGS);
+Datum minmax_page_items(PG_FUNCTION_ARGS);
+Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(minmax_page_type);
+PG_FUNCTION_INFO_V1(minmax_page_items);
+PG_FUNCTION_INFO_V1(minmax_metapage_info);
+PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+typedef struct mm_page_state
+{
+ TupleDesc tupdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+} mm_page_state;
+
+
+static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+Datum
+minmax_page_type(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+}
+
+/*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+static Page
+verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+{
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+}
+
+
+#ifdef NOT_YET
+/*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+Datum
+minmax_page_items(PG_FUNCTION_ARGS)
+{
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->tupdesc = CreateTupleDescCopy(RelationGetDescr(indexRel));
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[6];
+ bool nulls[6];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->tupdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->values[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->values[att].hasnulls);
+ if (!state->dtup->values[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->values[att];
+
+ values[4] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->min));
+ values[5] = CStringGetTextDatum(OutputFunctionCall(outputfn,
+ mmvalues->max));
+ }
+ else
+ {
+ nulls[4] = true;
+ nulls[5] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+}
+#endif
+
+Datum
+minmax_metapage_info(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
+ values[1] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[2] = true;
+ else
+ values[2] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+Datum
+minmax_revmap_array_data(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+}
+
+/*
+ * Return the TID array stored in a minmax revmap page
+ */
+Datum
+minmax_revmap_data(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
diff --git a/contrib/pageinspect/pageinspect--1.2.sql b/contrib/pageinspect/pageinspect--1.2.sql
index 15e8e1e..b6410aa 100644
--- a/contrib/pageinspect/pageinspect--1.2.sql
+++ b/contrib/pageinspect/pageinspect--1.2.sql
@@ -99,6 +99,52 @@ AS 'MODULE_PATHNAME', 'bt_page_items'
LANGUAGE C STRICT;
--
+-- minmax_page_type()
+--
+CREATE FUNCTION minmax_page_type(IN page bytea)
+RETURNS text
+AS 'MODULE_PATHNAME', 'minmax_page_type'
+LANGUAGE C STRICT;
+
+--
+-- minmax_metapage_info()
+--
+CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+LANGUAGE C STRICT;
+
+--
+-- minmax_page_items()
+--
+/* needs more work
+CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT min text,
+ OUT max text)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'minmax_page_items'
+LANGUAGE C STRICT;
+*/
+
+--
+-- minmax_revmap_array_data()
+CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+LANGUAGE C STRICT;
+
+--
+-- minmax_revmap_data()
+CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+LANGUAGE C STRICT;
+
+--
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index cbcaaa6..8ffff06 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -13,6 +13,7 @@
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+#include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
diff --git a/minmax-proposal b/minmax-proposal
new file mode 100644
index 0000000..ededbcd
--- /dev/null
+++ b/minmax-proposal
@@ -0,0 +1,306 @@
+Minmax Range Indexes
+====================
+
+Minmax indexes are a new access method intended to enable very fast scanning of
+extremely large tables.
+
+The essential idea of a minmax index is to keep track of summarizing values in
+consecutive groups of heap pages (page ranges); for example, the minimum and
+maximum values for datatypes with a btree opclass, or the bounding box for
+geometric types. These values can be used by constraint exclusion to avoid
+scanning such pages, depending on query quals.
+
+The main drawback of this is having to update the stored summary values of each
+page range as tuples are inserted into them.
+
+Other database systems already have similar features. Some examples:
+
+* Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+* Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+* Infobright has this automatically within their "data packs" according to a
+ May 3rd, 2009 blog post
+ http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
+
+* MonetDB also uses this technique, according to a published paper
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+Index creation
+--------------
+
+To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+Partial indexes are not supported currently; since an index is concerned with
+summary values of the involved columns across all the pages in the table, it
+normally doesn't make sense to exclude some tuples. These might be useful if
+the index predicates are also used in queries. We exclude these for now for
+conceptual simplicity.
+
+Expressional indexes can probably be supported in the future, but we disallow
+them initially for conceptual simplicity.
+
+Having multiple minmax indexes in the same table is acceptable, though most of
+the time it would make more sense to have a single index covering all the
+interesting columns. Multiple indexes might be useful for columns added later.
+
+Access Method Design
+--------------------
+
+Since item pointers are not stored inside indexes of this type, it is not
+possible to support the amgettuple interface. Instead, we only provide
+amgetbitmap support; scanning a relation using this index requires a recheck
+node on top. The amgetbitmap routine returns a TIDBitmap comprising all pages
+in those page groups that match the query qualifications. The recheck node
+prunes tuples that are not visible according to the query qualifications.
+
+For each supported datatype, we need an operator class with the following
+catalog entries:
+
+- support operators (pg_amop): same as btree (<, <=, =, >=, >)
+- support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+
+These are used pervasively:
+
+- The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+- During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+- During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+In each index tuple (corresponding to one page range), we store:
+- for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+Different datatypes store other values instead of min/max, for example
+geometric types might store a bounding box. The NULL bits are always present.
+
+These null bits are stored in a single null bitmask of length 2x number of
+columns.
+
+With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+which seems reasonable. There are 6 extra bytes for padding between the null
+bitmask and the first data item, assuming 64-bit alignment; so the total size
+for such an index would actually be 528 bytes.
+
+This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+(8 bytes) + data value (8 bytes) * 32 * 2
+
+(Of course, larger columns are possible, such as varchar, but creating minmax
+indexes on such columns seems of little practical usefulness. Also, the
+usefulness of an index containing so many columns is dubious.)
+
+There can be gaps where some pages have no covering index entry.
+
+The Range Reverse Map
+---------------------
+
+To find out the index tuple for a particular page range, we have an internal
+structure we call the range reverse map. This stores one TID per page range,
+which is the address of the index tuple summarizing that range. Since these
+map entries are fixed size, it is possible to compute the address of the range
+map entry for any given heap page by simple arithmetic.
+
+When a new heap tuple is inserted in a summarized page range, we compare the
+existing index tuple with the new heap tuple. If the heap tuple is outside the
+summarization data given by the index tuple for any indexed column (or if the
+new heap tuple contains null values but the index tuple indicate there are no
+nulls), it is necessary to create a new index tuple with the new values. To do
+this, a new index tuple is inserted, and the reverse range map is updated to
+point to it. The old index tuple is left in place, for later garbage
+collection. As an optimization, we sometimes overwrite the old index tuple in
+place with the new data, which avoids the need for later garbage collection.
+
+If the reverse range map points to an invalid TID, the corresponding page range
+is considered to be not summarized.
+
+To scan a table following a minmax index, we scan the reverse range map
+sequentially. This yields index tuples in ascending page range order. Query
+quals are matched to each index tuple; if they match, each page within the page
+range is returned as part of the output TID bitmap. If there's no match, they
+are skipped. Reverse range map entries returning invalid index TIDs, that is
+unsummarized page ranges, are also returned in the TID bitmap.
+
+To store the range reverse map, we map its logical page numbers to physical
+pages. We use a large two-level BlockNumber array for this: The metapage
+contains an array of BlockNumbers; each of these points to a "revmap array
+page". Each revmap array page contains BlockNumbers, which in turn point to
+"revmap regular pages", which are the ones that contain the revmap data itself.
+Therefore, to find a given index tuple, we need to examine the metapage and
+obtain the revmap array page number; then read the array page. From there we
+obtain the revmap regular page number, and that one contains the TID we're
+interested in. As an optimization, regular revmap page number 0 is stored in
+physical page number 1, that is, the page just after the metapage. This means
+that scanning a table of about 1300 page ranges (the number of TIDs that fit in
+a single 8kB page) does not require accessing the metapage at all.
+
+When tuples are added to unsummarized pages, nothing needs to happen.
+
+Heap tuples can be removed from anywhere without restriction. It might be
+useful to mark the corresponding index tuple somehow, if the heap tuple is one
+of the constraining values of the summary data (i.e. either min or max in the
+case of a btree-opclass-bearing datatype), so that in the future we are aware
+of the need to re-execute summarization on that range, leading to a possible
+tightening of the summary values.
+
+Index entries that are not referenced from the revmap can be removed from the
+main fork. This currently happens at amvacuumcleanup, though it could be
+carried out separately; no heap scan is necessary to determine which tuples
+are unreachable.
+
+Summarization
+-------------
+
+At index creation time, the whole table is scanned; for each page range the
+summarizing values of each indexed column and nulls bitmap are collected and
+stored in the index.
+
+Once in a while, it is necessary to summarize a bunch of unsummarized pages
+(because the table has grown since the index was created), or re-summarize a
+range that has been marked invalid. This is simple: scan the page range
+calculating the summary values for each indexed column, then insert the new
+index entry at the end of the index.
+
+The easiest way to go around this seems to have vacuum do it. That way we can
+simply do re-summarization on the amvacuumcleanup routine. Other answers would
+mean we need a separate AM routine, which appears unwarranted at this stage.
+
+Vacuuming
+---------
+
+Vacuuming a table that has a minmax index does not represent a significant
+challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+when heap tuples are removed. It might be that some min() value can be
+incremented, or some max() value can be decremented; but this would represent
+an optimization opportunity only, not a correctness issue. Perhaps it's
+simpler to represent this as the need to re-run summarization on the affected
+page range.
+
+Note that if there are no indexes on the table other than the minmax index,
+usage of maintenance_work_mem by vacuum can be decreased significantly, because
+no detailed index scan needs to take place (and thus it's not necessary for
+vacuum to save TIDs to remove). This optimization opportunity is best left for
+future improvement.
+
+Locking considerations
+----------------------
+
+To read the TID during an index scan, we follow this protocol:
+
+* read revmap page
+* obtain share lock on the revmap buffer
+* read the TID
+* obtain share lock on buffer of main fork
+* LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+* release revmap buffer lock
+* read the index tuple
+* release the tuple lock
+* release main fork buffer lock
+
+
+To update the summary tuple for a page range, we use this protocol:
+
+* insert a new index tuple somewhere in the main fork; note its TID
+* read revmap page
+* obtain exclusive lock on revmap buffer
+* write the TID
+* release lock
+
+This ensures no concurrent reader can obtain a partially-written TID.
+Note we don't need a tuple lock here. Concurrent scans don't have to
+worry about whether they got the old or new index tuple: if they get the
+old one, the tighter values are okay from a correctness standpoint because
+due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+For vacuuming, we need to figure out which index tuples are no longer
+referenced from the reverse range map. This requires some brute force,
+but is simple:
+
+1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+Note this doesn't require scanning the heap at all, or being involved in
+the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+which is a nice property because index scans keep pages pinned for long
+periods.
+
+
+
+Optimizer
+---------
+
+In order to make this all work, the only thing we need to do is ensure we have a
+good enough opclass and amcostestimate. With this, the optimizer is able to pick
+up the index on its own.
+
+
+Open questions
+--------------
+
+* Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+* More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+References:
+
+Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+http://wiki.postgresql.org/wiki/Segment_Exclusion
+http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index c32088f..db46539 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c7ad6f9..1bef404 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -209,6 +209,13 @@ static relopt_int intRelOpts[] =
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d731f98..78f35b9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -271,6 +271,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
@@ -296,6 +298,14 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
pgstat_count_heap_scan(scan->rs_rd);
}
+void
+heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+{
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+}
+
/*
* heapgetpage - subroutine for heapgettup()
*
@@ -636,7 +646,8 @@ heapgettup(HeapScanDesc scan,
*/
if (backward)
{
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
@@ -646,7 +657,8 @@ heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
@@ -897,7 +909,8 @@ heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
@@ -907,7 +920,8 @@ heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
diff --git a/src/backend/access/minmax/Makefile b/src/backend/access/minmax/Makefile
new file mode 100644
index 0000000..2c80a20
--- /dev/null
+++ b/src/backend/access/minmax/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/minmax
+#
+# IDENTIFICATION
+# src/backend/access/minmax/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/minmax
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/minmax/minmax.c b/src/backend/access/minmax/minmax.c
new file mode 100644
index 0000000..b07a90b
--- /dev/null
+++ b/src/backend/access/minmax/minmax.c
@@ -0,0 +1,1568 @@
+/*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * support collatable datatypes
+ * * ScalarArrayOpExpr
+ * * Make use of the stored NULL bits
+ * * we can support unlogged indexes now
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "access/minmax_xlog.h"
+#include "access/reloptions.h"
+#include "access/relscan.h"
+#include "access/xlogutils.h"
+#include "catalog/index.h"
+#include "catalog/pg_operator.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/lmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+
+
+/*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+typedef struct MMBuildState
+{
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ DeformedMMTuple *dtuple;
+} MMBuildState;
+
+/*
+ * Struct used as "opaque" during index scans
+ */
+typedef struct MinmaxOpaque
+{
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+} MinmaxOpaque;
+
+static MinmaxDesc *minmax_build_mmdesc(Relation rel);
+static MMBuildState *initialize_mm_buildstate(Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange);
+static void remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed);
+static void rerun_summarization(Relation idxRel, Relation heapRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ BlockNumber *nonsummarized, int numnonsummarized);
+static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+static void form_and_insert_tuple(MMBuildState *mmstate);
+static int qsortCompareItemPointers(const void *a, const void *b);
+
+
+/*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to update the index (either by
+ * a new index tuple and repointing the revmap, or by overwriting the existing
+ * index tuple).
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+Datum
+mminsert(PG_FUNCTION_ARGS)
+{
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ ItemId origlp;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ bool need_insert = false;
+
+ rmAccess = mmRevmapAccessInit(idxRel, NULL);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ indexInfo = BuildIndexInfo(idxRel);
+ mmdesc = minmax_build_mmdesc(idxRel);
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ Assert(iblk != InvalidBlockNumber);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ origlp = PageGetItemId(page, ioff);
+ mmtup = (MMTuple *) PageGetItem(page, origlp);
+
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ result = FunctionCall5Coll(addValue,
+ PG_GET_COLLATION(),
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ Size tupsz;
+ MMTuple *tup;
+
+ tup = minmax_form_tuple(mmdesc, dtup, &tupsz);
+
+ /*
+ * If the size of the original tuple is greater or equal to the new
+ * index tuple, we can overwrite. This saves regular page bloat, and
+ * also saves revmap traffic. This might leave some unused space
+ * before the start of the next tuple, but we don't worry about that
+ * here.
+ *
+ * We avoid doing this when the itempointer of the index tuple would
+ * change, because that would require an update to the revmap while
+ * holding exclusive lock on this page, which would reduce concurrency.
+ *
+ * Note that we continue to acccess 'origlp' here, even though there
+ * was an interval during which the page wasn't locked. Since we hold
+ * pin on the page, this is okay -- the buffer cannot go away from
+ * under us, and also tuples cannot be shuffled around.
+ */
+ if (tupsz >= ItemIdGetLength(origlp))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageOverwriteItemData(BufferGetPage(buf),
+ ioff,
+ (Item) tup, tupsz);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxRel->rd_node;
+ xlrec.target.tid = idxtid;
+ xlrec.overwrite = true;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = tupsz;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ /*
+ * The new tuple is larger than the original one, so we must insert
+ * a new one the slow way.
+ */
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+
+#ifdef NOT_YET
+ /*
+ * Possible optimization: if we can grab an exclusive lock on the
+ * buffer containing the old tuple right away, we can also seize
+ * the opportunity to prune the old tuple and avoid some bloat.
+ * This is not necessary for correctness.
+ */
+ if (ConditionalLockBuffer(buf))
+ {
+ /* prune the old tuple */
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+#endif
+ }
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+}
+
+/*
+ * Initialize state for a Minmax index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+Datum
+mmbeginscan(PG_FUNCTION_ARGS)
+{
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+}
+
+/*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first for more ideas on processing the scan key.
+ */
+Datum
+mmgetbitmap(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ MinmaxDesc *mmdesc = minmax_build_mmdesc(idxRel);
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * mmdesc->md_tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(opaque->rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ /*
+ * Obtain the buffer that contains the tuple. We might already
+ * have it pinned.
+ */
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ Assert(idxblkno != InvalidBlockNumber);
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+ }
+
+ /*
+ * We now have the containing buffer locked, so we can release the
+ * tuple lock.
+ */
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *) PageGetItem(page, PageGetItemId(page, idxoffno));
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with summary values stored for the range. If
+ * scan keys are matched, the page range must be added to the
+ * bitmap. We initially assume the range needs to be added; in
+ * particular this serves the case where there are no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation used
+ * in the index column. Otherwise we shouldn't be using this
+ * index ...
+ */
+ Assert(key->sk_collation ==
+ mmdesc->md_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page range
+ * values; if so, have the pages in the range added to the
+ * output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole, so break out of the loop as soon as a
+ * false return value is obtained.
+ */
+ add = FunctionCall5Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ Int16GetDatum(keyattno),
+ UInt16GetDatum(key->sk_strategy),
+ key->sk_argument);
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ break;
+ }
+
+ pfree(dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+}
+
+/*
+ * Re-initialize state for a minmax index scan
+ */
+Datum
+mmrescan(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Close down a minmax index scan
+ */
+Datum
+mmendscan(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+mmmarkpos(PG_FUNCTION_ARGS)
+{
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+}
+
+Datum
+mmrestrpos(PG_FUNCTION_ARGS)
+{
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+}
+
+/*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+static void
+mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+{
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1))
+ {
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(mmstate);
+
+ /* set state to correspond to the next range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for it */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ mmstate->dtuple->dt_seentup = true;
+ for (i = 0; i < mmstate->mmDesc->md_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ /* FIXME must be cached somewhere */
+ addValue = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ mmstate->mmDesc->md_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+}
+
+/*
+ * mmbuild() -- build a new minmax index.
+ */
+Datum
+mmbuild(PG_FUNCTION_ARGS)
+{
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = mm_getnewbuffer(index);
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /*
+ * Set up an empty revmap, and get access to it
+ */
+ mmRevmapCreate(index);
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* process the final batch */
+ form_and_insert_tuple(mmstate);
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+}
+
+Datum
+mmbuildempty(PG_FUNCTION_ARGS)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * mmbulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+Datum
+mmbulkdelete(PG_FUNCTION_ARGS)
+{
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+}
+
+/*
+ * This routine is in charge of "vacuuming" a minmax index: 1) remove index
+ * tuples that are no longer referenced from the revmap. 2) summarize ranges
+ * that are currently unsummarized.
+ */
+Datum
+mmvacuumcleanup(PG_FUNCTION_ARGS)
+{
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber pagesPerRange;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers of
+ * ranges that are not summarized.
+ */
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* and summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess, pagesPerRange,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+}
+
+/*
+ * reloptions processor for minmax indexes
+ */
+Datum
+mmoptions(PG_FUNCTION_ARGS)
+{
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+}
+
+/*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+Buffer
+mm_getnewbuffer(Relation irel)
+{
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /* FIXME need to request a MaxFSMRequestSize page from the FSM here */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+}
+
+/*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+void
+mm_page_init(Page page, uint16 type)
+{
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+}
+
+/*
+ * Initialize a new minmax index' metapage.
+ */
+void
+mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+{
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->pagesPerRange = pagesPerRange;
+ metadata->minmaxVersion = version;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+}
+
+/*
+ * Build a MinmaxDesc used to create or scan a minmax index
+ */
+static MinmaxDesc *
+minmax_build_mmdesc(Relation rel)
+{
+ MinmaxOpcInfo **opcinfo;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+ Datum indclassDatum;
+ oidvector *indclass;
+ bool isnull;
+
+ tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Obtain MinmaxOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, rel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ opcinfo = (MinmaxOpcInfo **) palloc(sizeof(MinmaxOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_OPCINFO);
+
+ opcinfo[keyno] = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall2(opcInfoFn,
+ ObjectIdGetDatum(opfam),
+ ObjectIdGetDatum(idxtypid)));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(MinmaxDesc, md_info) +
+ sizeof(MinmaxOpcInfo *) * tupdesc->natts;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->md_index = rel;
+ mmdesc->md_tupdesc = CreateTupleDescCopy(tupdesc);
+ mmdesc->md_disktdesc = NULL; /* generated lazily */
+ mmdesc->md_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ mmdesc->md_info[keyno] = opcinfo[keyno];
+
+ return mmdesc;
+}
+
+/*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+static MMBuildState *
+initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
+ BlockNumber pagesPerRange)
+{
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ return mmstate;
+}
+
+/*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return in nonsummed the array (and in numnonsummed its size) of
+ * block numbers for which the revmap returns InvalidTid; this is used in a
+ * later stage to execute re-summarization. (Each block number returned
+ * corresponds to the heap page number with which each unsummarized range
+ * starts.) Space for the array is palloc'ed, and must be freed by caller.
+ *
+ * idxRel is the index relation; heapNumBlocks is the size of the heap
+ * relation; strategy is appropriate for bulk scanning.
+ */
+static void
+remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+{
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = CurrentMemoryContext;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+ MinmaxSpecialSpace *special;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ page = BufferGetPage(buf);
+
+ /*
+ * Verify the type of the page we got; if it's not a regular page,
+ * ignore it.
+ */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != MINMAX_PAGETYPE_REGULAR)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+
+ /*
+ * Enter each live tuple into the hash table
+ */
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *)
+ hash_search(tuples, &tid, HASH_ENTER, &found);
+ Assert(!found);
+ hitem->referenced = false;
+ numitems++;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. Before doing so, if the
+ * heap page range is complete but not summarized, store its
+ * initial page number in the unsummarized array, for later
+ * summarization.
+ */
+ if (heapBlk + pagesPerRange < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+ else
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ /*
+ * If the item is not in the hash, it must have been inserted after the
+ * index was scanned, and therefore we should leave things well alone.
+ * (There might be a leftover entry, but it's okay because next vacuum
+ * will remove it.)
+ */
+ if (!found)
+ continue;
+
+ hitem->referenced = true;
+
+ /* discount items set as referenced */
+ numitems--;
+ }
+ Assert(numitems >= 0);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples.
+ */
+ deletable = palloc(sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ for (start = 0, i = 0; i < numdeletable; i++)
+ {
+ /*
+ * Are we at the end of the items that together belong in one
+ * particular page? If so, then it's deletion time.
+ */
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+ int freespace;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ /*
+ * Now defragment the page.
+ */
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_BULKREMOVE,
+ rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* next iteration starts where this one ended */
+ start = i + 1;
+
+ /* remember free space while we have the buffer locked */
+ freespace = PageGetFreeSpace(page);
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+
+ RecordPageWithFreeSpace(idxRel, blk, freespace);
+ }
+ }
+
+ pfree(deletable);
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+}
+
+/*
+ * Summarize the given page ranges of the given index.
+ */
+static void
+rerun_summarization(Relation idxRel, Relation heapRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ BlockNumber *nonsummarized, int numnonsummarized)
+{
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(idxRel, rmAccess, pagesPerRange);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+
+ mmstate->currRangeStart = blk;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the
+ * specified page range, summarizing the heap tuples in it. This scan
+ * stops just short of mmbuildCallback creating the new index entry.
+ */
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple and insert it. Note mmbuildCallback didn't
+ * have the chance to actually insert anything into the index, because
+ * the heapscan should have ended just as it reached the final tuple in
+ * the range.
+ */
+ form_and_insert_tuple(mmstate);
+
+ /* and re-initialize state for the next range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+}
+
+/*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+static void
+mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+{
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer and should have
+ * thrown a user-facing error message if there's not, so it's a program
+ * error if there isn't.
+ */
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+ if (PageGetFreeSpace(page) < itemsz)
+ elog(ERROR, "index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel));
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ MarkBufferDirty(*buffer);
+
+ blk = BufferGetBlockNumber(*buffer);
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapblkno);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ xlrec.overwrite = false;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /*
+ * Note we need to keep the lock on the buffer until after the revmap
+ * has been updated. Otherwise, a concurrent scanner could try to obtain
+ * the index tuple from the revmap before we're done writing it.
+ */
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has enough, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, a new
+ * buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. This function returns true if this happens, false
+ * otherwise.
+ */
+static bool
+mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+{
+ Buffer buf;
+ Page page;
+ bool extended = false;
+
+gib_restart:
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer. It was tested for free space, but in some cases only before
+ * locking it, so a recheck is necessary because a concurrent inserter
+ * might have put items in it.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly; but first make sure we record the page as
+ * free, for next time.
+ */
+ freespace = PageGetFreeSpace(page);
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ /*
+ * We have a block number from FSM now. Check that it has enough
+ * free space, and break out to return it if it does; otherwise
+ * start over. Note that we allow for the FSM to be out of date
+ * here, and in that case we update it and move on.
+ */
+ Assert(blk != InvalidBlockNumber);
+ buf = ReadBuffer(irel, blk);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ break;
+ }
+
+ /* Not really enough space: register reality and start over */
+ ReleaseBuffer(buf);
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Now recheck free space with exclusive lock held, and start over if it's
+ * not enough.
+ */
+ Assert(!BufferIsInvalid(*buffer));
+ page = BufferGetPage(*buffer);
+ if (PageGetFreeSpace(page) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ goto gib_restart;
+ }
+
+ /*
+ * XXX we could perhaps avoid this if we used RelationSetTargetBlock ...
+ */
+ if (extended)
+ FreeSpaceMapVacuum(irel);
+
+ return extended;
+}
+
+/*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+static void
+form_and_insert_tuple(MMBuildState *mmstate)
+{
+ MMTuple *tup;
+ Size size;
+
+ /* if this dtuple didn't see any heap tuple at all, don't insert it */
+ if (!mmstate->dtuple->dt_seentup)
+ return;
+
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+}
+
+/*
+ * qsort comparator for ItemPointerData items
+ */
+static int
+qsortCompareItemPointers(const void *a, const void *b)
+{
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
diff --git a/src/backend/access/minmax/mmrevmap.c b/src/backend/access/minmax/mmrevmap.c
new file mode 100644
index 0000000..9c269c1
--- /dev/null
+++ b/src/backend/access/minmax/mmrevmap.c
@@ -0,0 +1,683 @@
+/*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_xlog.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+
+/*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+/*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+#define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+#define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+struct mmRevmapAccess
+{
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer metaBuf;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+};
+/* typedef appears in minmax_revmap.h */
+
+
+/*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+mmRevmapAccess *
+mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+{
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->metaBuf = meta;
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ if (pagesPerRange)
+ *pagesPerRange = metadata->pagesPerRange;
+
+ return rmAccess;
+}
+
+/*
+ * Release resources associated with a revmap access object.
+ */
+void
+mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+{
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->metaBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+}
+
+/*
+ * In the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages per range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct revmap page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+void
+rm_page_set_iptr(Page page, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+{
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+}
+
+/*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+BlockNumber
+initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+{
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+}
+
+/*
+ * Lock the metapage as specified by called, and update the given rmAccess with
+ * the metapage data. The metapage buffer is locked when this function
+ * returns; it's the caller's responsibility to unlock it.
+ */
+static void
+rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+{
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ LockBuffer(rmAccess->metaBuf, lockmode);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+
+#ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+#endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+}
+
+/*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+void
+initialize_rma_page(Buffer buf)
+{
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+}
+
+/*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno.
+ */
+static void
+update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+{
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+}
+
+/*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+static BlockNumber
+rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+{
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again. XXX but we didn't hold lock!
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ *
+ * FIXME -- what if we were asked not to extend?
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Modify the revmap array page to point to the newly allocated revmap
+ * page.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ /*
+ * XXX this rma_nblocks assignment should probably be conditional on the
+ * current rma_blocks value.
+ */
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+}
+
+/*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+void
+mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+{
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "setting %u/%u in logical page %lu (physical %u) for heap %u",
+ blkno, offno,
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+}
+
+
+/*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with
+ * LockTuple. It is the caller's responsibility to release that lock.
+ */
+void
+mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+{
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+void
+mmRevmapCreate(Relation idxrel)
+{
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+}
diff --git a/src/backend/access/minmax/mmsortable.c b/src/backend/access/minmax/mmsortable.c
new file mode 100644
index 0000000..17095af
--- /dev/null
+++ b/src/backend/access/minmax/mmsortable.c
@@ -0,0 +1,265 @@
+/*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_tuple.h"
+#include "access/skey.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/syscache.h"
+
+
+/*
+ * Procedure numbers must not collide with MINMAX_PROCNUM defines in
+ * minmax_internal.h. Note we only need inequality functions.
+ */
+#define SORTABLE_NUM_PROCNUMS 4 /* # support procs we need */
+#define PROCNUM_LESS 4
+#define PROCNUM_LESSEQUAL 5
+#define PROCNUM_GREATER 6
+#define PROCNUM_GREATEREQUAL 7
+
+/* subtract this from procnum to obtain index in SortableOpaque arrays */
+#define PROCNUM_BASE 4
+
+static FmgrInfo *mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno,
+ uint16 procnum);
+
+PG_FUNCTION_INFO_V1(mmSortableOpcInfo);
+PG_FUNCTION_INFO_V1(mmSortableAddValue);
+PG_FUNCTION_INFO_V1(mmSortableConsistent);
+
+Datum mmSortableOpcInfo(PG_FUNCTION_ARGS);
+Datum mmSortableAddValue(PG_FUNCTION_ARGS);
+Datum mmSortableConsistent(PG_FUNCTION_ARGS);
+
+typedef struct SortableOpaque
+{
+ FmgrInfo operators[SORTABLE_NUM_PROCNUMS];
+ bool inited[SORTABLE_NUM_PROCNUMS];
+} SortableOpaque;
+
+/*
+ * Return the number and OIDs of (the functions that underlie) operators we
+ * need to build a minmax index, as a pointer to a newly palloc'ed MinmaxOpers.
+ */
+Datum
+mmSortableOpcInfo(PG_FUNCTION_ARGS)
+{
+ SortableOpaque *opaque;
+ MinmaxOpcInfo *result;
+
+ opaque = palloc0(sizeof(SortableOpaque));
+ /*
+ * 'operators' is initialized lazily, as indicated by 'inited' which was
+ * initialized to all false by palloc0.
+ */
+
+ result = palloc(sizeof(MinmaxOpcInfo));
+ result->oi_nstored = 2; /* min, max */
+ result->oi_opaque = opaque;
+
+ PG_RETURN_POINTER(result);
+}
+
+/*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+Datum
+mmSortableAddValue(PG_FUNCTION_ARGS)
+{
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->dt_columns[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->dt_columns[attno - 1].allnulls)
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+}
+
+/*
+ * Given an index tuple corresponding to a certain page range, and a scan key
+ * (represented by its index attribute number, the value and an operator
+ * strategy number), return whether the scan key is consistent with the page
+ * range. Return true if so, false otherwise.
+ *
+ * XXX what do we need to do with NULL values here?
+ *
+ * XXX would it be better to pass the ScanKey as a whole rather than parts of
+ * it?
+ */
+Datum
+mmSortableConsistent(PG_FUNCTION_ARGS)
+{
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_INT16(2);
+ StrategyNumber strat = PG_GETARG_UINT16(3);
+ Datum value = PG_GETARG_DATUM(4);
+ Datum matches;
+ Oid colloid = PG_GET_COLLATION();
+
+ switch (strat)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESS),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATER),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", strat);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+}
+
+/*
+ * Return the procedure corresponding to the given function support number.
+ */
+static FmgrInfo *
+mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno, uint16 procnum)
+{
+ SortableOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (SortableOpaque *) mmdesc->md_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(mmdesc->md_index, attno, procnum),
+ CurrentMemoryContext);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+}
diff --git a/src/backend/access/minmax/mmtuple.c b/src/backend/access/minmax/mmtuple.c
new file mode 100644
index 0000000..445f558
--- /dev/null
+++ b/src/backend/access/minmax/mmtuple.c
@@ -0,0 +1,454 @@
+/*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax_tuple.h"
+#include "access/tupdesc.h"
+#include "access/tupmacs.h"
+
+
+static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+/*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+static TupleDesc
+mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+{
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->md_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->md_totalstored, false);
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->md_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->md_tupdesc->attrs[i]->atttypid,
+ mmdesc->md_tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ mmdesc->md_disktdesc = tupdesc;
+ }
+
+ return mmdesc->md_disktdesc;
+}
+
+/*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+MMTuple *
+minmax_form_tuple(MinmaxDesc *mmdesc, DeformedMMTuple *tuple, Size *size)
+{
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->md_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->md_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->md_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (idxattno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ /* XXX datumCopy ?? */
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->md_tupdesc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+}
+
+/*
+ * Free a tuple created by minmax_form_tuple
+ */
+void
+minmax_free_tuple(MMTuple *tuple)
+{
+ pfree(tuple);
+}
+
+DeformedMMTuple *
+minmax_new_dtuple(MinmaxDesc *mmdesc)
+{
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->md_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->md_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->md_info[i]->oi_nstored;
+ }
+
+ return dtup;
+}
+
+/*
+ * Reset a DeformedMMTuple to initial state
+ */
+void
+minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+{
+ int i;
+
+ dtuple->dt_seentup = false;
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * mmdesc->md_info[i]->oi_nstored);
+ }
+}
+
+/*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need to apply
+ * datumCopy inside the loop. We probably also need a minmax_free_dtuple()
+ * function.
+ */
+DeformedMMTuple *
+minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+{
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->md_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ dtup->dt_columns[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->md_totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->md_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] = values[valueno++];
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+}
+
+/*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->md_totalstored
+ * allnulls output "allnulls", size mmdesc->md_tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->md_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+static inline void
+mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+{
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ allnulls[attnum] = true;
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && att_isnull(mmdesc->md_tupdesc->natts + attnum, hasnulls);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ for (stored = 0, attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->md_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+}
diff --git a/src/backend/access/minmax/mmxlog.c b/src/backend/access/minmax/mmxlog.c
new file mode 100644
index 0000000..c9b1461
--- /dev/null
+++ b/src/backend/access/minmax/mmxlog.c
@@ -0,0 +1,305 @@
+/*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+#include "postgres.h"
+
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "access/minmax_xlog.h"
+#include "access/xlogutils.h"
+#include "storage/freespace.h"
+
+
+/*
+ * xlog replay routines
+ */
+static void
+minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+}
+
+static void
+minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ if (xlrec->overwrite)
+ PageOverwriteItemData(page, offnum, (Item) mmtuple, tuplen);
+ else
+ {
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+ }
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+}
+
+static void
+minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+}
+
+static void
+minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBuffer(xlrec->node, xlrec->mapBlock, init);
+ Assert(BufferIsValid(buffer));
+ page = BufferGetPage(buffer);
+ if (init)
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+}
+
+static void
+minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+}
+
+static void
+minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+}
+
+static void
+minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+}
+
+void
+minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 7d092d2..5575a71 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -9,7 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
- mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
+ minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
+ smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/minmaxdesc.c b/src/backend/access/rmgrdesc/minmaxdesc.c
new file mode 100644
index 0000000..efcff67
--- /dev/null
+++ b/src/backend/access/rmgrdesc/minmaxdesc.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/minmax_xlog.h"
+
+static void
+out_target(StringInfo buf, xl_minmax_tid *target)
+{
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+}
+
+void
+minmax_desc(StringInfo buf, XLogRecord *record)
+{
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else if (info == XLOG_MINMAX_METAPG_SET)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
+
+ appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_RMARRAY_SET)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
+
+ appendStringInfoString(buf, "revmap array: ");
+ appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->rmarray,
+ xlrec->blkidx, xlrec->newpg);
+ }
+
+ else
+ appendStringInfo(buf, "UNKNOWN");
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index c0a7a6f..e285e50 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -12,6 +12,7 @@
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+#include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a5a204e..cbb0ab8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2096,6 +2096,27 @@ IndexBuildHeapScan(Relation heapRelation,
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+}
+
+/*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+double
+IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+{
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
@@ -2166,6 +2187,9 @@ IndexBuildHeapScan(Relation heapRelation,
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f1b20e..55e375f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..5000f39 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -324,6 +324,41 @@ PageAddItem(Page page,
}
/*
+ * PageOverwriteItemData
+ * Overwrite the data for the item at the given offset.
+ *
+ * The new data must fit in the existing data space for the old tuple.
+ */
+void
+PageOverwriteItemData(Page page, OffsetNumber offset, Item item, Size size)
+{
+ PageHeader phdr = (PageHeader) page;
+ ItemId itemId;
+
+ /*
+ * Be wary about corrupted page pointers
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(PANIC,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+ itemId = PageGetItemId(phdr, offset);
+ if (!ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId))
+ elog(ERROR, "existing item to overwrite is not used");
+
+ if (ItemIdGetLength(itemId) < size)
+ elog(ERROR, "existing item is not large enough to be overwritten");
+
+ memcpy((char *) page + ItemIdGetOffset(itemId), item, size);
+ ItemIdSetNormal(itemId, ItemIdGetOffset(itemId), size);
+}
+
+/*
* PageGetTempPage
* Get a temporary page in local memory for special processing.
* The returned page is not initialized at all; caller must do that.
@@ -399,7 +434,8 @@ PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
- * sorting support for PageRepairFragmentation and PageIndexMultiDelete
+ * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
+ * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
@@ -896,6 +932,182 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
phdr->pd_upper = upper;
}
+/*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+void
+PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+{
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+}
/*
* Set checksum for a page in shared buffers.
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e932ccf..61e1a28 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7349,3 +7349,27 @@ gincostestimate(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+
+Datum
+mmcostestimate(PG_FUNCTION_ARGS)
+{
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..5354a3b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -112,6 +112,8 @@ extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/minmax.h b/src/include/access/minmax.h
new file mode 100644
index 0000000..edb88ba
--- /dev/null
+++ b/src/include/access/minmax.h
@@ -0,0 +1,52 @@
+/*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include "fmgr.h"
+#include "nodes/execnodes.h"
+#include "utils/relcache.h"
+
+
+/*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+extern Datum mmbuild(PG_FUNCTION_ARGS);
+extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+extern Datum mminsert(PG_FUNCTION_ARGS);
+extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+extern Datum mmgettuple(PG_FUNCTION_ARGS);
+extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+extern Datum mmrescan(PG_FUNCTION_ARGS);
+extern Datum mmendscan(PG_FUNCTION_ARGS);
+extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+/*
+ * Storage type for MinMax' reloptions
+ */
+typedef struct MinmaxOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+} MinmaxOptions;
+
+#define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+#define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+#endif /* MINMAX_H */
diff --git a/src/include/access/minmax_internal.h b/src/include/access/minmax_internal.h
new file mode 100644
index 0000000..b7c28be
--- /dev/null
+++ b/src/include/access/minmax_internal.h
@@ -0,0 +1,83 @@
+/*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+#ifndef MINMAX_INTERNAL_H
+#define MINMAX_INTERNAL_H
+
+#include "fmgr.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+
+/*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple and vice-versa.
+ *
+ * Note: we assume, for now, that the data stored for each column is the same
+ * datatype as the indexed heap column. This restriction can be lifted by
+ * having an Oid array pointer on the PerCol struct, where each member of the
+ * array indicates the typid of the stored data.
+ */
+
+/* struct returned by "OpcInfo" amproc */
+typedef struct MinmaxOpcInfo
+{
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+} MinmaxOpcInfo;
+
+typedef struct MinmaxDesc
+{
+ /* the index relation itself */
+ Relation md_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc md_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc md_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int md_totalstored;
+
+ /* per-column info */
+ MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
+} MinmaxDesc;
+
+/*
+ * Globally-known function support numbers for Minmax indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+#define MINMAX_PROCNUM_OPCINFO 1
+#define MINMAX_PROCNUM_ADDVALUE 2
+#define MINMAX_PROCNUM_CONSISTENT 3
+
+#define MINMAX_DEBUG
+
+/* we allow debug if using GCC; otherwise don't bother */
+#if defined(MINMAX_DEBUG) && defined(__GNUC__)
+#define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+#else
+#define MINMAX_elog(a) void(0)
+#endif
+
+/* minmax.c */
+extern Buffer mm_getnewbuffer(Relation irel);
+extern void mm_page_init(Page page, uint16 type);
+extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+#endif /* MINMAX_INTERNAL_H */
diff --git a/src/include/access/minmax_page.h b/src/include/access/minmax_page.h
new file mode 100644
index 0000000..04f40d8
--- /dev/null
+++ b/src/include/access/minmax_page.h
@@ -0,0 +1,88 @@
+/*
+ * Prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+#ifndef MINMAX_PAGE_H
+#define MINMAX_PAGE_H
+
+
+/* special space on all minmax pages stores a "type" identifier */
+#define MINMAX_PAGETYPE_META 0xF091
+#define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+#define MINMAX_PAGETYPE_REVMAP 0xF093
+#define MINMAX_PAGETYPE_REGULAR 0xF094
+
+typedef struct MinmaxSpecialSpace
+{
+ uint16 type;
+} MinmaxSpecialSpace;
+
+/* Metapage definitions */
+typedef struct MinmaxMetaPageData
+{
+ uint32 minmaxMagic;
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+} MinmaxMetaPageData;
+
+/*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+#define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+#define MINMAX_CURRENT_VERSION 1
+#define MINMAX_META_MAGIC 0xA8109CFA
+
+#define MINMAX_METAPAGE_BLKNO 0
+
+/* Definitions for regular revmap pages */
+typedef struct RevmapContents
+{
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+} RevmapContents;
+
+#define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+/* max num of items in the array */
+#define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+/* Definitions for array revmap pages */
+typedef struct RevmapArrayContents
+{
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+} RevmapArrayContents;
+
+#define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+/* max num of items in the array */
+#define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+#endif /* MINMAX_PAGE_H */
diff --git a/src/include/access/minmax_revmap.h b/src/include/access/minmax_revmap.h
new file mode 100644
index 0000000..1c968f3
--- /dev/null
+++ b/src/include/access/minmax_revmap.h
@@ -0,0 +1,40 @@
+/*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+#ifndef MINMAX_REVMAP_H
+#define MINMAX_REVMAP_H
+
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+/* struct definition lives in mmrevmap.c */
+typedef struct mmRevmapAccess mmRevmapAccess;
+
+extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange);
+extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+extern void mmRevmapCreate(Relation idxrel);
+extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+/* internal stuff also used by xlog replay */
+extern void rm_page_set_iptr(Page page, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, BlockNumber blkno, OffsetNumber offno);
+extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+extern void initialize_rma_page(Buffer buf);
+
+
+#endif /* MINMAX_REVMAP_H */
diff --git a/src/include/access/minmax_tuple.h b/src/include/access/minmax_tuple.h
new file mode 100644
index 0000000..bd57fdd
--- /dev/null
+++ b/src/include/access/minmax_tuple.h
@@ -0,0 +1,84 @@
+/*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+#ifndef MINMAX_TUPLE_H
+#define MINMAX_TUPLE_H
+
+#include "access/minmax_internal.h"
+#include "access/tupdesc.h"
+
+
+/*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+typedef struct MMValues
+{
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+} MMValues;
+
+/*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+typedef struct DeformedMMTuple
+{
+ int dt_seentup;
+ MMValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+} DeformedMMTuple;
+
+/*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each value stored); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+typedef struct MMTuple
+{
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+} MMTuple;
+
+#define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+/*
+ * t_info manipulation macros
+ */
+#define MMIDX_OFFSET_MASK 0x1F
+/* bit 0x20 is not used at present */
+/* bit 0x40 is not used at present */
+#define MMIDX_NULLS_MASK 0x80
+
+#define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+#define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc,
+ DeformedMMTuple *tuple, Size *size);
+extern void minmax_free_tuple(MMTuple *tuple);
+
+extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+#endif /* MINMAX_TUPLE_H */
diff --git a/src/include/access/minmax_xlog.h b/src/include/access/minmax_xlog.h
new file mode 100644
index 0000000..b13fe2c
--- /dev/null
+++ b/src/include/access/minmax_xlog.h
@@ -0,0 +1,134 @@
+/*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MINMAX_XLOG_H
+#define MINMAX_XLOG_H
+
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/itemptr.h"
+#include "storage/relfilenode.h"
+#include "utils/relcache.h"
+
+
+/*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+#define XLOG_MINMAX_CREATE_INDEX 0x00
+#define XLOG_MINMAX_INSERT 0x10
+#define XLOG_MINMAX_BULKREMOVE 0x20
+#define XLOG_MINMAX_REVMAP_SET 0x30
+#define XLOG_MINMAX_METAPG_SET 0x40
+#define XLOG_MINMAX_RMARRAY_SET 0x50
+#define XLOG_MINMAX_INIT_RMPG 0x60
+
+#define XLOG_MINMAX_OPMASK 0x70
+/*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+#define XLOG_MINMAX_INIT_PAGE 0x80
+
+/* This is what we need to know about a minmax index create */
+typedef struct xl_minmax_createidx
+{
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+} xl_minmax_createidx;
+#define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+/* All that we need to find a minmax tuple */
+typedef struct xl_minmax_tid
+{
+ RelFileNode node;
+ ItemPointerData tid;
+} xl_minmax_tid;
+
+#define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+/* This is what we need to know about a minmax tuple insert */
+typedef struct xl_minmax_insert
+{
+ xl_minmax_tid target;
+ bool overwrite;
+ /* tuple data follows at end of struct */
+} xl_minmax_insert;
+
+#define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, overwrite) + sizeof(bool))
+
+/* This is what we need to know about a bulk minmax tuple remove */
+typedef struct xl_minmax_bulkremove
+{
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+} xl_minmax_bulkremove;
+
+#define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+/* This is what we need to know about a revmap "set heap ptr" */
+typedef struct xl_minmax_rm_set
+{
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+} xl_minmax_rm_set;
+
+#define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+/* This is what we need to know about a "metapage set" operation */
+typedef struct xl_minmax_metapg_set
+{
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+} xl_minmax_metapg_set;
+
+#define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+/* This is what we need to know about a "revmap array set" operation */
+typedef struct xl_minmax_rmarray_set
+{
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+} xl_minmax_rmarray_set;
+
+#define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+/* This is what we need to know when we initialize a new revmap page */
+typedef struct xl_minmax_init_rmpg
+{
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+} xl_minmax_init_rmpg;
+
+#define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+extern void minmax_desc(StringInfo buf, XLogRecord *record);
+extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+#endif /* MINMAX_XLOG_H */
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index c226448..985d435 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -45,8 +45,9 @@ typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..8beb1be 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -35,8 +35,10 @@ typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
- BlockNumber rs_nblocks; /* number of blocks to scan */
+ BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 662fb77..9dc995a 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -42,3 +42,4 @@ PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 006b180..de90178 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -97,6 +97,14 @@ extern double IndexBuildHeapScan(Relation heapRelation,
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
diff --git a/src/include/catalog/pg_am.h b/src/include/catalog/pg_am.h
index 759ea70..3010120 100644
--- a/src/include/catalog/pg_am.h
+++ b/src/include/catalog/pg_am.h
@@ -132,5 +132,7 @@ DESCR("GIN index access method");
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+DATA(insert OID = 3580 ( minmax 5 7 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+#define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index 3ef5a49..4df4b7a 100644
--- a/src/include/catalog/pg_amop.h
+++ b/src/include/catalog/pg_amop.h
@@ -845,4 +845,85 @@ DATA(insert ( 3550 869 869 25 s 932 783 0 ));
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+/*
+ * MinMax int4_ops
+ */
+DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+/*
+ * MinMax numeric_ops
+ */
+DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+/*
+ * MinMax text_ops
+ */
+DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+/*
+ * MinMax time_ops
+ */
+DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+/*
+ * MinMax timetz_ops
+ */
+DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+/*
+ * MinMax timestamp_ops
+ */
+DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+/*
+ * MinMax timestamptz_ops
+ */
+DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+/*
+ * MinMax date_ops
+ */
+DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+/*
+ * MinMax char_ops
+ */
+DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 10a47df..9eb2456 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -431,4 +431,77 @@ DATA(insert ( 4017 25 25 3 4029 ));
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+/* minmax */
+DATA(insert ( 4054 23 23 1 3383 ));
+DATA(insert ( 4054 23 23 2 3384 ));
+DATA(insert ( 4054 23 23 3 3385 ));
+DATA(insert ( 4054 23 23 4 66 ));
+DATA(insert ( 4054 23 23 5 149 ));
+DATA(insert ( 4054 23 23 6 150 ));
+DATA(insert ( 4054 23 23 7 147 ));
+
+DATA(insert ( 4055 1700 1700 1 3383 ));
+DATA(insert ( 4055 1700 1700 2 3384 ));
+DATA(insert ( 4055 1700 1700 3 3385 ));
+DATA(insert ( 4055 1700 1700 4 1722 ));
+DATA(insert ( 4055 1700 1700 5 1723 ));
+DATA(insert ( 4055 1700 1700 6 1721 ));
+DATA(insert ( 4055 1700 1700 7 1720 ));
+
+DATA(insert ( 4056 25 25 1 3383 ));
+DATA(insert ( 4056 25 25 2 3384 ));
+DATA(insert ( 4056 25 25 3 3385 ));
+DATA(insert ( 4056 25 25 4 740 ));
+DATA(insert ( 4056 25 25 5 741 ));
+DATA(insert ( 4056 25 25 6 743 ));
+DATA(insert ( 4056 25 25 7 742 ));
+
+DATA(insert ( 4057 1083 1083 1 3383 ));
+DATA(insert ( 4057 1083 1083 2 3384 ));
+DATA(insert ( 4057 1083 1083 3 3385 ));
+DATA(insert ( 4057 1083 1083 4 1102 ));
+DATA(insert ( 4057 1083 1083 5 1103 ));
+DATA(insert ( 4057 1083 1083 6 1105 ));
+DATA(insert ( 4057 1083 1083 7 1104 ));
+
+DATA(insert ( 4058 1266 1266 1 3383 ));
+DATA(insert ( 4058 1266 1266 2 3384 ));
+DATA(insert ( 4058 1266 1266 3 3385 ));
+DATA(insert ( 4058 1266 1266 4 1354 ));
+DATA(insert ( 4058 1266 1266 5 1355 ));
+DATA(insert ( 4058 1266 1266 6 1356 ));
+DATA(insert ( 4058 1266 1266 7 1357 ));
+
+DATA(insert ( 4059 1114 1114 1 3383 ));
+DATA(insert ( 4059 1114 1114 2 3384 ));
+DATA(insert ( 4059 1114 1114 3 3385 ));
+DATA(insert ( 4059 1114 1114 4 2054 ));
+DATA(insert ( 4059 1114 1114 5 2055 ));
+DATA(insert ( 4059 1114 1114 6 2056 ));
+DATA(insert ( 4059 1114 1114 7 2057 ));
+
+DATA(insert ( 4060 1184 1184 1 3383 ));
+DATA(insert ( 4060 1184 1184 2 3384 ));
+DATA(insert ( 4060 1184 1184 3 3385 ));
+DATA(insert ( 4060 1184 1184 4 1154 ));
+DATA(insert ( 4060 1184 1184 5 1155 ));
+DATA(insert ( 4060 1184 1184 6 1156 ));
+DATA(insert ( 4060 1184 1184 7 1157 ));
+
+DATA(insert ( 4061 1082 1082 1 3383 ));
+DATA(insert ( 4061 1082 1082 2 3384 ));
+DATA(insert ( 4061 1082 1082 3 3385 ));
+DATA(insert ( 4061 1082 1082 4 1087 ));
+DATA(insert ( 4061 1082 1082 5 1088 ));
+DATA(insert ( 4061 1082 1082 6 1090 ));
+DATA(insert ( 4061 1082 1082 7 1089 ));
+
+DATA(insert ( 4062 18 18 1 3383 ));
+DATA(insert ( 4062 18 18 2 3384 ));
+DATA(insert ( 4062 18 18 3 3385 ));
+DATA(insert ( 4062 18 18 4 1246 ));
+DATA(insert ( 4062 18 18 5 72 ));
+DATA(insert ( 4062 18 18 6 74 ));
+DATA(insert ( 4062 18 18 7 73 ));
+
#endif /* PG_AMPROC_H */
diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index dc52341..70e21ce 100644
--- a/src/include/catalog/pg_opclass.h
+++ b/src/include/catalog/pg_opclass.h
@@ -235,5 +235,14 @@ DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+DATA(insert ( 3580 int4_ops PGNSP PGUID 4054 23 t 0 ));
+DATA(insert ( 3580 numeric_ops PGNSP PGUID 4055 1700 t 0 ));
+DATA(insert ( 3580 text_ops PGNSP PGUID 4056 25 t 0 ));
+DATA(insert ( 3580 time_ops PGNSP PGUID 4057 1083 t 0 ));
+DATA(insert ( 3580 timetz_ops PGNSP PGUID 4058 1266 t 0 ));
+DATA(insert ( 3580 timestamp_ops PGNSP PGUID 4059 1114 t 0 ));
+DATA(insert ( 3580 timestamptz_ops PGNSP PGUID 4060 1184 t 0 ));
+DATA(insert ( 3580 date_ops PGNSP PGUID 4061 1082 t 0 ));
+DATA(insert ( 3580 char_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index 26297ce..08ea569 100644
--- a/src/include/catalog/pg_opfamily.h
+++ b/src/include/catalog/pg_opfamily.h
@@ -157,4 +157,14 @@ DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+DATA(insert OID = 4054 ( 3580 int4_ops PGNSP PGUID ));
+DATA(insert OID = 4055 ( 3580 numeric_ops PGNSP PGUID ));
+DATA(insert OID = 4056 ( 3580 text_ops PGNSP PGUID ));
+DATA(insert OID = 4057 ( 3580 time_ops PGNSP PGUID ));
+DATA(insert OID = 4058 ( 3580 timetz_ops PGNSP PGUID ));
+DATA(insert OID = 4059 ( 3580 timestamp_ops PGNSP PGUID ));
+DATA(insert OID = 4060 ( 3580 timestamptz_ops PGNSP PGUID ));
+DATA(insert OID = 4061 ( 3580 date_ops PGNSP PGUID ));
+DATA(insert OID = 4062 ( 3580 char_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..433a442 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -565,6 +565,34 @@ DESCR("btree(internal)");
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
@@ -4064,6 +4092,14 @@ DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+/* Minmax */
+DATA(insert OID = 3383 ( minmax_sortable_opcinfo PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+DATA(insert OID = 3384 ( minmax_sortable_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableAddValue _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+DATA(insert OID = 3385 ( minmax_sortable_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableConsistent _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index d96e375..7801c85 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -393,6 +393,8 @@ extern void PageInit(Page page, Size pageSize, Size specialSize);
extern bool PageIsVerified(Page page, BlockNumber blkno);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
+extern void PageOverwriteItemData(Page page, OffsetNumber offset, Item item,
+ Size size);
extern Page PageGetTempPage(Page page);
extern Page PageGetTempPageCopy(Page page);
extern Page PageGetTempPageCopySpecial(Page page);
@@ -403,6 +405,8 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 0f662ec..7482252 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -195,6 +195,7 @@ extern Datum hashcostestimate(PG_FUNCTION_ARGS);
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index c04cddc..0ce2739 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -1591,6 +1591,11 @@ ORDER BY 1, 2, 3;
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
@@ -1613,7 +1618,7 @@ ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
-(80 rows)
+(85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
@@ -1775,11 +1780,13 @@ WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
@@ -1800,7 +1807,8 @@ WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opcname | procnums
--------+---------+----------
diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql
index 213a66d..6670661 100644
--- a/src/test/regress/sql/opr_sanity.sql
+++ b/src/test/regress/sql/opr_sanity.sql
@@ -1178,11 +1178,13 @@ WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
@@ -1201,7 +1203,8 @@ WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Unfortunately, we can't check the amproc link very well because the
On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
I have chosen to keep the name "minmax", even if the opclasses now let
one implement completely different things on top of it such as geometry
bounding boxes and bloom filters (aka bitmap indexes). I don't see a
need for a rename: essentially, in PR we can just say "we have these
neat minmax indexes that other databases also have, but instead of just
being used for integer data, they can also be used for geometry, GIS and
bitmap indexes, so as always we're more powerful than everyone else when
implementing new database features".
Plus we haven't come up with a better name ...
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM2fa67e8d1eb8f95cd29e97540fc96619f607cd6a2b763f57da4bb83572167e2fc6c667c6a1ce87404ec628af131e1748@asav-3.01.com
FWIW I think I haven't responded appropriately to the points raised by
Heikki. Basically, as I see it there are three main items:
1. the revmap physical-to-logical mapping is too complex; let's use
something else.
We had revmap originally in a separate fork. The current approach grew
out of the necessity of putting it in the main fork while ensuring that
fast access to individual pages is possible. There are of course many
ways to skin this cat; Heikki's proposal is to have it always occupy the
first few physical pages, rather than require a logical-to-physical
mapping table. To implement this he proposes to move other pages out of
the way as the index grows. I don't really have much love for this
idea. We can change how this is implemented later in the cycle, if we
find that a different approach is better than my proposal. I don't want
to spend endless time meddling with this (and I definitely don't want to
have this delay the eventual commit of the patch.)
2. vacuuming is not optimal
Right now, to eliminate garbage index tuples we need to first scan
the revmap to figure out which tuples are unreferenced. There is a
concern that if there's an excess of dead tuples, the index becomes
unvacuumable because palloc() fails due to request size. This is
largely theoretical because in order for this to happen there need to be
several million dead index tuples. As a minimal fix to alleviate this
problem without requiring a complete rework of vacuuming, we can cap
that palloc request to maintenance_work_mem and remove dead tuples in a
loop instead of trying to remove all of them in a single pass.
Another thing proposed was to store range numbers (or just heap page
numbers) within each index tuple. I felt that this would add more bloat
unnecessarily. However, there is some padding space in index tuple that
maybe we can use to store range numbers. I will think some more about
how we can use this to simplify vacuuming.
3. avoid MMTuple as it is just unnecessary extra complexity.
The main thing that MMTuple adds is not the fact that we save 2 bytes
by storing BlockNumber as is instead of within a TID field. Instead,
it's that we can construct and deconstruct using our own design, which
means we can use however many Datum entries we want and however many
"null" flags. In normal heap and index tuples, there are always the
same number of datum/nulls. In minmax, the number of nulls is twice the
number of indexed columns; the number of datum values is determined by
how many datum values are stored per opclass ("sortable" opclasses
store 2 columns, but geometry would store only one). If we were to use
regular IndexTuples, we would lose that .. and I have no idea how it
would work. In other words, MMTuples look fine to me.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 5, 2014 at 7:55 PM, Josh Berkus <josh@agliodbs.com> wrote:
On 08/05/2014 04:41 PM, Alvaro Herrera wrote:
I have chosen to keep the name "minmax", even if the opclasses now let
one implement completely different things on top of it such as geometry
bounding boxes and bloom filters (aka bitmap indexes). I don't see a
need for a rename: essentially, in PR we can just say "we have these
neat minmax indexes that other databases also have, but instead of just
being used for integer data, they can also be used for geometry, GIS and
bitmap indexes, so as always we're more powerful than everyone else when
implementing new database features".Plus we haven't come up with a better name ...
Several good suggestions have been made, like "summarizing" or
"summary" indexes and "compressed range" indexes. I still really
dislike the present name - you might think this is a type of index
that has something to do with optimizing "min" and "max", but what it
really is is a kind of small index for a big table. The current name
couldn't make that less clear.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
"Summary" seems good. If I get enough votes I can change it to that.
CREATE INDEX foo ON t USING summary (cols)
"Summarizing" seems weird on that command. Not sure about "compressed
range", as you would have to use an abbreviation or run the words
together.
Summarizing index sounds better to my ears, but both ideas based on
"summary" are quite succint and to-the-point descriptions of what's
happening, so I vote for those.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 20140806162551.GA7097@eldon.alvh.no-ip.org
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"?
CREATE INDEX foo ON t USING comprange (cols)
CREATE INDEX foo ON t USING compressedrng (cols) -- ugh
-- or use an identifier with whitespace:
CREATE INDEX foo ON t USING "compressed range" (cols)
The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one.
The create command can still use minmax, or rangemap if you prefer
that, while the framework's code uses summary or summarizing.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 20140806162551.GA7097@eldon.alvh.no-ip.org
On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"?
CREATE INDEX foo ON t USING comprange (cols)
CREATE INDEX foo ON t USING compressedrng (cols) -- ugh
-- or use an identifier with whitespace:
CREATE INDEX foo ON t USING "compressed range" (cols)The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one.
"Block filter" indexes?
The create command can still use minmax, or rangemap if you prefer
that, while the framework's code uses summary or summarizing.
"Summary" sounds like materialized views to me.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 6, 2014 at 1:35 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 6, 2014 at 01:31:14PM -0300, Claudio Freire wrote:
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"?
CREATE INDEX foo ON t USING comprange (cols)
CREATE INDEX foo ON t USING compressedrng (cols) -- ugh
-- or use an identifier with whitespace:
CREATE INDEX foo ON t USING "compressed range" (cols)The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one."Block filter" indexes?
Nice one
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 6, 2014 at 1:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Claudio Freire wrote:
On Wed, Aug 6, 2014 at 1:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
CREATE INDEX foo ON t USING crange (cols) -- misspelling of "cringe"?
CREATE INDEX foo ON t USING comprange (cols)
CREATE INDEX foo ON t USING compressedrng (cols) -- ugh
-- or use an identifier with whitespace:
CREATE INDEX foo ON t USING "compressed range" (cols)The word you'd use there is not necessarily the one you use on the
framework, since the framework applies to many such techniques, but
the index type there is one specific one.The create command can still use minmax, or rangemap if you prefer
that, while the framework's code uses summary or summarizing.I think you're confusing the AM name with the opclass name. The name
you specify in that part of the command is the access method name. You
can specify the opclass together with each column, like so:CREATE INDEX foo ON t USING blockfilter
(order_date date_minmax_ops, geometry gis_bbox_ops);
Oh, uh... no, I'm not confusing them, but now I just realized how one
would implement other classes of block filtering indexes, and yeah...
you do it through the opclasses.
I'm sticking to bloom filters:
CREATE INDEX foo ON t USING blockfilter (order_date date_minmax_ops,
path character_bloom_ops);
Cool. Very cool.
So, I like blockfilter a lot. I change my vote to blockfilter ;)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: 20140806165542.GB7097@eldon.alvh.no-ip.org
2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.
I don't like that quite as well as summary, but I'd prefer either to
the current naming.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.I don't like that quite as well as summary, but I'd prefer either to
the current naming.
Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 7, 2014 at 11:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.I don't like that quite as well as summary, but I'd prefer either to
the current naming.Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).
Block filter would refer to the index property that selects blocks,
not tuples, and it does so through a "filter function" (for min-max,
it's a range check, but for other opclasses it could be anything).
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Simon Riggs wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.I don't like that quite as well as summary, but I'd prefer either to
the current naming.Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).
I was thinking just "blockfilter" (I did show a sample command).
Claudio explained the name downthread; personally, of all the options
suggested thus far, it's the one I like the most (including minmax).
At this point, the naming issue is what is keeping me from committing
this patch, so the quicker we can solve it, the merrier.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 7, 2014 at 10:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.I don't like that quite as well as summary, but I'd prefer either to
the current naming.Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.
range index might get confused with range types; block range index
seems better. I like summary, but I'm fine with block range index or
block filter index, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
+1 for BRIN !
On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 6, 2014 at 4:06 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:2014-08-06 Claudio Freire <klaussfreire@gmail.com>:
So, I like blockfilter a lot. I change my vote to blockfilter ;)
+1 for blockfilter, because it stresses the fact that the "physical"
arrangement of rows in blocks matters for this index.I don't like that quite as well as summary, but I'd prefer either to
the current naming.Yes, "summary index" isn't good. I'm not sure where the block or the
filter part comes in though, so -1 to "block filter", not least
because it doesn't have a good abbreviation (bfin??).A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
2014-08-07 Oleg Bartunov <obartunov@gmail.com>:
+1 for BRIN !
+1, rolls off the tongue smoothly and captures the essence :-).
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 07/08/14 16:16, Simon Riggs wrote:
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.
+1 for block range index (BRIN)
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Simon Riggs wrote:
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.
Seems a lot of people liked BRIN. I will be adopting that by renaming
files and directories soon.
Here's v14. I fixed a few bugs; most notably, queries with IS NULL and
IS NOT NULL now work correctly. Also I made the pageinspect extension
be able to display existing index tuples (I had disabled that when
generalizing the opclass stuff). It only works with minmax opclasses
for now; it should be easy to fix if/when we add more stuff though.
I also added some docs. These are not finished by any means. They
talk about the index using the BRIN term.
All existing opclasses were renamed to "<type>_minmax_ops".
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-14.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,426 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_type(PG_FUNCTION_ARGS);
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+ Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_type);
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+ PG_FUNCTION_INFO_V1(minmax_metapage_info);
+ PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+ PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+ typedef struct mm_page_state
+ {
+ MinmaxDesc *mmdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+
+ static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ minmax_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->mmdesc = minmax_build_mmdesc(indexRel);
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->mmdesc->md_tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->mmdesc->md_tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[5];
+ bool nulls[5];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->mmdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->dt_columns[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->dt_columns[att].hasnulls);
+ if (!state->dtup->dt_columns[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->dt_columns[att];
+ char *min,
+ *max;
+ char *rangeval;
+
+ /*
+ * XXX -- we assume here that the opclass uses 2 stored
+ * values, which is true for now (only minmax opclasses exist).
+ * Other opclasses might do something different.
+ */
+ min = OutputFunctionCall(outputfn, mmvalues->values[0]);
+ max = OutputFunctionCall(outputfn, mmvalues->values[1]);
+ rangeval = psprintf("%s..%s", min, max);
+ values[4] = CStringGetTextDatum(rangeval);
+
+ }
+ else
+ {
+ nulls[4] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->mmdesc->md_tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+ }
+
+ Datum
+ minmax_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
+ values[1] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[2] = true;
+ else
+ values[2] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+ Datum
+ minmax_revmap_array_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+ }
+
+ /*
+ * Return the TID array stored in a minmax revmap page
+ */
+ Datum
+ minmax_revmap_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,147 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_type()
+ --
+ CREATE FUNCTION minmax_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'minmax_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_metapage_info()
+ --
+ CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT value text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_array_data()
+ CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_data()
+ CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/doc/src/sgml/brin.sgml
***************
*** 0 ****
--- 1,248 ----
+ <!-- doc/src/sgml/brin.sgml -->
+
+ <chapter id="BRIN">
+ <title>BRIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+
+ <sect1 id="brin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>BRIN</acronym> stands for Block Range Index.
+ <acronym>BRIN</acronym> is designed for handling very large tables
+ in which certain columns have some natural correlation with its
+ physical position. For example, a table storing orders might have
+ a date column on which each order was placed, and much of the time
+ the earlier entries will appear earlier in the table as well; or a
+ table storing a ZIP code column might have all codes for a city
+ grouped together naturally. For each block range, some summary info
+ is stored in the index.
+ </para>
+
+ <para>
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+ scanning facility only, and will return all tuples in all pages within
+ each range if the summary info stored by the index indicates that some
+ tuples in the range might match the given query conditions. The executor
+ is in charge of rechecking these tuples and discarding those that do not
+ match — in other words, these indexes are lossy.
+ This enables them to work as very fast sequential scan helpers to avoid
+ scanning blocks that are known not to contain matching tuples.
+ </para>
+
+ <para>
+ The specific data that a <acronym>BRIN</acronym> index will store
+ depends on the operator class selected for the data type.
+ Datatypes having a linear sort order can have operator classes that
+ store the minimum and maximum value within each block range, for instance;
+ geometrical types might store the common bounding box.
+ </para>
+
+ <para>
+ The size of the block range is determined at index creation time with
+ the pages_per_range storage parameter. The smaller the number, the
+ larger the index becomes (because of the need to store more index entries),
+ but at the same time the summary data stored can be more precise and
+ more data blocks can be skipped.
+ </para>
+
+ <para>
+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
+ is primarily maintained by Álvaro Herrera.
+ </para>
+ </sect1>
+
+ <sect1 id="brin-builtin-opclasses">
+ <title>Built-in Operator Classes</title>
+
+ <para>
+ The core <productname>PostgreSQL</productname> distribution includes
+ includes the <acronym>BRIN</acronym> operator classes shown in
+ <xref linkend="gin-builtin-opclasses-table">.
+ </para>
+
+ <table id="brin-builtin-opclasses-table">
+ <title>Built-in <acronym>BRIN</acronym> Operator Classes</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Indexed Data Type</entry>
+ <entry>Indexable Operators</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>char_minmax_ops</literal></entry>
+ <entry><type>"char"</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>date_minmax_ops</literal></entry>
+ <entry><type>date</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>int4_minmax_ops</literal></entry>
+ <entry><type>integer</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>numeric_minmax_ops</literal></entry>
+ <entry><type>numeric</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text_minmax_ops</literal></entry>
+ <entry><type>text</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>time_minmax_ops</literal></entry>
+ <entry><type>time</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timetz_minmax_ops</literal></entry>
+ <entry><type>time with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamp_minmax_ops</literal></entry>
+ <entry><type>timestamp</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamptz_minmax_ops</literal></entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="brin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>BRIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer only to implement the semantics
+ of the data type being accessed. The <acronym>BRIN</acronym> layer
+ itself takes care of concurrency, logging and searching the index structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>BRIN</acronym> access method working is to
+ implement a few user-defined methods, which define the behavior of
+ summary values stored in the index and the way they interact with
+ scan keys.
+ In short, <acronym>BRIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
+ </para>
+
+ <para>
+ There are three methods that an operator class for <acronym>BRIN</acronym>
+ must provide:
+
+ <variablelist>
+ <varlistentry>
+ <term><function>Datum opcInfo(...)</></term>
+ <listitem>
+ <para>
+ Returns internal information about the summary data stored
+ about indexed columns.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool consistent(...)</function></term>
+ <listitem>
+ <para>
+ Returns whether the key is consistent with the given index tuple.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool addValue(...)</function></term>
+ <listitem>
+ <para>
+ Modifies the index tuple to make it consistent with the given
+ indexed data.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <!-- this needs improvement ... -->
+ To implement these methods in a generic ways, normally the opclass
+ defines its own internal support functions. For instance, minmax
+ opclasses add the support functions for the four inequality operators
+ for the datatype.
+ Additionally, the operator class must supply appropriate
+ operator entries,
+ to enable the optimizer to use the index when those operators are
+ used in queries.
+ </para>
+ </sect1>
+ </chapter>
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***************
*** 87,92 ****
--- 87,93 ----
<!ENTITY gist SYSTEM "gist.sgml">
<!ENTITY spgist SYSTEM "spgist.sgml">
<!ENTITY gin SYSTEM "gin.sgml">
+ <!ENTITY brin SYSTEM "brin.sgml">
<!ENTITY planstats SYSTEM "planstats.sgml">
<!ENTITY indexam SYSTEM "indexam.sgml">
<!ENTITY nls SYSTEM "nls.sgml">
*** a/doc/src/sgml/indices.sgml
--- b/doc/src/sgml/indices.sgml
***************
*** 116,122 **** CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST and GIN. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
--- 116,123 ----
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST, GIN and BRIN.
! Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
***************
*** 326,331 **** SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
--- 327,365 ----
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GIN">.
</para>
+
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>BRIN</primary>
+ <see>index</see>
+ </indexterm>
+ BRIN indexes (a shorthand for Block Range indexes)
+ store summaries about the values stored in consecutive table physical block ranges.
+ Like GiST, SP-GiST and GIN,
+ BRIN can support many different indexing strategies,
+ and the particular operators with which a BRIN index can be used
+ vary depending on the indexing strategy.
+ For datatypes that have a linear sort order, the indexed data
+ corresponds to the minimum and maximum values of the
+ values in the column for each block range,
+ which support indexed queries using these operators:
+
+ <simplelist>
+ <member><literal><</literal></member>
+ <member><literal><=</literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>>=</literal></member>
+ <member><literal>></literal></member>
+ </simplelist>
+
+ The BRIN operator classes included in the standard distribution are
+ documented in <xref linkend="brin-builtin-opclasses-table">.
+ For more information see <xref linkend="BRIN">.
+ </para>
</sect1>
*** a/doc/src/sgml/postgres.sgml
--- b/doc/src/sgml/postgres.sgml
***************
*** 247,252 ****
--- 247,253 ----
&gist;
&spgist;
&gin;
+ &brin;
&storage;
&bki;
&planstats;
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,306 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of summarizing values in
+ consecutive groups of heap pages (page ranges); for example, the minimum and
+ maximum values for datatypes with a btree opclass, or the bounding box for
+ geometric types. These values can be used by constraint exclusion to avoid
+ scanning such pages, depending on query quals.
+
+ The main drawback of this is having to update the stored summary values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have similar features. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs" according to a
+ May 3rd, 2009 blog post
+ http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB also uses this technique, according to a published paper
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported currently; since an index is concerned with
+ summary values of the involved columns across all the pages in the table, it
+ normally doesn't make sense to exclude some tuples. These might be useful if
+ the index predicates are also used in queries. We exclude these for now for
+ conceptual simplicity.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine returns a TIDBitmap comprising all pages
+ in those page groups that match the query qualifications. The recheck node
+ prunes tuples that are not visible according to the query qualifications.
+
+ For each supported datatype, we need an operator class with the following
+ catalog entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+ - support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+
+ These are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ Different datatypes store other values instead of min/max, for example
+ geometric types might store a bounding box. The NULL bits are always present.
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious.)
+
+ There can be gaps where some pages have no covering index entry.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have an internal
+ structure we call the range reverse map. This stores one TID per page range,
+ which is the address of the index tuple summarizing that range. Since these
+ map entries are fixed size, it is possible to compute the address of the range
+ map entry for any given heap page by simple arithmetic.
+
+ When a new heap tuple is inserted in a summarized page range, we compare the
+ existing index tuple with the new heap tuple. If the heap tuple is outside the
+ summarization data given by the index tuple for any indexed column (or if the
+ new heap tuple contains null values but the index tuple indicate there are no
+ nulls), it is necessary to create a new index tuple with the new values. To do
+ this, a new index tuple is inserted, and the reverse range map is updated to
+ point to it. The old index tuple is left in place, for later garbage
+ collection. As an optimization, we sometimes overwrite the old index tuple in
+ place with the new data, which avoids the need for later garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is considered to be not summarized.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order. Query
+ quals are matched to each index tuple; if they match, each page within the page
+ range is returned as part of the output TID bitmap. If there's no match, they
+ are skipped. Reverse range map entries returning invalid index TIDs, that is
+ unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we map its logical page numbers to physical
+ pages. We use a large two-level BlockNumber array for this: The metapage
+ contains an array of BlockNumbers; each of these points to a "revmap array
+ page". Each revmap array page contains BlockNumbers, which in turn point to
+ "revmap regular pages", which are the ones that contain the revmap data itself.
+ Therefore, to find a given index tuple, we need to examine the metapage and
+ obtain the revmap array page number; then read the array page. From there we
+ obtain the revmap regular page number, and that one contains the TID we're
+ interested in. As an optimization, regular revmap page number 0 is stored in
+ physical page number 1, that is, the page just after the metapage. This means
+ that scanning a table of about 1300 page ranges (the number of TIDs that fit in
+ a single 8kB page) does not require accessing the metapage at all.
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction. It might be
+ useful to mark the corresponding index tuple somehow, if the heap tuple is one
+ of the constraining values of the summary data (i.e. either min or max in the
+ case of a btree-opclass-bearing datatype), so that in the future we are aware
+ of the need to re-execute summarization on that range, leading to a possible
+ tightening of the summary values.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ summarizing values of each indexed column and nulls bitmap are collected and
+ stored in the index.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the summary values for each indexed column, then insert the new
+ index entry at the end of the index.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 271,276 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 271,278 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 296,301 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 298,311 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 636,642 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 646,653 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 646,652 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 657,664 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 897,903 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 909,916 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 907,913 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 920,927 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1567 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * support collatable datatypes
+ * * ScalarArrayOpExpr
+ * * Make use of the stored NULL bits
+ * * we can support unlogged indexes now
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ DeformedMMTuple *dtuple;
+ } MMBuildState;
+
+ /*
+ * Struct used as "opaque" during index scans
+ */
+ typedef struct MinmaxOpaque
+ {
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ } MinmaxOpaque;
+
+ static MMBuildState *initialize_mm_buildstate(Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange);
+ static void remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed);
+ static void rerun_summarization(Relation idxRel, Relation heapRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ BlockNumber *nonsummarized, int numnonsummarized);
+ static void mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+ static bool mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz);
+ static void form_and_insert_tuple(MMBuildState *mmstate);
+ static int qsortCompareItemPointers(const void *a, const void *b);
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to update the index (either by
+ * a new index tuple and repointing the revmap, or by overwriting the existing
+ * index tuple).
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ ItemId origlp;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ ItemPointerData idxtid;
+ BlockNumber heapBlk;
+ BlockNumber iblk;
+ OffsetNumber ioff;
+ Buffer buf;
+ IndexInfo *indexInfo;
+ Page page;
+ int keyno;
+ bool need_insert = false;
+
+ rmAccess = mmRevmapAccessInit(idxRel, NULL);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &idxtid);
+ /* tuple lock on idxtid is grabbed by mmGetHeapBlockItemptr */
+
+ if (!ItemPointerIsValid(&idxtid))
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ return BoolGetDatum(false);
+ }
+
+ indexInfo = BuildIndexInfo(idxRel);
+ mmdesc = minmax_build_mmdesc(idxRel);
+
+ iblk = ItemPointerGetBlockNumber(&idxtid);
+ ioff = ItemPointerGetOffsetNumber(&idxtid);
+ Assert(iblk != InvalidBlockNumber);
+ buf = ReadBuffer(idxRel, iblk);
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ UnlockTuple(idxRel, &idxtid, ShareLock);
+ page = BufferGetPage(buf);
+ origlp = PageGetItemId(page, ioff);
+ mmtup = (MMTuple *) PageGetItem(page, origlp);
+
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ result = FunctionCall5Coll(addValue,
+ PG_GET_COLLATION(),
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (need_insert)
+ {
+ Size tupsz;
+ MMTuple *tup;
+
+ tup = minmax_form_tuple(mmdesc, dtup, &tupsz);
+
+ /*
+ * If the size of the original tuple is greater or equal to the new
+ * index tuple, we can overwrite. This saves regular page bloat, and
+ * also saves revmap traffic. This might leave some unused space
+ * before the start of the next tuple, but we don't worry about that
+ * here.
+ *
+ * We avoid doing this when the itempointer of the index tuple would
+ * change, because that would require an update to the revmap while
+ * holding exclusive lock on this page, which would reduce concurrency.
+ *
+ * Note that we continue to acccess 'origlp' here, even though there
+ * was an interval during which the page wasn't locked. Since we hold
+ * pin on the page, this is okay -- the buffer cannot go away from
+ * under us, and also tuples cannot be shuffled around.
+ */
+ if (tupsz <= ItemIdGetLength(origlp))
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ PageOverwriteItemData(BufferGetPage(buf),
+ ioff,
+ (Item) tup, tupsz);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxRel->rd_node;
+ xlrec.target.tid = idxtid;
+ xlrec.overwrite = true;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = tupsz;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ else
+ {
+ /*
+ * The new tuple is larger than the original one, so we must insert
+ * a new one the slow way.
+ */
+ mm_doinsert(idxRel, rmAccess, &buf, heapBlk, tup, tupsz);
+
+ #ifdef NOT_YET
+ /*
+ * Possible optimization: if we can grab an exclusive lock on the
+ * buffer containing the old tuple right away, we can also seize
+ * the opportunity to prune the old tuple and avoid some bloat.
+ * This is not necessary for correctness.
+ */
+ if (ConditionalLockBuffer(buf))
+ {
+ /* prune the old tuple */
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ #endif
+ }
+ }
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+ }
+
+ /*
+ * Initialize state for a Minmax index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+ }
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first on what to do about sk_subtype.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer currIdxBuf = InvalidBuffer;
+ MinmaxDesc *mmdesc = minmax_build_mmdesc(idxRel);
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * mmdesc->md_tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ bool addrange;
+
+ mmGetHeapBlockItemptr(opaque->rmAccess, heapBlk, &itupptr);
+
+ /*
+ * For revmap items that return InvalidTID, we must return the whole
+ * range; otherwise, fetch the index item and compare it to the scan
+ * keys.
+ */
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ addrange = true;
+ }
+ else
+ {
+ Page page;
+ OffsetNumber idxoffno;
+ BlockNumber idxblkno;
+ MMTuple *tup;
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ /*
+ * Obtain the buffer that contains the tuple. We might already
+ * have it pinned.
+ */
+ idxoffno = ItemPointerGetOffsetNumber(&itupptr);
+ idxblkno = ItemPointerGetBlockNumber(&itupptr);
+ if (currIdxBuf == InvalidBuffer ||
+ idxblkno != BufferGetBlockNumber(currIdxBuf))
+ {
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ Assert(idxblkno != InvalidBlockNumber);
+ currIdxBuf = ReadBuffer(idxRel, idxblkno);
+ LockBuffer(currIdxBuf, BUFFER_LOCK_SHARE);
+ }
+
+ /*
+ * We now have the containing buffer locked, so we can release the
+ * tuple lock.
+ */
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ page = BufferGetPage(currIdxBuf);
+ tup = (MMTuple *) PageGetItem(page, PageGetItemId(page, idxoffno));
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with summary values stored for the range. If
+ * scan keys are matched, the page range must be added to the
+ * bitmap. We initially assume the range needs to be added; in
+ * particular this serves the case where there are no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation used
+ * in the index column. Otherwise we shouldn't be using this
+ * index ...
+ */
+ Assert(key->sk_collation ==
+ mmdesc->md_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page range
+ * values; if so, have the pages in the range added to the
+ * output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole, so break out of the loop as soon as a
+ * false return value is obtained.
+ */
+ add = FunctionCall3Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ PointerGetDatum(key));
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ break;
+ }
+
+ pfree(dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (currIdxBuf != InvalidBuffer)
+ UnlockReleaseBuffer(currIdxBuf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+ }
+
+ /*
+ * Re-initialize state for a minmax index scan
+ */
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Close down a minmax index scan
+ */
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1))
+ {
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(mmstate);
+
+ /* set state to correspond to the next range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for it */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ mmstate->dtuple->dt_seentup = true;
+ for (i = 0; i < mmstate->mmDesc->md_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ mmstate->mmDesc->md_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = mm_getnewbuffer(index);
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /*
+ * Set up an empty revmap, and get access to it
+ */
+ mmRevmapCreate(index);
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* process the final batch */
+ form_and_insert_tuple(mmstate);
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * mmbulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a minmax index: 1) remove index
+ * tuples that are no longer referenced from the revmap. 2) summarize ranges
+ * that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ BlockNumber *nonsummarized = NULL;
+ int numnonsummarized;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber pagesPerRange;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * First scan the index, removing index tuples that are no longer
+ * referenced from the revmap. While at it, collect the page numbers of
+ * ranges that are not summarized.
+ */
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ remove_deletable_tuples(info->index, heapNumBlocks, info->strategy,
+ &nonsummarized, &numnonsummarized);
+
+ /* and summarize the ranges collected above */
+ if (nonsummarized)
+ {
+ rerun_summarization(info->index, heapRel, rmAccess, pagesPerRange,
+ nonsummarized, numnonsummarized);
+ pfree(nonsummarized);
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * reloptions processor for minmax indexes
+ */
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+ Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * XXX As a possible improvement, we could request a blank page to the FSM
+ * here. Such pages could get inserted into the FSM if, for instance, two
+ * processes extend the relation concurrently to add one more page to the
+ * revmap and the second one discovers it doesn't actually need the page it
+ * got.
+ */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ mm_page_init(Page page, uint16 type)
+ {
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+ /*
+ * Initialize a new minmax index' metapage.
+ */
+ void
+ mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+ {
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->pagesPerRange = pagesPerRange;
+ metadata->minmaxVersion = version;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+ }
+
+ /*
+ * Build a MinmaxDesc used to create or scan a minmax index
+ */
+ MinmaxDesc *
+ minmax_build_mmdesc(Relation rel)
+ {
+ MinmaxOpcInfo **opcinfo;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+ Datum indclassDatum;
+ oidvector *indclass;
+ bool isnull;
+
+ tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Obtain MinmaxOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, rel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ opcinfo = (MinmaxOpcInfo **) palloc(sizeof(MinmaxOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_OPCINFO);
+
+ opcinfo[keyno] = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall2(opcInfoFn,
+ ObjectIdGetDatum(opfam),
+ ObjectIdGetDatum(idxtypid)));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(MinmaxDesc, md_info) +
+ sizeof(MinmaxOpcInfo *) * tupdesc->natts;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->md_index = rel;
+ mmdesc->md_tupdesc = CreateTupleDescCopy(tupdesc);
+ mmdesc->md_disktdesc = NULL; /* generated lazily */
+ mmdesc->md_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ mmdesc->md_info[keyno] = opcinfo[keyno];
+
+ return mmdesc;
+ }
+
+ /*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+ static MMBuildState *
+ initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
+ BlockNumber pagesPerRange)
+ {
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ return mmstate;
+ }
+
+ /*
+ * Remove index tuples that are no longer useful.
+ *
+ * While at it, return in nonsummed the array (and in numnonsummed its size) of
+ * block numbers for which the revmap returns InvalidTid; this is used in a
+ * later stage to execute re-summarization. (Each block number returned
+ * corresponds to the heap page number with which each unsummarized range
+ * starts.) Space for the array is palloc'ed, and must be freed by caller.
+ *
+ * idxRel is the index relation; heapNumBlocks is the size of the heap
+ * relation; strategy is appropriate for bulk scanning.
+ */
+ static void
+ remove_deletable_tuples(Relation idxRel, BlockNumber heapNumBlocks,
+ BufferAccessStrategy strategy,
+ BlockNumber **nonsummed, int *numnonsummed)
+ {
+ HASHCTL hctl;
+ HTAB *tuples;
+ HASH_SEQ_STATUS status;
+ BlockNumber nblocks;
+ BlockNumber blk;
+ mmRevmapAccess *rmAccess;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ int numitems = 0;
+ int numdeletable = 0;
+ ItemPointerData *deletable;
+ int start;
+ int i;
+ BlockNumber *nonsumm = NULL;
+ int maxnonsumm = 0;
+ int numnonsumm = 0;
+
+ typedef struct DeletableTuple
+ {
+ ItemPointerData tid;
+ bool referenced;
+ } DeletableTuple;
+
+ nblocks = RelationGetNumberOfBlocks(idxRel);
+
+ /* Initialize hash used to track deletable tuples */
+ memset(&hctl, 0, sizeof(hctl));
+ hctl.keysize = sizeof(ItemPointerData);
+ hctl.entrysize = sizeof(DeletableTuple);
+ hctl.hcxt = CurrentMemoryContext;
+ hctl.hash = tag_hash;
+
+ /* assume ten entries per page. No harm in getting this wrong */
+ tuples = hash_create("mmvacuumcleanup", nblocks * 10, &hctl,
+ HASH_CONTEXT | HASH_FUNCTION | HASH_ELEM);
+
+ /*
+ * Scan the index sequentially, entering each item into a hash table.
+ * Initially, the items are marked as not referenced.
+ */
+ for (blk = 0; blk < nblocks; blk++)
+ {
+ Buffer buf;
+ Page page;
+ OffsetNumber offno;
+ MinmaxSpecialSpace *special;
+
+ vacuum_delay_point();
+
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk, RBM_NORMAL,
+ strategy);
+ page = BufferGetPage(buf);
+
+ /*
+ * Verify the type of the page we got; if it's not a regular page,
+ * ignore it.
+ */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != MINMAX_PAGETYPE_REGULAR)
+ {
+ ReleaseBuffer(buf);
+ continue;
+ }
+
+ /*
+ * Enter each live tuple into the hash table
+ */
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+ for (offno = 1; offno <= PageGetMaxOffsetNumber(page); offno++)
+ {
+ ItemPointerData tid;
+ ItemId itemid;
+ bool found;
+ DeletableTuple *hitem;
+
+ itemid = PageGetItemId(page, offno);
+ if (!ItemIdHasStorage(itemid))
+ continue;
+
+ ItemPointerSet(&tid, blk, offno);
+ hitem = (DeletableTuple *)
+ hash_search(tuples, &tid, HASH_ENTER, &found);
+ Assert(!found);
+ hitem->referenced = false;
+ numitems++;
+ }
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Now scan the revmap, and determine which of these TIDs are still
+ * referenced
+ */
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ ItemPointerData itupptr;
+ DeletableTuple *hitem;
+ bool found;
+
+ mmGetHeapBlockItemptr(rmAccess, heapBlk, &itupptr);
+
+ if (!ItemPointerIsValid(&itupptr))
+ {
+ /*
+ * Ignore revmap entries set to invalid. Before doing so, if the
+ * heap page range is complete but not summarized, store its
+ * initial page number in the unsummarized array, for later
+ * summarization.
+ */
+ if (heapBlk + pagesPerRange < heapNumBlocks)
+ {
+ if (maxnonsumm == 0)
+ {
+ Assert(!nonsumm);
+ maxnonsumm = 8;
+ nonsumm = palloc(sizeof(BlockNumber) * maxnonsumm);
+ }
+ else if (numnonsumm >= maxnonsumm)
+ {
+ maxnonsumm *= 2;
+ nonsumm = repalloc(nonsumm, sizeof(BlockNumber) * maxnonsumm);
+ }
+
+ nonsumm[numnonsumm++] = heapBlk;
+ }
+
+ continue;
+ }
+ else
+ UnlockTuple(idxRel, &itupptr, ShareLock);
+
+ hitem = (DeletableTuple *) hash_search(tuples,
+ &itupptr,
+ HASH_FIND,
+ &found);
+ /*
+ * If the item is not in the hash, it must have been inserted after the
+ * index was scanned, and therefore we should leave things well alone.
+ * (There might be a leftover entry, but it's okay because next vacuum
+ * will remove it.)
+ */
+ if (!found)
+ continue;
+
+ hitem->referenced = true;
+
+ /* discount items set as referenced */
+ numitems--;
+ }
+ Assert(numitems >= 0);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ /*
+ * Now scan the hash, and keep track of the removable (i.e. not referenced,
+ * not locked) tuples.
+ */
+ deletable = palloc(sizeof(ItemPointerData) * numitems);
+
+ hash_freeze(tuples);
+ hash_seq_init(&status, tuples);
+ for (;;)
+ {
+ DeletableTuple *hitem;
+
+ hitem = hash_seq_search(&status);
+ if (!hitem)
+ break;
+ if (hitem->referenced)
+ continue;
+ if (!ConditionalLockTuple(idxRel, &hitem->tid, ExclusiveLock))
+ continue;
+
+ /*
+ * By here, we know this tuple is not referenced from the revmap.
+ * Also, since we hold the tuple lock, we know that if there is a
+ * concurrent scan that had obtained the tuple before the reference
+ * got removed, either that scan is not looking at the tuple (because
+ * that would have prevented us from getting the tuple lock) or it is
+ * holding the containing buffer's lock. If the former, then there's
+ * no problem with removing the tuple immediately; if the latter, we
+ * will block below trying to acquire that lock, so by the time we are
+ * unblocked, the concurrent scan will no longer be interested in the
+ * tuple contents anymore. Therefore, this tuple can be removed from
+ * the block.
+ */
+ UnlockTuple(idxRel, &hitem->tid, ExclusiveLock);
+
+ deletable[numdeletable++] = hitem->tid;
+ }
+
+ /*
+ * Now sort the array of deletable index tuples, and walk this array by
+ * pages doing bulk deletion of items on each page; the free space map is
+ * updated for pages on which we delete item.
+ */
+ qsort(deletable, numdeletable, sizeof(ItemPointerData),
+ qsortCompareItemPointers);
+
+ for (start = 0, i = 0; i < numdeletable; i++)
+ {
+ /*
+ * Are we at the end of the items that together belong in one
+ * particular page? If so, then it's deletion time.
+ */
+ if (i == numdeletable - 1 ||
+ (ItemPointerGetBlockNumber(&deletable[start]) !=
+ ItemPointerGetBlockNumber(&deletable[i + 1])))
+ {
+ OffsetNumber *offnos;
+ int noffs;
+ Buffer buf;
+ Page page;
+ int j;
+ BlockNumber blk;
+ int freespace;
+
+ vacuum_delay_point();
+
+ blk = ItemPointerGetBlockNumber(&deletable[start]);
+ buf = ReadBufferExtended(idxRel, MAIN_FORKNUM, blk,
+ RBM_NORMAL, strategy);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ noffs = i + 1 - start;
+ offnos = palloc(sizeof(OffsetNumber) * noffs);
+
+ for (j = 0; j < noffs; j++)
+ offnos[j] = ItemPointerGetOffsetNumber(&deletable[start + j]);
+
+ /*
+ * Now defragment the page.
+ */
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ MarkBufferDirty(buf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxRel))
+ {
+ xl_minmax_bulkremove xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+
+ xlrec.node = idxRel->rd_node;
+ xlrec.block = blk;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxBulkRemove;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ /*
+ * The OffsetNumber array is not actually in the buffer, but we
+ * pretend that it is. When XLogInsert stores the whole
+ * buffer, the offset array need not be stored too.
+ */
+ rdata[1].data = (char *) offnos;
+ rdata[1].len = sizeof(OffsetNumber) * noffs;
+ rdata[1].buffer = buf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_BULKREMOVE,
+ rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* next iteration starts where this one ended */
+ start = i + 1;
+
+ /* remember free space while we have the buffer locked */
+ freespace = PageGetFreeSpace(page);
+
+ UnlockReleaseBuffer(buf);
+ pfree(offnos);
+
+ RecordPageWithFreeSpace(idxRel, blk, freespace);
+ }
+ }
+
+ pfree(deletable);
+
+ /* Finally, ensure the index' FSM is consistent */
+ FreeSpaceMapVacuum(idxRel);
+
+ *nonsummed = nonsumm;
+ *numnonsummed = numnonsumm;
+
+ hash_destroy(tuples);
+ }
+
+ /*
+ * Summarize the given page ranges of the given index.
+ */
+ static void
+ rerun_summarization(Relation idxRel, Relation heapRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange,
+ BlockNumber *nonsummarized, int numnonsummarized)
+ {
+ int i;
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(idxRel, rmAccess, pagesPerRange);
+
+ for (i = 0; i < numnonsummarized; i++)
+ {
+ BlockNumber blk = nonsummarized[i];
+ ItemPointerData iptr;
+
+ mmstate->currRangeStart = blk;
+
+ mmGetHeapBlockItemptr(rmAccess, blk, &iptr);
+ /* it can't have been re-summarized concurrently .. */
+ Assert(!ItemPointerIsValid(&iptr));
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the
+ * specified page range, summarizing the heap tuples in it. This scan
+ * stops just short of mmbuildCallback creating the new index entry.
+ */
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ blk, pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple and insert it. Note mmbuildCallback didn't
+ * have the chance to actually insert anything into the index, because
+ * the heapscan should have ended just as it reached the final tuple in
+ * the range.
+ */
+ form_and_insert_tuple(mmstate);
+
+ /* and re-initialize state for the next range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is checked for free space to insert the new entry;
+ * if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * The buffer is marked dirty.
+ */
+ static void
+ mm_doinsert(Relation idxrel, mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapblkno, MMTuple *tup, Size itemsz)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ bool extended;
+
+ itemsz = MAXALIGN(itemsz);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer and should have
+ * thrown a user-facing error message if there isn't, so at this point it's
+ * a program error if that happens.
+ */
+ extended = mm_getinsertbuffer(idxrel, buffer, itemsz);
+ page = BufferGetPage(*buffer);
+ if (PageGetFreeSpace(page) < itemsz)
+ elog(ERROR, "index row size %lu exceeds maximum for index \"%s\"",
+ itemsz, RelationGetRelationName(idxrel));
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ MarkBufferDirty(*buffer);
+
+ blk = BufferGetBlockNumber(*buffer);
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapblkno);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.target.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.target.tid, blk, off);
+ xlrec.overwrite = false;
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ /*
+ * If this is the first tuple in the page, we can reinit the page
+ * instead of restoring the whole thing. Set flag, and hide buffer
+ * references from XLogInsert.
+ */
+ if (extended)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ rdata[1].buffer = InvalidBuffer;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our lock and update revmap */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ mmSetHeapBlockItemptr(rmAccess, heapblkno, blk, off);
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has enough, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, a new
+ * buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. This function returns true if this happens, false
+ * otherwise.
+ */
+ static bool
+ mm_getinsertbuffer(Relation irel, Buffer *buffer, Size itemsz)
+ {
+ Buffer buf;
+ Page page;
+ bool extended = false;
+
+ gib_restart:
+ buf = *buffer;
+
+ if (BufferIsInvalid(buf) ||
+ (PageGetFreeSpace(BufferGetPage(buf)) < itemsz))
+ {
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer. It was tested for free space, but in some cases only before
+ * locking it, so a recheck is necessary because a concurrent inserter
+ * might have put items in it.
+ */
+ for (;;)
+ {
+ BlockNumber blk;
+ int freespace;
+
+ blk = GetPageWithFreeSpace(irel, itemsz);
+ if (blk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ buf = mm_getnewbuffer(irel);
+ page = BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * If an entirely new page does not contain enough free space
+ * for the new item, then surely that item is oversized.
+ * Complain loudly; but first make sure we record the page as
+ * free, for next time.
+ */
+ freespace = PageGetFreeSpace(page);
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ if (freespace < itemsz)
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ extended = true;
+ break;
+ }
+
+ /*
+ * We have a block number from FSM now. Check that it has enough
+ * free space, and break out to return it if it does; otherwise
+ * start over. Note that we allow for the FSM to be out of date
+ * here, and in that case we update it and move on.
+ */
+ Assert(blk != InvalidBlockNumber);
+ buf = ReadBuffer(irel, blk);
+ page = BufferGetPage(buf);
+ freespace = PageGetFreeSpace(page);
+ if (freespace >= itemsz)
+ {
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ break;
+ }
+
+ /* Not really enough space: register reality and start over */
+ ReleaseBuffer(buf);
+ RecordPageWithFreeSpace(irel, blk, freespace);
+ }
+
+ if (!BufferIsInvalid(*buffer))
+ ReleaseBuffer(*buffer);
+ *buffer = buf;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ /*
+ * Now recheck free space with exclusive lock held, and start over if it's
+ * not enough.
+ */
+ Assert(!BufferIsInvalid(*buffer));
+ page = BufferGetPage(*buffer);
+ if (PageGetFreeSpace(page) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ goto gib_restart;
+ }
+
+ /*
+ * In the case where we extended the relation, make sure we make the FSM
+ * aware of this new page. This is so that other processes can make use of
+ * this new page right away.
+ */
+ if (extended)
+ FreeSpaceMapVacuum(irel);
+
+ return extended;
+ }
+
+ /*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+ static void
+ form_and_insert_tuple(MMBuildState *mmstate)
+ {
+ MMTuple *tup;
+ Size size;
+
+ /* if this dtuple didn't see any heap tuple at all, don't insert it */
+ if (!mmstate->dtuple->dt_seentup)
+ return;
+
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+ }
+
+ /*
+ * qsort comparator for ItemPointerData items
+ */
+ static int
+ qsortCompareItemPointers(const void *a, const void *b)
+ {
+ return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,683 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/heapam_xlog.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+
+
+
+ /*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+ /*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+ #define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+ #define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer metaBuf;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+ };
+ /* typedef appears in minmax_revmap.h */
+
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->metaBuf = meta;
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ if (pagesPerRange)
+ *pagesPerRange = metadata->pagesPerRange;
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->metaBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * In the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages per range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct revmap page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ rm_page_set_iptr(Page page, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+ }
+
+ /*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ BlockNumber
+ initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+ {
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+ }
+
+ /*
+ * Lock the metapage as specified by called, and update the given rmAccess with
+ * the metapage data. The metapage buffer is locked when this function
+ * returns; it's the caller's responsibility to unlock it.
+ */
+ static void
+ rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+ {
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ LockBuffer(rmAccess->metaBuf, lockmode);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+
+ #ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+ #endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+ }
+
+ /*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ void
+ initialize_rma_page(Buffer buf)
+ {
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+ }
+
+ /*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno.
+ */
+ static void
+ update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+ {
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again. XXX but we didn't hold lock!
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ *
+ * FIXME -- what if we were asked not to extend?
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Modify the revmap array page to point to the newly allocated revmap
+ * page.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ /*
+ * XXX this rma_nblocks assignment should probably be conditional on the
+ * current rma_blocks value.
+ */
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+ }
+
+ /*
+ * Set the TID of the index entry corresponding to the range that includes
+ * the given heap page to the given item pointer.
+ *
+ * The map is extended, if necessary.
+ */
+ void
+ mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+ {
+ BlockNumber mapBlk;
+ bool extend = false;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "setting %u/%u in logical page %lu (physical %u) for heap %u",
+ blkno, offno,
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+ START_CRIT_SECTION();
+
+ rm_page_set_iptr(BufferGetPage(rmAccess->currBuf),
+ rmAccess->pagesPerRange,
+ heapBlk,
+ blkno, offno);
+
+ MarkBufferDirty(rmAccess->currBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rm_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_REVMAP_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.mapBlock = mapBlk;
+ xlrec.pagesPerRange = rmAccess->pagesPerRange;
+ xlrec.heapBlock = heapBlk;
+ ItemPointerSet(&(xlrec.newval), blkno, offno);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRevmapSet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ if (extend)
+ {
+ info |= XLOG_MINMAX_INIT_PAGE;
+ /* If the page is new, there's no need for a full page image */
+ rdata[0].next = NULL;
+ }
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+
+ /*
+ * Return the TID of the index entry corresponding to the range that includes
+ * the given heap page. If the TID is valid, the tuple is locked with
+ * LockTuple. It is the caller's responsibility to release that lock.
+ */
+ void
+ mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ ItemPointerData *out)
+ {
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ ItemPointerSetInvalid(out);
+ return;
+ }
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ ItemPointerCopy(iptr, out);
+
+ if (ItemPointerIsValid(iptr))
+ LockTuple(rmAccess->idxrel, iptr, ShareLock);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmsortable.c
***************
*** 0 ****
--- 1,280 ----
+ /*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+ #include "postgres.h"
+
+ #include "access/genam.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/skey.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * Procedure numbers must not collide with MINMAX_PROCNUM defines in
+ * minmax_internal.h. Note we only need inequality functions.
+ */
+ #define SORTABLE_NUM_PROCNUMS 4 /* # support procs we need */
+ #define PROCNUM_LESS 4
+ #define PROCNUM_LESSEQUAL 5
+ #define PROCNUM_GREATEREQUAL 6
+ #define PROCNUM_GREATER 7
+
+ /* subtract this from procnum to obtain index in SortableOpaque arrays */
+ #define PROCNUM_BASE 4
+
+ static FmgrInfo *mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno,
+ uint16 procnum);
+
+ PG_FUNCTION_INFO_V1(mmSortableOpcInfo);
+ PG_FUNCTION_INFO_V1(mmSortableAddValue);
+ PG_FUNCTION_INFO_V1(mmSortableConsistent);
+
+ Datum mmSortableOpcInfo(PG_FUNCTION_ARGS);
+ Datum mmSortableAddValue(PG_FUNCTION_ARGS);
+ Datum mmSortableConsistent(PG_FUNCTION_ARGS);
+
+ typedef struct SortableOpaque
+ {
+ FmgrInfo operators[SORTABLE_NUM_PROCNUMS];
+ bool inited[SORTABLE_NUM_PROCNUMS];
+ } SortableOpaque;
+
+ /*
+ * Return the number and OIDs of (the functions that underlie) operators we
+ * need to build a minmax index, as a pointer to a newly palloc'ed MinmaxOpers.
+ */
+ Datum
+ mmSortableOpcInfo(PG_FUNCTION_ARGS)
+ {
+ SortableOpaque *opaque;
+ MinmaxOpcInfo *result;
+
+ opaque = palloc0(sizeof(SortableOpaque));
+ /*
+ * 'operators' is initialized lazily, as indicated by 'inited' which was
+ * initialized to all false by palloc0.
+ */
+
+ result = palloc(sizeof(MinmaxOpcInfo));
+ result->oi_nstored = 2; /* min, max */
+ result->oi_opaque = opaque;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ /*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+ Datum
+ mmSortableAddValue(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->dt_columns[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->dt_columns[attno - 1].allnulls)
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+ }
+
+ /*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple. Return true
+ * if so, false otherwise.
+ */
+ Datum
+ mmSortableConsistent(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ ScanKey key = (ScanKey) PG_GETARG_POINTER(2);
+ Oid colloid = PG_GET_COLLATION();
+ AttrNumber attno = key->sk_attno;
+ Datum value;
+ Datum matches;
+
+ /* handle IS NULL/IS NOT NULL tests */
+ if (key->sk_flags & SK_ISNULL)
+ {
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (dtup->dt_columns[attno - 1].allnulls ||
+ dtup->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(false);
+ }
+
+ /*
+ * For IS NOT NULL we can only exclude blocks if all values are nulls.
+ */
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (dtup->dt_columns[attno - 1].allnulls)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+ }
+
+ value = key->sk_argument;
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESS),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATER),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+ }
+
+ /*
+ * Return the procedure corresponding to the given function support number.
+ */
+ static FmgrInfo *
+ mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno, uint16 procnum)
+ {
+ SortableOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (SortableOpaque *) mmdesc->md_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(mmdesc->md_index, attno, procnum),
+ CurrentMemoryContext);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,454 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+ static TupleDesc
+ mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+ {
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->md_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->md_totalstored, false);
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->md_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->md_tupdesc->attrs[i]->atttypid,
+ mmdesc->md_tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ mmdesc->md_disktdesc = tupdesc;
+ }
+
+ return mmdesc->md_disktdesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+ MMTuple *
+ minmax_form_tuple(MinmaxDesc *mmdesc, DeformedMMTuple *tuple, Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->md_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->md_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->md_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (idxattno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ /* XXX datumCopy ?? */
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->md_tupdesc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ DeformedMMTuple *
+ minmax_new_dtuple(MinmaxDesc *mmdesc)
+ {
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->md_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->md_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->md_info[i]->oi_nstored;
+ }
+
+ return dtup;
+ }
+
+ /*
+ * Reset a DeformedMMTuple to initial state
+ */
+ void
+ minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+ {
+ int i;
+
+ dtuple->dt_seentup = false;
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * mmdesc->md_info[i]->oi_nstored);
+ }
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need to apply
+ * datumCopy inside the loop. We probably also need a minmax_free_dtuple()
+ * function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->md_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ dtup->dt_columns[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->md_totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->md_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] = values[valueno++];
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->md_totalstored
+ * allnulls output "allnulls", size mmdesc->md_tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->md_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ allnulls[attnum] = true;
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && att_isnull(mmdesc->md_tupdesc->natts + attnum, nullbits);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ for (stored = 0, attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->md_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,305 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ int tuplen;
+ MMTuple *mmtuple;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, blkno, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+ }
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ if (xlrec->overwrite)
+ PageOverwriteItemData(page, offnum, (Item) mmtuple, tuplen);
+ else
+ {
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+ }
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_bulkremove(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ OffsetNumber *offnos;
+ int noffs;
+ Size freespace;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ if (!BufferIsValid(buffer))
+ return;
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn <= PageGetLSN(page)) /* changes are applied */
+ {
+ UnlockReleaseBuffer(buffer);
+ return;
+ }
+
+ offnos = (OffsetNumber *) ((char *) xlrec + SizeOfMinmaxBulkRemove);
+ noffs = (record->xl_len - SizeOfMinmaxBulkRemove) / sizeof(OffsetNumber);
+
+ PageIndexDeleteNoCompact(page, offnos, noffs);
+ freespace = PageGetFreeSpace(page);
+
+ PageSetLSN(page, lsn);
+
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+
+ /* update FSM as well */
+ XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
+ }
+
+ static void
+ minmax_xlog_revmap_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) XLogRecGetData(record);
+ bool init;
+ Buffer buffer;
+ Page page;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ init = (record->xl_info & XLOG_MINMAX_INIT_PAGE) != 0;
+ buffer = XLogReadBuffer(xlrec->node, xlrec->mapBlock, init);
+ Assert(BufferIsValid(buffer));
+ page = BufferGetPage(buffer);
+ if (init)
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+
+ rm_page_set_iptr(page, xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+ }
+
+ static void
+ minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_BULKREMOVE:
+ minmax_xlog_bulkremove(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_SET:
+ minmax_xlog_revmap_set(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,95 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ static void
+ out_target(StringInfo buf, xl_minmax_tid *target)
+ {
+ appendStringInfo(buf, "rel %u/%u/%u; tid %u/%u",
+ target->node.spcNode, target->node.dbNode, target->node.relNode,
+ ItemPointerGetBlockNumber(&(target->tid)),
+ ItemPointerGetOffsetNumber(&(target->tid)));
+ }
+
+ void
+ minmax_desc(StringInfo buf, XLogRecord *record)
+ {
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: %u/%u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ out_target(buf, &(xlrec->target));
+ }
+ else if (info == XLOG_MINMAX_BULKREMOVE)
+ {
+ xl_minmax_bulkremove *xlrec = (xl_minmax_bulkremove *) rec;
+
+ appendStringInfo(buf, "bulkremove: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->block);
+ }
+ else if (info == XLOG_MINMAX_REVMAP_SET)
+ {
+ xl_minmax_rm_set *xlrec = (xl_minmax_rm_set *) rec;
+
+ appendStringInfo(buf, "revmap set: rel %u/%u/%u mapblk %u pagesPerRange %u item %u value %u/%u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->mapBlock,
+ xlrec->pagesPerRange, xlrec->heapBlock,
+ ItemPointerGetBlockNumber(&(xlrec->newval)),
+ ItemPointerGetOffsetNumber(&(xlrec->newval)));
+ }
+ else if (info == XLOG_MINMAX_METAPG_SET)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
+
+ appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_RMARRAY_SET)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
+
+ appendStringInfoString(buf, "revmap array: ");
+ appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->rmarray,
+ xlrec->blkidx, xlrec->newpg);
+ }
+
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2096,2101 **** IndexBuildHeapScan(Relation heapRelation,
--- 2096,2122 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2166,2171 **** IndexBuildHeapScan(Relation heapRelation,
--- 2187,2195 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 324,329 **** PageAddItem(Page page,
--- 324,364 ----
}
/*
+ * PageOverwriteItemData
+ * Overwrite the data for the item at the given offset.
+ *
+ * The new data must fit in the existing data space for the old tuple.
+ */
+ void
+ PageOverwriteItemData(Page page, OffsetNumber offset, Item item, Size size)
+ {
+ PageHeader phdr = (PageHeader) page;
+ ItemId itemId;
+
+ /*
+ * Be wary about corrupted page pointers
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(PANIC,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+ itemId = PageGetItemId(phdr, offset);
+ if (!ItemIdIsUsed(itemId) || !ItemIdHasStorage(itemId))
+ elog(ERROR, "existing item to overwrite is not used");
+
+ if (ItemIdGetLength(itemId) < size)
+ elog(ERROR, "existing item is not large enough to be overwritten");
+
+ memcpy((char *) page + ItemIdGetOffset(itemId), item, size);
+ ItemIdSetNormal(itemId, ItemIdGetOffset(itemId), size);
+ }
+
+ /*
* PageGetTempPage
* Get a temporary page in local memory for special processing.
* The returned page is not initialized at all; caller must do that.
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 434,441 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 932,1113 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7375 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for MinMax' reloptions
+ */
+ typedef struct MinmaxOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+ } MinmaxOptions;
+
+ #define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+ #define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,84 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "fmgr.h"
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple and vice-versa.
+ *
+ * Note: we assume, for now, that the data stored for each column is the same
+ * datatype as the indexed heap column. This restriction can be lifted by
+ * having an Oid array pointer on the PerCol struct, where each member of the
+ * array indicates the typid of the stored data.
+ */
+
+ /* struct returned by "OpcInfo" amproc */
+ typedef struct MinmaxOpcInfo
+ {
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+ } MinmaxOpcInfo;
+
+ typedef struct MinmaxDesc
+ {
+ /* the index relation itself */
+ Relation md_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc md_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc md_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int md_totalstored;
+
+ /* per-column info */
+ MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
+ } MinmaxDesc;
+
+ /*
+ * Globally-known function support numbers for Minmax indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+ #define MINMAX_PROCNUM_OPCINFO 1
+ #define MINMAX_PROCNUM_ADDVALUE 2
+ #define MINMAX_PROCNUM_CONSISTENT 3
+
+ #define MINMAX_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(MINMAX_DEBUG) && defined(__GNUC__)
+ #define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define MINMAX_elog(a) void(0)
+ #endif
+
+ /* minmax.c */
+ extern MinmaxDesc *minmax_build_mmdesc(Relation rel);
+ extern Buffer mm_getnewbuffer(Relation irel);
+ extern void mm_page_init(Page page, uint16 type);
+ extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_page.h
***************
*** 0 ****
--- 1,88 ----
+ /*
+ * Prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef MINMAX_PAGE_H
+ #define MINMAX_PAGE_H
+
+
+ /* special space on all minmax pages stores a "type" identifier */
+ #define MINMAX_PAGETYPE_META 0xF091
+ #define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+ #define MINMAX_PAGETYPE_REVMAP 0xF093
+ #define MINMAX_PAGETYPE_REGULAR 0xF094
+
+ typedef struct MinmaxSpecialSpace
+ {
+ uint16 type;
+ } MinmaxSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ uint32 minmaxMagic;
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+ } MinmaxMetaPageData;
+
+ /*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+ #define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ /* Definitions for regular revmap pages */
+ typedef struct RevmapContents
+ {
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ /* Definitions for array revmap pages */
+ typedef struct RevmapArrayContents
+ {
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+ } RevmapArrayContents;
+
+ #define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+ #endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,40 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern void mmSetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ BlockNumber blkno, OffsetNumber offno);
+ extern void mmGetHeapBlockItemptr(mmRevmapAccess *rmAccess, BlockNumber blk,
+ ItemPointerData *iptr);
+ extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+ /* internal stuff also used by xlog replay */
+ extern void rm_page_set_iptr(Page page, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, BlockNumber blkno, OffsetNumber offno);
+ extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+ extern void initialize_rma_page(Buffer buf);
+
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,84 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/minmax_internal.h"
+ #include "access/tupdesc.h"
+
+
+ /*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+ typedef struct MMValues
+ {
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+ typedef struct DeformedMMTuple
+ {
+ int dt_seentup;
+ MMValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each value stored); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+ typedef struct MMTuple
+ {
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+
+ extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+ extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+ extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,134 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_BULKREMOVE 0x20
+ #define XLOG_MINMAX_REVMAP_SET 0x30
+ #define XLOG_MINMAX_METAPG_SET 0x40
+ #define XLOG_MINMAX_RMARRAY_SET 0x50
+ #define XLOG_MINMAX_INIT_RMPG 0x60
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+ /* All that we need to find a minmax tuple */
+ typedef struct xl_minmax_tid
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ } xl_minmax_tid;
+
+ #define SizeOfMinmaxTid (offsetof(xl_minmax_tid, tid) + SizeOfIptrData)
+
+ /* This is what we need to know about a minmax tuple insert */
+ typedef struct xl_minmax_insert
+ {
+ xl_minmax_tid target;
+ bool overwrite;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, overwrite) + sizeof(bool))
+
+ /* This is what we need to know about a bulk minmax tuple remove */
+ typedef struct xl_minmax_bulkremove
+ {
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+ } xl_minmax_bulkremove;
+
+ #define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+ /* This is what we need to know about a revmap "set heap ptr" */
+ typedef struct xl_minmax_rm_set
+ {
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+ } xl_minmax_rm_set;
+
+ #define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+ /* This is what we need to know about a "metapage set" operation */
+ typedef struct xl_minmax_metapg_set
+ {
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_metapg_set;
+
+ #define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know about a "revmap array set" operation */
+ typedef struct xl_minmax_rmarray_set
+ {
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_rmarray_set;
+
+ #define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know when we initialize a new revmap page */
+ typedef struct xl_minmax_init_rmpg
+ {
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+ } xl_minmax_init_rmpg;
+
+ #define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+ extern void minmax_desc(StringInfo buf, XLogRecord *record);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( minmax 5 7 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * int4_minmax_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * numeric_minmax_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * text_minmax_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * time_minmax_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * timetz_minmax_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * timestamp_minmax_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * timestamptz_minmax_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * date_minmax_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * char_minmax_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 431,434 **** DATA(insert ( 4017 25 25 3 4029 ));
--- 431,507 ----
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+ /* minmax */
+ DATA(insert ( 4054 23 23 1 3383 ));
+ DATA(insert ( 4054 23 23 2 3384 ));
+ DATA(insert ( 4054 23 23 3 3385 ));
+ DATA(insert ( 4054 23 23 4 66 ));
+ DATA(insert ( 4054 23 23 5 149 ));
+ DATA(insert ( 4054 23 23 6 150 ));
+ DATA(insert ( 4054 23 23 7 147 ));
+
+ DATA(insert ( 4055 1700 1700 1 3383 ));
+ DATA(insert ( 4055 1700 1700 2 3384 ));
+ DATA(insert ( 4055 1700 1700 3 3385 ));
+ DATA(insert ( 4055 1700 1700 4 1722 ));
+ DATA(insert ( 4055 1700 1700 5 1723 ));
+ DATA(insert ( 4055 1700 1700 6 1721 ));
+ DATA(insert ( 4055 1700 1700 7 1720 ));
+
+ DATA(insert ( 4056 25 25 1 3383 ));
+ DATA(insert ( 4056 25 25 2 3384 ));
+ DATA(insert ( 4056 25 25 3 3385 ));
+ DATA(insert ( 4056 25 25 4 740 ));
+ DATA(insert ( 4056 25 25 5 741 ));
+ DATA(insert ( 4056 25 25 6 743 ));
+ DATA(insert ( 4056 25 25 7 742 ));
+
+ DATA(insert ( 4057 1083 1083 1 3383 ));
+ DATA(insert ( 4057 1083 1083 2 3384 ));
+ DATA(insert ( 4057 1083 1083 3 3385 ));
+ DATA(insert ( 4057 1083 1083 4 1102 ));
+ DATA(insert ( 4057 1083 1083 5 1103 ));
+ DATA(insert ( 4057 1083 1083 6 1105 ));
+ DATA(insert ( 4057 1083 1083 7 1104 ));
+
+ DATA(insert ( 4058 1266 1266 1 3383 ));
+ DATA(insert ( 4058 1266 1266 2 3384 ));
+ DATA(insert ( 4058 1266 1266 3 3385 ));
+ DATA(insert ( 4058 1266 1266 4 1354 ));
+ DATA(insert ( 4058 1266 1266 5 1355 ));
+ DATA(insert ( 4058 1266 1266 6 1356 ));
+ DATA(insert ( 4058 1266 1266 7 1357 ));
+
+ DATA(insert ( 4059 1114 1114 1 3383 ));
+ DATA(insert ( 4059 1114 1114 2 3384 ));
+ DATA(insert ( 4059 1114 1114 3 3385 ));
+ DATA(insert ( 4059 1114 1114 4 2054 ));
+ DATA(insert ( 4059 1114 1114 5 2055 ));
+ DATA(insert ( 4059 1114 1114 6 2056 ));
+ DATA(insert ( 4059 1114 1114 7 2057 ));
+
+ DATA(insert ( 4060 1184 1184 1 3383 ));
+ DATA(insert ( 4060 1184 1184 2 3384 ));
+ DATA(insert ( 4060 1184 1184 3 3385 ));
+ DATA(insert ( 4060 1184 1184 4 1154 ));
+ DATA(insert ( 4060 1184 1184 5 1155 ));
+ DATA(insert ( 4060 1184 1184 6 1156 ));
+ DATA(insert ( 4060 1184 1184 7 1157 ));
+
+ DATA(insert ( 4061 1082 1082 1 3383 ));
+ DATA(insert ( 4061 1082 1082 2 3384 ));
+ DATA(insert ( 4061 1082 1082 3 3385 ));
+ DATA(insert ( 4061 1082 1082 4 1087 ));
+ DATA(insert ( 4061 1082 1082 5 1088 ));
+ DATA(insert ( 4061 1082 1082 6 1090 ));
+ DATA(insert ( 4061 1082 1082 7 1089 ));
+
+ DATA(insert ( 4062 18 18 1 3383 ));
+ DATA(insert ( 4062 18 18 2 3384 ));
+ DATA(insert ( 4062 18 18 3 3385 ));
+ DATA(insert ( 4062 18 18 4 1246 ));
+ DATA(insert ( 4062 18 18 5 72 ));
+ DATA(insert ( 4062 18 18 6 74 ));
+ DATA(insert ( 4062 18 18 7 73 ));
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_minmax_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_minmax_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_minmax_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_minmax_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_minmax_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_minmax_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_minmax_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_minmax_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_minmax_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_minax_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_minmax_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
***************
*** 4064,4069 **** DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
--- 4092,4105 ----
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+ /* Minmax */
+ DATA(insert OID = 3383 ( minmax_sortable_opcinfo PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3384 ( minmax_sortable_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableAddValue _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3385 ( minmax_sortable_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableConsistent _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 393,398 **** extern void PageInit(Page page, Size pageSize, Size specialSize);
--- 393,400 ----
extern bool PageIsVerified(Page page, BlockNumber blkno);
extern OffsetNumber PageAddItem(Page page, Item item, Size size,
OffsetNumber offsetNumber, bool overwrite, bool is_heap);
+ extern void PageOverwriteItemData(Page page, OffsetNumber offset, Item item,
+ Size size);
extern Page PageGetTempPage(Page page);
extern Page PageGetTempPageCopy(Page page);
extern Page PageGetTempPageCopySpecial(Page page);
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 405,412 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1591,1596 **** ORDER BY 1, 2, 3;
--- 1591,1601 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1613,1619 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1618,1624 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1775,1785 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
--- 1780,1792 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
***************
*** 1800,1806 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1807,1814 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1178,1188 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
--- 1178,1190 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
***************
*** 1201,1207 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1203,1210 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Unfortunately, we can't check the amproc link very well because the
On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
+1 for BRIN !
On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.
How about Block Range Dynamic indexes?
Or Range Usage Metadata indexes?
You see what I'm getting at:
BRanDy
RUM
... to keep with our "new indexes" naming scheme ...
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM6650ccf46aec372cce42614e5d23e9d015370a8368a5845d0f2bd4b79f7d3aa640671895b14351b5b18bc9850d973041@asav-2.01.com
On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
+1 for BRIN !
On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.How about Block Range Dynamic indexes?
Or Range Usage Metadata indexes?
You see what I'm getting at:
BRanDy
RUM
... to keep with our "new indexes" naming scheme ...
Not the best fit for kids, fine for grad students.
BRIN seems to be a perfect consensus, so +1 for it.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 7, 2014 at 7:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
range index might get confused with range types; block range index
seems better. I like summary, but I'm fine with block range index or
block filter index, too.
+1
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/07/2014 05:52 PM, Michael Paquier wrote:
On Fri, Aug 8, 2014 at 9:47 AM, Josh Berkus <josh@agliodbs.com> wrote:
On 08/07/2014 08:38 AM, Oleg Bartunov wrote:
+1 for BRIN !
On Thu, Aug 7, 2014 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 7 August 2014 14:53, Robert Haas <robertmhaas@gmail.com> wrote:
A better description would be "block range index" since we are
indexing a range of blocks (not just one block). Perhaps a better one
would be simply "range index", which we could abbreviate to RIN or
BRIN.How about Block Range Dynamic indexes?
Or Range Usage Metadata indexes?
You see what I'm getting at:
BRanDy
RUM
... to keep with our "new indexes" naming scheme ...
Not the best fit for kids, fine for grad students.
But, it goes perfectly with our GIN and VODKA indexes.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMa4db21d77894735cc88df5bcf73fc561e32975ee2392e9df86ac9a3acce7b053bef48c1b457b574a53baea1e00a7033a@asav-3.01.com
On 08/06/2014 05:35 AM, Alvaro Herrera wrote:
FWIW I think I haven't responded appropriately to the points raised by
Heikki. Basically, as I see it there are three main items:1. the revmap physical-to-logical mapping is too complex; let's use
something else.We had revmap originally in a separate fork. The current approach grew
out of the necessity of putting it in the main fork while ensuring that
fast access to individual pages is possible. There are of course many
ways to skin this cat; Heikki's proposal is to have it always occupy the
first few physical pages, rather than require a logical-to-physical
mapping table. To implement this he proposes to move other pages out of
the way as the index grows. I don't really have much love for this
idea. We can change how this is implemented later in the cycle, if we
find that a different approach is better than my proposal. I don't want
to spend endless time meddling with this (and I definitely don't want to
have this delay the eventual commit of the patch.)
Please also note that LockTuple is pretty expensive, compared to
lightweight locks. Remember how Robert made hash indexes signifcantly
faster a couple of years ago (commit 76837c15) by removing the need for
heavy-weight locks during queries. To demonstrate that, I applied your
patch, and ran a very simple test:
create table numbers as select i*1000+j as n from generate_series(0,
19999) i, generate_series(1, 1000) j;
create index number_minmax on numbers using minmax (n) with
(pages_per_range=1);
I ran "explain analyze select * from numbers where n = 10;" a few times
under "perf" profiler. The full profile is attached, but here's the top 10:
Samples: 3K of event 'cycles', Event count (approx.): 2332550418
+ 24.15% postmaster postgres [.] hash_search_with_hash_value
+ 10.55% postmaster postgres [.] LWLockAcquireCommon
+ 7.12% postmaster postgres [.] hash_any
+ 6.77% postmaster postgres [.] minmax_deform_tuple
+ 6.67% postmaster postgres [.] LWLockRelease
+ 5.55% postmaster postgres [.] AllocSetAlloc
+ 4.37% postmaster postgres [.] SetupLockInTable.isra.2
+ 2.79% postmaster postgres [.] LockRelease
+ 2.67% postmaster postgres [.] LockAcquireExtended
+ 2.54% postmaster postgres [.] mmgetbitmap
If you drill into those functions, you'll see that most of the time
spent in hash_search_with_hash_value, LWLockAcquireCommon and hash_any
are coming from heavy-weight lock handling. At a rough estimate, about
1/3 of the CPU time is spent on LockTuple/UnlockTuple.
Maybe we don't care because it's fast enough anyway, but it just seems
like we're leaving a lot of money on the table. Because of that, and all
the other reasons already discussed, I strongly feel that this design
should be changed.
3. avoid MMTuple as it is just unnecessary extra complexity.
The main thing that MMTuple adds is not the fact that we save 2 bytes
by storing BlockNumber as is instead of within a TID field. Instead,
it's that we can construct and deconstruct using our own design, which
means we can use however many Datum entries we want and however many
"null" flags. In normal heap and index tuples, there are always the
same number of datum/nulls. In minmax, the number of nulls is twice the
number of indexed columns; the number of datum values is determined by
how many datum values are stored per opclass ("sortable" opclasses
store 2 columns, but geometry would store only one).
Hmm. Why is the number of null bits 2x the number of indexed columns? I
would expect there to be one null bit per stored Datum.
(/me looks at the patch):
/*
* We need a double-length bitmap on an on-disk minmax index tuple;
* the first half stores the "allnulls" bits, the second stores
* "hasnulls".
*/
So, one bit means whether there are any heap tuples with a NULL in the
indexed column, and the other bit means if the value stored for that
column is a NULL. Does that mean that it's not possible to store a NULL
minimum, but non-NULL maximum, for a single column? I can't immediately
think of an example where you'd want to do that, but I'm also not
convinced that no opclass would ever want that. Individual bits are
cheap, so I'm inclined to rather have too many of them than regret later.
In any case, it should be documented in minmax_tuple.h what those
null-bits are and how they're laid out in the bitmap. The comment there
currently just says that there are "two null bits for each value stored"
(which isn't actually wrong, because you're storing two bits per indexed
column, not two bits per value stored (but I just suggested changing
that, after which the comment would be correct)).
PS. Please add regression tests. It would also be good to implement at
least one other opclass than the b-tree based ones, to make sure that
the code actually works with something else too. I'd suggest
implementing the bounding box opclass for points, that seems simple.
- Heikki
Attachments:
I think there's a race condition in mminsert, if two backends insert a
tuple to the same heap page range concurrently. mminsert does this:
1. Fetch the MMtuple for the page range
2. Check if any of the stored datums need updating
3. Unlock the page.
4. Lock the page again in exclusive mode.
5. Update the tuple.
It's possible that two backends arrive at phase 3 at the same time, with
different values. For example, backend A wants to update the minimum to
contain 10, and and backend B wants to update it to 5. Now, if backend B
gets to update the tuple first, to 5, backend A will update the tuple to
10 when it gets the lock, which is wrong.
The simplest solution would be to get the buffer lock in exclusive mode
to begin with, so that you don't need to release it between steps 2 and
5. That might be a significant hit on concurrency, though, when most of
the insertions don't in fact have to update the value. Another idea is
to re-check the updated values after acquiring the lock in exclusive
mode, to see if they match the previous values.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Another race condition:
If a new tuple is inserted to the range while summarization runs, it's
possible that the new tuple isn't included in the tuple that the
summarization calculated, nor does the insertion itself udpate it.
1. There is no index tuple for page range 1-10
2. Summarization begins. It scans pages 1-5.
3. A new insertion inserts a heap tuple to page 1.
4. The insertion sees that there is no index tuple covering range 1-10,
so it doesn't update it.
5. The summarization finishes scanning pages 5-10, and inserts the new
index tuple. The summarization didn't see the newly inserted heap tuple,
and hence it's not included in the calculated index tuple.
One idea is to do the summarization in two stages. First, insert a
placeholder tuple, with no real value in it. A query considers the
placeholder tuple the same as a missing tuple, ie. always considers it a
match. An insertion updates the placeholder tuple with the value
inserted, as if it was a regular mmtuple. After summarization has
finished scanning the page range, it turns the placeholder tuple into a
regular tuple, by unioning the placeholder value with the value formed
by scanning the heap.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I couldn't resist starting to hack on this, and implemented the scheme
I've been having in mind:
1. MMTuple contains the block number of the heap page (range) that the
tuple represents. Vacuum is no longer needed to clean up old tuples;
when an index tuples is updated, the old tuple is deleted atomically
with the insertion of a new tuple and updating the revmap, so no garbage
is left behind.
2. LockTuple is gone. When following the pointer from revmap to MMTuple,
the block number is used to check that you land on the right tuple. If
not, the search is started over, looking at the revmap again.
I'm sure this still needs some cleanup, but here's the patch, based on
your v14. Now that I know what this approach looks like, I still like it
much better. The insert and update code is somewhat more complicated,
because you have to be careful to lock the old page, new page, and
revmap page in the right order. But it's not too bad, and it gets rid of
all the complexity in vacuum.
- Heikki
Attachments:
minmax-v14-heikki-2.patchtext/x-diff; name=minmax-v14-heikki-2.patchDownload
diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index f10229d..45b5b6c 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -1,7 +1,7 @@
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
-OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
+OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
diff --git a/contrib/pageinspect/mmfuncs.c b/contrib/pageinspect/mmfuncs.c
new file mode 100644
index 0000000..17b3d8b
--- /dev/null
+++ b/contrib/pageinspect/mmfuncs.c
@@ -0,0 +1,426 @@
+/*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "catalog/index.h"
+#include "catalog/pg_type.h"
+#include "funcapi.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "miscadmin.h"
+
+Datum minmax_page_type(PG_FUNCTION_ARGS);
+Datum minmax_page_items(PG_FUNCTION_ARGS);
+Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+PG_FUNCTION_INFO_V1(minmax_page_type);
+PG_FUNCTION_INFO_V1(minmax_page_items);
+PG_FUNCTION_INFO_V1(minmax_metapage_info);
+PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+typedef struct mm_page_state
+{
+ MinmaxDesc *mmdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ FmgrInfo outputfn[FLEXIBLE_ARRAY_MEMBER];
+} mm_page_state;
+
+
+static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+Datum
+minmax_page_type(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+}
+
+/*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+static Page
+verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+{
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+}
+
+
+/*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+Datum
+minmax_page_items(PG_FUNCTION_ARGS)
+{
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, outputfn) +
+ sizeof(FmgrInfo) * RelationGetDescr(indexRel)->natts);
+
+ state->mmdesc = minmax_build_mmdesc(indexRel);
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ index_close(indexRel, AccessShareLock);
+
+ for (attno = 1; attno <= state->mmdesc->md_tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+
+ getTypeOutputInfo(state->mmdesc->md_tupdesc->attrs[attno - 1]->atttypid,
+ &output, &isVarlena);
+ fmgr_info(output, &state->outputfn[attno - 1]);
+ }
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[5];
+ bool nulls[5];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->mmdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->dt_columns[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->dt_columns[att].hasnulls);
+ if (!state->dtup->dt_columns[att].allnulls)
+ {
+ FmgrInfo *outputfn = &state->outputfn[att];
+ MMValues *mmvalues = &state->dtup->dt_columns[att];
+ char *min,
+ *max;
+ char *rangeval;
+
+ /*
+ * XXX -- we assume here that the opclass uses 2 stored
+ * values, which is true for now (only minmax opclasses exist).
+ * Other opclasses might do something different.
+ */
+ min = OutputFunctionCall(outputfn, mmvalues->values[0]);
+ max = OutputFunctionCall(outputfn, mmvalues->values[1]);
+ rangeval = psprintf("%s..%s", min, max);
+ values[4] = CStringGetTextDatum(rangeval);
+
+ }
+ else
+ {
+ nulls[4] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->mmdesc->md_tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ SRF_RETURN_DONE(fctx);
+}
+
+Datum
+minmax_metapage_info(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
+ values[1] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[2] = true;
+ else
+ values[2] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+Datum
+minmax_revmap_array_data(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+}
+
+/*
+ * Return the TID array stored in a minmax revmap page
+ */
+Datum
+minmax_revmap_data(PG_FUNCTION_ARGS)
+{
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
diff --git a/contrib/pageinspect/pageinspect--1.2.sql b/contrib/pageinspect/pageinspect--1.2.sql
index 15e8e1e..56c9ba8 100644
--- a/contrib/pageinspect/pageinspect--1.2.sql
+++ b/contrib/pageinspect/pageinspect--1.2.sql
@@ -99,6 +99,49 @@ AS 'MODULE_PATHNAME', 'bt_page_items'
LANGUAGE C STRICT;
--
+-- minmax_page_type()
+--
+CREATE FUNCTION minmax_page_type(IN page bytea)
+RETURNS text
+AS 'MODULE_PATHNAME', 'minmax_page_type'
+LANGUAGE C STRICT;
+
+--
+-- minmax_metapage_info()
+--
+CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+LANGUAGE C STRICT;
+
+--
+-- minmax_page_items()
+--
+CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT value text)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'minmax_page_items'
+LANGUAGE C STRICT;
+
+--
+-- minmax_revmap_array_data()
+CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+LANGUAGE C STRICT;
+
+--
+-- minmax_revmap_data()
+CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+LANGUAGE C STRICT;
+
+--
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index cbcaaa6..8ffff06 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -13,6 +13,7 @@
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+#include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
new file mode 100644
index 0000000..3fa21a2
--- /dev/null
+++ b/doc/src/sgml/brin.sgml
@@ -0,0 +1,248 @@
+<!-- doc/src/sgml/brin.sgml -->
+
+<chapter id="BRIN">
+<title>BRIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+
+<sect1 id="brin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>BRIN</acronym> stands for Block Range Index.
+ <acronym>BRIN</acronym> is designed for handling very large tables
+ in which certain columns have some natural correlation with its
+ physical position. For example, a table storing orders might have
+ a date column on which each order was placed, and much of the time
+ the earlier entries will appear earlier in the table as well; or a
+ table storing a ZIP code column might have all codes for a city
+ grouped together naturally. For each block range, some summary info
+ is stored in the index.
+ </para>
+
+ <para>
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+ scanning facility only, and will return all tuples in all pages within
+ each range if the summary info stored by the index indicates that some
+ tuples in the range might match the given query conditions. The executor
+ is in charge of rechecking these tuples and discarding those that do not
+ match — in other words, these indexes are lossy.
+ This enables them to work as very fast sequential scan helpers to avoid
+ scanning blocks that are known not to contain matching tuples.
+ </para>
+
+ <para>
+ The specific data that a <acronym>BRIN</acronym> index will store
+ depends on the operator class selected for the data type.
+ Datatypes having a linear sort order can have operator classes that
+ store the minimum and maximum value within each block range, for instance;
+ geometrical types might store the common bounding box.
+ </para>
+
+ <para>
+ The size of the block range is determined at index creation time with
+ the pages_per_range storage parameter. The smaller the number, the
+ larger the index becomes (because of the need to store more index entries),
+ but at the same time the summary data stored can be more precise and
+ more data blocks can be skipped.
+ </para>
+
+ <para>
+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
+ is primarily maintained by Álvaro Herrera.
+ </para>
+</sect1>
+
+<sect1 id="brin-builtin-opclasses">
+ <title>Built-in Operator Classes</title>
+
+ <para>
+ The core <productname>PostgreSQL</productname> distribution includes
+ includes the <acronym>BRIN</acronym> operator classes shown in
+ <xref linkend="gin-builtin-opclasses-table">.
+ </para>
+
+ <table id="brin-builtin-opclasses-table">
+ <title>Built-in <acronym>BRIN</acronym> Operator Classes</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Indexed Data Type</entry>
+ <entry>Indexable Operators</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>char_minmax_ops</literal></entry>
+ <entry><type>"char"</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>date_minmax_ops</literal></entry>
+ <entry><type>date</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>int4_minmax_ops</literal></entry>
+ <entry><type>integer</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>numeric_minmax_ops</literal></entry>
+ <entry><type>numeric</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text_minmax_ops</literal></entry>
+ <entry><type>text</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>time_minmax_ops</literal></entry>
+ <entry><type>time</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timetz_minmax_ops</literal></entry>
+ <entry><type>time with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamp_minmax_ops</literal></entry>
+ <entry><type>timestamp</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamptz_minmax_ops</literal></entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+</sect1>
+
+<sect1 id="brin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>BRIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer only to implement the semantics
+ of the data type being accessed. The <acronym>BRIN</acronym> layer
+ itself takes care of concurrency, logging and searching the index structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>BRIN</acronym> access method working is to
+ implement a few user-defined methods, which define the behavior of
+ summary values stored in the index and the way they interact with
+ scan keys.
+ In short, <acronym>BRIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
+ </para>
+
+ <para>
+ There are three methods that an operator class for <acronym>BRIN</acronym>
+ must provide:
+
+ <variablelist>
+ <varlistentry>
+ <term><function>Datum opcInfo(...)</></term>
+ <listitem>
+ <para>
+ Returns internal information about the summary data stored
+ about indexed columns.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool consistent(...)</function></term>
+ <listitem>
+ <para>
+ Returns whether the key is consistent with the given index tuple.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool addValue(...)</function></term>
+ <listitem>
+ <para>
+ Modifies the index tuple to make it consistent with the given
+ indexed data.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+<!-- this needs improvement ... -->
+ To implement these methods in a generic ways, normally the opclass
+ defines its own internal support functions. For instance, minmax
+ opclasses add the support functions for the four inequality operators
+ for the datatype.
+ Additionally, the operator class must supply appropriate
+ operator entries,
+ to enable the optimizer to use the index when those operators are
+ used in queries.
+ </para>
+</sect1>
+</chapter>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 5902f97..f03b72a 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -87,6 +87,7 @@
<!ENTITY gist SYSTEM "gist.sgml">
<!ENTITY spgist SYSTEM "spgist.sgml">
<!ENTITY gin SYSTEM "gin.sgml">
+<!ENTITY brin SYSTEM "brin.sgml">
<!ENTITY planstats SYSTEM "planstats.sgml">
<!ENTITY indexam SYSTEM "indexam.sgml">
<!ENTITY nls SYSTEM "nls.sgml">
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 64530a1..b73463a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -116,7 +116,8 @@ CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
- B-tree, Hash, GiST, SP-GiST and GIN. Each index type uses a different
+ B-tree, Hash, GiST, SP-GiST, GIN and BRIN.
+ Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
@@ -326,6 +327,39 @@ SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GIN">.
</para>
+
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>BRIN</primary>
+ <see>index</see>
+ </indexterm>
+ BRIN indexes (a shorthand for Block Range indexes)
+ store summaries about the values stored in consecutive table physical block ranges.
+ Like GiST, SP-GiST and GIN,
+ BRIN can support many different indexing strategies,
+ and the particular operators with which a BRIN index can be used
+ vary depending on the indexing strategy.
+ For datatypes that have a linear sort order, the indexed data
+ corresponds to the minimum and maximum values of the
+ values in the column for each block range,
+ which support indexed queries using these operators:
+
+ <simplelist>
+ <member><literal><</literal></member>
+ <member><literal><=</literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>>=</literal></member>
+ <member><literal>></literal></member>
+ </simplelist>
+
+ The BRIN operator classes included in the standard distribution are
+ documented in <xref linkend="brin-builtin-opclasses-table">.
+ For more information see <xref linkend="BRIN">.
+ </para>
</sect1>
diff --git a/doc/src/sgml/postgres.sgml b/doc/src/sgml/postgres.sgml
index 9bde108..a648a4c 100644
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
@@ -247,6 +247,7 @@
&gist;
&spgist;
&gin;
+ &brin;
&storage;
&bki;
&planstats;
diff --git a/minmax-proposal b/minmax-proposal
new file mode 100644
index 0000000..ededbcd
--- /dev/null
+++ b/minmax-proposal
@@ -0,0 +1,306 @@
+Minmax Range Indexes
+====================
+
+Minmax indexes are a new access method intended to enable very fast scanning of
+extremely large tables.
+
+The essential idea of a minmax index is to keep track of summarizing values in
+consecutive groups of heap pages (page ranges); for example, the minimum and
+maximum values for datatypes with a btree opclass, or the bounding box for
+geometric types. These values can be used by constraint exclusion to avoid
+scanning such pages, depending on query quals.
+
+The main drawback of this is having to update the stored summary values of each
+page range as tuples are inserted into them.
+
+Other database systems already have similar features. Some examples:
+
+* Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+* Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+* Infobright has this automatically within their "data packs" according to a
+ May 3rd, 2009 blog post
+ http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
+
+* MonetDB also uses this technique, according to a published paper
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+Index creation
+--------------
+
+To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+Partial indexes are not supported currently; since an index is concerned with
+summary values of the involved columns across all the pages in the table, it
+normally doesn't make sense to exclude some tuples. These might be useful if
+the index predicates are also used in queries. We exclude these for now for
+conceptual simplicity.
+
+Expressional indexes can probably be supported in the future, but we disallow
+them initially for conceptual simplicity.
+
+Having multiple minmax indexes in the same table is acceptable, though most of
+the time it would make more sense to have a single index covering all the
+interesting columns. Multiple indexes might be useful for columns added later.
+
+Access Method Design
+--------------------
+
+Since item pointers are not stored inside indexes of this type, it is not
+possible to support the amgettuple interface. Instead, we only provide
+amgetbitmap support; scanning a relation using this index requires a recheck
+node on top. The amgetbitmap routine returns a TIDBitmap comprising all pages
+in those page groups that match the query qualifications. The recheck node
+prunes tuples that are not visible according to the query qualifications.
+
+For each supported datatype, we need an operator class with the following
+catalog entries:
+
+- support operators (pg_amop): same as btree (<, <=, =, >=, >)
+- support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+
+These are used pervasively:
+
+- The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+- During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+- During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+In each index tuple (corresponding to one page range), we store:
+- for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+Different datatypes store other values instead of min/max, for example
+geometric types might store a bounding box. The NULL bits are always present.
+
+These null bits are stored in a single null bitmask of length 2x number of
+columns.
+
+With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+which seems reasonable. There are 6 extra bytes for padding between the null
+bitmask and the first data item, assuming 64-bit alignment; so the total size
+for such an index would actually be 528 bytes.
+
+This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+(8 bytes) + data value (8 bytes) * 32 * 2
+
+(Of course, larger columns are possible, such as varchar, but creating minmax
+indexes on such columns seems of little practical usefulness. Also, the
+usefulness of an index containing so many columns is dubious.)
+
+There can be gaps where some pages have no covering index entry.
+
+The Range Reverse Map
+---------------------
+
+To find out the index tuple for a particular page range, we have an internal
+structure we call the range reverse map. This stores one TID per page range,
+which is the address of the index tuple summarizing that range. Since these
+map entries are fixed size, it is possible to compute the address of the range
+map entry for any given heap page by simple arithmetic.
+
+When a new heap tuple is inserted in a summarized page range, we compare the
+existing index tuple with the new heap tuple. If the heap tuple is outside the
+summarization data given by the index tuple for any indexed column (or if the
+new heap tuple contains null values but the index tuple indicate there are no
+nulls), it is necessary to create a new index tuple with the new values. To do
+this, a new index tuple is inserted, and the reverse range map is updated to
+point to it. The old index tuple is left in place, for later garbage
+collection. As an optimization, we sometimes overwrite the old index tuple in
+place with the new data, which avoids the need for later garbage collection.
+
+If the reverse range map points to an invalid TID, the corresponding page range
+is considered to be not summarized.
+
+To scan a table following a minmax index, we scan the reverse range map
+sequentially. This yields index tuples in ascending page range order. Query
+quals are matched to each index tuple; if they match, each page within the page
+range is returned as part of the output TID bitmap. If there's no match, they
+are skipped. Reverse range map entries returning invalid index TIDs, that is
+unsummarized page ranges, are also returned in the TID bitmap.
+
+To store the range reverse map, we map its logical page numbers to physical
+pages. We use a large two-level BlockNumber array for this: The metapage
+contains an array of BlockNumbers; each of these points to a "revmap array
+page". Each revmap array page contains BlockNumbers, which in turn point to
+"revmap regular pages", which are the ones that contain the revmap data itself.
+Therefore, to find a given index tuple, we need to examine the metapage and
+obtain the revmap array page number; then read the array page. From there we
+obtain the revmap regular page number, and that one contains the TID we're
+interested in. As an optimization, regular revmap page number 0 is stored in
+physical page number 1, that is, the page just after the metapage. This means
+that scanning a table of about 1300 page ranges (the number of TIDs that fit in
+a single 8kB page) does not require accessing the metapage at all.
+
+When tuples are added to unsummarized pages, nothing needs to happen.
+
+Heap tuples can be removed from anywhere without restriction. It might be
+useful to mark the corresponding index tuple somehow, if the heap tuple is one
+of the constraining values of the summary data (i.e. either min or max in the
+case of a btree-opclass-bearing datatype), so that in the future we are aware
+of the need to re-execute summarization on that range, leading to a possible
+tightening of the summary values.
+
+Index entries that are not referenced from the revmap can be removed from the
+main fork. This currently happens at amvacuumcleanup, though it could be
+carried out separately; no heap scan is necessary to determine which tuples
+are unreachable.
+
+Summarization
+-------------
+
+At index creation time, the whole table is scanned; for each page range the
+summarizing values of each indexed column and nulls bitmap are collected and
+stored in the index.
+
+Once in a while, it is necessary to summarize a bunch of unsummarized pages
+(because the table has grown since the index was created), or re-summarize a
+range that has been marked invalid. This is simple: scan the page range
+calculating the summary values for each indexed column, then insert the new
+index entry at the end of the index.
+
+The easiest way to go around this seems to have vacuum do it. That way we can
+simply do re-summarization on the amvacuumcleanup routine. Other answers would
+mean we need a separate AM routine, which appears unwarranted at this stage.
+
+Vacuuming
+---------
+
+Vacuuming a table that has a minmax index does not represent a significant
+challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+when heap tuples are removed. It might be that some min() value can be
+incremented, or some max() value can be decremented; but this would represent
+an optimization opportunity only, not a correctness issue. Perhaps it's
+simpler to represent this as the need to re-run summarization on the affected
+page range.
+
+Note that if there are no indexes on the table other than the minmax index,
+usage of maintenance_work_mem by vacuum can be decreased significantly, because
+no detailed index scan needs to take place (and thus it's not necessary for
+vacuum to save TIDs to remove). This optimization opportunity is best left for
+future improvement.
+
+Locking considerations
+----------------------
+
+To read the TID during an index scan, we follow this protocol:
+
+* read revmap page
+* obtain share lock on the revmap buffer
+* read the TID
+* obtain share lock on buffer of main fork
+* LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+* release revmap buffer lock
+* read the index tuple
+* release the tuple lock
+* release main fork buffer lock
+
+
+To update the summary tuple for a page range, we use this protocol:
+
+* insert a new index tuple somewhere in the main fork; note its TID
+* read revmap page
+* obtain exclusive lock on revmap buffer
+* write the TID
+* release lock
+
+This ensures no concurrent reader can obtain a partially-written TID.
+Note we don't need a tuple lock here. Concurrent scans don't have to
+worry about whether they got the old or new index tuple: if they get the
+old one, the tighter values are okay from a correctness standpoint because
+due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+For vacuuming, we need to figure out which index tuples are no longer
+referenced from the reverse range map. This requires some brute force,
+but is simple:
+
+1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+Note this doesn't require scanning the heap at all, or being involved in
+the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+which is a nice property because index scans keep pages pinned for long
+periods.
+
+
+
+Optimizer
+---------
+
+In order to make this all work, the only thing we need to do is ensure we have a
+good enough opclass and amcostestimate. With this, the optimizer is able to pick
+up the index on its own.
+
+
+Open questions
+--------------
+
+* Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+* More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+References:
+
+Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+http://wiki.postgresql.org/wiki/Segment_Exclusion
+http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index c32088f..db46539 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
-SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
+SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index c7ad6f9..1bef404 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -209,6 +209,13 @@ static relopt_int intRelOpts[] =
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d731f98..78f35b9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -271,6 +271,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
@@ -296,6 +298,14 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
pgstat_count_heap_scan(scan->rs_rd);
}
+void
+heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+{
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+}
+
/*
* heapgetpage - subroutine for heapgettup()
*
@@ -636,7 +646,8 @@ heapgettup(HeapScanDesc scan,
*/
if (backward)
{
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
@@ -646,7 +657,8 @@ heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
@@ -897,7 +909,8 @@ heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
@@ -907,7 +920,8 @@ heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
- finished = (page == scan->rs_startblock);
+ finished = (page == scan->rs_startblock) ||
+ (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
diff --git a/src/backend/access/minmax/Makefile b/src/backend/access/minmax/Makefile
new file mode 100644
index 0000000..2c80a20
--- /dev/null
+++ b/src/backend/access/minmax/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+# Makefile for access/minmax
+#
+# IDENTIFICATION
+# src/backend/access/minmax/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/minmax
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/minmax/minmax.c b/src/backend/access/minmax/minmax.c
new file mode 100644
index 0000000..01166e2
--- /dev/null
+++ b/src/backend/access/minmax/minmax.c
@@ -0,0 +1,1318 @@
+/*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * support collatable datatypes
+ * * ScalarArrayOpExpr
+ * * Make use of the stored NULL bits
+ * * we can support unlogged indexes now
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "access/minmax_xlog.h"
+#include "access/reloptions.h"
+#include "access/relscan.h"
+#include "access/xlogutils.h"
+#include "catalog/index.h"
+#include "catalog/pg_operator.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/freespace.h"
+#include "storage/lmgr.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/memutils.h"
+#include "utils/syscache.h"
+
+
+/*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+typedef struct MMBuildState
+{
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ DeformedMMTuple *dtuple;
+} MMBuildState;
+
+/*
+ * Struct used as "opaque" during index scans
+ */
+typedef struct MinmaxOpaque
+{
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+} MinmaxOpaque;
+
+static MMBuildState *initialize_mm_buildstate(Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange);
+static void summarize_range(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk, BlockNumber pagesPerRange);
+static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ MMTuple *origtup, Size origsz,
+ MMTuple *newtup, Size newsz, bool samepage);
+static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange, mmRevmapAccess *rmAccess,
+ Buffer *buffer, BlockNumber heapblkno, MMTuple *tup, Size itemsz);
+static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz);
+static void form_and_insert_tuple(MMBuildState *mmstate);
+
+
+/*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its min()/max() stored
+ * values with those of the new tuple; if the tuple values are in range,
+ * there's nothing to do; otherwise we need to update the index (either by
+ * a new index tuple and repointing the revmap, or by overwriting the existing
+ * index tuple).
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+Datum
+mminsert(PG_FUNCTION_ARGS)
+{
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ BlockNumber pagesPerRange;
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ OffsetNumber off;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ BlockNumber heapBlk;
+ Buffer buf = InvalidBuffer;
+ IndexInfo *indexInfo;
+ int keyno;
+ bool need_insert = false;
+
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ /* normalize the block number to be the first block in the range */
+ heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
+ mmtup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+
+ if (!mmtup)
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+ return BoolGetDatum(false);
+ }
+
+ indexInfo = BuildIndexInfo(idxRel);
+ mmdesc = minmax_build_mmdesc(idxRel);
+
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later.
+ */
+ for (keyno = 0; keyno < indexInfo->ii_NumIndexAttrs; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ result = FunctionCall5Coll(addValue,
+ PG_GET_COLLATION(),
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ if (need_insert)
+ {
+ Page page = BufferGetPage(buf);
+ ItemId lp = PageGetItemId(page, off);
+ Size origsz;
+ MMTuple *origtup;
+ Size newsz;
+ MMTuple *newtup;
+ bool samepage;
+
+ /*
+ * Make a copy of the old tuple, so that we can compare it after
+ * re-acquiring the lock.
+ */
+ origsz = ItemIdGetLength(lp);
+ origtup = minmax_copy_tuple(mmtup, origsz);
+
+ /* before releasing the lock, check if we can do a same-page update. */
+ if (newsz <= origsz || PageGetExactFreeSpace(page) >= (origsz - newsz))
+ samepage = true;
+ else
+ samepage = false;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ newtup = minmax_form_tuple(mmdesc, heapBlk, dtup, &newsz);
+
+ mm_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk, buf, off, origtup, origsz,
+ newtup, newsz, samepage);
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ return BoolGetDatum(false);
+}
+
+/*
+ * Initialize state for a Minmax index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+Datum
+mmbeginscan(PG_FUNCTION_ARGS)
+{
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+}
+
+/*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first on what to do about sk_subtype.
+ */
+Datum
+mmgetbitmap(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer buf = InvalidBuffer;
+ MinmaxDesc *mmdesc = minmax_build_mmdesc(idxRel);
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * mmdesc->md_tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ bool addrange;
+ OffsetNumber off;
+ MMTuple *tup;
+
+ tup = mmGetMMTupleForHeapBlock(opaque->rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ /*
+ * For page ranges with no indexed tuple, we must return the whole
+ * range; otherwise, compare it to the scan keys.
+ */
+ if (tup == NULL)
+ {
+ addrange = true;
+ }
+ else
+ {
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with summary values stored for the range. If
+ * scan keys are matched, the page range must be added to the
+ * bitmap. We initially assume the range needs to be added; in
+ * particular this serves the case where there are no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation used
+ * in the index column. Otherwise we shouldn't be using this
+ * index ...
+ */
+ Assert(key->sk_collation ==
+ mmdesc->md_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page range
+ * values; if so, have the pages in the range added to the
+ * output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole, so break out of the loop as soon as a
+ * false return value is obtained.
+ */
+ add = FunctionCall3Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ PointerGetDatum(key));
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ break;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ pfree(dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+}
+
+/*
+ * Re-initialize state for a minmax index scan
+ */
+Datum
+mmrescan(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Close down a minmax index scan
+ */
+Datum
+mmendscan(PG_FUNCTION_ARGS)
+{
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+}
+
+Datum
+mmmarkpos(PG_FUNCTION_ARGS)
+{
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+}
+
+Datum
+mmrestrpos(PG_FUNCTION_ARGS)
+{
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+}
+
+/*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+static void
+mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+{
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1))
+ {
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(mmstate);
+
+ /* set state to correspond to the next range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for it */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ mmstate->dtuple->dt_seentup = true;
+ for (i = 0; i < mmstate->mmDesc->md_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ mmstate->mmDesc->md_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+}
+
+/*
+ * mmbuild() -- build a new minmax index.
+ */
+Datum
+mmbuild(PG_FUNCTION_ARGS)
+{
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = ReadBuffer(index, P_NEW);
+ Assert(BufferGetBlockNumber(meta) == MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /*
+ * Set up an empty revmap, and get access to it
+ */
+ mmRevmapCreate(index);
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* process the final batch */
+ form_and_insert_tuple(mmstate);
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = mmstate->numtuples;
+
+ PG_RETURN_POINTER(result);
+}
+
+Datum
+mmbuildempty(PG_FUNCTION_ARGS)
+{
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * mmbulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+Datum
+mmbulkdelete(PG_FUNCTION_ARGS)
+{
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+}
+
+/*
+ * This routine is in charge of "vacuuming" a minmax index: we just summarize
+ * ranges that are currently unsummarized.
+ */
+Datum
+mmvacuumcleanup(PG_FUNCTION_ARGS)
+{
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ Buffer buf;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * Scan the revmap to find unsummarized items.
+ */
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+ buf = InvalidBuffer;
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ MMTuple *tup;
+ OffsetNumber off;
+
+ tup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ if (tup == NULL)
+ {
+ /* no revmap entry for this heap range. Summarize it. */
+ summarize_range(info->index, heapRel, rmAccess, heapBlk,
+ pagesPerRange);
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+}
+
+/*
+ * reloptions processor for minmax indexes
+ */
+Datum
+mmoptions(PG_FUNCTION_ARGS)
+{
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+}
+
+/*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+void
+mm_page_init(Page page, uint16 type)
+{
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+}
+
+/*
+ * Initialize a new minmax index' metapage.
+ */
+void
+mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+{
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->pagesPerRange = pagesPerRange;
+ metadata->minmaxVersion = version;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+}
+
+/*
+ * Build a MinmaxDesc used to create or scan a minmax index
+ */
+MinmaxDesc *
+minmax_build_mmdesc(Relation rel)
+{
+ MinmaxOpcInfo **opcinfo;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+ Datum indclassDatum;
+ oidvector *indclass;
+ bool isnull;
+
+ tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Obtain MinmaxOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ indclassDatum = SysCacheGetAttr(INDEXRELID, rel->rd_indextuple,
+ Anum_pg_index_indclass, &isnull);
+ Assert(!isnull);
+ indclass = (oidvector *) DatumGetPointer(indclassDatum);
+ opcinfo = (MinmaxOpcInfo **) palloc(sizeof(MinmaxOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ Oid opfam = get_opclass_family(indclass->values[keyno]);
+ Oid idxtypid = tupdesc->attrs[keyno]->atttypid;
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_OPCINFO);
+
+ opcinfo[keyno] = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall2(opcInfoFn,
+ ObjectIdGetDatum(opfam),
+ ObjectIdGetDatum(idxtypid)));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(MinmaxDesc, md_info) +
+ sizeof(MinmaxOpcInfo *) * tupdesc->natts;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->md_index = rel;
+ mmdesc->md_tupdesc = CreateTupleDescCopy(tupdesc);
+ mmdesc->md_disktdesc = NULL; /* generated lazily */
+ mmdesc->md_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ mmdesc->md_info[keyno] = opcinfo[keyno];
+
+ return mmdesc;
+}
+
+/*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+static MMBuildState *
+initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
+ BlockNumber pagesPerRange)
+{
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ return mmstate;
+}
+
+/*
+ * Summarize the given page range of the given index.
+ */
+static void
+summarize_range(Relation idxRel, Relation heapRel, mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk, BlockNumber pagesPerRange)
+{
+ IndexInfo *indexInfo;
+ MMBuildState *mmstate;
+
+ indexInfo = BuildIndexInfo(idxRel);
+
+ mmstate = initialize_mm_buildstate(idxRel, rmAccess, pagesPerRange);
+ mmstate->currRangeStart = heapBlk;
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the
+ * specified page range, summarizing the heap tuples in it. This scan
+ * stops just short of mmbuildCallback creating the new index entry.
+ */
+ IndexBuildHeapRangeScan(heapRel, idxRel, indexInfo, false,
+ heapBlk, pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple and insert it. Note mmbuildCallback didn't
+ * have the chance to actually insert anything into the index, because
+ * the heapscan should have ended just as it reached the final tuple in
+ * the range.
+ */
+ form_and_insert_tuple(mmstate);
+
+ /* and re-initialize state for the next range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ if (BufferIsValid(mmstate->currentInsertBuf))
+ {
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ mmstate->currentInsertBuf = InvalidBuffer;
+ }
+}
+
+static bool
+mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf,
+ OffsetNumber oldoff,
+ MMTuple *origtup, Size origsz,
+ MMTuple *newtup, Size newsz, bool samepage)
+{
+ Page oldpage;
+ ItemId origlp;
+ MMTuple *oldtup;
+ Size oldsz;
+ Buffer newbuf;
+
+ if (!samepage)
+ {
+ /* need a new page */
+ newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz);
+ /*
+ * Note: it's possible that the returned newbuf is the same as oldbuf,
+ * if mm_getinsertbuffer determined that the old buffer does in fact
+ * have enough space.
+ */
+ if (newbuf == oldbuf)
+ newbuf = InvalidBuffer;
+ }
+ else
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ }
+ oldpage = BufferGetPage(oldbuf);
+ origlp = PageGetItemId(oldpage, oldoff);
+
+ /* Check that the old tuple wasn't updated concurrently */
+ if (!ItemIdIsNormal(origlp))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ oldsz = ItemIdGetLength(origlp);
+ oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
+
+ if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ /* Great, the old tuple is intact. we can proceed with the update */
+ /*
+ * If there's enough room on the old page for the new tuple, replace it.
+ *
+ * Note that there might now be enough space on the page even though
+ * the caller told us there isn't, if a concurrent updated moved a tuple
+ * elsewhere or replaced a tuple with a smaller one.
+ */
+ if (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz))
+ {
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+
+ START_CRIT_SECTION();
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple");
+ MarkBufferDirty(oldbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ BlockNumber blk = BufferGetBlockNumber(oldbuf);
+ xl_minmax_samepage_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_SAMEPAGE_UPDATE;
+
+ xlrec.node = idxrel->rd_node;
+ ItemPointerSetBlockNumber(&xlrec.tid, blk);
+ ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxSamepageUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = oldbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ else if (newbuf == InvalidBuffer)
+ {
+ /* Not enough space, but the caller that there was. Have to start over */
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+ else
+ {
+ /*
+ * Not enough free space on the oldpage. Put the new tuple on the
+ * new page, and update the revmap.
+ */
+ Page newpage = BufferGetPage(newbuf);
+ Buffer revmapbuf;
+ ItemPointerData newtid;
+ OffsetNumber newoff;
+
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
+ if (newoff == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple to new page");
+ MarkBufferDirty(oldbuf);
+ MarkBufferDirty(newbuf);
+
+ ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[4];
+ uint8 info = XLOG_MINMAX_UPDATE;
+
+ xlrec.new.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
+ xlrec.new.heapBlk = heapBlk;
+ xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ xlrec.new.pagesPerRange = pagesPerRange;
+ ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = newbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = &(rdata[2]);
+
+ rdata[2].data = (char *) NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = revmapbuf;
+ rdata[2].buffer_std = true;
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = (char *) NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = oldbuf;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ PageSetLSN(newpage, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(newbuf);
+ return true;
+ }
+}
+
+/*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is first checked for free space to insert the new
+ * entry; if there isn't enough, a new buffer is obtained and pinned.
+ */
+static void
+mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapBlk, MMTuple *tup, Size itemsz)
+{
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ Buffer revmapbuf;
+ ItemPointerData tid;
+
+ itemsz = MAXALIGN(itemsz);
+
+ if (BufferIsValid(*buffer))
+ {
+ page = BufferGetPage(*buffer);
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ if (PageGetFreeSpace(page) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ }
+ }
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer.
+ */
+ if (!BufferIsValid(*buffer))
+ {
+ *buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz);
+ page = BufferGetPage(*buffer);
+ Assert(PageGetFreeSpace(page) >= itemsz);
+ }
+
+ blk = BufferGetBlockNumber(*buffer);
+
+ /* lock the revmap for the update */
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "could not insert new index tuple to page");
+ MarkBufferDirty(*buffer);
+
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapBlk);
+
+ ItemPointerSet(&tid, blk, off);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.heapBlk = heapBlk;
+ xlrec.pagesPerRange = pagesPerRange;
+ xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ ItemPointerSet(&xlrec.tid, blk, off);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our locks */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+}
+
+/*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz.
+ *
+ * The passed buffer argument is tested for free space; if it has enough, it is
+ * locked and returned. Otherwise, that buffer (if valid) is unpinned, a new
+ * buffer is obtained, and returned pinned and locked.
+ *
+ * If there's no existing page with enough free to accomodate the new item,
+ * the relation is extended. This function returns true if this happens, false
+ * otherwise.
+ */
+static Buffer
+mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz)
+{
+ BlockNumber oldblk;
+ BlockNumber newblk;
+ Buffer buf;
+ Page page;
+ bool extended = false;
+ int freespace;
+
+ if (BufferIsValid(oldbuf))
+ oldblk = BufferGetBlockNumber(oldbuf);
+ else
+ oldblk = InvalidBlockNumber;
+
+ /*
+ * By the time we break out of this loop, buf is a locked and pinned
+ * buffer. It was tested for free space, but in some cases only before
+ * locking it, so a recheck is necessary because a concurrent inserter
+ * might have put items in it.
+ */
+ newblk = GetPageWithFreeSpace(irel, itemsz);
+ for (;;)
+ {
+ bool extensionLockHeld = false;
+
+ if (newblk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ if (!RELATION_IS_LOCAL(irel))
+ {
+ LockRelationForExtension(irel, ExclusiveLock);
+ extensionLockHeld = true;
+ }
+ buf = ReadBuffer(irel, P_NEW);
+ extended = true;
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buf));
+ }
+ else if (BufferIsValid(oldbuf) && newblk == oldblk)
+ {
+ /*
+ * There's an odd corner-case here where the FSM is out-of-date,
+ * and gave us the old page.
+ */
+ buf = oldbuf;
+ }
+ else
+ {
+ buf = ReadBuffer(irel, newblk);
+ }
+
+ if (BufferIsValid(oldbuf) && newblk < oldblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (BufferIsValid(oldbuf) && newblk > oldblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (extensionLockHeld)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ page = BufferGetPage(buf);
+
+ if (extended)
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * We have a new buffer from FSM now, and both pages are locked.
+ * Check that the new page has enough free space, and return it if it
+ * does; otherwise start over. Note that we allow for the FSM to be
+ * out of date here, and in that case we update it and move on.
+ */
+ freespace = PageGetFreeSpace(page);
+
+ if (freespace >= itemsz)
+ return buf;
+
+ /* This page is no good. */
+
+ /*
+ * If an entirely new page does not contain enough free space for
+ * the new item, then surely that item is oversized. Complain
+ * loudly; but first make sure we record the page as free, for
+ * next time.
+ */
+ if (extended)
+ {
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ }
+
+ if (!BufferIsValid(oldbuf) || newblk != oldblk)
+ UnlockReleaseBuffer(buf);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+ newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+ }
+
+ /* not reached */
+ return InvalidBuffer;
+}
+
+/*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+static void
+form_and_insert_tuple(MMBuildState *mmstate)
+{
+ MMTuple *tup;
+ Size size;
+
+ /* if this dtuple didn't see any heap tuple at all, don't insert it */
+ if (!mmstate->dtuple->dt_seentup)
+ return;
+
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->currRangeStart,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->pagesPerRange, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart, tup,
+ size);
+ mmstate->numtuples++;
+ pfree(tup);
+}
diff --git a/src/backend/access/minmax/mmrevmap.c b/src/backend/access/minmax/mmrevmap.c
new file mode 100644
index 0000000..08ab956
--- /dev/null
+++ b/src/backend/access/minmax/mmrevmap.c
@@ -0,0 +1,732 @@
+/*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+#include "postgres.h"
+
+#include "access/heapam_xlog.h"
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_xlog.h"
+#include "access/rmgr.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "utils/memutils.h"
+
+
+
+/*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+/*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+#define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+#define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+struct mmRevmapAccess
+{
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer metaBuf;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+};
+/* typedef appears in minmax_revmap.h */
+
+
+static Buffer mm_getnewbuffer(Relation irel);
+
+/*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+mmRevmapAccess *
+mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+{
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->metaBuf = meta;
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ if (pagesPerRange)
+ *pagesPerRange = metadata->pagesPerRange;
+
+ return rmAccess;
+}
+
+/*
+ * Release resources associated with a revmap access object.
+ */
+void
+mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+{
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->metaBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+}
+
+/*
+ * In the given revmap page, which is used in a minmax index of pagesPerRange
+ * pages per range, set the element corresponding to heap block number heapBlk
+ * to the value (blkno, offno).
+ *
+ * Caller must have obtained the correct revmap page.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+static void
+rm_page_set_iptr(Page page, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ BlockNumber blkno, OffsetNumber offno)
+{
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr, blkno, offno);
+}
+
+/*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+BlockNumber
+initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+{
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+}
+
+/*
+ * Lock the metapage as specified by called, and update the given rmAccess with
+ * the metapage data. The metapage buffer is locked when this function
+ * returns; it's the caller's responsibility to unlock it.
+ */
+static void
+rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+{
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ LockBuffer(rmAccess->metaBuf, lockmode);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+
+#ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+#endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+}
+
+/*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+void
+initialize_rma_page(Buffer buf)
+{
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+}
+
+/*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno.
+ */
+static void
+update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+{
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+}
+
+/*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+static BlockNumber
+rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+{
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again. XXX but we didn't hold lock!
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ *
+ * FIXME -- what if we were asked not to extend?
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Modify the revmap array page to point to the newly allocated revmap
+ * page.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ /*
+ * XXX this rma_nblocks assignment should probably be conditional on the
+ * current rma_blocks value.
+ */
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+}
+
+/*
+ * Prepare for updating an entry in the revmap.
+ *
+ * The map is extended, if necessary.
+ */
+Buffer
+mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk)
+{
+ BlockNumber mapBlk;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "locking revmap page for logical page %lu (physical %u) for heap %u",
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ return rmAccess->currBuf;
+}
+
+/*
+ * Once the operation is complete, the caller must update the LSN on the
+ * returned buffer, and release the lock.
+ */
+void
+mmSetHeapBlockItemptr(Buffer buf, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ ItemPointerData tid)
+{
+ /* The correct page should already be pinned and locked */
+ rm_page_set_iptr(BufferGetPage(buf),
+ pagesPerRange,
+ heapBlk,
+ ItemPointerGetBlockNumber(&tid),
+ ItemPointerGetOffsetNumber(&tid));
+}
+
+
+/*
+ * Fetch the MMTuple for a given heap block.
+ *
+ * The buffer containing the tuple is locked, and returned in *buf. As an
+ * optimization, the caller can pass a pinned buffer *buf on entry, which will
+ * avoid a pin-unpin cycle when the next tuple is on the same page as previous
+ * one.
+ *
+ * If no tuple is found for the given heap range, returns NULL. In that case,
+ * *buf might still be updated, but it's not locked.
+ */
+MMTuple *
+mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer *buf, OffsetNumber *off, int mode)
+{
+ Relation idxRel = rmAccess->idxrel;
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ BlockNumber blk;
+ Page page;
+ ItemId lp;
+ MMTuple *mmtup;
+
+ /* normalize the heap block number to be the first page in the range */
+ heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ *off = InvalidOffsetNumber;
+ return NULL;
+ }
+
+ for (;;)
+ {
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ if (!ItemPointerIsValid(iptr))
+ {
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ return NULL;
+ }
+
+ blk = ItemPointerGetBlockNumber(iptr);
+ *off = ItemPointerGetOffsetNumber(iptr);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+
+ /* Ok, got a pointer to where the MMTuple should be. Fetch it. */
+ if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != blk)
+ {
+ if (BufferIsValid(*buf))
+ ReleaseBuffer(*buf);
+ *buf = ReadBuffer(idxRel, blk);
+ }
+ LockBuffer(*buf, mode);
+ page = BufferGetPage(*buf);
+ lp = PageGetItemId(page, *off);
+ if (ItemIdIsUsed(lp))
+ {
+ mmtup = (MMTuple *) PageGetItem(page, lp);
+
+ if (mmtup->mt_blkno == heapBlk)
+ {
+ /* found it! */
+ return mmtup;
+ }
+ }
+ /*
+ * No luck. Assume that the revmap was updated concurrently.
+ *
+ * XXX: it would be nice to add some kind of a sanity check here to
+ * avoid looping infinitely, if the revmap points to wrong tuple for
+ * some reason.
+ */
+ LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+ /* not reached, but keep compiler quiet */
+ return NULL;
+}
+
+/*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+void
+mmRevmapCreate(Relation idxrel)
+{
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+static Buffer
+mm_getnewbuffer(Relation irel)
+{
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * XXX As a possible improvement, we could request a blank page to the FSM
+ * here. Such pages could get inserted into the FSM if, for instance, two
+ * processes extend the relation concurrently to add one more page to the
+ * revmap and the second one discovers it doesn't actually need the page it
+ * got.
+ */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+}
diff --git a/src/backend/access/minmax/mmsortable.c b/src/backend/access/minmax/mmsortable.c
new file mode 100644
index 0000000..986e645
--- /dev/null
+++ b/src/backend/access/minmax/mmsortable.c
@@ -0,0 +1,280 @@
+/*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_tuple.h"
+#include "access/skey.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
+#include "utils/syscache.h"
+
+
+/*
+ * Procedure numbers must not collide with MINMAX_PROCNUM defines in
+ * minmax_internal.h. Note we only need inequality functions.
+ */
+#define SORTABLE_NUM_PROCNUMS 4 /* # support procs we need */
+#define PROCNUM_LESS 4
+#define PROCNUM_LESSEQUAL 5
+#define PROCNUM_GREATEREQUAL 6
+#define PROCNUM_GREATER 7
+
+/* subtract this from procnum to obtain index in SortableOpaque arrays */
+#define PROCNUM_BASE 4
+
+static FmgrInfo *mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno,
+ uint16 procnum);
+
+PG_FUNCTION_INFO_V1(mmSortableOpcInfo);
+PG_FUNCTION_INFO_V1(mmSortableAddValue);
+PG_FUNCTION_INFO_V1(mmSortableConsistent);
+
+Datum mmSortableOpcInfo(PG_FUNCTION_ARGS);
+Datum mmSortableAddValue(PG_FUNCTION_ARGS);
+Datum mmSortableConsistent(PG_FUNCTION_ARGS);
+
+typedef struct SortableOpaque
+{
+ FmgrInfo operators[SORTABLE_NUM_PROCNUMS];
+ bool inited[SORTABLE_NUM_PROCNUMS];
+} SortableOpaque;
+
+/*
+ * Return the number and OIDs of (the functions that underlie) operators we
+ * need to build a minmax index, as a pointer to a newly palloc'ed MinmaxOpers.
+ */
+Datum
+mmSortableOpcInfo(PG_FUNCTION_ARGS)
+{
+ SortableOpaque *opaque;
+ MinmaxOpcInfo *result;
+
+ opaque = palloc0(sizeof(SortableOpaque));
+ /*
+ * 'operators' is initialized lazily, as indicated by 'inited' which was
+ * initialized to all false by palloc0.
+ */
+
+ result = palloc(sizeof(MinmaxOpcInfo));
+ result->oi_nstored = 2; /* min, max */
+ result->oi_opaque = opaque;
+
+ PG_RETURN_POINTER(result);
+}
+
+/*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+Datum
+mmSortableAddValue(PG_FUNCTION_ARGS)
+{
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->dt_columns[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->dt_columns[attno - 1].allnulls)
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+}
+
+/*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple. Return true
+ * if so, false otherwise.
+ */
+Datum
+mmSortableConsistent(PG_FUNCTION_ARGS)
+{
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ ScanKey key = (ScanKey) PG_GETARG_POINTER(2);
+ Oid colloid = PG_GET_COLLATION();
+ AttrNumber attno = key->sk_attno;
+ Datum value;
+ Datum matches;
+
+ /* handle IS NULL/IS NOT NULL tests */
+ if (key->sk_flags & SK_ISNULL)
+ {
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (dtup->dt_columns[attno - 1].allnulls ||
+ dtup->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(false);
+ }
+
+ /*
+ * For IS NOT NULL we can only exclude blocks if all values are nulls.
+ */
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (dtup->dt_columns[attno - 1].allnulls)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+ }
+
+ value = key->sk_argument;
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESS),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATER),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+}
+
+/*
+ * Return the procedure corresponding to the given function support number.
+ */
+static FmgrInfo *
+mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno, uint16 procnum)
+{
+ SortableOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (SortableOpaque *) mmdesc->md_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(mmdesc->md_index, attno, procnum),
+ CurrentMemoryContext);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+}
diff --git a/src/backend/access/minmax/mmtuple.c b/src/backend/access/minmax/mmtuple.c
new file mode 100644
index 0000000..b65b979
--- /dev/null
+++ b/src/backend/access/minmax/mmtuple.c
@@ -0,0 +1,476 @@
+/*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/minmax_tuple.h"
+#include "access/tupdesc.h"
+#include "access/tupmacs.h"
+
+
+static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+/*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+static TupleDesc
+mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+{
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->md_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->md_totalstored, false);
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->md_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->md_tupdesc->attrs[i]->atttypid,
+ mmdesc->md_tupdesc->attrs[i]->atttypmod,
+ 0);
+ }
+
+ mmdesc->md_disktdesc = tupdesc;
+ }
+
+ return mmdesc->md_disktdesc;
+}
+
+/*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+MMTuple *
+minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno, DeformedMMTuple *tuple, Size *size)
+{
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->md_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->md_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->md_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ for (idxattno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ /* XXX datumCopy ?? */
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->md_tupdesc->natts * 2);
+ }
+
+ /*
+ * TODO: we can probably do away with alignment here, and save some
+ * precious disk space. When there's no bitmap we can save 6 bytes. Maybe
+ * we can use the first col's type alignment instead of maxalign.
+ */
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_blkno = blkno;
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ }
+
+ *size = len;
+ return rettuple;
+}
+
+/*
+ * Free a tuple created by minmax_form_tuple
+ */
+void
+minmax_free_tuple(MMTuple *tuple)
+{
+ pfree(tuple);
+}
+
+MMTuple *
+minmax_copy_tuple(MMTuple *tuple, Size len)
+{
+ MMTuple *newtup;
+
+ newtup = palloc(len);
+ memcpy(newtup, tuple, len);
+
+ return newtup;
+}
+
+bool
+minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen)
+{
+ if (alen != blen)
+ return false;
+ if (memcmp(a, b, alen) != 0)
+ return false;
+ return true;
+}
+
+DeformedMMTuple *
+minmax_new_dtuple(MinmaxDesc *mmdesc)
+{
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->md_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->md_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->md_info[i]->oi_nstored;
+ }
+
+ return dtup;
+}
+
+/*
+ * Reset a DeformedMMTuple to initial state
+ */
+void
+minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+{
+ int i;
+
+ dtuple->dt_seentup = false;
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * mmdesc->md_info[i]->oi_nstored);
+ }
+}
+
+/*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need to apply
+ * datumCopy inside the loop. We probably also need a minmax_free_dtuple()
+ * function.
+ */
+DeformedMMTuple *
+minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+{
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->md_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ dtup->dt_columns[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->md_totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->md_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] = values[valueno++];
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+}
+
+/*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->md_totalstored
+ * allnulls output "allnulls", size mmdesc->md_tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->md_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+static inline void
+mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+{
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off = 0;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ */
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ if (nulls && att_isnull(attnum, nullbits))
+ {
+ allnulls[attnum] = true;
+ continue;
+ }
+
+ allnulls[attnum] = false;
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && att_isnull(mmdesc->md_tupdesc->natts + attnum, nullbits);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ for (stored = 0, attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->md_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+}
diff --git a/src/backend/access/minmax/mmxlog.c b/src/backend/access/minmax/mmxlog.c
new file mode 100644
index 0000000..ab3f9fe
--- /dev/null
+++ b/src/backend/access/minmax/mmxlog.c
@@ -0,0 +1,360 @@
+/*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+#include "postgres.h"
+
+#include "access/minmax.h"
+#include "access/minmax_internal.h"
+#include "access/minmax_page.h"
+#include "access/minmax_revmap.h"
+#include "access/minmax_tuple.h"
+#include "access/minmax_xlog.h"
+#include "access/xlogutils.h"
+#include "storage/freespace.h"
+
+
+/*
+ * xlog replay routines
+ */
+static void
+minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+}
+
+/*
+ * Common part of an insert or update. Inserts the new tuple and updates the
+ * revmap.
+ */
+static void
+minmax_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, xl_minmax_insert *xlrec,
+ MMTuple *mmtuple, int tuplen)
+{
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ Assert(mmtuple->mt_blkno == xlrec->heapBlk);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ }
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* update the revmap */
+ if (record->xl_info & XLR_BKP_BLOCK(1))
+ {
+ (void) RestoreBackupBlock(lsn, record, 1, false, false);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, xlrec->revmapBlk, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ mmSetHeapBlockItemptr(buffer, xlrec->pagesPerRange, xlrec->heapBlk, xlrec->tid);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+}
+
+static void
+minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ minmax_xlog_insert_update(lsn, record, xlrec, newtup, tuplen);
+}
+
+static void
+minmax_xlog_update(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_update *xlrec = (xl_minmax_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ OffsetNumber offnum;
+ Buffer buffer;
+ Page page;
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxUpdate;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxUpdate);
+
+ /* First insert the new tuple and update revmap, like in an insertion. */
+ minmax_xlog_insert_update(lsn, record, &xlrec->new, newtup, tuplen);
+
+ /* Then remove the old tuple */
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ {
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ }
+ else
+ {
+ blkno = ItemPointerGetBlockNumber(&(xlrec->oldtid));
+ buffer = XLogReadBuffer(xlrec->new.node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->oldtid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+}
+
+/*
+ * Update a tuple on a single page.
+ */
+static void
+minmax_xlog_samepage_update(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_samepage_update *xlrec = (xl_minmax_samepage_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ MMTuple *mmtuple;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxSamepageUpdate;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxSamepageUpdate);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_samepage_update: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_samepage_update: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+}
+
+
+static void
+minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+}
+
+static void
+minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+}
+
+static void
+minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+{
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+}
+
+void
+minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+{
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_UPDATE:
+ minmax_xlog_update(lsn, record);
+ break;
+ case XLOG_MINMAX_SAMEPAGE_UPDATE:
+ minmax_xlog_samepage_update(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+}
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 7d092d2..5575a71 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -9,7 +9,8 @@ top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
- mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
+ minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
+ smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index c0a7a6f..e285e50 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -12,6 +12,7 @@
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+#include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a5a204e..cbb0ab8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2096,6 +2096,27 @@ IndexBuildHeapScan(Relation heapRelation,
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+}
+
+/*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+double
+IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+{
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
@@ -2166,6 +2187,9 @@ IndexBuildHeapScan(Relation heapRelation,
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f1b20e..55e375f 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,6 +132,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 6351a9b..2b858c8 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -399,7 +399,8 @@ PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
- * sorting support for PageRepairFragmentation and PageIndexMultiDelete
+ * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
+ * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
@@ -896,6 +897,182 @@ PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
phdr->pd_upper = upper;
}
+/*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+void
+PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+{
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+}
/*
* Set checksum for a page in shared buffers.
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e932ccf..61e1a28 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -7349,3 +7349,27 @@ gincostestimate(PG_FUNCTION_ARGS)
PG_RETURN_VOID();
}
+
+Datum
+mmcostestimate(PG_FUNCTION_ARGS)
+{
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+}
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..5354a3b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -112,6 +112,8 @@ extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/minmax.h b/src/include/access/minmax.h
new file mode 100644
index 0000000..edb88ba
--- /dev/null
+++ b/src/include/access/minmax.h
@@ -0,0 +1,52 @@
+/*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include "fmgr.h"
+#include "nodes/execnodes.h"
+#include "utils/relcache.h"
+
+
+/*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+extern Datum mmbuild(PG_FUNCTION_ARGS);
+extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+extern Datum mminsert(PG_FUNCTION_ARGS);
+extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+extern Datum mmgettuple(PG_FUNCTION_ARGS);
+extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+extern Datum mmrescan(PG_FUNCTION_ARGS);
+extern Datum mmendscan(PG_FUNCTION_ARGS);
+extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+/*
+ * Storage type for MinMax' reloptions
+ */
+typedef struct MinmaxOptions
+{
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+} MinmaxOptions;
+
+#define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+#define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+#endif /* MINMAX_H */
diff --git a/src/include/access/minmax_internal.h b/src/include/access/minmax_internal.h
new file mode 100644
index 0000000..6b5135b
--- /dev/null
+++ b/src/include/access/minmax_internal.h
@@ -0,0 +1,83 @@
+/*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+#ifndef MINMAX_INTERNAL_H
+#define MINMAX_INTERNAL_H
+
+#include "fmgr.h"
+#include "storage/buf.h"
+#include "storage/bufpage.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+
+/*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple and vice-versa.
+ *
+ * Note: we assume, for now, that the data stored for each column is the same
+ * datatype as the indexed heap column. This restriction can be lifted by
+ * having an Oid array pointer on the PerCol struct, where each member of the
+ * array indicates the typid of the stored data.
+ */
+
+/* struct returned by "OpcInfo" amproc */
+typedef struct MinmaxOpcInfo
+{
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+} MinmaxOpcInfo;
+
+typedef struct MinmaxDesc
+{
+ /* the index relation itself */
+ Relation md_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc md_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc md_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int md_totalstored;
+
+ /* per-column info */
+ MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
+} MinmaxDesc;
+
+/*
+ * Globally-known function support numbers for Minmax indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+#define MINMAX_PROCNUM_OPCINFO 1
+#define MINMAX_PROCNUM_ADDVALUE 2
+#define MINMAX_PROCNUM_CONSISTENT 3
+
+#define MINMAX_DEBUG
+
+/* we allow debug if using GCC; otherwise don't bother */
+#if defined(MINMAX_DEBUG) && defined(__GNUC__)
+#define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+#else
+#define MINMAX_elog(a) void(0)
+#endif
+
+/* minmax.c */
+extern MinmaxDesc *minmax_build_mmdesc(Relation rel);
+extern void mm_page_init(Page page, uint16 type);
+extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+#endif /* MINMAX_INTERNAL_H */
diff --git a/src/include/access/minmax_page.h b/src/include/access/minmax_page.h
new file mode 100644
index 0000000..04f40d8
--- /dev/null
+++ b/src/include/access/minmax_page.h
@@ -0,0 +1,88 @@
+/*
+ * Prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+#ifndef MINMAX_PAGE_H
+#define MINMAX_PAGE_H
+
+
+/* special space on all minmax pages stores a "type" identifier */
+#define MINMAX_PAGETYPE_META 0xF091
+#define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+#define MINMAX_PAGETYPE_REVMAP 0xF093
+#define MINMAX_PAGETYPE_REGULAR 0xF094
+
+typedef struct MinmaxSpecialSpace
+{
+ uint16 type;
+} MinmaxSpecialSpace;
+
+/* Metapage definitions */
+typedef struct MinmaxMetaPageData
+{
+ uint32 minmaxMagic;
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+} MinmaxMetaPageData;
+
+/*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+#define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+#define MINMAX_CURRENT_VERSION 1
+#define MINMAX_META_MAGIC 0xA8109CFA
+
+#define MINMAX_METAPAGE_BLKNO 0
+
+/* Definitions for regular revmap pages */
+typedef struct RevmapContents
+{
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+} RevmapContents;
+
+#define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+/* max num of items in the array */
+#define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+/* Definitions for array revmap pages */
+typedef struct RevmapArrayContents
+{
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+} RevmapArrayContents;
+
+#define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+/* max num of items in the array */
+#define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+#endif /* MINMAX_PAGE_H */
diff --git a/src/include/access/minmax_revmap.h b/src/include/access/minmax_revmap.h
new file mode 100644
index 0000000..1dbd082
--- /dev/null
+++ b/src/include/access/minmax_revmap.h
@@ -0,0 +1,41 @@
+/*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+#ifndef MINMAX_REVMAP_H
+#define MINMAX_REVMAP_H
+
+#include "access/minmax_tuple.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/off.h"
+#include "utils/relcache.h"
+
+/* struct definition lives in mmrevmap.c */
+typedef struct mmRevmapAccess mmRevmapAccess;
+
+extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange);
+extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+extern void mmRevmapCreate(Relation idxrel);
+extern Buffer mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk);
+extern void mmSetHeapBlockItemptr(Buffer rmbuf, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ ItemPointerData tid);
+
+extern MMTuple *mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer *buf, OffsetNumber *off, int mode);
+extern void mmRevmapTruncate(mmRevmapAccess *rmAccess,
+ BlockNumber heapNumBlocks);
+
+/* internal stuff also used by xlog replay */
+extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+extern void initialize_rma_page(Buffer buf);
+
+
+#endif /* MINMAX_REVMAP_H */
diff --git a/src/include/access/minmax_tuple.h b/src/include/access/minmax_tuple.h
new file mode 100644
index 0000000..d46b5f0
--- /dev/null
+++ b/src/include/access/minmax_tuple.h
@@ -0,0 +1,88 @@
+/*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+#ifndef MINMAX_TUPLE_H
+#define MINMAX_TUPLE_H
+
+#include "access/minmax_internal.h"
+#include "access/tupdesc.h"
+
+
+/*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+typedef struct MMValues
+{
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+} MMValues;
+
+/*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+typedef struct DeformedMMTuple
+{
+ int dt_seentup;
+ MMValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+} DeformedMMTuple;
+
+/*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each value stored); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+typedef struct MMTuple
+{
+ BlockNumber mt_blkno;
+
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+} MMTuple;
+
+#define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+/*
+ * t_info manipulation macros
+ */
+#define MMIDX_OFFSET_MASK 0x1F
+/* bit 0x20 is not used at present */
+/* bit 0x40 is not used at present */
+#define MMIDX_NULLS_MASK 0x80
+
+#define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+#define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
+ DeformedMMTuple *tuple, Size *size);
+extern void minmax_free_tuple(MMTuple *tuple);
+MMTuple *minmax_copy_tuple(MMTuple *tuple, Size len);
+extern bool minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen);
+
+extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+#endif /* MINMAX_TUPLE_H */
diff --git a/src/include/access/minmax_xlog.h b/src/include/access/minmax_xlog.h
new file mode 100644
index 0000000..fba9c32
--- /dev/null
+++ b/src/include/access/minmax_xlog.h
@@ -0,0 +1,154 @@
+/*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MINMAX_XLOG_H
+#define MINMAX_XLOG_H
+
+#include "access/xlog.h"
+#include "storage/bufpage.h"
+#include "storage/itemptr.h"
+#include "storage/relfilenode.h"
+#include "utils/relcache.h"
+
+
+/*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+#define XLOG_MINMAX_CREATE_INDEX 0x00
+#define XLOG_MINMAX_INSERT 0x10
+#define XLOG_MINMAX_UPDATE 0x20
+#define XLOG_MINMAX_SAMEPAGE_UPDATE 0x30
+#define XLOG_MINMAX_METAPG_SET 0x40
+#define XLOG_MINMAX_RMARRAY_SET 0x50
+#define XLOG_MINMAX_INIT_RMPG 0x60
+
+#define XLOG_MINMAX_OPMASK 0x70
+/*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+#define XLOG_MINMAX_INIT_PAGE 0x80
+
+/* This is what we need to know about a minmax index create */
+typedef struct xl_minmax_createidx
+{
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+} xl_minmax_createidx;
+#define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+/*
+ * This is what we need to know about a minmax tuple insert
+ */
+typedef struct xl_minmax_insert
+{
+ RelFileNode node;
+ BlockNumber heapBlk;
+
+ /* extra information needed to update the revmap */
+ BlockNumber revmapBlk;
+ BlockNumber pagesPerRange;
+
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+} xl_minmax_insert;
+
+#define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, tid) + sizeof(ItemPointerData))
+
+/*
+ * A cross-page update is the same as an insert, but also store the old tid.
+ */
+typedef struct xl_minmax_update
+{
+ xl_minmax_insert new;
+ ItemPointerData oldtid;
+} xl_minmax_update;
+
+#define SizeOfMinmaxUpdate (offsetof(xl_minmax_update, oldtid) + sizeof(ItemPointerData))
+
+/* This is what we need to know about a minmax tuple samepage update */
+typedef struct xl_minmax_samepage_update
+{
+ RelFileNode node;
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+} xl_minmax_samepage_update;
+
+#define SizeOfMinmaxSamepageUpdate (offsetof(xl_minmax_samepage_update, tid) + sizeof(ItemPointerData))
+
+/* This is what we need to know about a bulk minmax tuple remove */
+typedef struct xl_minmax_bulkremove
+{
+ RelFileNode node;
+ BlockNumber block;
+ /* offset number array follows at end of struct */
+} xl_minmax_bulkremove;
+
+#define SizeOfMinmaxBulkRemove (offsetof(xl_minmax_bulkremove, block) + sizeof(BlockNumber))
+
+/* This is what we need to know about a revmap "set heap ptr" */
+typedef struct xl_minmax_rm_set
+{
+ RelFileNode node;
+ BlockNumber mapBlock;
+ int pagesPerRange;
+ BlockNumber heapBlock;
+ ItemPointerData newval;
+} xl_minmax_rm_set;
+
+#define SizeOfMinmaxRevmapSet (offsetof(xl_minmax_rm_set, newval) + SizeOfIptrData)
+
+/* This is what we need to know about a "metapage set" operation */
+typedef struct xl_minmax_metapg_set
+{
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+} xl_minmax_metapg_set;
+
+#define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+/* This is what we need to know about a "revmap array set" operation */
+typedef struct xl_minmax_rmarray_set
+{
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+} xl_minmax_rmarray_set;
+
+#define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+/* This is what we need to know when we initialize a new revmap page */
+typedef struct xl_minmax_init_rmpg
+{
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+} xl_minmax_init_rmpg;
+
+#define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+extern void minmax_desc(StringInfo buf, XLogRecord *record);
+extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+#endif /* MINMAX_XLOG_H */
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index c226448..985d435 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -45,8 +45,9 @@ typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
- RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
+ RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..8beb1be 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -35,8 +35,10 @@ typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
- BlockNumber rs_nblocks; /* number of blocks to scan */
+ BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 662fb77..9dc995a 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -42,3 +42,4 @@ PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 006b180..de90178 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -97,6 +97,14 @@ extern double IndexBuildHeapScan(Relation heapRelation,
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
diff --git a/src/include/catalog/pg_am.h b/src/include/catalog/pg_am.h
index 759ea70..3010120 100644
--- a/src/include/catalog/pg_am.h
+++ b/src/include/catalog/pg_am.h
@@ -132,5 +132,7 @@ DESCR("GIN index access method");
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+DATA(insert OID = 3580 ( minmax 5 7 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+#define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
diff --git a/src/include/catalog/pg_amop.h b/src/include/catalog/pg_amop.h
index 3ef5a49..ecfcd2f 100644
--- a/src/include/catalog/pg_amop.h
+++ b/src/include/catalog/pg_amop.h
@@ -845,4 +845,85 @@ DATA(insert ( 3550 869 869 25 s 932 783 0 ));
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+/*
+ * int4_minmax_ops
+ */
+DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+/*
+ * numeric_minmax_ops
+ */
+DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+/*
+ * text_minmax_ops
+ */
+DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+/*
+ * time_minmax_ops
+ */
+DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+/*
+ * timetz_minmax_ops
+ */
+DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+/*
+ * timestamp_minmax_ops
+ */
+DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+/*
+ * timestamptz_minmax_ops
+ */
+DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+/*
+ * date_minmax_ops
+ */
+DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+/*
+ * char_minmax_ops
+ */
+DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
diff --git a/src/include/catalog/pg_amproc.h b/src/include/catalog/pg_amproc.h
index 10a47df..9eb2456 100644
--- a/src/include/catalog/pg_amproc.h
+++ b/src/include/catalog/pg_amproc.h
@@ -431,4 +431,77 @@ DATA(insert ( 4017 25 25 3 4029 ));
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+/* minmax */
+DATA(insert ( 4054 23 23 1 3383 ));
+DATA(insert ( 4054 23 23 2 3384 ));
+DATA(insert ( 4054 23 23 3 3385 ));
+DATA(insert ( 4054 23 23 4 66 ));
+DATA(insert ( 4054 23 23 5 149 ));
+DATA(insert ( 4054 23 23 6 150 ));
+DATA(insert ( 4054 23 23 7 147 ));
+
+DATA(insert ( 4055 1700 1700 1 3383 ));
+DATA(insert ( 4055 1700 1700 2 3384 ));
+DATA(insert ( 4055 1700 1700 3 3385 ));
+DATA(insert ( 4055 1700 1700 4 1722 ));
+DATA(insert ( 4055 1700 1700 5 1723 ));
+DATA(insert ( 4055 1700 1700 6 1721 ));
+DATA(insert ( 4055 1700 1700 7 1720 ));
+
+DATA(insert ( 4056 25 25 1 3383 ));
+DATA(insert ( 4056 25 25 2 3384 ));
+DATA(insert ( 4056 25 25 3 3385 ));
+DATA(insert ( 4056 25 25 4 740 ));
+DATA(insert ( 4056 25 25 5 741 ));
+DATA(insert ( 4056 25 25 6 743 ));
+DATA(insert ( 4056 25 25 7 742 ));
+
+DATA(insert ( 4057 1083 1083 1 3383 ));
+DATA(insert ( 4057 1083 1083 2 3384 ));
+DATA(insert ( 4057 1083 1083 3 3385 ));
+DATA(insert ( 4057 1083 1083 4 1102 ));
+DATA(insert ( 4057 1083 1083 5 1103 ));
+DATA(insert ( 4057 1083 1083 6 1105 ));
+DATA(insert ( 4057 1083 1083 7 1104 ));
+
+DATA(insert ( 4058 1266 1266 1 3383 ));
+DATA(insert ( 4058 1266 1266 2 3384 ));
+DATA(insert ( 4058 1266 1266 3 3385 ));
+DATA(insert ( 4058 1266 1266 4 1354 ));
+DATA(insert ( 4058 1266 1266 5 1355 ));
+DATA(insert ( 4058 1266 1266 6 1356 ));
+DATA(insert ( 4058 1266 1266 7 1357 ));
+
+DATA(insert ( 4059 1114 1114 1 3383 ));
+DATA(insert ( 4059 1114 1114 2 3384 ));
+DATA(insert ( 4059 1114 1114 3 3385 ));
+DATA(insert ( 4059 1114 1114 4 2054 ));
+DATA(insert ( 4059 1114 1114 5 2055 ));
+DATA(insert ( 4059 1114 1114 6 2056 ));
+DATA(insert ( 4059 1114 1114 7 2057 ));
+
+DATA(insert ( 4060 1184 1184 1 3383 ));
+DATA(insert ( 4060 1184 1184 2 3384 ));
+DATA(insert ( 4060 1184 1184 3 3385 ));
+DATA(insert ( 4060 1184 1184 4 1154 ));
+DATA(insert ( 4060 1184 1184 5 1155 ));
+DATA(insert ( 4060 1184 1184 6 1156 ));
+DATA(insert ( 4060 1184 1184 7 1157 ));
+
+DATA(insert ( 4061 1082 1082 1 3383 ));
+DATA(insert ( 4061 1082 1082 2 3384 ));
+DATA(insert ( 4061 1082 1082 3 3385 ));
+DATA(insert ( 4061 1082 1082 4 1087 ));
+DATA(insert ( 4061 1082 1082 5 1088 ));
+DATA(insert ( 4061 1082 1082 6 1090 ));
+DATA(insert ( 4061 1082 1082 7 1089 ));
+
+DATA(insert ( 4062 18 18 1 3383 ));
+DATA(insert ( 4062 18 18 2 3384 ));
+DATA(insert ( 4062 18 18 3 3385 ));
+DATA(insert ( 4062 18 18 4 1246 ));
+DATA(insert ( 4062 18 18 5 72 ));
+DATA(insert ( 4062 18 18 6 74 ));
+DATA(insert ( 4062 18 18 7 73 ));
+
#endif /* PG_AMPROC_H */
diff --git a/src/include/catalog/pg_opclass.h b/src/include/catalog/pg_opclass.h
index dc52341..39af2ad 100644
--- a/src/include/catalog/pg_opclass.h
+++ b/src/include/catalog/pg_opclass.h
@@ -235,5 +235,14 @@ DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+DATA(insert ( 3580 int4_minmax_ops PGNSP PGUID 4054 23 t 0 ));
+DATA(insert ( 3580 numeric_minmax_ops PGNSP PGUID 4055 1700 t 0 ));
+DATA(insert ( 3580 text_minmax_ops PGNSP PGUID 4056 25 t 0 ));
+DATA(insert ( 3580 time_minmax_ops PGNSP PGUID 4057 1083 t 0 ));
+DATA(insert ( 3580 timetz_minmax_ops PGNSP PGUID 4058 1266 t 0 ));
+DATA(insert ( 3580 timestamp_minmax_ops PGNSP PGUID 4059 1114 t 0 ));
+DATA(insert ( 3580 timestamptz_minmax_ops PGNSP PGUID 4060 1184 t 0 ));
+DATA(insert ( 3580 date_minmax_ops PGNSP PGUID 4061 1082 t 0 ));
+DATA(insert ( 3580 char_minmax_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
diff --git a/src/include/catalog/pg_opfamily.h b/src/include/catalog/pg_opfamily.h
index 26297ce..2c2103b 100644
--- a/src/include/catalog/pg_opfamily.h
+++ b/src/include/catalog/pg_opfamily.h
@@ -157,4 +157,14 @@ DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+DATA(insert OID = 4054 ( 3580 int4_minax_ops PGNSP PGUID ));
+DATA(insert OID = 4055 ( 3580 numeric_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4056 ( 3580 text_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4057 ( 3580 time_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4058 ( 3580 timetz_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4059 ( 3580 timestamp_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4060 ( 3580 timestamptz_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4061 ( 3580 date_minmax_ops PGNSP PGUID ));
+DATA(insert OID = 4062 ( 3580 char_minmax_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0af1248..76e939a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -565,6 +565,34 @@ DESCR("btree(internal)");
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
@@ -4064,6 +4092,14 @@ DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+/* Minmax */
+DATA(insert OID = 3383 ( minmax_sortable_opcinfo PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+DATA(insert OID = 3384 ( minmax_sortable_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableAddValue _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+DATA(insert OID = 3385 ( minmax_sortable_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableConsistent _null_ _null_ _null_ ));
+DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index d96e375..db7075f 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -403,6 +403,8 @@ extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 0f662ec..7482252 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -195,6 +195,7 @@ extern Datum hashcostestimate(PG_FUNCTION_ARGS);
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index c04cddc..0ce2739 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -1591,6 +1591,11 @@ ORDER BY 1, 2, 3;
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
@@ -1613,7 +1618,7 @@ ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
-(80 rows)
+(85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
@@ -1775,11 +1780,13 @@ WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
@@ -1800,7 +1807,8 @@ WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opcname | procnums
--------+---------+----------
diff --git a/src/test/regress/sql/opr_sanity.sql b/src/test/regress/sql/opr_sanity.sql
index 213a66d..6670661 100644
--- a/src/test/regress/sql/opr_sanity.sql
+++ b/src/test/regress/sql/opr_sanity.sql
@@ -1178,11 +1178,13 @@ WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
@@ -1201,7 +1203,8 @@ WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
- amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
+ amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
+ amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Unfortunately, we can't check the amproc link very well because the
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.
What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 August 2014 10:01, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
It's possible that two backends arrive at phase 3 at the same time, with
different values. For example, backend A wants to update the minimum to
contain 10, and and backend B wants to update it to 5. Now, if backend B
gets to update the tuple first, to 5, backend A will update the tuple to 10
when it gets the lock, which is wrong.The simplest solution would be to get the buffer lock in exclusive mode to
begin with, so that you don't need to release it between steps 2 and 5. That
might be a significant hit on concurrency, though, when most of the
insertions don't in fact have to update the value. Another idea is to
re-check the updated values after acquiring the lock in exclusive mode, to
see if they match the previous values.
Simplest solution is to re-apply the test just before update, so in
the above example, if we think we want to lower the minimum to 10 and
when we get there it is already 5, we just don't update.
We don't need to do the re-check always, though. We can read the page
LSN while holding share lock, then re-read it once we acquire
exclusive lock. If LSN is the same, no need for datatype specific
re-checks at all.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
I couldn't resist starting to hack on this, and implemented the scheme I've
been having in mind:1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
block number is used to check that you land on the right tuple. If not, the
search is started over, looking at the revmap again.
Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.
Part 1 doesn't sound good yet.
Are they connected?
More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/10/2014 12:22 PM, Simon Riggs wrote:
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.What happens if the transaction that does this aborts? Surely that
means the new value is itself garbage? What cleans up that?
It's no different from Alvaro's patch. The updated MMTuple covers the
aborted value, but that's OK from a correctnes point of view.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/10/2014 12:42 PM, Simon Riggs wrote:
On 8 August 2014 16:03, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
I couldn't resist starting to hack on this, and implemented the scheme I've
been having in mind:1. MMTuple contains the block number of the heap page (range) that the tuple
represents. Vacuum is no longer needed to clean up old tuples; when an index
tuples is updated, the old tuple is deleted atomically with the insertion of
a new tuple and updating the revmap, so no garbage is left behind.2. LockTuple is gone. When following the pointer from revmap to MMTuple, the
block number is used to check that you land on the right tuple. If not, the
search is started over, looking at the revmap again.Part 2 sounds interesting, especially because of the reduction in CPU
that it might allow.Part 1 doesn't sound good yet.
Are they connected?
Yes. The optimistic locking in part 2 is based on checking that the
block number on the MMTuple matches what you're searching for, and that
there is never more than one MMTuple in the index with the same block
number.
More importantly, can't we tweak this after commit? Delaying commit
just means less time for other people to see, test, understand tune
and fix. I see you (Heikki) doing lots of incremental development,
lots of small commits. Can't we do this one the same?
Well, I wouldn't consider "let's redesign how locking and vacuuming
works and change the on-disk format" as incremental development ;-).
It's more like, well, redesigning the whole thing. Any testing and
tuning would certainly need to be redone after such big changes.
If you agree that these changes make sense, let's do them now and not
waste people's time testing and tuning a dead-end design. If you don't
agree, then let's discuss that.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 6:01 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
It's possible that two backends arrive at phase 3 at the same time, with
different values. For example, backend A wants to update the minimum to
contain 10, and and backend B wants to update it to 5. Now, if backend B
gets to update the tuple first, to 5, backend A will update the tuple to 10
when it gets the lock, which is wrong.The simplest solution would be to get the buffer lock in exclusive mode to
begin with, so that you don't need to release it between steps 2 and 5. That
might be a significant hit on concurrency, though, when most of the
insertions don't in fact have to update the value. Another idea is to
re-check the updated values after acquiring the lock in exclusive mode, to
see if they match the previous values.
No, the simplest solution is to re-check the bounds after acquiring
the exclusive lock. So instead of doing the addValue with share lock,
do a consistency check first, and if it's not consistent, do the
addValue with exclusive lock.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
I couldn't resist starting to hack on this, and implemented the
scheme I've been having in mind:1. MMTuple contains the block number of the heap page (range) that
the tuple represents. Vacuum is no longer needed to clean up old
tuples; when an index tuples is updated, the old tuple is deleted
atomically with the insertion of a new tuple and updating the
revmap, so no garbage is left behind.2. LockTuple is gone. When following the pointer from revmap to
MMTuple, the block number is used to check that you land on the
right tuple. If not, the search is started over, looking at the
revmap again.
Thanks, looks good, yeah. Did you just forget to attach the
access/rmgrdesc/minmaxdesc.c file, or did you ignore it altogether?
Anyway I hacked one up, and cleaned up some other things.
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.
It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
"(2,1)","(2,2)","(2,3)","(2,4)","(2,5)","(4,1)","(5,1)","(6,1)","(7,1)","(8,1)","(9,1)","(10,1)","(11,1)","(12,1)","(13,1)","(14,1)","(15,1)","(16,1)","(17,1)","(18,1)","(19,1)","(20,1)","(21,1)","(22,1)","(23,1)","(24,1)","(25,1)","(26,1)","(27,1)","(28,1)","(29,1)","(30,1)","(31,1)","(32,1)","(33,1)","(34,1)","(35,1)","(36,1)","(37,1)","(38,1)","(39,1)","(40,1)","(41,1)","(42,1)","(43,1)","(44,1)","(45,1)","(46,1)","(47,1)","(48,1)","(49,1)","(50,1)","(51,1)","(52,1)","(53,1)","(54,1)","(55,1)","(56,1)","(57,1)","(58,1)","(59,1)","(60,1)","(61,1)","(62,1)","(63,1)","(64,1)","(65,1)","(66,1)","(67,1)","(68,1)","(69,1)","(70,1)","(71,1)","(72,1)","(73,1)","(74,1)","(75,1)","(76,1)","(77,1)","(78,1)","(79,1)","(80,1)","(81,1)","(82,1)","(83,1)","(84,1)","(85,1)","(86,1)","(87,1)","(88,1)","(89,1)","(90,1)","(91,1)","(92,1)","(93,1)","(94,1)","(95,1)","(96,1)","(97,1)","(98,1)","(99,1)","(100,1)","(101,1)","(102,1)","(103,1)","(104,1)","(105,1)","(106,1)","(107,1)","(108,1)","(109,1)","(110,1)","(111,1)","(112,1)","(113,1)","(114,1)","(115,1)","(116,1)","(117,1)","(118,1)","(119,1)","(120,1)","(121,1)","(122,1)","(123,1)","(124,1)","(125,1)","(126,1)","(127,1)","(128,1)","(129,1)","(130,1)","(131,1)","(132,1)","(133,1)","(134,1)"
There are some who would think that getting one item per page is
suboptimal. (Maybe it's just a missing FSM update somewhere.)
I've been hacking away a bit more at this; will post updated patch
probably tomorrow (was about to post but just found a memory stomp in
pageinspect.)
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera wrote:
Heikki Linnakangas wrote:
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:
I fixed this issue, and did a lot more rework and bugfixing. Here's
v15, based on v14-heikki2.
I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)
I haven't tested the new xlog records yet.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-15.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,460 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "lib/stringinfo.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+ Datum minmax_page_type(PG_FUNCTION_ARGS);
+ Datum minmax_page_items(PG_FUNCTION_ARGS);
+ Datum minmax_metapage_info(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_array_data(PG_FUNCTION_ARGS);
+ Datum minmax_revmap_data(PG_FUNCTION_ARGS);
+
+ PG_FUNCTION_INFO_V1(minmax_page_type);
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+ PG_FUNCTION_INFO_V1(minmax_metapage_info);
+ PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
+ PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+ typedef struct mm_column_state
+ {
+ int nstored;
+ FmgrInfo outputFn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_column_state;
+
+ typedef struct mm_page_state
+ {
+ MinmaxDesc *mmdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ mm_column_state *columns[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+
+ static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ minmax_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP_ARRAY:
+ type = "revmap array";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, columns) +
+ sizeof(mm_column_state) * RelationGetDescr(indexRel)->natts);
+
+ state->mmdesc = minmax_build_mmdesc(indexRel);
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ for (attno = 1; attno <= state->mmdesc->md_tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+ FmgrInfo *opcInfoFn;
+ MinmaxOpcInfo *opcinfo;
+ int i;
+ mm_column_state *column;
+
+ opcInfoFn = index_getprocinfo(indexRel, attno, MINMAX_PROCNUM_OPCINFO);
+ opcinfo = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+
+ column = palloc(offsetof(mm_column_state, outputFn) +
+ sizeof(FmgrInfo) * opcinfo->oi_nstored);
+
+ column->nstored = opcinfo->oi_nstored;
+ for (i = 0; i < opcinfo->oi_nstored; i++)
+ {
+ getTypeOutputInfo(opcinfo->oi_typids[i], &output, &isVarlena);
+ fmgr_info(output, &column->outputFn[i]);
+ }
+
+ state->columns[attno - 1] = column;
+ }
+
+ index_close(indexRel, AccessShareLock);
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[5];
+ bool nulls[5];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->mmdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->dt_columns[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->dt_columns[att].hasnulls);
+ if (!state->dtup->dt_columns[att].allnulls)
+ {
+ MMValues *mmvalues = &state->dtup->dt_columns[att];
+ StringInfoData s;
+ bool first;
+ int i;
+
+ initStringInfo(&s);
+ appendStringInfoChar(&s, '{');
+
+ first = true;
+ for (i = 0; i < state->columns[att]->nstored; i++)
+ {
+ char *val;
+
+ if (!first)
+ appendStringInfoString(&s, " .. ");
+ first = false;
+ val = OutputFunctionCall(&state->columns[att]->outputFn[i],
+ mmvalues->values[i]);
+ appendStringInfoString(&s, val);
+ pfree(val);
+ }
+ appendStringInfoChar(&s, '}');
+
+ values[4] = CStringGetTextDatum(s.data);
+ pfree(s.data);
+ }
+ else
+ {
+ nulls[4] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->mmdesc->md_tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ minmax_free_mmdesc(state->mmdesc);
+
+ SRF_RETURN_DONE(fctx);
+ }
+
+ Datum
+ minmax_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[3];
+ bool nulls[3];
+ ArrayBuildState *astate = NULL;
+ HeapTuple htup;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
+ values[1] = Int32GetDatum(meta->minmaxVersion);
+
+ /* Extract (possibly empty) list of revmap array page numbers. */
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ {
+ BlockNumber blkno;
+
+ blkno = meta->revmapArrayPages[i];
+ if (blkno == InvalidBlockNumber)
+ break; /* XXX or continue? */
+ astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
+ false, INT8OID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[2] = true;
+ else
+ values[2] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the BlockNumber array stored in a revmap array page
+ */
+ Datum
+ minmax_revmap_array_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ ArrayBuildState *astate = NULL;
+ RevmapArrayContents *contents;
+ Datum blkarr;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
+ "revmap array");
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+
+ for (i = 0; i < contents->rma_nblocks; i++)
+ astate = accumArrayResult(astate,
+ Int64GetDatum((int64) contents->rma_blocks[i]),
+ false, INT8OID, CurrentMemoryContext);
+ Assert(astate != NULL);
+
+ blkarr = makeArrayResult(astate, CurrentMemoryContext);
+ PG_RETURN_DATUM(blkarr);
+ }
+
+ /*
+ * Return the TID array stored in a minmax revmap page
+ */
+ Datum
+ minmax_revmap_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,147 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_type()
+ --
+ CREATE FUNCTION minmax_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'minmax_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_metapage_info()
+ --
+ CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT revmap_array_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT value text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_array_data()
+ CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
+ OUT revmap_pages BIGINT[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_data()
+ CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT logblk BIGINT, OUT pages tid[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/doc/src/sgml/brin.sgml
***************
*** 0 ****
--- 1,248 ----
+ <!-- doc/src/sgml/brin.sgml -->
+
+ <chapter id="BRIN">
+ <title>BRIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+
+ <sect1 id="brin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>BRIN</acronym> stands for Block Range Index.
+ <acronym>BRIN</acronym> is designed for handling very large tables
+ in which certain columns have some natural correlation with its
+ physical position. For example, a table storing orders might have
+ a date column on which each order was placed, and much of the time
+ the earlier entries will appear earlier in the table as well; or a
+ table storing a ZIP code column might have all codes for a city
+ grouped together naturally. For each block range, some summary info
+ is stored in the index.
+ </para>
+
+ <para>
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+ scanning facility only, and will return all tuples in all pages within
+ each range if the summary info stored by the index indicates that some
+ tuples in the range might match the given query conditions. The executor
+ is in charge of rechecking these tuples and discarding those that do not
+ match — in other words, these indexes are lossy.
+ This enables them to work as very fast sequential scan helpers to avoid
+ scanning blocks that are known not to contain matching tuples.
+ </para>
+
+ <para>
+ The specific data that a <acronym>BRIN</acronym> index will store
+ depends on the operator class selected for the data type.
+ Datatypes having a linear sort order can have operator classes that
+ store the minimum and maximum value within each block range, for instance;
+ geometrical types might store the common bounding box.
+ </para>
+
+ <para>
+ The size of the block range is determined at index creation time with
+ the pages_per_range storage parameter. The smaller the number, the
+ larger the index becomes (because of the need to store more index entries),
+ but at the same time the summary data stored can be more precise and
+ more data blocks can be skipped.
+ </para>
+
+ <para>
+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
+ is primarily maintained by Álvaro Herrera.
+ </para>
+ </sect1>
+
+ <sect1 id="brin-builtin-opclasses">
+ <title>Built-in Operator Classes</title>
+
+ <para>
+ The core <productname>PostgreSQL</productname> distribution includes
+ includes the <acronym>BRIN</acronym> operator classes shown in
+ <xref linkend="gin-builtin-opclasses-table">.
+ </para>
+
+ <table id="brin-builtin-opclasses-table">
+ <title>Built-in <acronym>BRIN</acronym> Operator Classes</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Indexed Data Type</entry>
+ <entry>Indexable Operators</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>char_minmax_ops</literal></entry>
+ <entry><type>"char"</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>date_minmax_ops</literal></entry>
+ <entry><type>date</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>int4_minmax_ops</literal></entry>
+ <entry><type>integer</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>numeric_minmax_ops</literal></entry>
+ <entry><type>numeric</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text_minmax_ops</literal></entry>
+ <entry><type>text</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>time_minmax_ops</literal></entry>
+ <entry><type>time</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timetz_minmax_ops</literal></entry>
+ <entry><type>time with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamp_minmax_ops</literal></entry>
+ <entry><type>timestamp</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamptz_minmax_ops</literal></entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="brin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>BRIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer only to implement the semantics
+ of the data type being accessed. The <acronym>BRIN</acronym> layer
+ itself takes care of concurrency, logging and searching the index structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>BRIN</acronym> access method working is to
+ implement a few user-defined methods, which define the behavior of
+ summary values stored in the index and the way they interact with
+ scan keys.
+ In short, <acronym>BRIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
+ </para>
+
+ <para>
+ There are three methods that an operator class for <acronym>BRIN</acronym>
+ must provide:
+
+ <variablelist>
+ <varlistentry>
+ <term><function>Datum opcInfo(...)</></term>
+ <listitem>
+ <para>
+ Returns internal information about the summary data stored
+ about indexed columns.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool consistent(...)</function></term>
+ <listitem>
+ <para>
+ Returns whether the key is consistent with the given index tuple.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool addValue(...)</function></term>
+ <listitem>
+ <para>
+ Modifies the index tuple to make it consistent with the given
+ indexed data.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <!-- this needs improvement ... -->
+ To implement these methods in a generic ways, normally the opclass
+ defines its own internal support functions. For instance, minmax
+ opclasses add the support functions for the four inequality operators
+ for the datatype.
+ Additionally, the operator class must supply appropriate
+ operator entries,
+ to enable the optimizer to use the index when those operators are
+ used in queries.
+ </para>
+ </sect1>
+ </chapter>
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***************
*** 87,92 ****
--- 87,93 ----
<!ENTITY gist SYSTEM "gist.sgml">
<!ENTITY spgist SYSTEM "spgist.sgml">
<!ENTITY gin SYSTEM "gin.sgml">
+ <!ENTITY brin SYSTEM "brin.sgml">
<!ENTITY planstats SYSTEM "planstats.sgml">
<!ENTITY indexam SYSTEM "indexam.sgml">
<!ENTITY nls SYSTEM "nls.sgml">
*** a/doc/src/sgml/indices.sgml
--- b/doc/src/sgml/indices.sgml
***************
*** 116,122 **** CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST and GIN. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
--- 116,123 ----
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST, GIN and BRIN.
! Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
***************
*** 326,331 **** SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
--- 327,365 ----
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GIN">.
</para>
+
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>BRIN</primary>
+ <see>index</see>
+ </indexterm>
+ BRIN indexes (a shorthand for Block Range indexes)
+ store summaries about the values stored in consecutive table physical block ranges.
+ Like GiST, SP-GiST and GIN,
+ BRIN can support many different indexing strategies,
+ and the particular operators with which a BRIN index can be used
+ vary depending on the indexing strategy.
+ For datatypes that have a linear sort order, the indexed data
+ corresponds to the minimum and maximum values of the
+ values in the column for each block range,
+ which support indexed queries using these operators:
+
+ <simplelist>
+ <member><literal><</literal></member>
+ <member><literal><=</literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>>=</literal></member>
+ <member><literal>></literal></member>
+ </simplelist>
+
+ The BRIN operator classes included in the standard distribution are
+ documented in <xref linkend="brin-builtin-opclasses-table">.
+ For more information see <xref linkend="BRIN">.
+ </para>
</sect1>
*** a/doc/src/sgml/postgres.sgml
--- b/doc/src/sgml/postgres.sgml
***************
*** 247,252 ****
--- 247,253 ----
&gist;
&spgist;
&gin;
+ &brin;
&storage;
&bki;
&planstats;
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,306 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of summarizing values in
+ consecutive groups of heap pages (page ranges); for example, the minimum and
+ maximum values for datatypes with a btree opclass, or the bounding box for
+ geometric types. These values can be used by constraint exclusion to avoid
+ scanning such pages, depending on query quals.
+
+ The main drawback of this is having to update the stored summary values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have similar features. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs" according to a
+ May 3rd, 2009 blog post
+ http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB also uses this technique, according to a published paper
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported currently; since an index is concerned with
+ summary values of the involved columns across all the pages in the table, it
+ normally doesn't make sense to exclude some tuples. These might be useful if
+ the index predicates are also used in queries. We exclude these for now for
+ conceptual simplicity.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine returns a TIDBitmap comprising all pages
+ in those page groups that match the query qualifications. The recheck node
+ prunes tuples that are not visible according to the query qualifications.
+
+ For each supported datatype, we need an operator class with the following
+ catalog entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+ - support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+
+ These are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ Different datatypes store other values instead of min/max, for example
+ geometric types might store a bounding box. The NULL bits are always present.
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious.)
+
+ There can be gaps where some pages have no covering index entry.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have an internal
+ structure we call the range reverse map. This stores one TID per page range,
+ which is the address of the index tuple summarizing that range. Since these
+ map entries are fixed size, it is possible to compute the address of the range
+ map entry for any given heap page by simple arithmetic.
+
+ When a new heap tuple is inserted in a summarized page range, we compare the
+ existing index tuple with the new heap tuple. If the heap tuple is outside the
+ summarization data given by the index tuple for any indexed column (or if the
+ new heap tuple contains null values but the index tuple indicate there are no
+ nulls), it is necessary to create a new index tuple with the new values. To do
+ this, a new index tuple is inserted, and the reverse range map is updated to
+ point to it. The old index tuple is left in place, for later garbage
+ collection. As an optimization, we sometimes overwrite the old index tuple in
+ place with the new data, which avoids the need for later garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is considered to be not summarized.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order. Query
+ quals are matched to each index tuple; if they match, each page within the page
+ range is returned as part of the output TID bitmap. If there's no match, they
+ are skipped. Reverse range map entries returning invalid index TIDs, that is
+ unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we map its logical page numbers to physical
+ pages. We use a large two-level BlockNumber array for this: The metapage
+ contains an array of BlockNumbers; each of these points to a "revmap array
+ page". Each revmap array page contains BlockNumbers, which in turn point to
+ "revmap regular pages", which are the ones that contain the revmap data itself.
+ Therefore, to find a given index tuple, we need to examine the metapage and
+ obtain the revmap array page number; then read the array page. From there we
+ obtain the revmap regular page number, and that one contains the TID we're
+ interested in. As an optimization, regular revmap page number 0 is stored in
+ physical page number 1, that is, the page just after the metapage. This means
+ that scanning a table of about 1300 page ranges (the number of TIDs that fit in
+ a single 8kB page) does not require accessing the metapage at all.
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction. It might be
+ useful to mark the corresponding index tuple somehow, if the heap tuple is one
+ of the constraining values of the summary data (i.e. either min or max in the
+ case of a btree-opclass-bearing datatype), so that in the future we are aware
+ of the need to re-execute summarization on that range, leading to a possible
+ tightening of the summary values.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ summarizing values of each indexed column and nulls bitmap are collected and
+ stored in the index.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the summary values for each indexed column, then insert the new
+ index entry at the end of the index.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 271,276 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 271,278 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 296,301 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 298,311 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 636,642 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 646,653 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 646,652 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 657,664 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 897,903 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 909,916 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 907,913 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 920,927 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,1383 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
+ * * add support for unlogged indexes
+ * * ditto expressional indexes
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "access/xlogutils.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_operator.h"
+ #include "commands/vacuum.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/indexfsm.h"
+ #include "storage/lmgr.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/memutils.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ bool seentup;
+ bool extended;
+ DeformedMMTuple *dtuple;
+ } MMBuildState;
+
+ /*
+ * Struct used as "opaque" during index scans
+ */
+ typedef struct MinmaxOpaque
+ {
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ } MinmaxOpaque;
+
+ static MMBuildState *initialize_mm_buildstate(Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange);
+ static bool terminate_mm_buildstate(MMBuildState *state);
+ static void summarize_range(MMBuildState *mmstate, Relation heapRel,
+ BlockNumber heapBlk);
+ static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ MMTuple *origtup, Size origsz,
+ MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended);
+ static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapblkno,
+ MMTuple *tup, Size itemsz, bool *extended);
+ static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *extended);
+ static void form_and_insert_tuple(MMBuildState *mmstate);
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its stored values with
+ * those of the new tuple; if the tuple values are consistent with the summary
+ * tuple, there's nothing to do; otherwise we need to update the index.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ BlockNumber pagesPerRange;
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ OffsetNumber off;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ BlockNumber heapBlk;
+ Buffer buf = InvalidBuffer;
+ int keyno;
+ bool need_insert = false;
+ bool extended;
+
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ /* normalize the block number to be the first block in the range */
+ heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
+ mmtup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+
+ if (!mmtup)
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+ return BoolGetDatum(false);
+ }
+
+ mmdesc = minmax_build_mmdesc(idxRel);
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later.
+ */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ result = FunctionCall5Coll(addValue,
+ idxRel->rd_indcollation[keyno],
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ if (need_insert)
+ {
+ Page page = BufferGetPage(buf);
+ ItemId lp = PageGetItemId(page, off);
+ Size origsz;
+ MMTuple *origtup;
+ Size newsz;
+ MMTuple *newtup;
+ bool samepage;
+
+ /*
+ * Make a copy of the old tuple, so that we can compare it after
+ * re-acquiring the lock.
+ */
+ origsz = ItemIdGetLength(lp);
+ origtup = minmax_copy_tuple(mmtup, origsz);
+
+ /* before releasing the lock, check if we can do a same-page update. */
+ if (newsz <= origsz || PageGetExactFreeSpace(page) >= (origsz - newsz))
+ samepage = true;
+ else
+ samepage = false;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ newtup = minmax_form_tuple(mmdesc, heapBlk, dtup, &newsz);
+
+ mm_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk, buf, off, origtup, origsz,
+ newtup, newsz, samepage, &extended);
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+ minmax_free_mmdesc(mmdesc);
+
+ if (extended)
+ IndexFreeSpaceMapVacuum(idxRel);
+
+ return BoolGetDatum(false);
+ }
+
+ /*
+ * Initialize state for a Minmax index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ opaque->mmDesc = minmax_build_mmdesc(r);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+ }
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first on what to do about sk_subtype.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer buf = InvalidBuffer;
+ MinmaxDesc *mmdesc;
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ mmdesc = opaque->mmDesc;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * mmdesc->md_tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ bool addrange;
+ OffsetNumber off;
+ MMTuple *tup;
+
+ tup = mmGetMMTupleForHeapBlock(opaque->rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ /*
+ * For page ranges with no indexed tuple, we must return the whole
+ * range; otherwise, compare it to the scan keys.
+ */
+ if (tup == NULL)
+ {
+ addrange = true;
+ }
+ else
+ {
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with summary values stored for the range. If
+ * scan keys are matched, the page range must be added to the
+ * bitmap. We initially assume the range needs to be added; in
+ * particular this serves the case where there are no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation used
+ * in the index column. Otherwise we shouldn't be using this
+ * index ...
+ */
+ Assert(key->sk_collation ==
+ mmdesc->md_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page range
+ * values; if so, have the pages in the range added to the
+ * output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole, so break out of the loop as soon as a
+ * false return value is obtained.
+ */
+ add = FunctionCall3Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ PointerGetDatum(key));
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ break;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ pfree(dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+ }
+
+ /*
+ * Re-initialize state for a minmax index scan
+ */
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Close down a minmax index scan
+ */
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ minmax_free_mmdesc(opaque->mmDesc);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1))
+ {
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(mmstate);
+
+ /* set state to correspond to the next range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for it */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ mmstate->seentup = false;
+ }
+
+ /* Accumulate the current tuple into the running state */
+ mmstate->seentup = true;
+ for (i = 0; i < mmstate->mmDesc->md_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ mmstate->mmDesc->md_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ double idxtuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ meta = ReadBuffer(index, P_NEW);
+ Assert(BufferGetBlockNumber(meta) == MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
+
+ START_CRIT_SECTION();
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+ END_CRIT_SECTION();
+
+ /*
+ * Set up an empty revmap, and get access to it
+ */
+ mmRevmapCreate(index);
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* process the final batch */
+ form_and_insert_tuple(mmstate);
+
+ /* release resources */
+ idxtuples = mmstate->numtuples;
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+ if (terminate_mm_buildstate(mmstate))
+ IndexFreeSpaceMapVacuum(index);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = idxtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * mmbulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a minmax index: we just summarize
+ * ranges that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate = NULL;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ Buffer buf;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ /*
+ * Scan the revmap to find unsummarized items.
+ */
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+ buf = InvalidBuffer;
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ MMTuple *tup;
+ OffsetNumber off;
+
+ tup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ if (tup == NULL)
+ {
+ /* no revmap entry for this heap range. Summarize it. */
+ if (mmstate == NULL)
+ mmstate = initialize_mm_buildstate(info->index, rmAccess,
+ pagesPerRange);
+ summarize_range(mmstate, heapRel, heapBlk);
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ /* free resources */
+ mmRevmapAccessTerminate(rmAccess);
+ if (mmstate && terminate_mm_buildstate(mmstate))
+ IndexFreeSpaceMapVacuum(info->index);
+
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * reloptions processor for minmax indexes
+ */
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ mm_page_init(Page page, uint16 type)
+ {
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+ /*
+ * Initialize a new minmax index' metapage.
+ */
+ void
+ mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+ {
+ MinmaxMetaPageData *metadata;
+ int i;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->pagesPerRange = pagesPerRange;
+ metadata->minmaxVersion = version;
+ for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
+ metadata->revmapArrayPages[i] = InvalidBlockNumber;
+ }
+
+ /*
+ * Build a MinmaxDesc used to create or scan a minmax index
+ */
+ MinmaxDesc *
+ minmax_build_mmdesc(Relation rel)
+ {
+ MinmaxOpcInfo **opcinfo;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+
+ tupdesc = RelationGetDescr(rel);
+ IncrTupleDescRefCount(tupdesc);
+
+ /*
+ * Obtain MinmaxOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ opcinfo = (MinmaxOpcInfo **) palloc(sizeof(MinmaxOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_OPCINFO);
+
+ /* actually FunctionCall0 but we don't have that */
+ opcinfo[keyno] = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(MinmaxDesc, md_info) +
+ sizeof(MinmaxOpcInfo *) * tupdesc->natts;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->md_index = rel;
+ mmdesc->md_tupdesc = tupdesc;
+ mmdesc->md_disktdesc = NULL; /* generated lazily */
+ mmdesc->md_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ mmdesc->md_info[keyno] = opcinfo[keyno];
+ pfree(opcinfo);
+
+ return mmdesc;
+ }
+
+ void
+ minmax_free_mmdesc(MinmaxDesc *mmdesc)
+ {
+ int keyno;
+
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ pfree(mmdesc->md_info[keyno]);
+ DecrTupleDescRefCount(mmdesc->md_tupdesc);
+ pfree(mmdesc);
+ }
+
+ /*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+ static MMBuildState *
+ initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
+ BlockNumber pagesPerRange)
+ {
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+ mmstate->extended = false;
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ mmstate->seentup = false;
+
+ return mmstate;
+ }
+
+ /*
+ * Release resources associated with a MMBuildState. Returns whether the FSM
+ * should be vacuumed afterwards.
+ */
+ static bool
+ terminate_mm_buildstate(MMBuildState *mmstate)
+ {
+ bool vacuumfsm;
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ Page page;
+
+ page = BufferGetPage(mmstate->currentInsertBuf);
+ RecordPageWithFreeSpace(mmstate->irel,
+ BufferGetBlockNumber(mmstate->currentInsertBuf),
+ PageGetFreeSpace(page));
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ }
+ vacuumfsm = mmstate->extended;
+
+ minmax_free_mmdesc(mmstate->mmDesc);
+ pfree(mmstate->dtuple);
+ pfree(mmstate);
+
+ return vacuumfsm;
+ }
+
+ /*
+ * Summarize the given page range of the given index.
+ */
+ static void
+ summarize_range(MMBuildState *mmstate, Relation heapRel, BlockNumber heapBlk)
+ {
+ IndexInfo *indexInfo;
+
+ indexInfo = BuildIndexInfo(mmstate->irel);
+
+ mmstate->currRangeStart = heapBlk;
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the
+ * specified page range, summarizing the heap tuples in it. This scan
+ * stops just short of mmbuildCallback creating the new index entry.
+ */
+ IndexBuildHeapRangeScan(heapRel, mmstate->irel, indexInfo, false,
+ heapBlk, mmstate->pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple and insert it. Note mmbuildCallback didn't
+ * have the chance to actually insert anything into the index, because
+ * the heapscan should have ended just as it reached the final tuple in
+ * the range.
+ */
+ form_and_insert_tuple(mmstate);
+
+ /* and re-initialize state for the next range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ mmstate->seentup = false;
+ }
+
+ /*
+ * Update tuple origtup (size origsz), located in offset oldoff of buffer
+ * oldbuf, to newtup (size newsz) as summary tuple for the page range starting
+ * at heapBlk. If samepage is true, then attempt to put the new tuple in the same
+ * page, otherwise get a new one.
+ *
+ * If the update is done, return true; the revmap is updated to point to the
+ * new tuple. If the update is not done for whatever reason, return false.
+ * Caller may retry the update if this happens.
+ *
+ * If the index had to be extended in the course of this operation, *extended
+ * is set to true.
+ */
+ static bool
+ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ MMTuple *origtup, Size origsz,
+ MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended)
+ {
+ Page oldpage;
+ ItemId origlp;
+ MMTuple *oldtup;
+ Size oldsz;
+ Buffer newbuf;
+
+ if (!samepage)
+ {
+ /* need a page on which to put the item */
+ newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz, extended);
+ /*
+ * Note: it's possible (though unlikely) that the returned newbuf is
+ * the same as oldbuf, if mm_getinsertbuffer determined that the old
+ * buffer does in fact have enough space.
+ */
+ if (newbuf == oldbuf)
+ newbuf = InvalidBuffer;
+ }
+ else
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ newbuf = InvalidBuffer;
+ }
+ oldpage = BufferGetPage(oldbuf);
+ origlp = PageGetItemId(oldpage, oldoff);
+
+ /* Check that the old tuple wasn't updated concurrently */
+ if (!ItemIdIsNormal(origlp))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ oldsz = ItemIdGetLength(origlp);
+ oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
+
+ /* If both tuples are in fact equal, there is nothing to do */
+ if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ /*
+ * Great, the old tuple is intact. We can proceed with the update.
+ *
+ * If there's enough room on the old page for the new tuple, replace it.
+ *
+ * Note that there might now be enough space on the page even though
+ * the caller told us there isn't, if a concurrent updated moved a tuple
+ * elsewhere or replaced a tuple with a smaller one.
+ */
+ if (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz))
+ {
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+
+ START_CRIT_SECTION();
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple");
+ MarkBufferDirty(oldbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ BlockNumber blk = BufferGetBlockNumber(oldbuf);
+ xl_minmax_samepage_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_SAMEPAGE_UPDATE;
+
+ xlrec.node = idxrel->rd_node;
+ ItemPointerSetBlockNumber(&xlrec.tid, blk);
+ ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxSamepageUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = oldbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ else if (newbuf == InvalidBuffer)
+ {
+ /*
+ * Not enough space, but caller said that there was. Tell them to
+ * start over
+ */
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+ else
+ {
+ /*
+ * Not enough free space on the oldpage. Put the new tuple on the
+ * new page, and update the revmap.
+ */
+ Page newpage = BufferGetPage(newbuf);
+ Buffer revmapbuf;
+ ItemPointerData newtid;
+ OffsetNumber newoff;
+
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
+ if (newoff == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple to new page");
+ MarkBufferDirty(oldbuf);
+ MarkBufferDirty(newbuf);
+
+ ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[4];
+ uint8 info = XLOG_MINMAX_UPDATE;
+
+ xlrec.new.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
+ xlrec.new.heapBlk = heapBlk;
+ xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ xlrec.new.pagesPerRange = pagesPerRange;
+ ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = newbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = &(rdata[2]);
+
+ rdata[2].data = (char *) NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = revmapbuf;
+ rdata[2].buffer_std = true;
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = (char *) NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = oldbuf;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ PageSetLSN(newpage, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(newbuf);
+ return true;
+ }
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is first checked for free space to insert the new
+ * entry; if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * If the relation had to be extended to make room for the new index tuple,
+ * *extended is set to true.
+ */
+ static void
+ mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer,
+ BlockNumber heapBlk, MMTuple *tup, Size itemsz, bool *extended)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ Buffer revmapbuf;
+ ItemPointerData tid;
+
+ itemsz = MAXALIGN(itemsz);
+
+ if (BufferIsValid(*buffer))
+ {
+ page = BufferGetPage(*buffer);
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ if (PageGetFreeSpace(page) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ }
+ }
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer.
+ */
+ if (!BufferIsValid(*buffer))
+ {
+ *buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
+ page = BufferGetPage(*buffer);
+ Assert(PageGetFreeSpace(page) >= itemsz);
+ }
+
+ page = BufferGetPage(*buffer);
+ blk = BufferGetBlockNumber(*buffer);
+
+ /* lock the revmap for the update */
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "could not insert new index tuple to page");
+ MarkBufferDirty(*buffer);
+
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapBlk);
+
+ ItemPointerSet(&tid, blk, off);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.heapBlk = heapBlk;
+ xlrec.pagesPerRange = pagesPerRange;
+ xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ ItemPointerSet(&xlrec.tid, blk, off);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our locks */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz. If oldbuf is a valid buffer, it is also locked (in a order
+ * determined to avoid deadlocks.)
+ *
+ * If there's no existing page with enough free space to accomodate the new
+ * item, the relation is extended. If this happens, *extended is set to true.
+ */
+ static Buffer
+ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended)
+ {
+ BlockNumber oldblk;
+ BlockNumber newblk;
+ Page page;
+ int freespace;
+ bool extended = false;
+
+ if (BufferIsValid(oldbuf))
+ oldblk = BufferGetBlockNumber(oldbuf);
+ else
+ oldblk = InvalidBlockNumber;
+
+ /*
+ * Loop until we find a page with sufficient free space. By the time we
+ * return to caller out of this loop, both buffers are valid and locked;
+ * if we have to restart here, neither buffer is locked and buf is not
+ * a pinned buffer.
+ */
+ newblk = GetPageWithFreeSpace(irel, itemsz);
+ for (;;)
+ {
+ Buffer buf;
+ bool extensionLockHeld = false;
+
+ if (newblk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ if (!RELATION_IS_LOCAL(irel))
+ {
+ LockRelationForExtension(irel, ExclusiveLock);
+ extensionLockHeld = true;
+ }
+ buf = ReadBuffer(irel, P_NEW);
+ extended = true;
+
+ MINMAX_elog(DEBUG2, "mm_getinsertbuffer: extending to page %u",
+ BufferGetBlockNumber(buf));
+ }
+ else if (newblk == oldblk)
+ {
+ /*
+ * There's an odd corner-case here where the FSM is out-of-date,
+ * and gave us the old page.
+ */
+ buf = oldbuf;
+ }
+ else
+ {
+ buf = ReadBuffer(irel, newblk);
+ }
+
+ if (BufferIsValid(oldbuf) && newblk < oldblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (BufferIsValid(oldbuf) && newblk > oldblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (extensionLockHeld)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ page = BufferGetPage(buf);
+
+ if (extended)
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * We have a new buffer from FSM now, and both pages are locked.
+ * Check that the new page has enough free space, and return it if it
+ * does; otherwise start over. Note that we allow for the FSM to be
+ * out of date here, and in that case we update it and move on.
+ */
+ freespace = PageGetFreeSpace(page);
+
+ if (freespace >= itemsz)
+ {
+ if (extended)
+ *was_extended = true;
+ return buf;
+ }
+
+ /* This page is no good. */
+
+ /*
+ * If an entirely new page does not contain enough free space for
+ * the new item, then surely that item is oversized. Complain
+ * loudly; but first make sure we record the page as free, for
+ * next time.
+ */
+ if (extended)
+ {
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ return InvalidBuffer; /* keep compiler quiet */
+ }
+
+ if (newblk != oldblk)
+ UnlockReleaseBuffer(buf);
+ if (BufferIsValid(oldbuf))
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+ newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+ }
+ }
+
+ /*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+ static void
+ form_and_insert_tuple(MMBuildState *mmstate)
+ {
+ MMTuple *tup;
+ Size size;
+
+ /* if we haven't seen any heap tuple yet, don't insert anything */
+ if (!mmstate->seentup)
+ return;
+
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->currRangeStart,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->pagesPerRange, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart,
+ tup, size, &mmstate->extended);
+ mmstate->numtuples++;
+ pfree(tup);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,732 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range, there is one most-up-to-date summary tuple,
+ * and its location is tracked by the revmap. Whenever a new tuple is inserted
+ * into a table that violates the previously recorded min/max values, a new
+ * tuple is inserted into the index and the revmap is updated to point to it.
+ *
+ * The pages of the revmap are interspersed in the index's main fork. The
+ * first revmap page is always the index's page number one (that is,
+ * immediately after the metapage). Subsequent revmap pages are allocated as
+ * they are needed; their locations are tracked by "array pages". The metapage
+ * contains a large BlockNumber array, which correspond to array pages. Thus,
+ * to find the second revmap page, we read the metapage and obtain the block
+ * number of the first array page; we then read that page, and the first
+ * element in it is the revmap page we're looking for.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/heapam_xlog.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "storage/relfilenode.h"
+ #include "storage/smgr.h"
+ #include "utils/memutils.h"
+
+
+ /*
+ * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * let one find the logical revmap page number and index number of the revmap
+ * item for the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
+
+ /*
+ * In array revmap pages, each item stores a BlockNumber. These defines let
+ * one find the page and index number of a given revmap block number. Note
+ * that the first revmap page (revmap logical page number 0) is always stored
+ * in physical block number 1, so array pages do not store that one.
+ */
+ #define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
+ #define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
+
+
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ Buffer metaBuf;
+ Buffer currBuf;
+ Buffer currArrayBuf;
+ BlockNumber *revmapArrayPages;
+ };
+ /* typedef appears in minmax_revmap.h */
+
+
+ static Buffer mm_getnewbuffer(Relation irel);
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->metaBuf = meta;
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->currBuf = InvalidBuffer;
+ rmAccess->currArrayBuf = InvalidBuffer;
+ rmAccess->revmapArrayPages = NULL;
+
+ if (pagesPerRange)
+ *pagesPerRange = metadata->pagesPerRange;
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ if (rmAccess->revmapArrayPages != NULL)
+ pfree(rmAccess->revmapArrayPages);
+ if (rmAccess->metaBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * Lock the metapage as specified by called, and update the given rmAccess with
+ * the metapage data. The metapage buffer is locked when this function
+ * returns; it's the caller's responsibility to unlock it.
+ */
+ static void
+ rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+ {
+ MinmaxMetaPageData *metadata;
+ MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
+ Page metapage;
+
+ LockBuffer(rmAccess->metaBuf, lockmode);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+
+ #ifdef USE_ASSERT_CHECKING
+ /* ensure we really got the metapage */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
+ Assert(special->type == MINMAX_PAGETYPE_META);
+ #endif
+
+ /* first time through? allocate the array */
+ if (rmAccess->revmapArrayPages == NULL)
+ rmAccess->revmapArrayPages =
+ palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+ memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
+ sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
+ }
+
+ /*
+ * Update the metapage, so that item arrayBlkIdx in the array of revmap array
+ * pages points to block number newPgBlkno.
+ */
+ static void
+ update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
+ BlockNumber newPgBlkno)
+ {
+ MinmaxMetaPageData *metadata;
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ START_CRIT_SECTION();
+ metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+ MarkBufferDirty(meta);
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_metapg_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.blkidx = arrayBlkIdx;
+ xlrec.newpg = newPgBlkno;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxMetapgSet;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
+ PageSetLSN(BufferGetPage(meta), recptr);
+ }
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * Note this might involve up to two buffer reads, including a possible
+ * update to the metapage.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ int arrayBlkIdx;
+ BlockNumber arrayBlk;
+ RevmapArrayContents *contents;
+ int revmapIdx;
+ BlockNumber targetblk;
+
+ /* the first revmap page is always block number 1 */
+ if (mapBlk == 0)
+ return (BlockNumber) 1;
+
+ /*
+ * For all other cases, take the long route of checking the metapage and
+ * revmap array pages.
+ */
+
+ /*
+ * Copy the revmap array from the metapage into private storage, if not
+ * done already in this scan.
+ */
+ if (rmAccess->revmapArrayPages == NULL)
+ {
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Consult the metapage array; if the array page we need is not set there,
+ * we need to extend the index to allocate the array page, and update the
+ * metapage array.
+ */
+ arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
+ if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
+ elog(ERROR, "non-existant revmap array page requested");
+
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ if (arrayBlk == InvalidBlockNumber)
+ {
+ /* if not asked to extend, there's no further work to do here */
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /*
+ * If we need to create a new array page, check the metapage again;
+ * someone might have created it after the last time we read the
+ * metapage. This time we acquire an exclusive lock, since we may need
+ * to extend. Lock before doing the physical relation extension, to
+ * avoid leaving an unused page around in case someone does this
+ * concurrently. Note that, unfortunately, we will be keeping the lock
+ * on the metapage alongside the relation extension lock, while doing a
+ * syscall involving disk I/O. Extending to add a new revmap array page
+ * is fairly infrequent, so it shouldn't be too bad.
+ *
+ * XXX it is possible to extend the relation unconditionally before
+ * locking the metapage, and later if we find that someone else had
+ * already added this page, save the page in FSM as MaxFSMRequestSize.
+ * That would be better for concurrency. Explore someday.
+ */
+ rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+
+ if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
+ {
+ BlockNumber newPgBlkno;
+
+ /*
+ * Ok, definitely need to allocate a new revmap array page;
+ * initialize a new page to the initial (empty) array revmap state
+ * and register it in metapage.
+ */
+ rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ initialize_rma_page(rmAccess->currArrayBuf);
+ MarkBufferDirty(rmAccess->currArrayBuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.array = true;
+ xlrec.logblk = InvalidBlockNumber;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer; /* FIXME */
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+ END_CRIT_SECTION();
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+
+ MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
+ BufferGetBlockNumber(rmAccess->currArrayBuf));
+
+ /* Update the metapage to point to the new array page. */
+ update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
+ newPgBlkno);
+ }
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ }
+
+ /*
+ * By here, we know the array page is set in the metapage array. Read that
+ * page; except that if we just allocated it, or we already hold pin on it,
+ * we don't need to read it again.
+ */
+ Assert(arrayBlk != InvalidBlockNumber);
+
+ if (rmAccess->currArrayBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ {
+ if (rmAccess->currArrayBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currArrayBuf);
+
+ rmAccess->currArrayBuf =
+ ReadBuffer(rmAccess->idxrel, arrayBlk);
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+
+ /*
+ * And now we can inspect its contents; if the target page is set, we can
+ * just return. Even if not set, we can also return if caller asked us not
+ * to extend the revmap.
+ */
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
+ if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ {
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return contents->rma_blocks[revmapIdx];
+ }
+
+ /*
+ * Trade our shared lock in the array page for exclusive, because we now
+ * need to allocate one more revmap page and modify the array page.
+ */
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ contents = (RevmapArrayContents *)
+ PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+
+ /*
+ * If someone else already set the value while we were waiting for the
+ * exclusive lock, we're done; otherwise, allocate a new block as the
+ * new revmap page, and update the array page to point to it.
+ */
+ if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
+ {
+ targetblk = contents->rma_blocks[revmapIdx];
+ }
+ else
+ {
+ Buffer newbuf;
+
+ /* not possible to get here if we weren't asked to extend */
+ Assert(extend);
+ newbuf = mm_getnewbuffer(rmAccess->idxrel);
+ START_CRIT_SECTION();
+ targetblk = initialize_rmr_page(newbuf, mapBlk);
+ MarkBufferDirty(newbuf);
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_init_rmpg xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.blkno = BufferGetBlockNumber(newbuf);
+ xlrec.array = false;
+ xlrec.logblk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxInitRmpg;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
+ PageSetLSN(BufferGetPage(newbuf), recptr);
+ }
+ END_CRIT_SECTION();
+
+ UnlockReleaseBuffer(newbuf);
+
+ /*
+ * Now make the revmap array page point to the newly allocated page.
+ * If necessary, also update the total number of items in it.
+ */
+ START_CRIT_SECTION();
+
+ contents->rma_blocks[revmapIdx] = targetblk;
+ if (contents->rma_nblocks < revmapIdx + 1)
+ contents->rma_nblocks = revmapIdx + 1;
+ MarkBufferDirty(rmAccess->currArrayBuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_rmarray_set xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info;
+
+ info = XLOG_MINMAX_RMARRAY_SET;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
+ xlrec.blkidx = revmapIdx;
+ xlrec.newpg = targetblk;
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxRmarraySet;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &rdata[1];
+
+ rdata[1].data = NULL;
+ rdata[1].len = 0;
+ rdata[1].buffer = rmAccess->currArrayBuf;
+ rdata[1].buffer_std = false;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+ PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+ }
+
+ LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+
+ return targetblk;
+ }
+
+ /*
+ * Prepare for updating an entry in the revmap.
+ *
+ * The map is extended, if necessary.
+ */
+ Buffer
+ mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk)
+ {
+ BlockNumber mapBlk;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+
+ MINMAX_elog(DEBUG2, "locking revmap page for logical page %lu (physical %u) for heap %u",
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ return rmAccess->currBuf;
+ }
+
+ /*
+ * In the given revmap buffer (locked appropriately by caller), which is used
+ * in a minmax index of pagesPerRange pages per range, set the element
+ * corresponding to heap block number heapBlk to the given TID.
+ *
+ * Once the operation is complete, the caller must update the LSN on the
+ * returned buffer.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ mmSetHeapBlockItemptr(Buffer buf, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ ItemPointerData tid)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ Page page;
+
+ /* The correct page should already be pinned and locked */
+ page = BufferGetPage(buf);
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr,
+ ItemPointerGetBlockNumber(&tid),
+ ItemPointerGetOffsetNumber(&tid));
+ }
+
+ /*
+ * Fetch the MMTuple for a given heap block.
+ *
+ * The buffer containing the tuple is locked, and returned in *buf. As an
+ * optimization, the caller can pass a pinned buffer *buf on entry, which will
+ * avoid a pin-unpin cycle when the next tuple is on the same page as previous
+ * one.
+ *
+ * If no tuple is found for the given heap range, returns NULL. In that case,
+ * *buf might still be updated, but it's not locked.
+ *
+ * The output tuple offset within the buffer is returned in *off.
+ */
+ MMTuple *
+ mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer *buf, OffsetNumber *off, int mode)
+ {
+ Relation idxRel = rmAccess->idxrel;
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ BlockNumber blk;
+ Page page;
+ ItemId lp;
+ MMTuple *mmtup;
+ ItemPointerData previptr;
+
+ /* normalize the heap block number to be the first page in the range */
+ heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
+
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ /* Translate the map block number to physical location */
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ *off = InvalidOffsetNumber;
+ return NULL;
+ }
+
+ ItemPointerSetInvalid(&previptr);
+ for (;;)
+ {
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ if (!ItemPointerIsValid(iptr))
+ {
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ return NULL;
+ }
+
+ /*
+ * Save the current TID we got from the revmap; if we loop we can
+ * sanity-check that the new one is different. Otherwise we might
+ * be stuck looping forever if the revmap is somehow badly broken.
+ */
+ if (ItemPointerIsValid(&previptr) && ItemPointerEquals(&previptr, iptr))
+ ereport(ERROR,
+ /* FIXME improve message */
+ (errmsg("revmap was updated but still contains same TID as before")));
+ previptr = *iptr;
+
+ blk = ItemPointerGetBlockNumber(iptr);
+ *off = ItemPointerGetOffsetNumber(iptr);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+
+ /* Ok, got a pointer to where the MMTuple should be. Fetch it. */
+ if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != blk)
+ {
+ if (BufferIsValid(*buf))
+ ReleaseBuffer(*buf);
+ *buf = ReadBuffer(idxRel, blk);
+ }
+ LockBuffer(*buf, mode);
+ page = BufferGetPage(*buf);
+ lp = PageGetItemId(page, *off);
+ if (ItemIdIsUsed(lp))
+ {
+ mmtup = (MMTuple *) PageGetItem(page, lp);
+
+ if (mmtup->mt_blkno == heapBlk)
+ {
+ /* found it! */
+ return mmtup;
+ }
+ }
+ /*
+ * No luck. Assume that the revmap was updated concurrently.
+ *
+ * XXX: it would be nice to add some kind of a sanity check here to
+ * avoid looping infinitely, if the revmap points to wrong tuple for
+ * some reason.
+ */
+ LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+ /* not reached, but keep compiler quiet */
+ return NULL;
+ }
+
+ /*
+ * Initialize the revmap of a new minmax index.
+ *
+ * NB -- caller is assumed to WAL-log this operation
+ */
+ void
+ mmRevmapCreate(Relation idxrel)
+ {
+ Buffer buf;
+
+ /*
+ * The first page of the revmap is always stored in block number 1 of the
+ * main fork. Because of this, the only thing we need to do is request
+ * a new page; we assume we are called immediately after the metapage has
+ * been initialized.
+ */
+ buf = mm_getnewbuffer(idxrel);
+ Assert(BufferGetBlockNumber(buf) == 1);
+
+ mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Initialize a new regular revmap page, which stores the given revmap logical
+ * page number. The newly allocated physical block number is returned.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ BlockNumber
+ initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
+ {
+ BlockNumber blkno;
+ Page page;
+ RevmapContents *contents;
+
+ page = BufferGetPage(newbuf);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ contents = (RevmapContents *) PageGetContents(page);
+ contents->rmr_logblk = mapBlk;
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+
+ blkno = BufferGetBlockNumber(newbuf);
+
+ return blkno;
+ }
+
+ /*
+ * Given a buffer (hopefully containing a blank page), set it up as a revmap
+ * array page.
+ *
+ * Used both by regular code path as well as during xlog replay.
+ */
+ void
+ initialize_rma_page(Buffer buf)
+ {
+ Page arrayPg;
+ RevmapArrayContents *contents;
+
+ arrayPg = BufferGetPage(buf);
+ mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
+ contents = (RevmapArrayContents *) PageGetContents(arrayPg);
+ contents->rma_nblocks = 0;
+ /* set the whole array to InvalidBlockNumber */
+ memset(contents->rma_blocks, 0xFF,
+ sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
+ }
+
+ /*
+ * Return an exclusively-locked buffer resulting from extending the relation.
+ */
+ static Buffer
+ mm_getnewbuffer(Relation irel)
+ {
+ Buffer buffer;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * XXX As a possible improvement, we could request a blank page to the FSM
+ * here. Such pages could get inserted into the FSM if, for instance, two
+ * processes extend the relation concurrently to add one more page to the
+ * revmap and the second one discovers it doesn't actually need the page it
+ * got.
+ */
+
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buffer = ReadBuffer(irel, P_NEW);
+ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+ MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
+ BufferGetBlockNumber(buffer));
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ return buffer;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmsortable.c
***************
*** 0 ****
--- 1,287 ----
+ /*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+ #include "postgres.h"
+
+ #include "access/genam.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/skey.h"
+ #include "catalog/pg_type.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * Procedure numbers must not collide with MINMAX_PROCNUM defines in
+ * minmax_internal.h. Note we only need inequality functions.
+ */
+ #define SORTABLE_NUM_PROCNUMS 4 /* # support procs we need */
+ #define PROCNUM_LESS 4
+ #define PROCNUM_LESSEQUAL 5
+ #define PROCNUM_GREATEREQUAL 6
+ #define PROCNUM_GREATER 7
+
+ /* subtract this from procnum to obtain index in SortableOpaque arrays */
+ #define PROCNUM_BASE 4
+
+ static FmgrInfo *mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno,
+ uint16 procnum);
+
+ PG_FUNCTION_INFO_V1(mmSortableAddValue);
+ PG_FUNCTION_INFO_V1(mmSortableConsistent);
+
+
+ typedef struct SortableOpaque
+ {
+ FmgrInfo operators[SORTABLE_NUM_PROCNUMS];
+ bool inited[SORTABLE_NUM_PROCNUMS];
+ } SortableOpaque;
+
+ #define OPCINFO(typname, typoid) \
+ PG_FUNCTION_INFO_V1(mmSortableOpcInfo_##typname); \
+ Datum \
+ mmSortableOpcInfo_##typname(PG_FUNCTION_ARGS) \
+ { \
+ SortableOpaque *opaque; \
+ MinmaxOpcInfo *result; \
+ \
+ opaque = palloc0(sizeof(SortableOpaque)); \
+ /* \
+ * 'operators' is initialized lazily, as indicated by 'inited' which was \
+ * initialized to all false by palloc0. \
+ */ \
+ \
+ result = palloc(SizeofMinmaxOpcInfo(2)); /* min, max */ \
+ result->oi_nstored = 2; \
+ result->oi_opaque = opaque; \
+ result->oi_typids[0] = typoid; \
+ result->oi_typids[1] = typoid; \
+ \
+ PG_RETURN_POINTER(result); \
+ }
+
+ OPCINFO(int4, INT4OID)
+ OPCINFO(numeric, NUMERICOID)
+ OPCINFO(text, TEXTOID)
+ OPCINFO(time, TIMEOID)
+ OPCINFO(timetz, TIMETZOID)
+ OPCINFO(timestamp, TIMESTAMPOID)
+ OPCINFO(timestamptz, TIMESTAMPTZOID)
+ OPCINFO(date, DATEOID)
+ OPCINFO(char, CHAROID)
+
+ /*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+ Datum
+ mmSortableAddValue(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->dt_columns[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->dt_columns[attno - 1].allnulls)
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+ }
+
+ /*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple. Return true
+ * if so, false otherwise.
+ */
+ Datum
+ mmSortableConsistent(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ ScanKey key = (ScanKey) PG_GETARG_POINTER(2);
+ Oid colloid = PG_GET_COLLATION();
+ AttrNumber attno = key->sk_attno;
+ Datum value;
+ Datum matches;
+
+ /* handle IS NULL/IS NOT NULL tests */
+ if (key->sk_flags & SK_ISNULL)
+ {
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (dtup->dt_columns[attno - 1].allnulls ||
+ dtup->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(false);
+ }
+
+ /*
+ * For IS NOT NULL we can only exclude blocks if all values are nulls.
+ */
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (dtup->dt_columns[attno - 1].allnulls)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+ }
+
+ value = key->sk_argument;
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESS),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATER),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+ }
+
+ /*
+ * Return the procedure corresponding to the given function support number.
+ */
+ static FmgrInfo *
+ mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno, uint16 procnum)
+ {
+ SortableOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (SortableOpaque *) mmdesc->md_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(mmdesc->md_index, attno, procnum),
+ CurrentMemoryContext);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,478 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+ static TupleDesc
+ mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+ {
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->md_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->md_totalstored, false);
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->md_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->md_info[i]->oi_typids[j],
+ -1, 0);
+ }
+
+ mmdesc->md_disktdesc = tupdesc;
+ }
+
+ return mmdesc->md_disktdesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+ MMTuple *
+ minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
+ DeformedMMTuple *tuple, Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->md_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->md_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->md_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ idxattno = 0;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->md_tupdesc->natts * 2);
+ }
+
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_blkno = blkno;
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ /*
+ * Note that we reverse the sense of null bits in this module: we store
+ * a 1 for a null attribute rather than a 0. So we must reverse the
+ * sense of the att_isnull test in mm_deconstruct_tuple as well.
+ */
+ bitP = ((bits8 *) ((char *) rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ MMTuple *
+ minmax_copy_tuple(MMTuple *tuple, Size len)
+ {
+ MMTuple *newtup;
+
+ newtup = palloc(len);
+ memcpy(newtup, tuple, len);
+
+ return newtup;
+ }
+
+ bool
+ minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen)
+ {
+ if (alen != blen)
+ return false;
+ if (memcmp(a, b, alen) != 0)
+ return false;
+ return true;
+ }
+
+ /*
+ * Create a new DeformedMMTuple from scratch, and initialize it to an empty
+ * state.
+ */
+ DeformedMMTuple *
+ minmax_new_dtuple(MinmaxDesc *mmdesc)
+ {
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->md_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->md_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->md_info[i]->oi_nstored;
+ }
+
+ return dtup;
+ }
+
+ /*
+ * Reset a DeformedMMTuple to initial state
+ */
+ void
+ minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+ {
+ int i;
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * mmdesc->md_info[i]->oi_nstored);
+ }
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need to apply
+ * datumCopy inside the loop. We probably also need a minmax_free_dtuple()
+ * function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->md_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ dtup->dt_columns[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->md_totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->md_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] = values[valueno++];
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->md_totalstored
+ * allnulls output "allnulls", size mmdesc->md_tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->md_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ * Note that we reverse the sense of the att_isnull test, because we store
+ * 1 for a null value (rather than a 1 for a not null value as is the
+ * att_isnull convention used elsewhere.) See minmax_form_tuple.
+ */
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ allnulls[attnum] = nulls && !att_isnull(attnum, nullbits);
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && !att_isnull(mmdesc->md_tupdesc->natts + attnum, nullbits);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ stored = 0;
+ off = 0;
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->md_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,360 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Common part of an insert or update. Inserts the new tuple and updates the
+ * revmap.
+ */
+ static void
+ minmax_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, xl_minmax_insert *xlrec,
+ MMTuple *mmtuple, int tuplen)
+ {
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ Assert(mmtuple->mt_blkno == xlrec->heapBlk);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ }
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* update the revmap */
+ if (record->xl_info & XLR_BKP_BLOCK(1))
+ {
+ (void) RestoreBackupBlock(lsn, record, 1, false, false);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, xlrec->revmapBlk, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ mmSetHeapBlockItemptr(buffer, xlrec->pagesPerRange, xlrec->heapBlk, xlrec->tid);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ minmax_xlog_insert_update(lsn, record, xlrec, newtup, tuplen);
+ }
+
+ static void
+ minmax_xlog_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_update *xlrec = (xl_minmax_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ OffsetNumber offnum;
+ Buffer buffer;
+ Page page;
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxUpdate;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxUpdate);
+
+ /* First insert the new tuple and update revmap, like in an insertion. */
+ minmax_xlog_insert_update(lsn, record, &xlrec->new, newtup, tuplen);
+
+ /* Then remove the old tuple */
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ {
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ }
+ else
+ {
+ blkno = ItemPointerGetBlockNumber(&(xlrec->oldtid));
+ buffer = XLogReadBuffer(xlrec->new.node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->oldtid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+ }
+
+ /*
+ * Update a tuple on a single page.
+ */
+ static void
+ minmax_xlog_samepage_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_samepage_update *xlrec = (xl_minmax_samepage_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ MMTuple *mmtuple;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxSamepageUpdate;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxSamepageUpdate);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_samepage_update: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_samepage_update: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+
+ static void
+ minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
+ Buffer meta;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ Assert(BufferIsValid(meta));
+
+ metapg = BufferGetPage(meta);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+ metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(meta);
+ UnlockReleaseBuffer(meta);
+ }
+
+ static void
+ minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
+ Buffer buffer;
+
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
+ Assert(BufferIsValid(buffer));
+
+ if (xlrec->array)
+ initialize_rma_page(buffer);
+ else
+ initialize_rmr_page(buffer, xlrec->logblk);
+
+ PageSetLSN(BufferGetPage(buffer), lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ static void
+ minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
+ Buffer buffer;
+ Page page;
+ RevmapArrayContents *contents;
+
+ /* If we have a full-page image, restore it and we're done */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ return;
+ }
+
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
+ Assert(BufferIsValid(buffer));
+
+ page = BufferGetPage(buffer);
+
+ contents = (RevmapArrayContents *) PageGetContents(page);
+ contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
+ contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_UPDATE:
+ minmax_xlog_update(lsn, record);
+ break;
+ case XLOG_MINMAX_SAMEPAGE_UPDATE:
+ minmax_xlog_samepage_update(lsn, record);
+ break;
+ case XLOG_MINMAX_METAPG_SET:
+ minmax_xlog_metapg_set(lsn, record);
+ break;
+ case XLOG_MINMAX_RMARRAY_SET:
+ minmax_xlog_rmarray_set(lsn, record);
+ break;
+ case XLOG_MINMAX_INIT_RMPG:
+ minmax_xlog_init_rmpg(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,113 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ void
+ minmax_desc(StringInfo buf, XLogRecord *record)
+ {
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: v%d pagesPerRange %u %u/%u/%u",
+ xlrec->version, xlrec->pagesPerRange,
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ appendStringInfo(buf, "%u/%u/%u blk %u revmapBlk %u pagesPerRange %u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->heapBlk, xlrec->revmapBlk,
+ xlrec->pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_MINMAX_UPDATE)
+ {
+ xl_minmax_update *xlrec = (xl_minmax_update *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "update(init): ");
+ else
+ appendStringInfo(buf, "update: ");
+ appendStringInfo(buf, "rel %u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u TID (%u,%u) old TID (%u,%u)",
+ xlrec->new.node.spcNode, xlrec->new.node.dbNode,
+ xlrec->new.node.relNode,
+ xlrec->new.heapBlk, xlrec->new.revmapBlk,
+ xlrec->new.pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->new.tid),
+ ItemPointerGetOffsetNumber(&xlrec->new.tid),
+ ItemPointerGetBlockNumber(&xlrec->oldtid),
+ ItemPointerGetOffsetNumber(&xlrec->oldtid));
+ }
+ else if (info == XLOG_MINMAX_SAMEPAGE_UPDATE)
+ {
+ xl_minmax_samepage_update *xlrec = (xl_minmax_samepage_update *) rec;
+
+ appendStringInfo(buf, "samepage_update: rel %u/%u/%u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_MINMAX_METAPG_SET)
+ {
+ xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
+
+ appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_RMARRAY_SET)
+ {
+ xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
+
+ appendStringInfoString(buf, "revmap array: ");
+ appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->rmarray,
+ xlrec->blkidx, xlrec->newpg);
+ }
+ else if (info == XLOG_MINMAX_INIT_RMPG)
+ {
+ xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) rec;
+
+ appendStringInfo(buf, "init_rmpg: rel %u/%u/%u blk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->blkno);
+ if (xlrec->array)
+ appendStringInfoString(buf, " (array)");
+ else
+ appendStringInfo(buf, "(regular) logblk %u", xlrec->logblk);
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2096,2101 **** IndexBuildHeapScan(Relation heapRelation,
--- 2096,2122 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2166,2171 **** IndexBuildHeapScan(Relation heapRelation,
--- 2187,2195 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 399,406 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 897,1078 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7375 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for MinMax' reloptions
+ */
+ typedef struct MinmaxOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+ } MinmaxOptions;
+
+ #define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+ #define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,91 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "fmgr.h"
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple and vice-versa.
+ *
+ * Note: we assume, for now, that the data stored for each column is the same
+ * datatype as the indexed heap column. This restriction can be lifted by
+ * having an Oid array pointer on the PerCol struct, where each member of the
+ * array indicates the typid of the stored data.
+ */
+
+ /* struct returned by "OpcInfo" amproc */
+ typedef struct MinmaxOpcInfo
+ {
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+
+ /* Type IDs of the stored columns */
+ Oid oi_typids[FLEXIBLE_ARRAY_MEMBER];
+ } MinmaxOpcInfo;
+
+ /* the size of a MinmaxOpcInfo for the given number of columns */
+ #define SizeofMinmaxOpcInfo(ncols) \
+ (offsetof(MinmaxOpcInfo, oi_typids) + sizeof(Oid) * ncols)
+
+ typedef struct MinmaxDesc
+ {
+ /* the index relation itself */
+ Relation md_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc md_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc md_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int md_totalstored;
+
+ /* per-column info */
+ MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
+ } MinmaxDesc;
+
+ /*
+ * Globally-known function support numbers for Minmax indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+ #define MINMAX_PROCNUM_OPCINFO 1
+ #define MINMAX_PROCNUM_ADDVALUE 2
+ #define MINMAX_PROCNUM_CONSISTENT 3
+
+ #define MINMAX_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(MINMAX_DEBUG) && defined(__GNUC__)
+ #define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define MINMAX_elog(a) void(0)
+ #endif
+
+ /* minmax.c */
+ extern MinmaxDesc *minmax_build_mmdesc(Relation rel);
+ extern void minmax_free_mmdesc(MinmaxDesc *mmdesc);
+ extern void mm_page_init(Page page, uint16 type);
+ extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_page.h
***************
*** 0 ****
--- 1,88 ----
+ /*
+ * Prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef MINMAX_PAGE_H
+ #define MINMAX_PAGE_H
+
+
+ /* special space on all minmax pages stores a "type" identifier */
+ #define MINMAX_PAGETYPE_META 0xF091
+ #define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
+ #define MINMAX_PAGETYPE_REVMAP 0xF093
+ #define MINMAX_PAGETYPE_REGULAR 0xF094
+
+ typedef struct MinmaxSpecialSpace
+ {
+ uint16 type;
+ } MinmaxSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ uint32 minmaxMagic;
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+ } MinmaxMetaPageData;
+
+ /*
+ * Number of array pages listed in metapage. Need to consider leaving enough
+ * space for the page header, the metapage struct, and the minmax special
+ * space.
+ */
+ #define MAX_REVMAP_ARRAYPAGES \
+ ((BLCKSZ - \
+ MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(MinmaxMetaPageData, revmapArrayPages) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
+ sizeof(BlockNumber))
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+
+ /* Definitions for regular revmap pages */
+ typedef struct RevmapContents
+ {
+ int32 rmr_logblk; /* logical blkno of this revmap page */
+ ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REGULAR_REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define REGULAR_REVMAP_PAGE_MAXITEMS \
+ (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ /* Definitions for array revmap pages */
+ typedef struct RevmapArrayContents
+ {
+ int32 rma_nblocks;
+ BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
+ } RevmapArrayContents;
+
+ #define REVMAP_ARRAY_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapArrayContents, rma_blocks) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define ARRAY_REVMAP_PAGE_MAXITEMS \
+ (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
+
+
+ #endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,41 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "access/minmax_tuple.h"
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
+ BlockNumber *pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern void mmRevmapCreate(Relation idxrel);
+ extern Buffer mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk);
+ extern void mmSetHeapBlockItemptr(Buffer rmbuf, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, ItemPointerData tid);
+ extern MMTuple *mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk, Buffer *buf, OffsetNumber *off,
+ int mode);
+
+ /* internal stuff also used by xlog replay */
+ extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
+ extern void initialize_rma_page(Buffer buf);
+
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,89 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/minmax_internal.h"
+ #include "access/tupdesc.h"
+
+
+ /*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+ typedef struct MMValues
+ {
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+ typedef struct DeformedMMTuple
+ {
+ BlockNumber dt_blkno; /* heap blkno that the tuple is for */
+ MMValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each indexed column); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+ typedef struct MMTuple
+ {
+ /* heap block number that the tuple is for */
+ BlockNumber mt_blkno;
+
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ MMTuple *minmax_copy_tuple(MMTuple *tuple, Size len);
+ extern bool minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen);
+
+ extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+ extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+ extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,132 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_UPDATE 0x20
+ #define XLOG_MINMAX_SAMEPAGE_UPDATE 0x30
+ #define XLOG_MINMAX_METAPG_SET 0x40
+ #define XLOG_MINMAX_RMARRAY_SET 0x50
+ #define XLOG_MINMAX_INIT_RMPG 0x60
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+ /*
+ * This is what we need to know about a minmax tuple insert
+ */
+ typedef struct xl_minmax_insert
+ {
+ RelFileNode node;
+ BlockNumber heapBlk;
+
+ /* extra information needed to update the revmap */
+ BlockNumber revmapBlk;
+ BlockNumber pagesPerRange;
+
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, tid) + sizeof(ItemPointerData))
+
+ /*
+ * A cross-page update is the same as an insert, but also store the old tid.
+ */
+ typedef struct xl_minmax_update
+ {
+ xl_minmax_insert new;
+ ItemPointerData oldtid;
+ } xl_minmax_update;
+
+ #define SizeOfMinmaxUpdate (offsetof(xl_minmax_update, oldtid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a minmax tuple samepage update */
+ typedef struct xl_minmax_samepage_update
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_minmax_samepage_update;
+
+ #define SizeOfMinmaxSamepageUpdate (offsetof(xl_minmax_samepage_update, tid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a "metapage set" operation */
+ typedef struct xl_minmax_metapg_set
+ {
+ RelFileNode node;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_metapg_set;
+
+ #define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know about a "revmap array set" operation */
+ typedef struct xl_minmax_rmarray_set
+ {
+ RelFileNode node;
+ BlockNumber rmarray;
+ uint32 blkidx;
+ BlockNumber newpg;
+ } xl_minmax_rmarray_set;
+
+ #define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
+ sizeof(BlockNumber))
+
+ /* This is what we need to know when we initialize a new revmap page */
+ typedef struct xl_minmax_init_rmpg
+ {
+ RelFileNode node;
+ bool array; /* array revmap page or regular revmap page */
+ BlockNumber blkno;
+ BlockNumber logblk; /* only used by regular revmap pages */
+ } xl_minmax_init_rmpg;
+
+ #define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+ sizeof(BlockNumber))
+
+
+ extern void minmax_desc(StringInfo buf, XLogRecord *record);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( minmax 5 7 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * int4_minmax_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * numeric_minmax_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * text_minmax_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * time_minmax_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * timetz_minmax_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * timestamp_minmax_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * timestamptz_minmax_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * date_minmax_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * char_minmax_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 432,435 **** DATA(insert ( 4017 25 25 3 4029 ));
--- 432,508 ----
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+ /* minmax */
+ DATA(insert ( 4054 23 23 1 3383 ));
+ DATA(insert ( 4054 23 23 2 3384 ));
+ DATA(insert ( 4054 23 23 3 3385 ));
+ DATA(insert ( 4054 23 23 4 66 ));
+ DATA(insert ( 4054 23 23 5 149 ));
+ DATA(insert ( 4054 23 23 6 150 ));
+ DATA(insert ( 4054 23 23 7 147 ));
+
+ DATA(insert ( 4055 1700 1700 1 3386 ));
+ DATA(insert ( 4055 1700 1700 2 3384 ));
+ DATA(insert ( 4055 1700 1700 3 3385 ));
+ DATA(insert ( 4055 1700 1700 4 1722 ));
+ DATA(insert ( 4055 1700 1700 5 1723 ));
+ DATA(insert ( 4055 1700 1700 6 1721 ));
+ DATA(insert ( 4055 1700 1700 7 1720 ));
+
+ DATA(insert ( 4056 25 25 1 3387 ));
+ DATA(insert ( 4056 25 25 2 3384 ));
+ DATA(insert ( 4056 25 25 3 3385 ));
+ DATA(insert ( 4056 25 25 4 740 ));
+ DATA(insert ( 4056 25 25 5 741 ));
+ DATA(insert ( 4056 25 25 6 743 ));
+ DATA(insert ( 4056 25 25 7 742 ));
+
+ DATA(insert ( 4057 1083 1083 1 3388 ));
+ DATA(insert ( 4057 1083 1083 2 3384 ));
+ DATA(insert ( 4057 1083 1083 3 3385 ));
+ DATA(insert ( 4057 1083 1083 4 1102 ));
+ DATA(insert ( 4057 1083 1083 5 1103 ));
+ DATA(insert ( 4057 1083 1083 6 1105 ));
+ DATA(insert ( 4057 1083 1083 7 1104 ));
+
+ DATA(insert ( 4058 1266 1266 1 3389 ));
+ DATA(insert ( 4058 1266 1266 2 3384 ));
+ DATA(insert ( 4058 1266 1266 3 3385 ));
+ DATA(insert ( 4058 1266 1266 4 1354 ));
+ DATA(insert ( 4058 1266 1266 5 1355 ));
+ DATA(insert ( 4058 1266 1266 6 1356 ));
+ DATA(insert ( 4058 1266 1266 7 1357 ));
+
+ DATA(insert ( 4059 1114 1114 1 3390 ));
+ DATA(insert ( 4059 1114 1114 2 3384 ));
+ DATA(insert ( 4059 1114 1114 3 3385 ));
+ DATA(insert ( 4059 1114 1114 4 2054 ));
+ DATA(insert ( 4059 1114 1114 5 2055 ));
+ DATA(insert ( 4059 1114 1114 6 2056 ));
+ DATA(insert ( 4059 1114 1114 7 2057 ));
+
+ DATA(insert ( 4060 1184 1184 1 3391 ));
+ DATA(insert ( 4060 1184 1184 2 3384 ));
+ DATA(insert ( 4060 1184 1184 3 3385 ));
+ DATA(insert ( 4060 1184 1184 4 1154 ));
+ DATA(insert ( 4060 1184 1184 5 1155 ));
+ DATA(insert ( 4060 1184 1184 6 1156 ));
+ DATA(insert ( 4060 1184 1184 7 1157 ));
+
+ DATA(insert ( 4061 1082 1082 1 3392 ));
+ DATA(insert ( 4061 1082 1082 2 3384 ));
+ DATA(insert ( 4061 1082 1082 3 3385 ));
+ DATA(insert ( 4061 1082 1082 4 1087 ));
+ DATA(insert ( 4061 1082 1082 5 1088 ));
+ DATA(insert ( 4061 1082 1082 6 1090 ));
+ DATA(insert ( 4061 1082 1082 7 1089 ));
+
+ DATA(insert ( 4062 18 18 1 3393 ));
+ DATA(insert ( 4062 18 18 2 3384 ));
+ DATA(insert ( 4062 18 18 3 3385 ));
+ DATA(insert ( 4062 18 18 4 1246 ));
+ DATA(insert ( 4062 18 18 5 72 ));
+ DATA(insert ( 4062 18 18 6 74 ));
+ DATA(insert ( 4062 18 18 7 73 ));
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_minmax_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_minmax_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_minmax_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_minmax_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_minmax_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_minmax_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_minmax_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_minmax_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_minmax_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_minax_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_minmax_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
***************
*** 4066,4071 **** DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
--- 4094,4123 ----
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+ /* Minmax */
+ DATA(insert OID = 3384 ( minmax_sortable_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableAddValue _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3385 ( minmax_sortable_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableConsistent _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3383 ( minmax_sortable_opcinfo_int4 PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_int4 _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3386 ( minmax_sortable_opcinfo_numeric PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_numeric _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3387 ( minmax_sortable_opcinfo_text PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_text _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3388 ( minmax_sortable_opcinfo_time PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_time _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3389 ( minmax_sortable_opcinfo_timetz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timetz _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3390 ( minmax_sortable_opcinfo_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timestamp _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3391 ( minmax_sortable_opcinfo_timestamptz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timestamptz _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3392 ( minmax_sortable_opcinfo_date PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_date _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3393 ( minmax_sortable_opcinfo_char PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_char _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,410 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1591,1596 **** ORDER BY 1, 2, 3;
--- 1591,1601 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1613,1619 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1618,1624 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1775,1785 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
--- 1780,1792 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
***************
*** 1800,1806 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1807,1814 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1178,1188 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
--- 1178,1190 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
***************
*** 1201,1207 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1203,1210 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Unfortunately, we can't check the amproc link very well because the
On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
Alvaro Herrera wrote:
Heikki Linnakangas wrote:
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:I fixed this issue, and did a lot more rework and bugfixing. Here's
v15, based on v14-heikki2.
Thanks!
I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)
There's this one issue I left in my patch version that I think we should
do something about:
+ /* + * No luck. Assume that the revmap was updated concurrently. + * + * XXX: it would be nice to add some kind of a sanity check here to + * avoid looping infinitely, if the revmap points to wrong tuple for + * some reason. + */
This happens when we follow the revmap to a tuple, but find that the
tuple points to a different block than what the revmap claimed.
Currently, we just assume that it's because the tuple was updated
concurrently, but while hacking, I frequently had a broken index where
the revmap pointed to bogus tuples or the tuples had a missing/wrong
block number on them, and ran into infinite loop here. It's clearly a
case of a corrupt index and shouldn't happen, but I would imagine that
it's a fairly typical way this would fail in production too because of
hardware issues or bugs. So I think we need to work a bit harder to stop
the looping and throw an error instead.
Perhaps something as simple as keeping a loop counter and giving up
after 1000 attempts would be good enough. The window between releasing
the lock on the revmap, and acquiring the lock on the page containing
the MMTuple is very narrow, so the chances of losing that race to a
concurrent update more than 1-2 times in a row is vanishingly small.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/15/2014 10:26 AM, Heikki Linnakangas wrote:
On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
Alvaro Herrera wrote:
Heikki Linnakangas wrote:
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:I fixed this issue, and did a lot more rework and bugfixing. Here's
v15, based on v14-heikki2.Thanks!
I think remaining issues are mostly minimal (pageinspect should output
block number alongside each tuple, now that we have it, for example.)There's this one issue I left in my patch version that I think we should
do something about:+ /* + * No luck. Assume that the revmap was updated concurrently. + * + * XXX: it would be nice to add some kind of a sanity check here to + * avoid looping infinitely, if the revmap points to wrong tuple for + * some reason. + */This happens when we follow the revmap to a tuple, but find that the
tuple points to a different block than what the revmap claimed.
Currently, we just assume that it's because the tuple was updated
concurrently, but while hacking, I frequently had a broken index where
the revmap pointed to bogus tuples or the tuples had a missing/wrong
block number on them, and ran into infinite loop here. It's clearly a
case of a corrupt index and shouldn't happen, but I would imagine that
it's a fairly typical way this would fail in production too because of
hardware issues or bugs. So I think we need to work a bit harder to stop
the looping and throw an error instead.Perhaps something as simple as keeping a loop counter and giving up
after 1000 attempts would be good enough. The window between releasing
the lock on the revmap, and acquiring the lock on the page containing
the MMTuple is very narrow, so the chances of losing that race to a
concurrent update more than 1-2 times in a row is vanishingly small.
Reading the patch more closely, I see that you added a check that when
we loop, we throw an error if the new item pointer in the revmap is the
same as before. In theory, it's possible that two concurrent updates
happen: one that moves the tuple we're looking for elsewhere, and
another that moves it back again. The probability of that is also
vanishingly small, so maybe that's OK. Or we could check the LSN; if the
revmap has been updated, its LSN must've changed.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 15, 2014 at 8:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Alvaro Herrera wrote:
Heikki Linnakangas wrote:
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:I fixed this issue, and did a lot more rework and bugfixing. Here's
v15, based on v14-heikki2.
I've not read the patch yet. But while testing the feature, I found that
* Brin index cannot be created on CHAR(n) column.
Maybe other data types have the same problem.
* FILLFACTOR cannot be set in brin index.
Are these intentional?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/15/2014 02:02 AM, Alvaro Herrera wrote:
Alvaro Herrera wrote:
Heikki Linnakangas wrote:
I'm sure this still needs some cleanup, but here's the patch, based
on your v14. Now that I know what this approach looks like, I still
like it much better. The insert and update code is somewhat more
complicated, because you have to be careful to lock the old page,
new page, and revmap page in the right order. But it's not too bad,
and it gets rid of all the complexity in vacuum.It seems there is some issue here, because pageinspect tells me the
index is not growing properly for some reason. minmax_revmap_data gives
me this array of TIDs after a bunch of insert/vacuum/delete/ etc:I fixed this issue, and did a lot more rework and bugfixing. Here's
v15, based on v14-heikki2.
So, the other design change I've been advocating is to store the revmap
in the first N blocks, instead of having the two-level structure with
array pages and revmap pages.
Attached is a patch for that, to be applied after v15. When the revmap
needs to be expanded, all the tuples on it are moved elsewhere
one-by-one. That adds some latency to the unfortunate guy who needs to
do that, but as the patch stands, the revmap is only ever extended by
VACUUM or CREATE INDEX, so I think that's fine. Like with my previous
patch, the point is to demonstrate how much simpler the code becomes
this way; I'm sure there are bugs and cleanup still necessary.
PS. Spotted one oversight in patch v15: callers of mm_doupdate must
check the return value, and retry the operation if it returns false.
- Heikki
Attachments:
minmax-revmap-redesign-over-v15-1.patchtext/x-diff; name=minmax-revmap-redesign-over-v15-1.patchDownload
commit ce4df0e9dbd43f7e3d4fdf3f7920301f81f17d63
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri Aug 15 18:32:19 2014 +0300
Get rid of array pages. Instead, move all tuples out of the way
diff --git a/contrib/pageinspect/mmfuncs.c b/contrib/pageinspect/mmfuncs.c
index 6cd559a..51cc9e2 100644
--- a/contrib/pageinspect/mmfuncs.c
+++ b/contrib/pageinspect/mmfuncs.c
@@ -74,9 +74,6 @@ minmax_page_type(PG_FUNCTION_ARGS)
case MINMAX_PAGETYPE_META:
type = "meta";
break;
- case MINMAX_PAGETYPE_REVMAP_ARRAY:
- type = "revmap array";
- break;
case MINMAX_PAGETYPE_REVMAP:
type = "revmap";
break;
@@ -343,11 +340,9 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
Page page;
MinmaxMetaPageData *meta;
TupleDesc tupdesc;
- Datum values[3];
- bool nulls[3];
- ArrayBuildState *astate = NULL;
+ Datum values[4];
+ bool nulls[4];
HeapTuple htup;
- int i;
page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
@@ -361,22 +356,8 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
MemSet(nulls, 0, sizeof(nulls));
values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
values[1] = Int32GetDatum(meta->minmaxVersion);
-
- /* Extract (possibly empty) list of revmap array page numbers. */
- for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
- {
- BlockNumber blkno;
-
- blkno = meta->revmapArrayPages[i];
- if (blkno == InvalidBlockNumber)
- break; /* XXX or continue? */
- astate = accumArrayResult(astate, Int64GetDatum((int64) blkno),
- false, INT8OID, CurrentMemoryContext);
- }
- if (astate == NULL)
- nulls[2] = true;
- else
- values[2] = makeArrayResult(astate, CurrentMemoryContext);
+ values[2] = Int32GetDatum(meta->pagesPerRange);
+ values[3] = Int64GetDatum(meta->lastRevmapPage);
htup = heap_form_tuple(tupdesc, values, nulls);
@@ -384,34 +365,6 @@ minmax_metapage_info(PG_FUNCTION_ARGS)
}
/*
- * Return the BlockNumber array stored in a revmap array page
- */
-Datum
-minmax_revmap_array_data(PG_FUNCTION_ARGS)
-{
- bytea *raw_page = PG_GETARG_BYTEA_P(0);
- Page page;
- ArrayBuildState *astate = NULL;
- RevmapArrayContents *contents;
- Datum blkarr;
- int i;
-
- page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP_ARRAY,
- "revmap array");
-
- contents = (RevmapArrayContents *) PageGetContents(page);
-
- for (i = 0; i < contents->rma_nblocks; i++)
- astate = accumArrayResult(astate,
- Int64GetDatum((int64) contents->rma_blocks[i]),
- false, INT8OID, CurrentMemoryContext);
- Assert(astate != NULL);
-
- blkarr = makeArrayResult(astate, CurrentMemoryContext);
- PG_RETURN_DATUM(blkarr);
-}
-
-/*
* Return the TID array stored in a minmax revmap page
*/
Datum
@@ -437,7 +390,7 @@ minmax_revmap_data(PG_FUNCTION_ARGS)
/* Extract values from the revmap page */
contents = (RevmapContents *) PageGetContents(page);
MemSet(nulls, 0, sizeof(nulls));
- values[0] = Int64GetDatum((uint64) contents->rmr_logblk);
+ values[0] = Int64GetDatum((uint64) 0);
/* Extract (possibly empty) list of TIDs in this page. */
for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
diff --git a/contrib/pageinspect/pageinspect--1.2.sql b/contrib/pageinspect/pageinspect--1.2.sql
index 56c9ba8..cba90ca 100644
--- a/contrib/pageinspect/pageinspect--1.2.sql
+++ b/contrib/pageinspect/pageinspect--1.2.sql
@@ -110,7 +110,7 @@ LANGUAGE C STRICT;
-- minmax_metapage_info()
--
CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
- OUT version integer, OUT revmap_array_pages BIGINT[])
+ OUT version integer, OUT pagesperrange integer, OUT lastrevmappage bigint)
AS 'MODULE_PATHNAME', 'minmax_metapage_info'
LANGUAGE C STRICT;
@@ -128,16 +128,9 @@ AS 'MODULE_PATHNAME', 'minmax_page_items'
LANGUAGE C STRICT;
--
--- minmax_revmap_array_data()
-CREATE FUNCTION minmax_revmap_array_data(IN page bytea,
- OUT revmap_pages BIGINT[])
-AS 'MODULE_PATHNAME', 'minmax_revmap_array_data'
-LANGUAGE C STRICT;
-
---
-- minmax_revmap_data()
CREATE FUNCTION minmax_revmap_data(IN page bytea,
- OUT logblk BIGINT, OUT pages tid[])
+ OUT dummy bigint, OUT pages tid[])
AS 'MODULE_PATHNAME', 'minmax_revmap_data'
LANGUAGE C STRICT;
diff --git a/src/backend/access/minmax/minmax.c b/src/backend/access/minmax/minmax.c
index addb3a0..18f85d7 100644
--- a/src/backend/access/minmax/minmax.c
+++ b/src/backend/access/minmax/minmax.c
@@ -34,9 +34,11 @@
#include "storage/freespace.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/smgr.h"
#include "utils/datum.h"
#include "utils/lsyscache.h"
#include "utils/memutils.h"
+#include "utils/rel.h"
#include "utils/syscache.h"
@@ -76,8 +78,8 @@ static void summarize_range(MMBuildState *mmstate, Relation heapRel,
static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
mmRevmapAccess *rmAccess, BlockNumber heapBlk,
Buffer oldbuf, OffsetNumber oldoff,
- MMTuple *origtup, Size origsz,
- MMTuple *newtup, Size newsz,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
bool samepage, bool *extended);
static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapblkno,
@@ -85,6 +87,7 @@ static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
bool *extended);
static void form_and_insert_tuple(MMBuildState *mmstate);
+static Size mm_page_get_freespace(Page page);
/*
@@ -536,11 +539,15 @@ mmbuild(PG_FUNCTION_ARGS)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("expression indexes not supported")));
+ /*
+ * Critical section not required, because on error the creation of the
+ * whole relation will be rolled back.
+ */
+
meta = ReadBuffer(index, P_NEW);
Assert(BufferGetBlockNumber(meta) == MINMAX_METAPAGE_BLKNO);
LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
- START_CRIT_SECTION();
mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
MINMAX_CURRENT_VERSION);
MarkBufferDirty(meta);
@@ -568,17 +575,11 @@ mmbuild(PG_FUNCTION_ARGS)
}
UnlockReleaseBuffer(meta);
- END_CRIT_SECTION();
-
- /*
- * Set up an empty revmap, and get access to it
- */
- mmRevmapCreate(index);
- rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
/*
* Initialize our state, including the deformed tuple state.
*/
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
/*
@@ -664,10 +665,11 @@ mmvacuumcleanup(PG_FUNCTION_ARGS)
heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
AccessShareLock);
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+
/*
* Scan the revmap to find unsummarized items.
*/
- rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
buf = InvalidBuffer;
heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
@@ -751,13 +753,32 @@ mm_page_init(Page page, uint16 type)
}
/*
+ * Return the amount of free space on a regular minmax index page.
+ *
+ * If the page is not a regular page, or has been marked with the
+ * MINMAX_EVACUATE_PAGE flag, returns 0.
+ */
+static Size
+mm_page_get_freespace(Page page)
+{
+ MinmaxSpecialSpace *special;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (!MINMAX_IS_REGULAR_PAGE(page) ||
+ (special->flags & MINMAX_EVACUATE_PAGE) != 0)
+ return 0;
+ else
+ return PageGetFreeSpace(page);
+
+}
+
+/*
* Initialize a new minmax index' metapage.
*/
void
mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
{
MinmaxMetaPageData *metadata;
- int i;
mm_page_init(page, MINMAX_PAGETYPE_META);
@@ -766,8 +787,7 @@ mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
metadata->minmaxMagic = MINMAX_META_MAGIC;
metadata->pagesPerRange = pagesPerRange;
metadata->minmaxVersion = version;
- for (i = 0; i < MAX_REVMAP_ARRAYPAGES; i++)
- metadata->revmapArrayPages[i] = InvalidBlockNumber;
+ metadata->lastRevmapPage = 0;
}
/*
@@ -875,7 +895,7 @@ terminate_mm_buildstate(MMBuildState *mmstate)
page = BufferGetPage(mmstate->currentInsertBuf);
RecordPageWithFreeSpace(mmstate->irel,
BufferGetBlockNumber(mmstate->currentInsertBuf),
- PageGetFreeSpace(page));
+ mm_page_get_freespace(page));
ReleaseBuffer(mmstate->currentInsertBuf);
}
vacuumfsm = mmstate->extended;
@@ -938,8 +958,8 @@ static bool
mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
mmRevmapAccess *rmAccess, BlockNumber heapBlk,
Buffer oldbuf, OffsetNumber oldoff,
- MMTuple *origtup, Size origsz,
- MMTuple *newtup, Size newsz,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
bool samepage, bool *extended)
{
Page oldpage;
@@ -947,11 +967,15 @@ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
MMTuple *oldtup;
Size oldsz;
Buffer newbuf;
+ MinmaxSpecialSpace *special;
if (!samepage)
{
/* need a page on which to put the item */
newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz, extended);
+ if (!BufferIsValid(newbuf))
+ return false;
+
/*
* Note: it's possible (though unlikely) that the returned newbuf is
* the same as oldbuf, if mm_getinsertbuffer determined that the old
@@ -985,6 +1009,8 @@ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
return false;
}
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(oldpage);
+
/*
* Great, the old tuple is intact. We can proceed with the update.
*
@@ -994,7 +1020,8 @@ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
* the caller told us there isn't, if a concurrent updated moved a tuple
* elsewhere or replaced a tuple with a smaller one.
*/
- if (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz))
+ if ((special->flags & MINMAX_EVACUATE_PAGE) == 0 &&
+ (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz)))
{
if (BufferIsValid(newbuf))
UnlockReleaseBuffer(newbuf);
@@ -1151,34 +1178,44 @@ mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
itemsz = MAXALIGN(itemsz);
+ /*
+ * Lock the revmap page for the update. Note that this may require
+ * extending the revmap, which in turn may require moving the currently
+ * pinned index block out of the way.
+ */
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer.
+ */
if (BufferIsValid(*buffer))
{
page = BufferGetPage(*buffer);
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
- if (PageGetFreeSpace(page) < itemsz)
+
+ /*
+ * It's possible that another backend (or ourselves!) extended the
+ * revmap over the page we held a pin on, so we cannot assume that
+ * it's still a regular page.
+ */
+ if (mm_page_get_freespace(page) < itemsz)
{
UnlockReleaseBuffer(*buffer);
*buffer = InvalidBuffer;
}
}
-
- /*
- * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
- * ensures there's enough space in the returned buffer.
- */
if (!BufferIsValid(*buffer))
{
*buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
+ Assert(BufferIsValid(*buffer));
page = BufferGetPage(*buffer);
- Assert(PageGetFreeSpace(page) >= itemsz);
+ Assert(mm_page_get_freespace(page) >= itemsz);
}
page = BufferGetPage(*buffer);
blk = BufferGetBlockNumber(*buffer);
- /* lock the revmap for the update */
- revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
-
START_CRIT_SECTION();
off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
false, false);
@@ -1233,12 +1270,116 @@ mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
}
/*
+ * Checks if a regular minmax index page is empty.
+ *
+ * If it's not, it's marked for "evacuation", meaning that no new tuples will
+ * be added to it.
+ */
+bool
+mm_start_evacuating_page(Relation idxRel, Buffer buf)
+{
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ /* prevent other backends from adding more stuff to this page. */
+ special->flags |= MINMAX_EVACUATE_PAGE;
+ MarkBufferDirtyHint(buf, true);
+
+ return true;
+ }
+ }
+ return false;
+}
+
+/*
+ * Move all tuples out of a page.
+ *
+ * The caller must hold an exclusive lock on the page. The lock and pin are
+ * released.
+ */
+void
+mm_evacuate_page(Relation idxRel, Buffer buf)
+{
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+ mmRevmapAccess *rmAccess;
+ BlockNumber pagesPerRange;
+
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+
+ page = BufferGetPage(buf);
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ Assert(special->flags & MINMAX_EVACUATE_PAGE);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ MMTuple *tup;
+ Size sz;
+ ItemId lp;
+ bool extended = false;
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ tup = (MMTuple *) PageGetItem(page, lp);
+ sz = ItemIdGetLength(lp);
+
+ tup = minmax_copy_tuple(tup, sz);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, tup->mt_blkno, buf,
+ off, tup, sz, tup, sz, false, &extended))
+ off--; /* retry */
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ if (extended)
+ IndexFreeSpaceMapVacuum(idxRel);
+
+ /* It's possible that someone extended the revmap over this page */
+ if (!MINMAX_IS_REGULAR_PAGE(page))
+ break;
+ }
+ }
+
+ mmRevmapAccessTerminate(rmAccess);
+
+ UnlockReleaseBuffer(buf);
+}
+
+/*
* Return a pinned and locked buffer which can be used to insert an index item
* of size itemsz. If oldbuf is a valid buffer, it is also locked (in a order
* determined to avoid deadlocks.)
*
* If there's no existing page with enough free space to accomodate the new
* item, the relation is extended. If this happens, *extended is set to true.
+ *
+ * If we find that the old page is no longer a regular index page (because
+ * of a revmap extension), the old buffer is unlocked and we return
+ * InvalidBuffer.
*/
static Buffer
mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
@@ -1261,7 +1402,9 @@ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
* if we have to restart here, neither buffer is locked and buf is not
* a pinned buffer.
*/
- newblk = GetPageWithFreeSpace(irel, itemsz);
+ newblk = RelationGetTargetBlock(irel);
+ if (newblk == InvalidBlockNumber)
+ newblk = GetPageWithFreeSpace(irel, itemsz);
for (;;)
{
Buffer buf;
@@ -1298,14 +1441,19 @@ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
buf = ReadBuffer(irel, newblk);
}
- if (BufferIsValid(oldbuf) && newblk < oldblk)
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ {
LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ if (!MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+ }
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- if (BufferIsValid(oldbuf) && newblk > oldblk)
- LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
-
if (extensionLockHeld)
UnlockRelationForExtension(irel, ExclusiveLock);
@@ -1319,13 +1467,21 @@ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
* Check that the new page has enough free space, and return it if it
* does; otherwise start over. Note that we allow for the FSM to be
* out of date here, and in that case we update it and move on.
+ *
+ * (mm_page_get_freespace also checks that the FSM didn't hand us a
+ * page that has since been repurposed for the revmap.)
*/
- freespace = PageGetFreeSpace(page);
-
+ freespace = mm_page_get_freespace(page);
if (freespace >= itemsz)
{
if (extended)
*was_extended = true;
+ RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
+
+ /* Lock the old buffer if not locked already */
+ if (BufferIsValid(oldbuf) && newblk < oldblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+
return buf;
}
@@ -1352,7 +1508,7 @@ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
if (newblk != oldblk)
UnlockReleaseBuffer(buf);
- if (BufferIsValid(oldbuf))
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
diff --git a/src/backend/access/minmax/mmrevmap.c b/src/backend/access/minmax/mmrevmap.c
index 923490e..48df2cd 100644
--- a/src/backend/access/minmax/mmrevmap.c
+++ b/src/backend/access/minmax/mmrevmap.c
@@ -8,14 +8,10 @@
* into a table that violates the previously recorded min/max values, a new
* tuple is inserted into the index and the revmap is updated to point to it.
*
- * The pages of the revmap are interspersed in the index's main fork. The
- * first revmap page is always the index's page number one (that is,
- * immediately after the metapage). Subsequent revmap pages are allocated as
- * they are needed; their locations are tracked by "array pages". The metapage
- * contains a large BlockNumber array, which correspond to array pages. Thus,
- * to find the second revmap page, we read the metapage and obtain the block
- * number of the first array page; we then read that page, and the first
- * element in it is the revmap page we're looking for.
+ * The pages of the revmap are in the beginning of the index, starting at
+ * immediately after the metapage at block 1. When the revmap needs to be
+ * expanded, all tuples on the regular minmax page at that block (if any) are
+ * moved out of the way.
*
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
@@ -41,7 +37,7 @@
/*
- * In regular revmap pages, each item stores an ItemPointerData. These defines
+ * In revmap pages, each item stores an ItemPointerData. These defines
* let one find the logical revmap page number and index number of the revmap
* item for the given heap block number.
*/
@@ -50,29 +46,19 @@
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
-/*
- * In array revmap pages, each item stores a BlockNumber. These defines let
- * one find the page and index number of a given revmap block number. Note
- * that the first revmap page (revmap logical page number 0) is always stored
- * in physical block number 1, so array pages do not store that one.
- */
-#define MAPBLK_TO_RMARRAY_BLK(rmBlk) ((rmBlk - 1) / ARRAY_REVMAP_PAGE_MAXITEMS)
-#define MAPBLK_TO_RMARRAY_INDEX(rmBlk) ((rmBlk - 1) % ARRAY_REVMAP_PAGE_MAXITEMS)
-
struct mmRevmapAccess
{
Relation idxrel;
BlockNumber pagesPerRange;
+ BlockNumber lastRevmapPage; /* cached from the metapage */
Buffer metaBuf;
Buffer currBuf;
- Buffer currArrayBuf;
- BlockNumber *revmapArrayPages;
};
/* typedef appears in minmax_revmap.h */
-static Buffer mm_getnewbuffer(Relation irel);
+static void rm_extend(mmRevmapAccess *rmAccess);
/*
* Initialize an access object for a reverse range map, which can be used to
@@ -94,8 +80,7 @@ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
rmAccess->idxrel = idxrel;
rmAccess->pagesPerRange = metadata->pagesPerRange;
rmAccess->currBuf = InvalidBuffer;
- rmAccess->currArrayBuf = InvalidBuffer;
- rmAccess->revmapArrayPages = NULL;
+ rmAccess->lastRevmapPage = InvalidBlockNumber;
if (pagesPerRange)
*pagesPerRange = metadata->pagesPerRange;
@@ -109,30 +94,24 @@ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
void
mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
{
- if (rmAccess->revmapArrayPages != NULL)
- pfree(rmAccess->revmapArrayPages);
if (rmAccess->metaBuf != InvalidBuffer)
ReleaseBuffer(rmAccess->metaBuf);
if (rmAccess->currBuf != InvalidBuffer)
ReleaseBuffer(rmAccess->currBuf);
- if (rmAccess->currArrayBuf != InvalidBuffer)
- ReleaseBuffer(rmAccess->currArrayBuf);
pfree(rmAccess);
}
/*
- * Lock the metapage as specified by called, and update the given rmAccess with
- * the metapage data. The metapage buffer is locked when this function
- * returns; it's the caller's responsibility to unlock it.
+ * Read the metapage and update the given rmAccess with the metapage data.
*/
static void
-rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
+rmaccess_read_metapage(mmRevmapAccess *rmAccess)
{
MinmaxMetaPageData *metadata;
MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
Page metapage;
- LockBuffer(rmAccess->metaBuf, lockmode);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_SHARE);
metapage = BufferGetPage(rmAccess->metaBuf);
#ifdef USE_ASSERT_CHECKING
@@ -141,51 +120,11 @@ rmaccess_get_metapage(mmRevmapAccess *rmAccess, int lockmode)
Assert(special->type == MINMAX_PAGETYPE_META);
#endif
- /* first time through? allocate the array */
- if (rmAccess->revmapArrayPages == NULL)
- rmAccess->revmapArrayPages =
- palloc(sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
-
metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
- memcpy(rmAccess->revmapArrayPages, metadata->revmapArrayPages,
- sizeof(BlockNumber) * MAX_REVMAP_ARRAYPAGES);
-}
-
-/*
- * Update the metapage, so that item arrayBlkIdx in the array of revmap array
- * pages points to block number newPgBlkno.
- */
-static void
-update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
- BlockNumber newPgBlkno)
-{
- MinmaxMetaPageData *metadata;
-
- metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
-
- START_CRIT_SECTION();
- metadata->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
- MarkBufferDirty(meta);
- if (RelationNeedsWAL(idxrel))
- {
- xl_minmax_metapg_set xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata;
- xlrec.node = idxrel->rd_node;
- xlrec.blkidx = arrayBlkIdx;
- xlrec.newpg = newPgBlkno;
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
- rdata.data = (char *) &xlrec;
- rdata.len = SizeOfMinmaxMetapgSet;
- rdata.buffer = InvalidBuffer;
- rdata.buffer_std = false;
- rdata.next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_METAPG_SET, &rdata);
- PageSetLSN(BufferGetPage(meta), recptr);
- }
- END_CRIT_SECTION();
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
}
/*
@@ -200,250 +139,140 @@ update_minmax_metapg(Relation idxrel, Buffer meta, uint32 arrayBlkIdx,
static BlockNumber
rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
{
- int arrayBlkIdx;
- BlockNumber arrayBlk;
- RevmapArrayContents *contents;
- int revmapIdx;
BlockNumber targetblk;
+ if (rmAccess->lastRevmapPage == InvalidBlockNumber)
+ rmaccess_read_metapage(rmAccess);
+
/* the first revmap page is always block number 1 */
- if (mapBlk == 0)
- return (BlockNumber) 1;
+ targetblk = mapBlk + 1;
- /*
- * For all other cases, take the long route of checking the metapage and
- * revmap array pages.
- */
+ if (targetblk <= rmAccess->lastRevmapPage)
+ return targetblk;
- /*
- * Copy the revmap array from the metapage into private storage, if not
- * done already in this scan.
- */
- if (rmAccess->revmapArrayPages == NULL)
- {
- rmaccess_get_metapage(rmAccess, BUFFER_LOCK_SHARE);
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- }
+ if (!extend)
+ return InvalidBlockNumber;
- /*
- * Consult the metapage array; if the array page we need is not set there,
- * we need to extend the index to allocate the array page, and update the
- * metapage array.
- */
- arrayBlkIdx = MAPBLK_TO_RMARRAY_BLK(mapBlk);
- if (arrayBlkIdx > MAX_REVMAP_ARRAYPAGES)
- elog(ERROR, "non-existant revmap array page requested");
+ /* Extend the revmap */
+ while (targetblk > rmAccess->lastRevmapPage)
+ rm_extend(rmAccess);
- arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
- if (arrayBlk == InvalidBlockNumber)
- {
- /* if not asked to extend, there's no further work to do here */
- if (!extend)
- return InvalidBlockNumber;
-
- /*
- * If we need to create a new array page, check the metapage again;
- * someone might have created it after the last time we read the
- * metapage. This time we acquire an exclusive lock, since we may need
- * to extend. Lock before doing the physical relation extension, to
- * avoid leaving an unused page around in case someone does this
- * concurrently. Note that, unfortunately, we will be keeping the lock
- * on the metapage alongside the relation extension lock, while doing a
- * syscall involving disk I/O. Extending to add a new revmap array page
- * is fairly infrequent, so it shouldn't be too bad.
- *
- * XXX it is possible to extend the relation unconditionally before
- * locking the metapage, and later if we find that someone else had
- * already added this page, save the page in FSM as MaxFSMRequestSize.
- * That would be better for concurrency. Explore someday.
- */
- rmaccess_get_metapage(rmAccess, BUFFER_LOCK_EXCLUSIVE);
+ return targetblk;
+}
- if (rmAccess->revmapArrayPages[arrayBlkIdx] == InvalidBlockNumber)
- {
- BlockNumber newPgBlkno;
-
- /*
- * Ok, definitely need to allocate a new revmap array page;
- * initialize a new page to the initial (empty) array revmap state
- * and register it in metapage.
- */
- rmAccess->currArrayBuf = mm_getnewbuffer(rmAccess->idxrel);
- START_CRIT_SECTION();
- initialize_rma_page(rmAccess->currArrayBuf);
- MarkBufferDirty(rmAccess->currArrayBuf);
- if (RelationNeedsWAL(rmAccess->idxrel))
- {
- xl_minmax_init_rmpg xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata;
-
- xlrec.node = rmAccess->idxrel->rd_node;
- xlrec.blkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
- xlrec.array = true;
- xlrec.logblk = InvalidBlockNumber;
-
- rdata.data = (char *) &xlrec;
- rdata.len = SizeOfMinmaxInitRmpg;
- rdata.buffer = InvalidBuffer; /* FIXME */
- rdata.buffer_std = false;
- rdata.next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
- PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
- }
- END_CRIT_SECTION();
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
- newPgBlkno = BufferGetBlockNumber(rmAccess->currArrayBuf);
- rmAccess->revmapArrayPages[arrayBlkIdx] = newPgBlkno;
+/*
+ * Extend the revmap by one page.
+ *
+ * If there is an existing minmax page at that block, it is atomically moved
+ * out of the way, and the redirect pointer on the new revmap page is set
+ * to point to its new location.
+ *
+ * If rmAccess->lastRevmapPage is out-of-date, it's updated and nothing else
+ * is done.
+ */
+static void
+rm_extend(mmRevmapAccess *rmAccess)
+{
+ Buffer buf;
+ Page page;
+ Page metapage;
+ MinmaxMetaPageData *metadata;
+ BlockNumber mapBlk;
+ BlockNumber nblocks;
+ Relation irel = rmAccess->idxrel;
+ bool needLock = !RELATION_IS_LOCAL(irel);
- MINMAX_elog(DEBUG2, "allocated block for revmap array page: %u",
- BufferGetBlockNumber(rmAccess->currArrayBuf));
+ /*
+ * Lock the metapage. This locks out concurrent extensions of the revmap,
+ * but note that we still need to grab the relation extension lock because
+ * another backend can still extend the index with regular minmax pages.
+ */
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_EXCLUSIVE);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
- /* Update the metapage to point to the new array page. */
- update_minmax_metapg(rmAccess->idxrel, rmAccess->metaBuf, arrayBlkIdx,
- newPgBlkno);
- }
+ /* Check that our cached lastRevmapPage value was up-to-date */
+ if (metadata->lastRevmapPage != rmAccess->lastRevmapPage)
+ {
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- arrayBlk = rmAccess->revmapArrayPages[arrayBlkIdx];
+ return;
}
+ mapBlk = metadata->lastRevmapPage + 1;
- /*
- * By here, we know the array page is set in the metapage array. Read that
- * page; except that if we just allocated it, or we already hold pin on it,
- * we don't need to read it again.
- */
- Assert(arrayBlk != InvalidBlockNumber);
-
- if (rmAccess->currArrayBuf == InvalidBuffer ||
- BufferGetBlockNumber(rmAccess->currArrayBuf) != arrayBlk)
+ nblocks = RelationGetNumberOfBlocks(irel);
+ if (mapBlk < nblocks)
{
- if (rmAccess->currArrayBuf != InvalidBuffer)
- ReleaseBuffer(rmAccess->currArrayBuf);
+ /* Check that the existing index block is sane. */
+ buf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ }
+ else
+ {
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buf = ReadBuffer(irel, P_NEW);
+ Assert(BufferGetBlockNumber(buf) == mapBlk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
- rmAccess->currArrayBuf =
- ReadBuffer(rmAccess->idxrel, arrayBlk);
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
}
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_SHARE);
+ /* Check that it's a regular block (or an empty page) */
+ if (!PageIsNew(page) && !MINMAX_IS_REGULAR_PAGE(page))
+ elog(ERROR, "unexpected minmax page type: 0x%04X",
+ MINMAX_PAGE_TYPE(page));
- /*
- * And now we can inspect its contents; if the target page is set, we can
- * just return. Even if not set, we can also return if caller asked us not
- * to extend the revmap.
- */
- contents = (RevmapArrayContents *)
- PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
- revmapIdx = MAPBLK_TO_RMARRAY_INDEX(mapBlk);
- if (!extend || revmapIdx <= contents->rma_nblocks - 1)
+ /* If the page is in use, evacuate it and restart */
+ if (mm_start_evacuating_page(rmAccess->idxrel, buf))
{
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
-
- return contents->rma_blocks[revmapIdx];
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ mm_evacuate_page(rmAccess->idxrel, buf);
+ return;
}
/*
- * Trade our shared lock in the array page for exclusive, because we now
- * need to allocate one more revmap page and modify the array page.
+ * Ok, we have now locked the metapage and the target block. Re-initialize
+ * it as a revmap page.
*/
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_EXCLUSIVE);
-
- contents = (RevmapArrayContents *)
- PageGetContents(BufferGetPage(rmAccess->currArrayBuf));
+ START_CRIT_SECTION();
- /*
- * If someone else already set the value while we were waiting for the
- * exclusive lock, we're done; otherwise, allocate a new block as the
- * new revmap page, and update the array page to point to it.
- */
- if (contents->rma_blocks[revmapIdx] != InvalidBlockNumber)
- {
- targetblk = contents->rma_blocks[revmapIdx];
- }
- else
- {
- Buffer newbuf;
-
- /* not possible to get here if we weren't asked to extend */
- Assert(extend);
- newbuf = mm_getnewbuffer(rmAccess->idxrel);
- START_CRIT_SECTION();
- targetblk = initialize_rmr_page(newbuf, mapBlk);
- MarkBufferDirty(newbuf);
- if (RelationNeedsWAL(rmAccess->idxrel))
- {
- xl_minmax_init_rmpg xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata;
-
- xlrec.node = rmAccess->idxrel->rd_node;
- xlrec.blkno = BufferGetBlockNumber(newbuf);
- xlrec.array = false;
- xlrec.logblk = mapBlk;
-
- rdata.data = (char *) &xlrec;
- rdata.len = SizeOfMinmaxInitRmpg;
- rdata.buffer = InvalidBuffer;
- rdata.buffer_std = false;
- rdata.next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_INIT_RMPG, &rdata);
- PageSetLSN(BufferGetPage(newbuf), recptr);
- }
- END_CRIT_SECTION();
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
- UnlockReleaseBuffer(newbuf);
+ metadata->lastRevmapPage = mapBlk;
+ MarkBufferDirty(rmAccess->metaBuf);
- /*
- * Now make the revmap array page point to the newly allocated page.
- * If necessary, also update the total number of items in it.
- */
- START_CRIT_SECTION();
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_revmap_extend xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
- contents->rma_blocks[revmapIdx] = targetblk;
- if (contents->rma_nblocks < revmapIdx + 1)
- contents->rma_nblocks = revmapIdx + 1;
- MarkBufferDirty(rmAccess->currArrayBuf);
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.targetBlk = mapBlk;
- /* XLOG stuff */
- if (RelationNeedsWAL(rmAccess->idxrel))
- {
- xl_minmax_rmarray_set xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata[2];
- uint8 info;
-
- info = XLOG_MINMAX_RMARRAY_SET;
-
- xlrec.node = rmAccess->idxrel->rd_node;
- xlrec.rmarray = BufferGetBlockNumber(rmAccess->currArrayBuf);
- xlrec.blkidx = revmapIdx;
- xlrec.newpg = targetblk;
-
- rdata[0].data = (char *) &xlrec;
- rdata[0].len = SizeOfMinmaxRmarraySet;
- rdata[0].buffer = InvalidBuffer;
- rdata[0].buffer_std = false;
- rdata[0].next = &rdata[1];
-
- rdata[1].data = NULL;
- rdata[1].len = 0;
- rdata[1].buffer = rmAccess->currArrayBuf;
- rdata[1].buffer_std = false;
- rdata[1].next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
- PageSetLSN(BufferGetPage(rmAccess->currArrayBuf), recptr);
- }
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxRevmapExtend;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
- END_CRIT_SECTION();
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_REVMAP_EXTEND, &rdata);
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
}
- LockBuffer(rmAccess->currArrayBuf, BUFFER_LOCK_UNLOCK);
+ END_CRIT_SECTION();
- return targetblk;
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(buf);
}
/*
@@ -604,17 +433,23 @@ mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
}
LockBuffer(*buf, mode);
page = BufferGetPage(*buf);
- lp = PageGetItemId(page, *off);
- if (ItemIdIsUsed(lp))
- {
- mmtup = (MMTuple *) PageGetItem(page, lp);
- if (mmtup->mt_blkno == heapBlk)
+ /* If we land on a revmap page, start over */
+ if (MINMAX_IS_REGULAR_PAGE(page))
+ {
+ lp = PageGetItemId(page, *off);
+ if (ItemIdIsUsed(lp))
{
- /* found it! */
- return mmtup;
+ mmtup = (MMTuple *) PageGetItem(page, lp);
+
+ if (mmtup->mt_blkno == heapBlk)
+ {
+ /* found it! */
+ return mmtup;
+ }
}
}
+
/*
* No luck. Assume that the revmap was updated concurrently.
*
@@ -627,106 +462,3 @@ mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
/* not reached, but keep compiler quiet */
return NULL;
}
-
-/*
- * Initialize the revmap of a new minmax index.
- *
- * NB -- caller is assumed to WAL-log this operation
- */
-void
-mmRevmapCreate(Relation idxrel)
-{
- Buffer buf;
-
- /*
- * The first page of the revmap is always stored in block number 1 of the
- * main fork. Because of this, the only thing we need to do is request
- * a new page; we assume we are called immediately after the metapage has
- * been initialized.
- */
- buf = mm_getnewbuffer(idxrel);
- Assert(BufferGetBlockNumber(buf) == 1);
-
- mm_page_init(BufferGetPage(buf), MINMAX_PAGETYPE_REVMAP);
- MarkBufferDirty(buf);
-
- UnlockReleaseBuffer(buf);
-}
-
-/*
- * Initialize a new regular revmap page, which stores the given revmap logical
- * page number. The newly allocated physical block number is returned.
- *
- * Used both by regular code path as well as during xlog replay.
- */
-BlockNumber
-initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk)
-{
- BlockNumber blkno;
- Page page;
- RevmapContents *contents;
-
- page = BufferGetPage(newbuf);
-
- mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
- contents = (RevmapContents *) PageGetContents(page);
- contents->rmr_logblk = mapBlk;
- /* the rmr_tids array is initialized to all invalid by PageInit */
-
- blkno = BufferGetBlockNumber(newbuf);
-
- return blkno;
-}
-
-/*
- * Given a buffer (hopefully containing a blank page), set it up as a revmap
- * array page.
- *
- * Used both by regular code path as well as during xlog replay.
- */
-void
-initialize_rma_page(Buffer buf)
-{
- Page arrayPg;
- RevmapArrayContents *contents;
-
- arrayPg = BufferGetPage(buf);
- mm_page_init(arrayPg, MINMAX_PAGETYPE_REVMAP_ARRAY);
- contents = (RevmapArrayContents *) PageGetContents(arrayPg);
- contents->rma_nblocks = 0;
- /* set the whole array to InvalidBlockNumber */
- memset(contents->rma_blocks, 0xFF,
- sizeof(BlockNumber) * ARRAY_REVMAP_PAGE_MAXITEMS);
-}
-
-/*
- * Return an exclusively-locked buffer resulting from extending the relation.
- */
-static Buffer
-mm_getnewbuffer(Relation irel)
-{
- Buffer buffer;
- bool needLock = !RELATION_IS_LOCAL(irel);
-
- /*
- * XXX As a possible improvement, we could request a blank page to the FSM
- * here. Such pages could get inserted into the FSM if, for instance, two
- * processes extend the relation concurrently to add one more page to the
- * revmap and the second one discovers it doesn't actually need the page it
- * got.
- */
-
- if (needLock)
- LockRelationForExtension(irel, ExclusiveLock);
-
- buffer = ReadBuffer(irel, P_NEW);
- LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
- MINMAX_elog(DEBUG2, "mm_getnewbuffer: extending to page %u",
- BufferGetBlockNumber(buffer));
-
- if (needLock)
- UnlockRelationForExtension(irel, ExclusiveLock);
-
- return buffer;
-}
diff --git a/src/backend/access/minmax/mmtuple.c b/src/backend/access/minmax/mmtuple.c
index 2e5aac5..b203b3a 100644
--- a/src/backend/access/minmax/mmtuple.c
+++ b/src/backend/access/minmax/mmtuple.c
@@ -256,7 +256,7 @@ minmax_copy_tuple(MMTuple *tuple, Size len)
}
bool
-minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen)
+minmax_tuples_equal(const MMTuple *a, Size alen, const MMTuple *b, Size blen)
{
if (alen != blen)
return false;
diff --git a/src/backend/access/minmax/mmxlog.c b/src/backend/access/minmax/mmxlog.c
index ab3f9fe..5690ceb 100644
--- a/src/backend/access/minmax/mmxlog.c
+++ b/src/backend/access/minmax/mmxlog.c
@@ -246,84 +246,54 @@ minmax_xlog_samepage_update(XLogRecPtr lsn, XLogRecord *record)
static void
-minmax_xlog_metapg_set(XLogRecPtr lsn, XLogRecord *record)
+minmax_xlog_revmap_extend(XLogRecPtr lsn, XLogRecord *record)
{
- xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) XLogRecGetData(record);
- Buffer meta;
+ xl_minmax_revmap_extend *xlrec = (xl_minmax_revmap_extend *) XLogRecGetData(record);
+ Buffer metabuf;
Page metapg;
MinmaxMetaPageData *metadata;
+ Buffer buf;
+ Page page;
- /* If we have a full-page image, restore it and we're done */
- if (record->xl_info & XLR_BKP_BLOCK(0))
- {
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
- return;
- }
-
- meta = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
- Assert(BufferIsValid(meta));
-
- metapg = BufferGetPage(meta);
- metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
- metadata->revmapArrayPages[xlrec->blkidx] = xlrec->newpg;
-
- PageSetLSN(metapg, lsn);
- MarkBufferDirty(meta);
- UnlockReleaseBuffer(meta);
-}
-
-static void
-minmax_xlog_init_rmpg(XLogRecPtr lsn, XLogRecord *record)
-{
- xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) XLogRecGetData(record);
- Buffer buffer;
-
+ /* Update the metapage */
if (record->xl_info & XLR_BKP_BLOCK(0))
{
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
- return;
+ metabuf = RestoreBackupBlock(lsn, record, 0, false, true);
}
-
- buffer = XLogReadBuffer(xlrec->node, xlrec->blkno, true);
- Assert(BufferIsValid(buffer));
-
- if (xlrec->array)
- initialize_rma_page(buffer);
else
- initialize_rmr_page(buffer, xlrec->logblk);
-
- PageSetLSN(BufferGetPage(buffer), lsn);
- MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
-}
+ {
+ metabuf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ if (BufferIsValid(metabuf))
+ {
+ metapg = BufferGetPage(metabuf);
+ if (lsn > PageGetLSN(metapg))
+ {
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
-static void
-minmax_xlog_rmarray_set(XLogRecPtr lsn, XLogRecord *record)
-{
- xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) XLogRecGetData(record);
- Buffer buffer;
- Page page;
- RevmapArrayContents *contents;
+ Assert(metadata->lastRevmapPage == xlrec->targetBlk - 1);
+ metadata->lastRevmapPage = xlrec->targetBlk;
- /* If we have a full-page image, restore it and we're done */
- if (record->xl_info & XLR_BKP_BLOCK(0))
- {
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
- return;
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(metabuf);
+ }
+ }
}
- buffer = XLogReadBuffer(xlrec->node, xlrec->rmarray, false);
- Assert(BufferIsValid(buffer));
+ /* Re-init the target block as a revmap page */
- page = BufferGetPage(buffer);
-
- contents = (RevmapArrayContents *) PageGetContents(page);
- contents->rma_blocks[xlrec->blkidx] = xlrec->newpg;
- contents->rma_nblocks = xlrec->blkidx + 1; /* XXX is this okay? */
+ buf = XLogReadBuffer(xlrec->node, xlrec->targetBlk, true);
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
+ MarkBufferDirty(buf);
+
+ metadata->lastRevmapPage = xlrec->targetBlk;
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(metabuf);
+
+ UnlockReleaseBuffer(buf);
+ UnlockReleaseBuffer(metabuf);
}
void
@@ -345,14 +315,8 @@ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
case XLOG_MINMAX_SAMEPAGE_UPDATE:
minmax_xlog_samepage_update(lsn, record);
break;
- case XLOG_MINMAX_METAPG_SET:
- minmax_xlog_metapg_set(lsn, record);
- break;
- case XLOG_MINMAX_RMARRAY_SET:
- minmax_xlog_rmarray_set(lsn, record);
- break;
- case XLOG_MINMAX_INIT_RMPG:
- minmax_xlog_init_rmpg(lsn, record);
+ case XLOG_MINMAX_REVMAP_EXTEND:
+ minmax_xlog_revmap_extend(lsn, record);
break;
default:
elog(PANIC, "minmax_redo: unknown op code %u", info);
diff --git a/src/include/access/minmax_internal.h b/src/include/access/minmax_internal.h
index 47ed279..c206168 100644
--- a/src/include/access/minmax_internal.h
+++ b/src/include/access/minmax_internal.h
@@ -87,5 +87,7 @@ extern void minmax_free_mmdesc(MinmaxDesc *mmdesc);
extern void mm_page_init(Page page, uint16 type);
extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
uint16 version);
+extern bool mm_start_evacuating_page(Relation idxRel, Buffer buf);
+extern void mm_evacuate_page(Relation idxRel, Buffer buf);
#endif /* MINMAX_INTERNAL_H */
diff --git a/src/include/access/minmax_page.h b/src/include/access/minmax_page.h
index 04f40d8..df7f940 100644
--- a/src/include/access/minmax_page.h
+++ b/src/include/access/minmax_page.h
@@ -19,13 +19,21 @@
/* special space on all minmax pages stores a "type" identifier */
#define MINMAX_PAGETYPE_META 0xF091
-#define MINMAX_PAGETYPE_REVMAP_ARRAY 0xF092
-#define MINMAX_PAGETYPE_REVMAP 0xF093
-#define MINMAX_PAGETYPE_REGULAR 0xF094
+#define MINMAX_PAGETYPE_REVMAP 0xF092
+#define MINMAX_PAGETYPE_REGULAR 0xF093
+
+#define MINMAX_PAGE_TYPE(page) \
+ (((MinmaxSpecialSpace *) PageGetSpecialPointer(page))->type)
+#define MINMAX_IS_REVMAP_PAGE(page) (MINMAX_PAGE_TYPE(page) == MINMAX_PAGETYPE_REVMAP)
+#define MINMAX_IS_REGULAR_PAGE(page) (MINMAX_PAGE_TYPE(page) == MINMAX_PAGETYPE_REGULAR)
+
+/* flags */
+#define MINMAX_EVACUATE_PAGE 1
typedef struct MinmaxSpecialSpace
{
- uint16 type;
+ uint16 flags;
+ uint16 type;
} MinmaxSpecialSpace;
/* Metapage definitions */
@@ -34,30 +42,18 @@ typedef struct MinmaxMetaPageData
uint32 minmaxMagic;
uint32 minmaxVersion;
BlockNumber pagesPerRange;
- BlockNumber revmapArrayPages[1]; /* actually MAX_REVMAP_ARRAYPAGES */
+ BlockNumber lastRevmapPage;
} MinmaxMetaPageData;
-/*
- * Number of array pages listed in metapage. Need to consider leaving enough
- * space for the page header, the metapage struct, and the minmax special
- * space.
- */
-#define MAX_REVMAP_ARRAYPAGES \
- ((BLCKSZ - \
- MAXALIGN(SizeOfPageHeaderData) - \
- offsetof(MinmaxMetaPageData, revmapArrayPages) - \
- MAXALIGN(sizeof(MinmaxSpecialSpace)) ) / \
- sizeof(BlockNumber))
-
#define MINMAX_CURRENT_VERSION 1
#define MINMAX_META_MAGIC 0xA8109CFA
-#define MINMAX_METAPAGE_BLKNO 0
+#define MINMAX_METAPAGE_BLKNO 0
+#define MINMAX_REVMAP_FIRST_BLKNO 1
/* Definitions for regular revmap pages */
typedef struct RevmapContents
{
- int32 rmr_logblk; /* logical blkno of this revmap page */
ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
} RevmapContents;
@@ -69,20 +65,4 @@ typedef struct RevmapContents
#define REGULAR_REVMAP_PAGE_MAXITEMS \
(REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
-/* Definitions for array revmap pages */
-typedef struct RevmapArrayContents
-{
- int32 rma_nblocks;
- BlockNumber rma_blocks[1]; /* really ARRAY_REVMAP_PAGE_MAXITEMS */
-} RevmapArrayContents;
-
-#define REVMAP_ARRAY_CONTENT_SIZE \
- (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
- offsetof(RevmapArrayContents, rma_blocks) - \
- MAXALIGN(sizeof(MinmaxSpecialSpace)))
-/* max num of items in the array */
-#define ARRAY_REVMAP_PAGE_MAXITEMS \
- (REVMAP_ARRAY_CONTENT_SIZE / sizeof(BlockNumber))
-
-
#endif /* MINMAX_PAGE_H */
diff --git a/src/include/access/minmax_revmap.h b/src/include/access/minmax_revmap.h
index 68729d8..73c6cd4 100644
--- a/src/include/access/minmax_revmap.h
+++ b/src/include/access/minmax_revmap.h
@@ -33,9 +33,4 @@ extern MMTuple *mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess,
BlockNumber heapBlk, Buffer *buf, OffsetNumber *off,
int mode);
-/* internal stuff also used by xlog replay */
-extern BlockNumber initialize_rmr_page(Buffer newbuf, BlockNumber mapBlk);
-extern void initialize_rma_page(Buffer buf);
-
-
#endif /* MINMAX_REVMAP_H */
diff --git a/src/include/access/minmax_tuple.h b/src/include/access/minmax_tuple.h
index 989a179..eff4d52 100644
--- a/src/include/access/minmax_tuple.h
+++ b/src/include/access/minmax_tuple.h
@@ -77,8 +77,9 @@ typedef struct MMTuple
extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
DeformedMMTuple *tuple, Size *size);
extern void minmax_free_tuple(MMTuple *tuple);
-MMTuple *minmax_copy_tuple(MMTuple *tuple, Size len);
-extern bool minmax_tuples_equal(MMTuple *a, Size alen, MMTuple *b, Size blen);
+extern MMTuple *minmax_copy_tuple(MMTuple *tuple, Size len);
+extern bool minmax_tuples_equal(const MMTuple *a, Size alen,
+ const MMTuple *b, Size blen);
extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
diff --git a/src/include/access/minmax_xlog.h b/src/include/access/minmax_xlog.h
index 00d3425..01bb065 100644
--- a/src/include/access/minmax_xlog.h
+++ b/src/include/access/minmax_xlog.h
@@ -31,9 +31,8 @@
#define XLOG_MINMAX_INSERT 0x10
#define XLOG_MINMAX_UPDATE 0x20
#define XLOG_MINMAX_SAMEPAGE_UPDATE 0x30
-#define XLOG_MINMAX_METAPG_SET 0x40
-#define XLOG_MINMAX_RMARRAY_SET 0x50
-#define XLOG_MINMAX_INIT_RMPG 0x60
+#define XLOG_MINMAX_REVMAP_EXTEND 0x40
+#define XLOG_MINMAX_REVMAP_VACUUM 0x50
#define XLOG_MINMAX_OPMASK 0x70
/*
@@ -90,39 +89,14 @@ typedef struct xl_minmax_samepage_update
#define SizeOfMinmaxSamepageUpdate (offsetof(xl_minmax_samepage_update, tid) + sizeof(ItemPointerData))
-/* This is what we need to know about a "metapage set" operation */
-typedef struct xl_minmax_metapg_set
+/* This is what we need to know about a revmap extension */
+typedef struct xl_minmax_revmap_extend
{
RelFileNode node;
- uint32 blkidx;
- BlockNumber newpg;
-} xl_minmax_metapg_set;
+ BlockNumber targetBlk;
+} xl_minmax_revmap_extend;
-#define SizeOfMinmaxMetapgSet (offsetof(xl_minmax_metapg_set, newpg) + \
- sizeof(BlockNumber))
-
-/* This is what we need to know about a "revmap array set" operation */
-typedef struct xl_minmax_rmarray_set
-{
- RelFileNode node;
- BlockNumber rmarray;
- uint32 blkidx;
- BlockNumber newpg;
-} xl_minmax_rmarray_set;
-
-#define SizeOfMinmaxRmarraySet (offsetof(xl_minmax_rmarray_set, newpg) + \
- sizeof(BlockNumber))
-
-/* This is what we need to know when we initialize a new revmap page */
-typedef struct xl_minmax_init_rmpg
-{
- RelFileNode node;
- bool array; /* array revmap page or regular revmap page */
- BlockNumber blkno;
- BlockNumber logblk; /* only used by regular revmap pages */
-} xl_minmax_init_rmpg;
-
-#define SizeOfMinmaxInitRmpg (offsetof(xl_minmax_init_rmpg, blkno) + \
+#define SizeOfMinmaxRevmapExtend (offsetof(xl_minmax_revmap_extend, targetBlk) + \
sizeof(BlockNumber))
Fujii Masao wrote:
I've not read the patch yet. But while testing the feature, I found that
* Brin index cannot be created on CHAR(n) column.
Maybe other data types have the same problem.
Yeah, it's just a matter of adding an opclass for it -- pretty simple
stuff really, because you don't need to write any code, just add a bunch
of catalog entries and an OPCINFO line in mmsortable.c.
Right now there are opclasses for the following types:
int4
numeric
text
date
timestamp with time zone
timestamp
time with time zone
time
"char"
We can eventually extend to cover all types that have btree opclasses,
but we can do that in a separate commit. I'm also considering removing
the opclass for time with time zone, as it's a pretty useless type. I
mostly added the ones that are there as a way to test that it behaved
reasonably in the various cases (pass by val vs. not, variable width vs.
fixed, different alignment requirements)
Of course, the real interesting part is adding a completely different
opclass, such as one that stores bounding boxes.
* FILLFACTOR cannot be set in brin index.
I hadn't added this one because I didn't think there was much point
previously, but I think it might now be useful to allow same-page
updates.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas wrote:
So, the other design change I've been advocating is to store the
revmap in the first N blocks, instead of having the two-level
structure with array pages and revmap pages.Attached is a patch for that, to be applied after v15. When the
revmap needs to be expanded, all the tuples on it are moved
elsewhere one-by-one. That adds some latency to the unfortunate guy
who needs to do that, but as the patch stands, the revmap is only
ever extended by VACUUM or CREATE INDEX, so I think that's fine.
Like with my previous patch, the point is to demonstrate how much
simpler the code becomes this way; I'm sure there are bugs and
cleanup still necessary.
Thanks for the prodding. I didn't like this too much initially, but
after going over it a few times I agree that having less code and a less
complex physical representation is better. Your proposed approach is to
just call the update routine on every tuple in the page we're
evacuating. There are optimizations possible (such as doing bulk
updates; and instead of updating the revmap, keep a redirection pointer
in the page we just evacuated, so that the revmap can be updated lazily
later), but I have spent way too long on this already that I am fine
with keeping what we have here. If somebody later wants to contribute
improvements to this, it'd be welcome. But on the other hand the
operation is not that frequent and as you say it's not executed by
user-facing queries, so perhaps it's okay.
I cleaned it up some: mainly I created a separate file (mmpageops.c)
that now hosts the routines related to page operations: mm_doinsert,
mm_doupdate, mm_start_evacuating_page, mm_evacuate_page. There are
other rather very minor changes here and there; also added
CHECK_FOR_INTERRUPTS in all relevant loops.
This bit in mm_doupdate I just couldn't understand:
/* If both tuples are in fact equal, there is nothing to do */
if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
{
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
}
Isn't the test exactly reversed? I don't see how this would work.
I updated it to
/*
* If both tuples are identical, there is nothing to do; except that if we
* were requested to move the tuple across pages, we do it even if they are
* equal.
*/
if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
{
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
}
PS. Spotted one oversight in patch v15: callers of mm_doupdate must
check the return value, and retry the operation if it returns false.
Right, thanks. Fixed.
So here's v16, rebased on top of 9bac66020. As far as I am concerned,
this is the last version before I start renaming everything to BRIN and
then commit.
contrib/pageinspect/Makefile | 2 +-
contrib/pageinspect/mmfuncs.c | 407 +++++++++++++
contrib/pageinspect/pageinspect--1.2.sql | 36 ++
contrib/pg_xlogdump/rmgrdesc.c | 1 +
doc/src/sgml/brin.sgml | 248 ++++++++
doc/src/sgml/filelist.sgml | 1 +
doc/src/sgml/indices.sgml | 36 +-
doc/src/sgml/postgres.sgml | 1 +
minmax-proposal | 306 ++++++++++
src/backend/access/Makefile | 2 +-
src/backend/access/common/reloptions.c | 7 +
src/backend/access/heap/heapam.c | 22 +-
src/backend/access/minmax/Makefile | 17 +
src/backend/access/minmax/minmax.c | 942 +++++++++++++++++++++++++++++++
src/backend/access/minmax/mmpageops.c | 638 +++++++++++++++++++++
src/backend/access/minmax/mmrevmap.c | 451 +++++++++++++++
src/backend/access/minmax/mmsortable.c | 287 ++++++++++
src/backend/access/minmax/mmtuple.c | 478 ++++++++++++++++
src/backend/access/minmax/mmxlog.c | 323 +++++++++++
src/backend/access/rmgrdesc/Makefile | 3 +-
src/backend/access/rmgrdesc/minmaxdesc.c | 89 +++
src/backend/access/transam/rmgr.c | 1 +
src/backend/catalog/index.c | 24 +
src/backend/replication/logical/decode.c | 1 +
src/backend/storage/page/bufpage.c | 179 +++++-
src/backend/utils/adt/selfuncs.c | 24 +
src/include/access/heapam.h | 2 +
src/include/access/minmax.h | 52 ++
src/include/access/minmax_internal.h | 86 +++
src/include/access/minmax_page.h | 70 +++
src/include/access/minmax_pageops.h | 29 +
src/include/access/minmax_revmap.h | 36 ++
src/include/access/minmax_tuple.h | 90 +++
src/include/access/minmax_xlog.h | 106 ++++
src/include/access/reloptions.h | 3 +-
src/include/access/relscan.h | 4 +-
src/include/access/rmgrlist.h | 1 +
src/include/catalog/index.h | 8 +
src/include/catalog/pg_am.h | 2 +
src/include/catalog/pg_amop.h | 81 +++
src/include/catalog/pg_amproc.h | 73 +++
src/include/catalog/pg_opclass.h | 9 +
src/include/catalog/pg_opfamily.h | 10 +
src/include/catalog/pg_proc.h | 52 ++
src/include/storage/bufpage.h | 2 +
src/include/utils/selfuncs.h | 1 +
src/test/regress/expected/opr_sanity.out | 14 +-
src/test/regress/sql/opr_sanity.sql | 7 +-
48 files changed, 5248 insertions(+), 16 deletions(-)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-16.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o mmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 0 ****
--- 1,407 ----
+ /*
+ * mmfuncs.c
+ * Functions to investigate MinMax indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/mmfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "lib/stringinfo.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+
+ PG_FUNCTION_INFO_V1(minmax_page_type);
+ PG_FUNCTION_INFO_V1(minmax_page_items);
+ PG_FUNCTION_INFO_V1(minmax_metapage_info);
+ PG_FUNCTION_INFO_V1(minmax_revmap_data);
+
+ typedef struct mm_column_state
+ {
+ int nstored;
+ FmgrInfo outputFn[FLEXIBLE_ARRAY_MEMBER];
+ } mm_column_state;
+
+ typedef struct mm_page_state
+ {
+ MinmaxDesc *mmdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedMMTuple *dtup;
+ mm_column_state *columns[FLEXIBLE_ARRAY_MEMBER];
+ } mm_page_state;
+
+
+ static Page verify_minmax_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ minmax_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ MinmaxSpecialSpace *special;
+ char *type;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case MINMAX_PAGETYPE_META:
+ type = "meta";
+ break;
+ case MINMAX_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case MINMAX_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a minmax page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_minmax_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ MinmaxSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a Minmax page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ /*
+ * Extract all item values from a minmax index page
+ *
+ * Usage: SELECT * FROM minmax_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ minmax_page_items(PG_FUNCTION_ARGS)
+ {
+ mm_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(mm_page_state, columns) +
+ sizeof(mm_column_state) * RelationGetDescr(indexRel)->natts);
+
+ state->mmdesc = minmax_build_mmdesc(indexRel);
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ for (attno = 1; attno <= state->mmdesc->md_tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+ FmgrInfo *opcInfoFn;
+ MinmaxOpcInfo *opcinfo;
+ int i;
+ mm_column_state *column;
+
+ opcInfoFn = index_getprocinfo(indexRel, attno, MINMAX_PROCNUM_OPCINFO);
+ opcinfo = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+
+ column = palloc(offsetof(mm_column_state, outputFn) +
+ sizeof(FmgrInfo) * opcinfo->oi_nstored);
+
+ column->nstored = opcinfo->oi_nstored;
+ for (i = 0; i < opcinfo->oi_nstored; i++)
+ {
+ getTypeOutputInfo(opcinfo->oi_typids[i], &output, &isVarlena);
+ fmgr_info(output, &column->outputFn[i]);
+ }
+
+ state->columns[attno - 1] = column;
+ }
+
+ index_close(indexRel, AccessShareLock);
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[5];
+ bool nulls[5];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ MMTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (MMTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = minmax_deform_tuple(state->mmdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->dt_columns[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->dt_columns[att].hasnulls);
+ if (!state->dtup->dt_columns[att].allnulls)
+ {
+ MMValues *mmvalues = &state->dtup->dt_columns[att];
+ StringInfoData s;
+ bool first;
+ int i;
+
+ initStringInfo(&s);
+ appendStringInfoChar(&s, '{');
+
+ first = true;
+ for (i = 0; i < state->columns[att]->nstored; i++)
+ {
+ char *val;
+
+ if (!first)
+ appendStringInfoString(&s, " .. ");
+ first = false;
+ val = OutputFunctionCall(&state->columns[att]->outputFn[i],
+ mmvalues->values[i]);
+ appendStringInfoString(&s, val);
+ pfree(val);
+ }
+ appendStringInfoChar(&s, '}');
+
+ values[4] = CStringGetTextDatum(s.data);
+ pfree(s.data);
+ }
+ else
+ {
+ nulls[4] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->mmdesc->md_tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ minmax_free_mmdesc(state->mmdesc);
+
+ SRF_RETURN_DONE(fctx);
+ }
+
+ Datum
+ minmax_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ MinmaxMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[4];
+ bool nulls[4];
+ HeapTuple htup;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (MinmaxMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->minmaxMagic));
+ values[1] = Int32GetDatum(meta->minmaxVersion);
+ values[2] = Int32GetDatum(meta->pagesPerRange);
+ values[3] = Int64GetDatum(meta->lastRevmapPage);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the TID array stored in a minmax revmap page
+ */
+ Datum
+ minmax_revmap_data(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ RevmapContents *contents;
+ TupleDesc tupdesc;
+ Datum values[2];
+ bool nulls[2];
+ HeapTuple htup;
+ ArrayBuildState *astate = NULL;
+ int i;
+
+ page = verify_minmax_page(raw_page, MINMAX_PAGETYPE_REVMAP, "revmap");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the revmap page */
+ contents = (RevmapContents *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = Int64GetDatum((uint64) 0);
+
+ /* Extract (possibly empty) list of TIDs in this page. */
+ for (i = 0; i < REVMAP_PAGE_MAXITEMS; i++)
+ {
+ ItemPointer tid;
+
+ tid = &contents->rmr_tids[i];
+ astate = accumArrayResult(astate,
+ PointerGetDatum(tid),
+ false, TIDOID, CurrentMemoryContext);
+ }
+ if (astate == NULL)
+ nulls[1] = true;
+ else
+ values[1] = makeArrayResult(astate, CurrentMemoryContext);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,140 ----
LANGUAGE C STRICT;
--
+ -- minmax_page_type()
+ --
+ CREATE FUNCTION minmax_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'minmax_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_metapage_info()
+ --
+ CREATE FUNCTION minmax_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT pagesperrange integer, OUT lastrevmappage bigint)
+ AS 'MODULE_PATHNAME', 'minmax_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_page_items()
+ --
+ CREATE FUNCTION minmax_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT value text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'minmax_page_items'
+ LANGUAGE C STRICT;
+
+ --
+ -- minmax_revmap_data()
+ CREATE FUNCTION minmax_revmap_data(IN page bytea,
+ OUT pages tid[])
+ AS 'MODULE_PATHNAME', 'minmax_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 13,18 ****
--- 13,19 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/rmgr.h"
*** /dev/null
--- b/doc/src/sgml/brin.sgml
***************
*** 0 ****
--- 1,248 ----
+ <!-- doc/src/sgml/brin.sgml -->
+
+ <chapter id="BRIN">
+ <title>BRIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+
+ <sect1 id="brin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>BRIN</acronym> stands for Block Range Index.
+ <acronym>BRIN</acronym> is designed for handling very large tables
+ in which certain columns have some natural correlation with its
+ physical position. For example, a table storing orders might have
+ a date column on which each order was placed, and much of the time
+ the earlier entries will appear earlier in the table as well; or a
+ table storing a ZIP code column might have all codes for a city
+ grouped together naturally. For each block range, some summary info
+ is stored in the index.
+ </para>
+
+ <para>
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+ scanning facility only, and will return all tuples in all pages within
+ each range if the summary info stored by the index indicates that some
+ tuples in the range might match the given query conditions. The executor
+ is in charge of rechecking these tuples and discarding those that do not
+ match — in other words, these indexes are lossy.
+ This enables them to work as very fast sequential scan helpers to avoid
+ scanning blocks that are known not to contain matching tuples.
+ </para>
+
+ <para>
+ The specific data that a <acronym>BRIN</acronym> index will store
+ depends on the operator class selected for the data type.
+ Datatypes having a linear sort order can have operator classes that
+ store the minimum and maximum value within each block range, for instance;
+ geometrical types might store the common bounding box.
+ </para>
+
+ <para>
+ The size of the block range is determined at index creation time with
+ the pages_per_range storage parameter. The smaller the number, the
+ larger the index becomes (because of the need to store more index entries),
+ but at the same time the summary data stored can be more precise and
+ more data blocks can be skipped.
+ </para>
+
+ <para>
+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
+ is primarily maintained by Álvaro Herrera.
+ </para>
+ </sect1>
+
+ <sect1 id="brin-builtin-opclasses">
+ <title>Built-in Operator Classes</title>
+
+ <para>
+ The core <productname>PostgreSQL</productname> distribution includes
+ includes the <acronym>BRIN</acronym> operator classes shown in
+ <xref linkend="gin-builtin-opclasses-table">.
+ </para>
+
+ <table id="brin-builtin-opclasses-table">
+ <title>Built-in <acronym>BRIN</acronym> Operator Classes</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Indexed Data Type</entry>
+ <entry>Indexable Operators</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>char_minmax_ops</literal></entry>
+ <entry><type>"char"</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>date_minmax_ops</literal></entry>
+ <entry><type>date</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>int4_minmax_ops</literal></entry>
+ <entry><type>integer</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>numeric_minmax_ops</literal></entry>
+ <entry><type>numeric</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text_minmax_ops</literal></entry>
+ <entry><type>text</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>time_minmax_ops</literal></entry>
+ <entry><type>time</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timetz_minmax_ops</literal></entry>
+ <entry><type>time with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamp_minmax_ops</literal></entry>
+ <entry><type>timestamp</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamptz_minmax_ops</literal></entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="brin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>BRIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer only to implement the semantics
+ of the data type being accessed. The <acronym>BRIN</acronym> layer
+ itself takes care of concurrency, logging and searching the index structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>BRIN</acronym> access method working is to
+ implement a few user-defined methods, which define the behavior of
+ summary values stored in the index and the way they interact with
+ scan keys.
+ In short, <acronym>BRIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
+ </para>
+
+ <para>
+ There are three methods that an operator class for <acronym>BRIN</acronym>
+ must provide:
+
+ <variablelist>
+ <varlistentry>
+ <term><function>Datum opcInfo(...)</></term>
+ <listitem>
+ <para>
+ Returns internal information about the summary data stored
+ about indexed columns.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool consistent(...)</function></term>
+ <listitem>
+ <para>
+ Returns whether the key is consistent with the given index tuple.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool addValue(...)</function></term>
+ <listitem>
+ <para>
+ Modifies the index tuple to make it consistent with the given
+ indexed data.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <!-- this needs improvement ... -->
+ To implement these methods in a generic ways, normally the opclass
+ defines its own internal support functions. For instance, minmax
+ opclasses add the support functions for the four inequality operators
+ for the datatype.
+ Additionally, the operator class must supply appropriate
+ operator entries,
+ to enable the optimizer to use the index when those operators are
+ used in queries.
+ </para>
+ </sect1>
+ </chapter>
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***************
*** 87,92 ****
--- 87,93 ----
<!ENTITY gist SYSTEM "gist.sgml">
<!ENTITY spgist SYSTEM "spgist.sgml">
<!ENTITY gin SYSTEM "gin.sgml">
+ <!ENTITY brin SYSTEM "brin.sgml">
<!ENTITY planstats SYSTEM "planstats.sgml">
<!ENTITY indexam SYSTEM "indexam.sgml">
<!ENTITY nls SYSTEM "nls.sgml">
*** a/doc/src/sgml/indices.sgml
--- b/doc/src/sgml/indices.sgml
***************
*** 116,122 **** CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST and GIN. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
--- 116,123 ----
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST, GIN and BRIN.
! Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
***************
*** 326,331 **** SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
--- 327,365 ----
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GIN">.
</para>
+
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>BRIN</primary>
+ <see>index</see>
+ </indexterm>
+ BRIN indexes (a shorthand for Block Range indexes)
+ store summaries about the values stored in consecutive table physical block ranges.
+ Like GiST, SP-GiST and GIN,
+ BRIN can support many different indexing strategies,
+ and the particular operators with which a BRIN index can be used
+ vary depending on the indexing strategy.
+ For datatypes that have a linear sort order, the indexed data
+ corresponds to the minimum and maximum values of the
+ values in the column for each block range,
+ which support indexed queries using these operators:
+
+ <simplelist>
+ <member><literal><</literal></member>
+ <member><literal><=</literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>>=</literal></member>
+ <member><literal>></literal></member>
+ </simplelist>
+
+ The BRIN operator classes included in the standard distribution are
+ documented in <xref linkend="brin-builtin-opclasses-table">.
+ For more information see <xref linkend="BRIN">.
+ </para>
</sect1>
*** a/doc/src/sgml/postgres.sgml
--- b/doc/src/sgml/postgres.sgml
***************
*** 247,252 ****
--- 247,253 ----
&gist;
&spgist;
&gin;
+ &brin;
&storage;
&bki;
&planstats;
*** /dev/null
--- b/minmax-proposal
***************
*** 0 ****
--- 1,306 ----
+ Minmax Range Indexes
+ ====================
+
+ Minmax indexes are a new access method intended to enable very fast scanning of
+ extremely large tables.
+
+ The essential idea of a minmax index is to keep track of summarizing values in
+ consecutive groups of heap pages (page ranges); for example, the minimum and
+ maximum values for datatypes with a btree opclass, or the bounding box for
+ geometric types. These values can be used by constraint exclusion to avoid
+ scanning such pages, depending on query quals.
+
+ The main drawback of this is having to update the stored summary values of each
+ page range as tuples are inserted into them.
+
+ Other database systems already have similar features. Some examples:
+
+ * Oracle Exadata calls this "storage indexes"
+ http://richardfoote.wordpress.com/category/storage-indexes/
+
+ * Netezza has "zone maps"
+ http://nztips.com/2010/11/netezza-integer-join-keys/
+
+ * Infobright has this automatically within their "data packs" according to a
+ May 3rd, 2009 blog post
+ http://www.infobright.org/index.php/organizing_data_and_more_about_rough_data_contest/
+
+ * MonetDB also uses this technique, according to a published paper
+ http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662
+ "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS"
+
+ Index creation
+ --------------
+
+ To create a minmax index, we use the standard wording:
+
+ CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e);
+
+ Partial indexes are not supported currently; since an index is concerned with
+ summary values of the involved columns across all the pages in the table, it
+ normally doesn't make sense to exclude some tuples. These might be useful if
+ the index predicates are also used in queries. We exclude these for now for
+ conceptual simplicity.
+
+ Expressional indexes can probably be supported in the future, but we disallow
+ them initially for conceptual simplicity.
+
+ Having multiple minmax indexes in the same table is acceptable, though most of
+ the time it would make more sense to have a single index covering all the
+ interesting columns. Multiple indexes might be useful for columns added later.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index requires a recheck
+ node on top. The amgetbitmap routine returns a TIDBitmap comprising all pages
+ in those page groups that match the query qualifications. The recheck node
+ prunes tuples that are not visible according to the query qualifications.
+
+ For each supported datatype, we need an operator class with the following
+ catalog entries:
+
+ - support operators (pg_amop): same as btree (<, <=, =, >=, >)
+ - support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+
+ These are used pervasively:
+
+ - The optimizer requires them to evaluate queries, so that the index is chosen
+ when queries on the indexed table are planned.
+ - During index construction (ambuild), they are used to determine the boundary
+ values for each page range.
+ - During index updates (aminsert), they are used to determine whether the new
+ heap tuple matches the existing index tuple; and if not, they are used to
+ construct the new index tuple.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ Different datatypes store other values instead of min/max, for example
+ geometric types might store a bounding box. The NULL bits are always present.
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating minmax
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious.)
+
+ There can be gaps where some pages have no covering index entry.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have an internal
+ structure we call the range reverse map. This stores one TID per page range,
+ which is the address of the index tuple summarizing that range. Since these
+ map entries are fixed size, it is possible to compute the address of the range
+ map entry for any given heap page by simple arithmetic.
+
+ When a new heap tuple is inserted in a summarized page range, we compare the
+ existing index tuple with the new heap tuple. If the heap tuple is outside the
+ summarization data given by the index tuple for any indexed column (or if the
+ new heap tuple contains null values but the index tuple indicate there are no
+ nulls), it is necessary to create a new index tuple with the new values. To do
+ this, a new index tuple is inserted, and the reverse range map is updated to
+ point to it. The old index tuple is left in place, for later garbage
+ collection. As an optimization, we sometimes overwrite the old index tuple in
+ place with the new data, which avoids the need for later garbage collection.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is considered to be not summarized.
+
+ To scan a table following a minmax index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order. Query
+ quals are matched to each index tuple; if they match, each page within the page
+ range is returned as part of the output TID bitmap. If there's no match, they
+ are skipped. Reverse range map entries returning invalid index TIDs, that is
+ unsummarized page ranges, are also returned in the TID bitmap.
+
+ To store the range reverse map, we map its logical page numbers to physical
+ pages. We use a large two-level BlockNumber array for this: The metapage
+ contains an array of BlockNumbers; each of these points to a "revmap array
+ page". Each revmap array page contains BlockNumbers, which in turn point to
+ "revmap regular pages", which are the ones that contain the revmap data itself.
+ Therefore, to find a given index tuple, we need to examine the metapage and
+ obtain the revmap array page number; then read the array page. From there we
+ obtain the revmap regular page number, and that one contains the TID we're
+ interested in. As an optimization, regular revmap page number 0 is stored in
+ physical page number 1, that is, the page just after the metapage. This means
+ that scanning a table of about 1300 page ranges (the number of TIDs that fit in
+ a single 8kB page) does not require accessing the metapage at all.
+
+ When tuples are added to unsummarized pages, nothing needs to happen.
+
+ Heap tuples can be removed from anywhere without restriction. It might be
+ useful to mark the corresponding index tuple somehow, if the heap tuple is one
+ of the constraining values of the summary data (i.e. either min or max in the
+ case of a btree-opclass-bearing datatype), so that in the future we are aware
+ of the need to re-execute summarization on that range, leading to a possible
+ tightening of the summary values.
+
+ Index entries that are not referenced from the revmap can be removed from the
+ main fork. This currently happens at amvacuumcleanup, though it could be
+ carried out separately; no heap scan is necessary to determine which tuples
+ are unreachable.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ summarizing values of each indexed column and nulls bitmap are collected and
+ stored in the index.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the summary values for each indexed column, then insert the new
+ index entry at the end of the index.
+
+ The easiest way to go around this seems to have vacuum do it. That way we can
+ simply do re-summarization on the amvacuumcleanup routine. Other answers would
+ mean we need a separate AM routine, which appears unwarranted at this stage.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a minmax index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some min() value can be
+ incremented, or some max() value can be decremented; but this would represent
+ an optimization opportunity only, not a correctness issue. Perhaps it's
+ simpler to represent this as the need to re-run summarization on the affected
+ page range.
+
+ Note that if there are no indexes on the table other than the minmax index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). This optimization opportunity is best left for
+ future improvement.
+
+ Locking considerations
+ ----------------------
+
+ To read the TID during an index scan, we follow this protocol:
+
+ * read revmap page
+ * obtain share lock on the revmap buffer
+ * read the TID
+ * obtain share lock on buffer of main fork
+ * LockTuple the TID (using the index as relation). A shared lock is
+ sufficient. We need the LockTuple to prevent VACUUM from recycling
+ the index tuple; see below.
+ * release revmap buffer lock
+ * read the index tuple
+ * release the tuple lock
+ * release main fork buffer lock
+
+
+ To update the summary tuple for a page range, we use this protocol:
+
+ * insert a new index tuple somewhere in the main fork; note its TID
+ * read revmap page
+ * obtain exclusive lock on revmap buffer
+ * write the TID
+ * release lock
+
+ This ensures no concurrent reader can obtain a partially-written TID.
+ Note we don't need a tuple lock here. Concurrent scans don't have to
+ worry about whether they got the old or new index tuple: if they get the
+ old one, the tighter values are okay from a correctness standpoint because
+ due to MVCC they can't possibly see the just-inserted heap tuples anyway.
+
+
+ For vacuuming, we need to figure out which index tuples are no longer
+ referenced from the reverse range map. This requires some brute force,
+ but is simple:
+
+ 1) scan the complete index, store each existing TIDs in a dynahash.
+ Hash key is the TID, hash value is a boolean initially set to false.
+ 2) scan the complete revmap sequentially, read the TIDs on each page. Share
+ lock on each page is sufficient. For each TID so obtained, grab the
+ element from the hash and update the boolean to true.
+ 3) Scan the index again; for each tuple found, search the hash table.
+ If the tuple is not present in hash, it must have been added after our
+ initial scan; ignore it. If tuple is present in hash, and the hash flag is
+ true, then the tuple is referenced from the revmap; ignore it. If the hash
+ flag is false, then the index tuple is no longer referenced by the revmap;
+ but it could be about to be accessed by a concurrent scan. Do
+ ConditionalLockTuple. If this fails, ignore the tuple (it's in use), it
+ will be deleted by a future vacuum. If lock is acquired, then we can safely
+ remove the index tuple.
+ 4) Index pages with free space can be detected by this second scan. Register
+ those with the FSM.
+
+ Note this doesn't require scanning the heap at all, or being involved in
+ the heap's cleanup procedure. Also, there is no need to LockBufferForCleanup,
+ which is a nice property because index scans keep pages pinned for long
+ periods.
+
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ minmax index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger minmax
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+
+
+ References:
+
+ Email thread on pgsql-hackers
+ http://www.postgresql.org/message-id/1199296574.7260.149.camel@ebony.site
+ From: Simon Riggs
+ To: pgsql-hackers
+ Subject: Dynamic Partitioning using Segment Visibility Map
+
+ http://wiki.postgresql.org/wiki/Segment_Exclusion
+ http://wiki.postgresql.org/wiki/Segment_Visibility_Map
+
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index minmax nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a Minmax index",
+ RELOPT_KIND_MINMAX
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 271,276 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 271,278 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 296,301 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 298,311 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 636,642 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 646,653 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 646,652 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 657,664 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 897,903 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 909,916 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 907,913 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 920,927 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** /dev/null
--- b/src/backend/access/minmax/Makefile
***************
*** 0 ****
--- 1,17 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/minmax
+ #
+ # IDENTIFICATION
+ # src/backend/access/minmax/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/minmax
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = minmax.o mmpageops.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/minmax/minmax.c
***************
*** 0 ****
--- 1,942 ----
+ /*
+ * minmax.c
+ * Implementation of Minmax indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/minmax.c
+ *
+ * TODO
+ * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
+ * * add support for unlogged indexes
+ * * ditto expressional indexes
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_pageops.h"
+ #include "access/minmax_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "catalog/index.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "utils/rel.h"
+
+
+ /*
+ * We use a MMBuildState during initial construction of a Minmax index.
+ * The running state is kept in a DeformedMMTuple.
+ */
+ typedef struct MMBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ bool seentup;
+ bool extended;
+ DeformedMMTuple *dtuple;
+ } MMBuildState;
+
+ /*
+ * Struct used as "opaque" during index scans
+ */
+ typedef struct MinmaxOpaque
+ {
+ BlockNumber pagesPerRange;
+ mmRevmapAccess *rmAccess;
+ MinmaxDesc *mmDesc;
+ } MinmaxOpaque;
+
+ static MMBuildState *initialize_mm_buildstate(Relation idxRel,
+ mmRevmapAccess *rmAccess, BlockNumber pagesPerRange);
+ static bool terminate_mm_buildstate(MMBuildState *state);
+ static void summarize_range(MMBuildState *mmstate, Relation heapRel,
+ BlockNumber heapBlk);
+ static void form_and_insert_tuple(MMBuildState *mmstate);
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a minmax index up to date,
+ * we need to obtain the relevant index tuple, compare its stored values with
+ * those of the new tuple; if the tuple values are consistent with the summary
+ * tuple, there's nothing to do; otherwise we need to update the index.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ mminsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ BlockNumber pagesPerRange;
+ MinmaxDesc *mmdesc;
+ mmRevmapAccess *rmAccess;
+ OffsetNumber off;
+ MMTuple *mmtup;
+ DeformedMMTuple *dtup;
+ BlockNumber heapBlk;
+ Buffer buf = InvalidBuffer;
+ int keyno;
+ bool need_insert = false;
+ bool extended;
+
+ rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
+
+ restart:
+ CHECK_FOR_INTERRUPTS();
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ /* normalize the block number to be the first block in the range */
+ heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
+ mmtup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+
+ if (!mmtup)
+ {
+ /* nothing to do, range is unsummarized */
+ mmRevmapAccessTerminate(rmAccess);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+ return BoolGetDatum(false);
+ }
+
+ mmdesc = minmax_build_mmdesc(idxRel);
+ dtup = minmax_deform_tuple(mmdesc, mmtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values; our
+ * deformed tuple will get updated if the new tuple doesn't fit the
+ * original range (note this means we can't break out of the loop early).
+ * Make a note of whether this happens, so that we know to insert the
+ * modified tuple later.
+ */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+ result = FunctionCall5Coll(addValue,
+ idxRel->rd_indcollation[keyno],
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ if (need_insert)
+ {
+ Page page = BufferGetPage(buf);
+ ItemId lp = PageGetItemId(page, off);
+ Size origsz;
+ MMTuple *origtup;
+ Size newsz;
+ MMTuple *newtup;
+ bool samepage;
+
+ /*
+ * Make a copy of the old tuple, so that we can compare it after
+ * re-acquiring the lock.
+ */
+ origsz = ItemIdGetLength(lp);
+ origtup = minmax_copy_tuple(mmtup, origsz);
+
+ /* before releasing the lock, check if we can do a same-page update. */
+ if (newsz <= origsz || PageGetExactFreeSpace(page) >= (origsz - newsz))
+ samepage = true;
+ else
+ samepage = false;
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ newtup = minmax_form_tuple(mmdesc, heapBlk, dtup, &newsz);
+
+ /*
+ * Try to update the tuple. If this doesn't work for whatever reason,
+ * we need to restart from the top; the revmap might be pointing at a
+ * different tuple for this block now, so we need to recompute
+ * to ensure both our new heap tuple and the other inserter's are
+ * covered by the combined tuple. It might be that we don't need to
+ * update at all.
+ */
+ if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk, buf, off,
+ origtup, origsz, newtup, newsz, samepage, &extended))
+ goto restart;
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ ReleaseBuffer(buf);
+
+ mmRevmapAccessTerminate(rmAccess);
+ minmax_free_mmdesc(mmdesc);
+
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
+
+ return BoolGetDatum(false);
+ }
+
+ /*
+ * Initialize state for a Minmax index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during mmrescan.
+ */
+ Datum
+ mmbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ MinmaxOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (MinmaxOpaque *) palloc(sizeof(MinmaxOpaque));
+ opaque->rmAccess = mmRevmapAccessInit(r, &opaque->pagesPerRange);
+ opaque->mmDesc = minmax_build_mmdesc(r);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+ }
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first on what to do about sk_subtype.
+ */
+ Datum
+ mmgetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer buf = InvalidBuffer;
+ MinmaxDesc *mmdesc;
+ Oid heapOid;
+ Relation heapRel;
+ MinmaxOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (MinmaxOpaque *) scan->opaque;
+ mmdesc = opaque->mmDesc;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * mmdesc->md_tupdesc->natts);
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, MINMAX_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->pagesPerRange)
+ {
+ bool addrange;
+ OffsetNumber off;
+ MMTuple *tup;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = mmGetMMTupleForHeapBlock(opaque->rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ /*
+ * For page ranges with no indexed tuple, we must return the whole
+ * range; otherwise, compare it to the scan keys.
+ */
+ if (tup == NULL)
+ {
+ addrange = true;
+ }
+ else
+ {
+ DeformedMMTuple *dtup;
+ int keyno;
+
+ dtup = minmax_deform_tuple(mmdesc, tup);
+
+ /*
+ * Compare scan keys with summary values stored for the range. If
+ * scan keys are matched, the page range must be added to the
+ * bitmap. We initially assume the range needs to be added; in
+ * particular this serves the case where there are no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation used
+ * in the index column. Otherwise we shouldn't be using this
+ * index ...
+ */
+ Assert(key->sk_collation ==
+ mmdesc->md_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page range
+ * values; if so, have the pages in the range added to the
+ * output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard the
+ * range as a whole, so break out of the loop as soon as a
+ * false return value is obtained.
+ */
+ add = FunctionCall3Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(mmdesc),
+ PointerGetDatum(dtup),
+ PointerGetDatum(key));
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ break;
+ }
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ pfree(dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+ }
+
+ /*
+ * Re-initialize state for a minmax index scan
+ */
+ Datum
+ mmrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Close down a minmax index scan
+ */
+ Datum
+ mmendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ MinmaxOpaque *opaque = (MinmaxOpaque *) scan->opaque;
+
+ mmRevmapAccessTerminate(opaque->rmAccess);
+ minmax_free_mmdesc(opaque->mmDesc);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ mmrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "MinMax does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ mmbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *state)
+ {
+ MMBuildState *mmstate = (MMBuildState *) state;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (mmstate->currRangeStart + mmstate->pagesPerRange - 1))
+ {
+
+ MINMAX_elog(DEBUG2, "mmbuildCallback: completed a range: %u--%u",
+ mmstate->currRangeStart,
+ mmstate->currRangeStart + mmstate->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(mmstate);
+
+ /* set state to correspond to the next range */
+ mmstate->currRangeStart += mmstate->pagesPerRange;
+
+ /* re-initialize state for it */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ mmstate->seentup = true;
+ for (i = 0; i < mmstate->mmDesc->md_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(index, i + 1,
+ MINMAX_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ mmstate->mmDesc->md_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(mmstate->mmDesc),
+ PointerGetDatum(mmstate->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+ }
+
+ /*
+ * mmbuild() -- build a new minmax index.
+ */
+ Datum
+ mmbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ double idxtuples;
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ /*
+ * Critical section not required, because on error the creation of the
+ * whole relation will be rolled back.
+ */
+
+ meta = ReadBuffer(index, P_NEW);
+ Assert(BufferGetBlockNumber(meta) == MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
+
+ mm_metapage_init(BufferGetPage(meta), MinmaxGetPagesPerRange(index),
+ MINMAX_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_minmax_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = MINMAX_CURRENT_VERSION;
+ xlrec.pagesPerRange = MinmaxGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = mmRevmapAccessInit(index, &pagesPerRange);
+ mmstate = initialize_mm_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ mmbuildCallback, (void *) mmstate);
+
+ /* process the final batch */
+ form_and_insert_tuple(mmstate);
+
+ /* release resources */
+ idxtuples = mmstate->numtuples;
+ mmRevmapAccessTerminate(mmstate->rmAccess);
+ if (terminate_mm_buildstate(mmstate))
+ FreeSpaceMapVacuum(index);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = idxtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ mmbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged MinMax indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * mmbulkdelete
+ * Since there are no per-heap-tuple index tuples in minmax indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+ Datum
+ mmbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a minmax index: we just summarize
+ * ranges that are currently unsummarized.
+ */
+ Datum
+ mmvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ mmRevmapAccess *rmAccess;
+ MMBuildState *mmstate = NULL;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ Buffer buf;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ rmAccess = mmRevmapAccessInit(info->index, &pagesPerRange);
+
+ /*
+ * Scan the revmap to find unsummarized items.
+ */
+ buf = InvalidBuffer;
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ MMTuple *tup;
+ OffsetNumber off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
+ BUFFER_LOCK_SHARE);
+ if (tup == NULL)
+ {
+ /* no revmap entry for this heap range. Summarize it. */
+ if (mmstate == NULL)
+ mmstate = initialize_mm_buildstate(info->index, rmAccess,
+ pagesPerRange);
+ summarize_range(mmstate, heapRel, heapBlk);
+ }
+ else
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ /* free resources */
+ mmRevmapAccessTerminate(rmAccess);
+ if (mmstate && terminate_mm_buildstate(mmstate))
+ FreeSpaceMapVacuum(info->index);
+
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * reloptions processor for minmax indexes
+ */
+ Datum
+ mmoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ MinmaxOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(MinmaxOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_MINMAX,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(MinmaxOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(MinmaxOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ mm_page_init(Page page, uint16 type)
+ {
+ MinmaxSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(MinmaxSpecialSpace));
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+
+ /*
+ * Initialize a new minmax index' metapage.
+ */
+ void
+ mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+ {
+ MinmaxMetaPageData *metadata;
+
+ mm_page_init(page, MINMAX_PAGETYPE_META);
+
+ metadata = (MinmaxMetaPageData *) PageGetContents(page);
+
+ metadata->minmaxMagic = MINMAX_META_MAGIC;
+ metadata->minmaxVersion = version;
+ metadata->pagesPerRange = pagesPerRange;
+
+ /*
+ * Note we cheat here a little. 0 is not a valid revmap block number
+ * (because it's the metapage buffer), but doing this enables the first
+ * revmap page to be created when the index is.
+ */
+ metadata->lastRevmapPage = 0;
+ }
+
+ /*
+ * Build a MinmaxDesc used to create or scan a minmax index
+ */
+ MinmaxDesc *
+ minmax_build_mmdesc(Relation rel)
+ {
+ MinmaxOpcInfo **opcinfo;
+ MinmaxDesc *mmdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+
+ tupdesc = RelationGetDescr(rel);
+ IncrTupleDescRefCount(tupdesc);
+
+ /*
+ * Obtain MinmaxOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ opcinfo = (MinmaxOpcInfo **) palloc(sizeof(MinmaxOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, MINMAX_PROCNUM_OPCINFO);
+
+ /* actually FunctionCall0 but we don't have that */
+ opcinfo[keyno] = (MinmaxOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(MinmaxDesc, md_info) +
+ sizeof(MinmaxOpcInfo *) * tupdesc->natts;
+
+ mmdesc = palloc(totalsize);
+ mmdesc->md_index = rel;
+ mmdesc->md_tupdesc = tupdesc;
+ mmdesc->md_disktdesc = NULL; /* generated lazily */
+ mmdesc->md_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ mmdesc->md_info[keyno] = opcinfo[keyno];
+ pfree(opcinfo);
+
+ return mmdesc;
+ }
+
+ void
+ minmax_free_mmdesc(MinmaxDesc *mmdesc)
+ {
+ int keyno;
+
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ pfree(mmdesc->md_info[keyno]);
+ DecrTupleDescRefCount(mmdesc->md_tupdesc);
+ pfree(mmdesc);
+ }
+
+ /*
+ * Initialize a MMBuildState appropriate to create tuples on the given index.
+ */
+ static MMBuildState *
+ initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
+ BlockNumber pagesPerRange)
+ {
+ MMBuildState *mmstate;
+
+ mmstate = palloc(sizeof(MMBuildState));
+
+ mmstate->irel = idxRel;
+ mmstate->numtuples = 0;
+ mmstate->currentInsertBuf = InvalidBuffer;
+ mmstate->pagesPerRange = pagesPerRange;
+ mmstate->currRangeStart = 0;
+ mmstate->rmAccess = rmAccess;
+ mmstate->mmDesc = minmax_build_mmdesc(idxRel);
+ mmstate->seentup = false;
+ mmstate->extended = false;
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
+
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+
+ return mmstate;
+ }
+
+ /*
+ * Release resources associated with a MMBuildState. Returns whether the FSM
+ * should be vacuumed afterwards.
+ */
+ static bool
+ terminate_mm_buildstate(MMBuildState *mmstate)
+ {
+ bool vacuumfsm;
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(mmstate->currentInsertBuf))
+ {
+ Page page;
+
+ page = BufferGetPage(mmstate->currentInsertBuf);
+ RecordPageWithFreeSpace(mmstate->irel,
+ BufferGetBlockNumber(mmstate->currentInsertBuf),
+ PageGetFreeSpace(page));
+ ReleaseBuffer(mmstate->currentInsertBuf);
+ }
+ vacuumfsm = mmstate->extended;
+
+ minmax_free_mmdesc(mmstate->mmDesc);
+ pfree(mmstate->dtuple);
+ pfree(mmstate);
+
+ return vacuumfsm;
+ }
+
+ /*
+ * Summarize the given page range of the given index.
+ */
+ static void
+ summarize_range(MMBuildState *mmstate, Relation heapRel, BlockNumber heapBlk)
+ {
+ IndexInfo *indexInfo;
+
+ indexInfo = BuildIndexInfo(mmstate->irel);
+
+ mmstate->currRangeStart = heapBlk;
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the
+ * specified page range, summarizing the heap tuples in it. This scan
+ * stops just short of mmbuildCallback creating the new index entry.
+ */
+ IndexBuildHeapRangeScan(heapRel, mmstate->irel, indexInfo, false,
+ heapBlk, mmstate->pagesPerRange,
+ mmbuildCallback, (void *) mmstate);
+
+ /*
+ * Create the index tuple and insert it. Note mmbuildCallback didn't
+ * have the chance to actually insert anything into the index, because
+ * the heapscan should have ended just as it reached the final tuple in
+ * the range.
+ */
+ form_and_insert_tuple(mmstate);
+
+ /* and re-initialize state for the next range */
+ minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
+ }
+
+ /*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+ static void
+ form_and_insert_tuple(MMBuildState *mmstate)
+ {
+ MMTuple *tup;
+ Size size;
+
+ /* if we haven't seen any heap tuple yet, don't insert anything */
+ if (!mmstate->seentup)
+ return;
+
+ tup = minmax_form_tuple(mmstate->mmDesc, mmstate->currRangeStart,
+ mmstate->dtuple, &size);
+ mm_doinsert(mmstate->irel, mmstate->pagesPerRange, mmstate->rmAccess,
+ &mmstate->currentInsertBuf, mmstate->currRangeStart,
+ tup, size, &mmstate->extended);
+ mmstate->numtuples++;
+ pfree(tup);
+
+ mmstate->seentup = false;
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmpageops.c
***************
*** 0 ****
--- 1,638 ----
+ /*
+ * mmpageops.c
+ * Page-handling routines for Minmax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmpageops.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_pageops.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+ #include "utils/rel.h"
+
+
+ static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended);
+ static Size mm_page_get_freespace(Page page);
+
+
+ /*
+ * Update tuple origtup (size origsz), located in offset oldoff of buffer
+ * oldbuf, to newtup (size newsz) as summary tuple for the page range starting
+ * at heapBlk. If samepage is true, then attempt to put the new tuple in the same
+ * page, otherwise use some other one.
+ *
+ * If the update is done, return true; the revmap is updated to point to the
+ * new tuple. If the update is not done for whatever reason, return false.
+ * Caller may retry the update if this happens.
+ *
+ * If the index had to be extended in the course of this operation, *extended
+ * is set to true.
+ */
+ bool
+ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended)
+ {
+ Page oldpage;
+ ItemId origlp;
+ MMTuple *oldtup;
+ Size oldsz;
+ Buffer newbuf;
+ MinmaxSpecialSpace *special;
+
+ if (!samepage)
+ {
+ /* need a page on which to put the item */
+ newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz, extended);
+ if (!BufferIsValid(newbuf))
+ return false;
+
+ /*
+ * Note: it's possible (though unlikely) that the returned newbuf is
+ * the same as oldbuf, if mm_getinsertbuffer determined that the old
+ * buffer does in fact have enough space.
+ */
+ if (newbuf == oldbuf)
+ newbuf = InvalidBuffer;
+ }
+ else
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ newbuf = InvalidBuffer;
+ }
+ oldpage = BufferGetPage(oldbuf);
+ origlp = PageGetItemId(oldpage, oldoff);
+
+ /* Check that the old tuple wasn't updated concurrently */
+ if (!ItemIdIsNormal(origlp))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ oldsz = ItemIdGetLength(origlp);
+ oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
+
+ /*
+ * If both tuples are identical, there is nothing to do; except that if we
+ * were requested to move the tuple across pages, we do it even if they are
+ * equal.
+ */
+ if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(oldpage);
+
+ /*
+ * Great, the old tuple is intact. We can proceed with the update.
+ *
+ * If there's enough room on the old page for the new tuple, replace it.
+ *
+ * Note that there might now be enough space on the page even though
+ * the caller told us there isn't, if a concurrent updated moved a tuple
+ * elsewhere or replaced a tuple with a smaller one.
+ */
+ if ((special->flags & MINMAX_EVACUATE_PAGE) == 0 &&
+ (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz)))
+ {
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+
+ START_CRIT_SECTION();
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple");
+ MarkBufferDirty(oldbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ BlockNumber blk = BufferGetBlockNumber(oldbuf);
+ xl_minmax_samepage_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_SAMEPAGE_UPDATE;
+
+ xlrec.node = idxrel->rd_node;
+ ItemPointerSetBlockNumber(&xlrec.tid, blk);
+ ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxSamepageUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = oldbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ else if (newbuf == InvalidBuffer)
+ {
+ /*
+ * Not enough space, but caller said that there was. Tell them to
+ * start over.
+ */
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+ else
+ {
+ /*
+ * Not enough free space on the oldpage. Put the new tuple on the
+ * new page, and update the revmap.
+ */
+ Page newpage = BufferGetPage(newbuf);
+ Buffer revmapbuf;
+ ItemPointerData newtid;
+ OffsetNumber newoff;
+
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
+ if (newoff == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple to new page");
+ MarkBufferDirty(oldbuf);
+ MarkBufferDirty(newbuf);
+
+ ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[4];
+ uint8 info = XLOG_MINMAX_UPDATE;
+
+ xlrec.new.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
+ xlrec.new.heapBlk = heapBlk;
+ xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ xlrec.new.pagesPerRange = pagesPerRange;
+ ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = newbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = &(rdata[2]);
+
+ rdata[2].data = (char *) NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = revmapbuf;
+ rdata[2].buffer_std = true;
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = (char *) NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = oldbuf;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ PageSetLSN(newpage, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(newbuf);
+ return true;
+ }
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is first checked for free space to insert the new
+ * entry; if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * If the relation had to be extended to make room for the new index tuple,
+ * *extended is set to true.
+ */
+ void
+ mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ MMTuple *tup, Size itemsz, bool *extended)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ Buffer revmapbuf;
+ ItemPointerData tid;
+
+ itemsz = MAXALIGN(itemsz);
+
+ /*
+ * Lock the revmap page for the update. Note that this may require
+ * extending the revmap, which in turn may require moving the currently
+ * pinned index block out of the way.
+ */
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer.
+ */
+ if (BufferIsValid(*buffer))
+ {
+ /*
+ * It's possible that another backend (or ourselves!) extended the
+ * revmap over the page we held a pin on, so we cannot assume that
+ * it's still a regular page.
+ */
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ if (mm_page_get_freespace(BufferGetPage(*buffer)) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buffer))
+ {
+ *buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
+ Assert(BufferIsValid(*buffer));
+ Assert(mm_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
+ }
+
+ page = BufferGetPage(*buffer);
+ blk = BufferGetBlockNumber(*buffer);
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "could not insert new index tuple to page");
+ MarkBufferDirty(*buffer);
+
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapBlk);
+
+ ItemPointerSet(&tid, blk, off);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.heapBlk = heapBlk;
+ xlrec.pagesPerRange = pagesPerRange;
+ xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ ItemPointerSet(&xlrec.tid, blk, off);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our locks */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Initiate page evacuation protocol.
+ *
+ * The page must be locked in exclusive mode by the caller.
+ *
+ * If the page is not yet initialized or empty, return false without doing
+ * anything; it can be used for revmap without any further changes. If it
+ * contains tuples, mark it for evacuation and return true.
+ */
+ bool
+ mm_start_evacuating_page(Relation idxRel, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ /* prevent other backends from adding more stuff to this page */
+ special->flags |= MINMAX_EVACUATE_PAGE;
+ MarkBufferDirtyHint(buf, true);
+
+ return true;
+ }
+ }
+ return false;
+ }
+
+ /*
+ * Move all tuples out of a page.
+ *
+ * The caller must hold lock on the page. The lock and pin are released.
+ */
+ void
+ mm_evacuate_page(Relation idxRel, BlockNumber pagesPerRange, mmRevmapAccess *rmAccess, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+ bool extended = false;
+
+ page = BufferGetPage(buf);
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ Assert(special->flags & MINMAX_EVACUATE_PAGE);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ MMTuple *tup;
+ Size sz;
+ ItemId lp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ sz = ItemIdGetLength(lp);
+ tup = (MMTuple *) PageGetItem(page, lp);
+ tup = minmax_copy_tuple(tup, sz);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, tup->mt_blkno, buf,
+ off, tup, sz, tup, sz, false, &extended))
+ off--; /* retry */
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ /* It's possible that someone extended the revmap over this page */
+ if (!MINMAX_IS_REGULAR_PAGE(page))
+ break;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz. If oldbuf is a valid buffer, it is also locked (in a order
+ * determined to avoid deadlocks.)
+ *
+ * If there's no existing page with enough free space to accomodate the new
+ * item, the relation is extended. If this happens, *extended is set to true.
+ *
+ * If we find that the old page is no longer a regular index page (because
+ * of a revmap extension), the old buffer is unlocked and we return
+ * InvalidBuffer.
+ */
+ static Buffer
+ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended)
+ {
+ BlockNumber oldblk;
+ BlockNumber newblk;
+ Page page;
+ int freespace;
+ bool extended = false;
+
+ if (BufferIsValid(oldbuf))
+ oldblk = BufferGetBlockNumber(oldbuf);
+ else
+ oldblk = InvalidBlockNumber;
+
+ /*
+ * Loop until we find a page with sufficient free space. By the time we
+ * return to caller out of this loop, both buffers are valid and locked;
+ * if we have to restart here, neither buffer is locked and buf is not
+ * a pinned buffer.
+ */
+ newblk = RelationGetTargetBlock(irel);
+ if (newblk == InvalidBlockNumber)
+ newblk = GetPageWithFreeSpace(irel, itemsz);
+ for (;;)
+ {
+ Buffer buf;
+ bool extensionLockHeld = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (newblk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ if (!RELATION_IS_LOCAL(irel))
+ {
+ LockRelationForExtension(irel, ExclusiveLock);
+ extensionLockHeld = true;
+ }
+ buf = ReadBuffer(irel, P_NEW);
+ extended = true;
+
+ MINMAX_elog(DEBUG2, "mm_getinsertbuffer: extending to page %u",
+ BufferGetBlockNumber(buf));
+ }
+ else if (newblk == oldblk)
+ {
+ /*
+ * There's an odd corner-case here where the FSM is out-of-date,
+ * and gave us the old page.
+ */
+ buf = oldbuf;
+ }
+ else
+ {
+ buf = ReadBuffer(irel, newblk);
+ }
+
+ /*
+ * We lock the old buffer first, if it's earlier than the new one.
+ * We also need to check that it hasn't been turned into a revmap
+ * page concurrently; if we detect that it happened, give up and
+ * tell caller to start over.
+ */
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ if (!MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (extensionLockHeld)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ page = BufferGetPage(buf);
+
+ if (extended)
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * We have a new buffer from FSM now. Check that the new page has
+ * enough free space, and return it if it does; otherwise start over.
+ * Note that we allow for the FSM to be out of date here, and in that
+ * case we update it and move on.
+ *
+ * (mm_page_get_freespace also checks that the FSM didn't hand us a
+ * page that has since been repurposed for the revmap.)
+ */
+ freespace = mm_page_get_freespace(page);
+ if (freespace >= itemsz)
+ {
+ if (extended)
+ *was_extended = true;
+
+ RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
+
+ /*
+ * Lock the old buffer if not locked already. Note that in this
+ * case we know for sure it's a regular page: it's later than the
+ * new page we just got, which is not a revmap page, and revmap
+ * pages are always consecutive.
+ */
+ if (BufferIsValid(oldbuf) && oldblk > newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)));
+ }
+
+ return buf;
+ }
+
+ /* This page is no good. */
+
+ /*
+ * If an entirely new page does not contain enough free space for
+ * the new item, then surely that item is oversized. Complain
+ * loudly; but first make sure we record the page as free, for
+ * next time.
+ */
+ if (extended)
+ {
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ return InvalidBuffer; /* keep compiler quiet */
+ }
+
+ if (newblk != oldblk)
+ UnlockReleaseBuffer(buf);
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+ newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+ }
+ }
+
+ /*
+ * Return the amount of free space on a regular minmax index page.
+ *
+ * If the page is not a regular page, or has been marked with the
+ * MINMAX_EVACUATE_PAGE flag, returns 0.
+ */
+ static Size
+ mm_page_get_freespace(Page page)
+ {
+ MinmaxSpecialSpace *special;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (!MINMAX_IS_REGULAR_PAGE(page) ||
+ (special->flags & MINMAX_EVACUATE_PAGE) != 0)
+ return 0;
+ else
+ return PageGetFreeSpace(page);
+
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 0 ****
--- 1,451 ----
+ /*
+ * mmrevmap.c
+ * Reverse range map for MinMax indexes
+ *
+ * The reverse range map (revmap) is a translation structure for minmax
+ * indexes: for each page range there is one summary tuple, and its location is
+ * tracked by the revmap. Whenever a new tuple is inserted into a table that
+ * violates the previously recorded summary values, a new tuple is inserted
+ * into the index and the revmap is updated to point to it.
+ *
+ * The revmap is stored in the first pages of the index, immediately following
+ * the metapage. When the revmap needs to be expanded, all tuples on the
+ * regular minmax page at that block (if any) are moved out of the way.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/xlog.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_pageops.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "utils/rel.h"
+
+
+ /*
+ * In revmap pages, each item stores an ItemPointerData. These defines let one
+ * find the logical revmap page number and index number of the revmap item for
+ * the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REVMAP_PAGE_MAXITEMS)
+
+
+ struct mmRevmapAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ BlockNumber lastRevmapPage; /* cached from the metapage */
+ Buffer metaBuf;
+ Buffer currBuf;
+ };
+ /* typedef appears in minmax_revmap.h */
+
+
+ static BlockNumber rm_get_phys_blkno(mmRevmapAccess *rmAccess,
+ BlockNumber mapBlk, bool extend);
+ static void rm_extend(mmRevmapAccess *rmAccess);
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by mmRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ mmRevmapAccess *
+ mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+ {
+ mmRevmapAccess *rmAccess;
+ Buffer meta;
+ MinmaxMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_SHARE);
+ metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(mmRevmapAccess));
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ rmAccess->metaBuf = meta;
+ rmAccess->currBuf = InvalidBuffer;
+
+ *pagesPerRange = metadata->pagesPerRange;
+
+ LockBuffer(meta, BUFFER_LOCK_UNLOCK);
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
+ {
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * Prepare for updating an entry in the revmap.
+ *
+ * The map is extended, if necessary.
+ */
+ Buffer
+ mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk)
+ {
+ BlockNumber mapBlk;
+
+ /*
+ * Translate the map block number to physical location. Note this extends
+ * the revmap, if necessary.
+ */
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+ Assert(mapBlk != InvalidBlockNumber);
+
+ MINMAX_elog(DEBUG2, "locking revmap page for logical page %lu (physical %u) for heap %u",
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ return rmAccess->currBuf;
+ }
+
+ /*
+ * In the given revmap buffer (locked appropriately by caller), which is used
+ * in a minmax index of pagesPerRange pages per range, set the element
+ * corresponding to heap block number heapBlk to the given TID.
+ *
+ * Once the operation is complete, the caller must update the LSN on the
+ * returned buffer.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ mmSetHeapBlockItemptr(Buffer buf, BlockNumber pagesPerRange, BlockNumber heapBlk,
+ ItemPointerData tid)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ Page page;
+
+ /* The correct page should already be pinned and locked */
+ page = BufferGetPage(buf);
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr,
+ ItemPointerGetBlockNumber(&tid),
+ ItemPointerGetOffsetNumber(&tid));
+ }
+
+ /*
+ * Fetch the MMTuple for a given heap block.
+ *
+ * The buffer containing the tuple is locked, and returned in *buf. As an
+ * optimization, the caller can pass a pinned buffer *buf on entry, which will
+ * avoid a pin-unpin cycle when the next tuple is on the same page as previous
+ * one.
+ *
+ * If no tuple is found for the given heap range, returns NULL. In that case,
+ * *buf might still be updated, but it's not locked.
+ *
+ * The output tuple offset within the buffer is returned in *off.
+ */
+ MMTuple *
+ mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer *buf, OffsetNumber *off, int mode)
+ {
+ Relation idxRel = rmAccess->idxrel;
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ BlockNumber blk;
+ Page page;
+ ItemId lp;
+ MMTuple *mmtup;
+ ItemPointerData previptr;
+
+ /* normalize the heap block number to be the first page in the range */
+ heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
+
+ /* Compute the revmap page number we need */
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ *off = InvalidOffsetNumber;
+ return NULL;
+ }
+
+ ItemPointerSetInvalid(&previptr);
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rmr_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ if (!ItemPointerIsValid(iptr))
+ {
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ return NULL;
+ }
+
+ /*
+ * Save the current TID we got from the revmap; if we loop we can
+ * sanity-check that the new one is different. Otherwise we might
+ * be stuck looping forever if the revmap is somehow badly broken.
+ */
+ if (ItemPointerIsValid(&previptr) && ItemPointerEquals(&previptr, iptr))
+ ereport(ERROR,
+ /* FIXME improve message */
+ (errmsg("revmap was updated but still contains same TID as before")));
+ previptr = *iptr;
+
+ blk = ItemPointerGetBlockNumber(iptr);
+ *off = ItemPointerGetOffsetNumber(iptr);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+
+ /* Ok, got a pointer to where the MMTuple should be. Fetch it. */
+ if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != blk)
+ {
+ if (BufferIsValid(*buf))
+ ReleaseBuffer(*buf);
+ *buf = ReadBuffer(idxRel, blk);
+ }
+ LockBuffer(*buf, mode);
+ page = BufferGetPage(*buf);
+
+ /* If we land on a revmap page, start over */
+ if (MINMAX_IS_REGULAR_PAGE(page))
+ {
+ lp = PageGetItemId(page, *off);
+ if (ItemIdIsUsed(lp))
+ {
+ mmtup = (MMTuple *) PageGetItem(page, lp);
+
+ if (mmtup->mt_blkno == heapBlk)
+ {
+ /* found it! */
+ return mmtup;
+ }
+ }
+ }
+
+ /*
+ * No luck. Assume that the revmap was updated concurrently.
+ */
+ LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+ /* not reached, but keep compiler quiet */
+ return NULL;
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ BlockNumber targetblk;
+
+ /* skip the metapage to obtain physical block numbers of revmap pages */
+ targetblk = mapBlk + 1;
+
+ /* Normal case: the revmap page is already allocated */
+ if (targetblk <= rmAccess->lastRevmapPage)
+ return targetblk;
+
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /* Extend the revmap */
+ while (targetblk > rmAccess->lastRevmapPage)
+ rm_extend(rmAccess);
+
+ return targetblk;
+ }
+
+ /*
+ * Extend the revmap by one page.
+ *
+ * However, if the revmap was extended by someone else concurrently, we might
+ * return without actually doing anything.
+ *
+ * If there is an existing minmax page at that block, it is atomically moved
+ * out of the way, and the redirect pointer on the new revmap page is set
+ * to point to its new location.
+ */
+ static void
+ rm_extend(mmRevmapAccess *rmAccess)
+ {
+ Buffer buf;
+ Page page;
+ Page metapage;
+ MinmaxMetaPageData *metadata;
+ BlockNumber mapBlk;
+ BlockNumber nblocks;
+ Relation irel = rmAccess->idxrel;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * Lock the metapage. This locks out concurrent extensions of the revmap,
+ * but note that we still need to grab the relation extension lock because
+ * another backend can extend the index with regular minmax pages.
+ */
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_EXCLUSIVE);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+
+ /*
+ * Check that our cached lastRevmapPage value was up-to-date; if it wasn't,
+ * update the cached copy and have caller start over.
+ */
+ if (metadata->lastRevmapPage != rmAccess->lastRevmapPage)
+ {
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ mapBlk = metadata->lastRevmapPage + 1;
+
+ nblocks = RelationGetNumberOfBlocks(irel);
+ if (mapBlk < nblocks)
+ {
+ buf = ReadBuffer(irel, mapBlk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ }
+ else
+ {
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buf = ReadBuffer(irel, P_NEW);
+ if (BufferGetBlockNumber(buf) != mapBlk)
+ {
+ /*
+ * Very rare corner case: somebody extended the relation
+ * concurrently after we read its length. If this happens, give up
+ * and have caller start over. We will have to evacuate that page
+ * from under whoever is using it.
+ */
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ }
+
+ /* Check that it's a regular block (or an empty page) */
+ if (!PageIsNew(page) && !MINMAX_IS_REGULAR_PAGE(page))
+ elog(ERROR, "unexpected minmax page type: 0x%04X",
+ MINMAX_PAGE_TYPE(page));
+
+ /* If the page is in use, evacuate it and restart */
+ if (mm_start_evacuating_page(irel, buf))
+ {
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ mm_evacuate_page(irel, rmAccess->pagesPerRange, rmAccess, buf);
+
+ /* have caller start over */
+ return;
+ }
+
+ /*
+ * Ok, we have now locked the metapage and the target block. Re-initialize
+ * it as a revmap page.
+ */
+ START_CRIT_SECTION();
+
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ metadata->lastRevmapPage = mapBlk;
+ MarkBufferDirty(rmAccess->metaBuf);
+
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_revmap_extend xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.targetBlk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxRevmapExtend;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ /* FIXME don't we need to log the metapage buffer also? */
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_REVMAP_EXTEND, &rdata);
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+
+ UnlockReleaseBuffer(buf);
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmsortable.c
***************
*** 0 ****
--- 1,287 ----
+ /*
+ * minmax_sortable.c
+ * Implementation of Minmax indexes for sortable datatypes
+ * (that is, anything with a btree opclass)
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmsortable.c
+ */
+ #include "postgres.h"
+
+ #include "access/genam.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_tuple.h"
+ #include "access/skey.h"
+ #include "catalog/pg_type.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * Procedure numbers must not collide with MINMAX_PROCNUM defines in
+ * minmax_internal.h. Note we only need inequality functions.
+ */
+ #define SORTABLE_NUM_PROCNUMS 4 /* # support procs we need */
+ #define PROCNUM_LESS 4
+ #define PROCNUM_LESSEQUAL 5
+ #define PROCNUM_GREATEREQUAL 6
+ #define PROCNUM_GREATER 7
+
+ /* subtract this from procnum to obtain index in SortableOpaque arrays */
+ #define PROCNUM_BASE 4
+
+ static FmgrInfo *mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno,
+ uint16 procnum);
+
+ PG_FUNCTION_INFO_V1(mmSortableAddValue);
+ PG_FUNCTION_INFO_V1(mmSortableConsistent);
+
+
+ typedef struct SortableOpaque
+ {
+ FmgrInfo operators[SORTABLE_NUM_PROCNUMS];
+ bool inited[SORTABLE_NUM_PROCNUMS];
+ } SortableOpaque;
+
+ #define OPCINFO(typname, typoid) \
+ PG_FUNCTION_INFO_V1(mmSortableOpcInfo_##typname); \
+ Datum \
+ mmSortableOpcInfo_##typname(PG_FUNCTION_ARGS) \
+ { \
+ SortableOpaque *opaque; \
+ MinmaxOpcInfo *result; \
+ \
+ opaque = palloc0(sizeof(SortableOpaque)); \
+ /* \
+ * 'operators' is initialized lazily, as indicated by 'inited' which was \
+ * initialized to all false by palloc0. \
+ */ \
+ \
+ result = palloc(SizeofMinmaxOpcInfo(2)); /* min, max */ \
+ result->oi_nstored = 2; \
+ result->oi_opaque = opaque; \
+ result->oi_typids[0] = typoid; \
+ result->oi_typids[1] = typoid; \
+ \
+ PG_RETURN_POINTER(result); \
+ }
+
+ OPCINFO(int4, INT4OID)
+ OPCINFO(numeric, NUMERICOID)
+ OPCINFO(text, TEXTOID)
+ OPCINFO(time, TIMEOID)
+ OPCINFO(timetz, TIMETZOID)
+ OPCINFO(timestamp, TIMESTAMPOID)
+ OPCINFO(timestamptz, TIMESTAMPTZOID)
+ OPCINFO(date, DATEOID)
+ OPCINFO(char, CHAROID)
+
+ /*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the domain specified by the existing
+ * tuple values, update the index range and return true. Otherwise, return
+ * false and do not modify in this case.
+ */
+ Datum
+ mmSortableAddValue(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtuple = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (dtuple->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(false);
+
+ dtuple->dt_columns[attno - 1].hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (dtuple->dt_columns[attno - 1].allnulls)
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ dtuple->dt_columns[attno - 1].allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the existing
+ * minimum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[0]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[0] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = mmsrt_get_procinfo(mmdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval,
+ dtuple->dt_columns[attno - 1].values[1]);
+ if (DatumGetBool(compar))
+ {
+ dtuple->dt_columns[attno - 1].values[1] =
+ datumCopy(newval, mmdesc->md_tupdesc->attrs[attno - 1]->attbyval,
+ mmdesc->md_tupdesc->attrs[attno - 1]->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+ }
+
+ /*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple. Return true
+ * if so, false otherwise.
+ */
+ Datum
+ mmSortableConsistent(PG_FUNCTION_ARGS)
+ {
+ MinmaxDesc *mmdesc = (MinmaxDesc *) PG_GETARG_POINTER(0);
+ DeformedMMTuple *dtup = (DeformedMMTuple *) PG_GETARG_POINTER(1);
+ ScanKey key = (ScanKey) PG_GETARG_POINTER(2);
+ Oid colloid = PG_GET_COLLATION();
+ AttrNumber attno = key->sk_attno;
+ Datum value;
+ Datum matches;
+
+ /* handle IS NULL/IS NOT NULL tests */
+ if (key->sk_flags & SK_ISNULL)
+ {
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (dtup->dt_columns[attno - 1].allnulls ||
+ dtup->dt_columns[attno - 1].hasnulls)
+ PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(false);
+ }
+
+ /*
+ * For IS NOT NULL we can only exclude blocks if all values are nulls.
+ */
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ if (dtup->dt_columns[attno - 1].allnulls)
+ PG_RETURN_BOOL(false);
+ PG_RETURN_BOOL(true);
+ }
+
+ value = key->sk_argument;
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESS),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <= scan
+ * key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[0],
+ value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(mmsrt_get_procinfo(mmdesc, attno,
+ PROCNUM_GREATER),
+ colloid,
+ dtup->dt_columns[attno - 1].values[1],
+ value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+ }
+
+ /*
+ * Return the procedure corresponding to the given function support number.
+ */
+ static FmgrInfo *
+ mmsrt_get_procinfo(MinmaxDesc *mmdesc, uint16 attno, uint16 procnum)
+ {
+ SortableOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (SortableOpaque *) mmdesc->md_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(mmdesc->md_index, attno, procnum),
+ CurrentMemoryContext);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmtuple.c
***************
*** 0 ****
--- 1,478 ----
+ /*
+ * MinMax-specific tuples
+ * Method implementations for tuples in minmax indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedMMTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A minmax tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/minmax_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+
+
+ static inline void mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Return a tuple descriptor used for on-disk storage of minmax tuples.
+ */
+ static TupleDesc
+ mmtuple_disk_tupdesc(MinmaxDesc *mmdesc)
+ {
+ /* We cache these in the MinmaxDesc */
+ if (mmdesc->md_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(mmdesc->md_totalstored, false);
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ for (j = 0; j < mmdesc->md_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ mmdesc->md_info[i]->oi_typids[j],
+ -1, 0);
+ }
+
+ mmdesc->md_disktdesc = tupdesc;
+ }
+
+ return mmdesc->md_disktdesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a minmax index.
+ */
+ MMTuple *
+ minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
+ DeformedMMTuple *tuple, Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ MMTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(mmdesc->md_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ nulls = palloc0(sizeof(bool) * mmdesc->md_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(mmdesc->md_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ idxattno = 0;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in
+ * the column; when this happens, there is no data to store. Thus
+ * set the nullable bits for all data elements of this column and
+ * we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of this
+ * means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[keyno]->oi_nstored;
+ datumno++)
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfMinMaxTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk minmax index tuple;
+ * the first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(mmdesc->md_tupdesc->natts * 2);
+ }
+
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(mmtuple_disk_tupdesc(mmdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->mt_blkno = blkno;
+ rettuple->mt_info = hoff;
+ Assert((rettuple->mt_info & MMIDX_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(mmtuple_disk_tupdesc(mmdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->mt_info |= MMIDX_NULLS_MASK;
+
+ /*
+ * Note that we reverse the sense of null bits in this module: we store
+ * a 1 for a null attribute rather than a 0. So we must reverse the
+ * sense of the att_isnull test in mm_deconstruct_tuple as well.
+ */
+ bitP = ((bits8 *) ((char *) rettuple + SizeOfMinMaxTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ bitP = ((bits8 *) (rettuple + SizeOfMinMaxTuple)) - 1;
+ }
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by minmax_form_tuple
+ */
+ void
+ minmax_free_tuple(MMTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ MMTuple *
+ minmax_copy_tuple(MMTuple *tuple, Size len)
+ {
+ MMTuple *newtup;
+
+ newtup = palloc(len);
+ memcpy(newtup, tuple, len);
+
+ return newtup;
+ }
+
+ bool
+ minmax_tuples_equal(const MMTuple *a, Size alen, const MMTuple *b, Size blen)
+ {
+ if (alen != blen)
+ return false;
+ if (memcmp(a, b, alen) != 0)
+ return false;
+ return true;
+ }
+
+ /*
+ * Create a new DeformedMMTuple from scratch, and initialize it to an empty
+ * state.
+ */
+ DeformedMMTuple *
+ minmax_new_dtuple(MinmaxDesc *mmdesc)
+ {
+ DeformedMMTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedMMTuple) +
+ sizeof(MMValues) * mmdesc->md_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * mmdesc->md_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * mmdesc->md_info[i]->oi_nstored;
+ }
+
+ return dtup;
+ }
+
+ /*
+ * Reset a DeformedMMTuple to initial state
+ */
+ void
+ minmax_dtuple_initialize(DeformedMMTuple *dtuple, MinmaxDesc *mmdesc)
+ {
+ int i;
+
+ for (i = 0; i < mmdesc->md_tupdesc->natts; i++)
+ {
+ /*
+ * FIXME -- we may need to pfree() some datums here before clobbering
+ * the whole thing
+ */
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * mmdesc->md_info[i]->oi_nstored);
+ }
+ }
+
+ /*
+ * Convert a MMTuple back to a DeformedMMTuple. This is the reverse of
+ * minmax_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ *
+ * XXX some callers might need copies of each datum; if so we need to apply
+ * datumCopy inside the loop. We probably also need a minmax_free_dtuple()
+ * function.
+ */
+ DeformedMMTuple *
+ minmax_deform_tuple(MinmaxDesc *mmdesc, MMTuple *tuple)
+ {
+ DeformedMMTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = minmax_new_dtuple(mmdesc);
+
+ values = palloc(sizeof(Datum) * mmdesc->md_totalstored);
+ allnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * mmdesc->md_tupdesc->natts);
+
+ tp = (char *) tuple + MMTupleDataOffset(tuple);
+
+ if (MMTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfMinMaxTuple);
+ else
+ nullbits = NULL;
+ mm_deconstruct_tuple(mmdesc,
+ tp, nullbits, MMTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item
+ * in the values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < mmdesc->md_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += mmdesc->md_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ dtup->dt_columns[keyno].values =
+ palloc(sizeof(Datum) * mmdesc->md_totalstored);
+
+ /* XXX optional datumCopy()? */
+ for (i = 0; i < mmdesc->md_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] = values[valueno++];
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /*
+ * mm_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk minmax tuple.
+ *
+ * Its arguments are:
+ * mmdesc minmax descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size mmdesc->md_totalstored
+ * allnulls output "allnulls", size mmdesc->md_tupdesc->natts
+ * hasnulls output "hasnulls", size mmdesc->md_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+ static inline void
+ mm_deconstruct_tuple(MinmaxDesc *mmdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ * Note that we reverse the sense of the att_isnull test, because we store
+ * 1 for a null value (rather than a 1 for a not null value as is the
+ * att_isnull convention used elsewhere.) See minmax_form_tuple.
+ */
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ allnulls[attnum] = nulls && !att_isnull(attnum, nullbits);
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data for
+ * this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && !att_isnull(mmdesc->md_tupdesc->natts + attnum, nullbits);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = mmtuple_disk_tupdesc(mmdesc);
+ stored = 0;
+ off = 0;
+ for (attnum = 0; attnum < mmdesc->md_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += mmdesc->md_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < mmdesc->md_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+ }
*** /dev/null
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 0 ****
--- 1,323 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax.h"
+ #include "access/minmax_internal.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
+ #include "access/minmax_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ minmax_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Common part of an insert or update. Inserts the new tuple and updates the
+ * revmap.
+ */
+ static void
+ minmax_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, xl_minmax_insert *xlrec,
+ MMTuple *mmtuple, int tuplen)
+ {
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ Assert(mmtuple->mt_blkno == xlrec->heapBlk);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ }
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_insert: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* update the revmap */
+ if (record->xl_info & XLR_BKP_BLOCK(1))
+ {
+ (void) RestoreBackupBlock(lsn, record, 1, false, false);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, xlrec->revmapBlk, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ mmSetHeapBlockItemptr(buffer, xlrec->pagesPerRange, xlrec->heapBlk, xlrec->tid);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ minmax_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) XLogRecGetData(record);
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxInsert;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxInsert);
+
+ minmax_xlog_insert_update(lsn, record, xlrec, newtup, tuplen);
+ }
+
+ static void
+ minmax_xlog_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_update *xlrec = (xl_minmax_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ OffsetNumber offnum;
+ Buffer buffer;
+ Page page;
+ MMTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxUpdate;
+ newtup = (MMTuple *) ((char *) xlrec + SizeOfMinmaxUpdate);
+
+ /* First insert the new tuple and update revmap, like in an insertion. */
+ minmax_xlog_insert_update(lsn, record, &xlrec->new, newtup, tuplen);
+
+ /* Then remove the old tuple */
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ {
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ }
+ else
+ {
+ blkno = ItemPointerGetBlockNumber(&(xlrec->oldtid));
+ buffer = XLogReadBuffer(xlrec->new.node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->oldtid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_insert: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+ }
+
+ /*
+ * Update a tuple on a single page.
+ */
+ static void
+ minmax_xlog_samepage_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_samepage_update *xlrec = (xl_minmax_samepage_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ MMTuple *mmtuple;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfMinmaxSamepageUpdate;
+ mmtuple = (MMTuple *) ((char *) xlrec + SizeOfMinmaxSamepageUpdate);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "minmax_xlog_samepage_update: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "minmax_xlog_samepage_update: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+
+ static void
+ minmax_xlog_revmap_extend(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_minmax_revmap_extend *xlrec = (xl_minmax_revmap_extend *) XLogRecGetData(record);
+ Buffer metabuf;
+ Page metapg;
+ MinmaxMetaPageData *metadata;
+ Buffer buf;
+ Page page;
+
+ /* Update the metapage */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ metabuf = RestoreBackupBlock(lsn, record, 0, false, true);
+ }
+ else
+ {
+ metabuf = XLogReadBuffer(xlrec->node, MINMAX_METAPAGE_BLKNO, false);
+ if (BufferIsValid(metabuf))
+ {
+ metapg = BufferGetPage(metabuf);
+ if (lsn > PageGetLSN(metapg))
+ {
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapg);
+
+ Assert(metadata->lastRevmapPage == xlrec->targetBlk - 1);
+ metadata->lastRevmapPage = xlrec->targetBlk;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(metabuf);
+ }
+ }
+ }
+
+ /*
+ * Re-init the target block as a revmap page. There's never a full-
+ * page image here.
+ */
+
+ buf = XLogReadBuffer(xlrec->node, xlrec->targetBlk, true);
+ page = (Page) BufferGetPage(buf);
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ void
+ minmax_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_MINMAX_OPMASK)
+ {
+ case XLOG_MINMAX_CREATE_INDEX:
+ minmax_xlog_createidx(lsn, record);
+ break;
+ case XLOG_MINMAX_INSERT:
+ minmax_xlog_insert(lsn, record);
+ break;
+ case XLOG_MINMAX_UPDATE:
+ minmax_xlog_update(lsn, record);
+ break;
+ case XLOG_MINMAX_SAMEPAGE_UPDATE:
+ minmax_xlog_samepage_update(lsn, record);
+ break;
+ case XLOG_MINMAX_REVMAP_EXTEND:
+ minmax_xlog_revmap_extend(lsn, record);
+ break;
+ default:
+ elog(PANIC, "minmax_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 9,15 **** top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
--- 9,16 ----
include $(top_builddir)/src/Makefile.global
OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
! minmaxdesc.o mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o \
! smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 0 ****
--- 1,89 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmaxdesc.c
+ * rmgr descriptor routines for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/minmaxdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_xlog.h"
+
+ void
+ minmax_desc(StringInfo buf, XLogRecord *record)
+ {
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_MINMAX_OPMASK;
+ if (info == XLOG_MINMAX_CREATE_INDEX)
+ {
+ xl_minmax_createidx *xlrec = (xl_minmax_createidx *) rec;
+
+ appendStringInfo(buf, "create index: v%d pagesPerRange %u %u/%u/%u",
+ xlrec->version, xlrec->pagesPerRange,
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_MINMAX_INSERT)
+ {
+ xl_minmax_insert *xlrec = (xl_minmax_insert *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ appendStringInfo(buf, "%u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->heapBlk, xlrec->revmapBlk,
+ xlrec->pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_MINMAX_UPDATE)
+ {
+ xl_minmax_update *xlrec = (xl_minmax_update *) rec;
+
+ if (record->xl_info & XLOG_MINMAX_INIT_PAGE)
+ appendStringInfo(buf, "update(init): ");
+ else
+ appendStringInfo(buf, "update: ");
+ appendStringInfo(buf, "rel %u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u old TID (%u,%u) TID (%u,%u)",
+ xlrec->new.node.spcNode, xlrec->new.node.dbNode,
+ xlrec->new.node.relNode,
+ xlrec->new.heapBlk, xlrec->new.revmapBlk,
+ xlrec->new.pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->oldtid),
+ ItemPointerGetOffsetNumber(&xlrec->oldtid),
+ ItemPointerGetBlockNumber(&xlrec->new.tid),
+ ItemPointerGetOffsetNumber(&xlrec->new.tid));
+ }
+ else if (info == XLOG_MINMAX_SAMEPAGE_UPDATE)
+ {
+ xl_minmax_samepage_update *xlrec = (xl_minmax_samepage_update *) rec;
+
+ appendStringInfo(buf, "samepage_update: rel %u/%u/%u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_MINMAX_REVMAP_EXTEND)
+ {
+ xl_minmax_revmap_extend *xlrec = (xl_minmax_revmap_extend *) rec;
+
+ appendStringInfo(buf, "revmap extend: rel %u/%u/%u targetBlk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->targetBlk);
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/minmax_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2096,2101 **** IndexBuildHeapScan(Relation heapRelation,
--- 2096,2122 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2166,2171 **** IndexBuildHeapScan(Relation heapRelation,
--- 2187,2195 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_MINMAX_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 399,406 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 897,1078 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7375 ----
PG_RETURN_VOID();
}
+
+ Datum
+ mmcostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+
+ *indexStartupCost = (Cost) seq_page_cost * index->pages * loop_count;
+ *indexTotalCost = *indexStartupCost;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ PG_RETURN_VOID();
+ }
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** /dev/null
--- b/src/include/access/minmax.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax.h
+ */
+ #ifndef MINMAX_H
+ #define MINMAX_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in minmax.c (external entry points for minmax)
+ */
+ extern Datum mmbuild(PG_FUNCTION_ARGS);
+ extern Datum mmbuildempty(PG_FUNCTION_ARGS);
+ extern Datum mminsert(PG_FUNCTION_ARGS);
+ extern Datum mmbeginscan(PG_FUNCTION_ARGS);
+ extern Datum mmgettuple(PG_FUNCTION_ARGS);
+ extern Datum mmgetbitmap(PG_FUNCTION_ARGS);
+ extern Datum mmrescan(PG_FUNCTION_ARGS);
+ extern Datum mmendscan(PG_FUNCTION_ARGS);
+ extern Datum mmmarkpos(PG_FUNCTION_ARGS);
+ extern Datum mmrestrpos(PG_FUNCTION_ARGS);
+ extern Datum mmbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum mmvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum mmcanreturn(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for MinMax' reloptions
+ */
+ typedef struct MinmaxOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+ } MinmaxOptions;
+
+ #define MINMAX_DEFAULT_PAGES_PER_RANGE 128
+ #define MinmaxGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((MinmaxOptions *) (relation)->rd_options)->pagesPerRange : \
+ MINMAX_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* MINMAX_H */
*** /dev/null
--- b/src/include/access/minmax_internal.h
***************
*** 0 ****
--- 1,86 ----
+ /*
+ * minmax_internal.h
+ * internal declarations for MinMax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_internal.h
+ */
+ #ifndef MINMAX_INTERNAL_H
+ #define MINMAX_INTERNAL_H
+
+ #include "fmgr.h"
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
+ * on-disk format to a DeformedMMTuple and vice-versa.
+ */
+
+ /* struct returned by "OpcInfo" amproc */
+ typedef struct MinmaxOpcInfo
+ {
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+
+ /* Type IDs of the stored columns */
+ Oid oi_typids[FLEXIBLE_ARRAY_MEMBER];
+ } MinmaxOpcInfo;
+
+ /* the size of a MinmaxOpcInfo for the given number of columns */
+ #define SizeofMinmaxOpcInfo(ncols) \
+ (offsetof(MinmaxOpcInfo, oi_typids) + sizeof(Oid) * ncols)
+
+ typedef struct MinmaxDesc
+ {
+ /* the index relation itself */
+ Relation md_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc md_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc md_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int md_totalstored;
+
+ /* per-column info */
+ MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* md_tupdesc->natts entries long */
+ } MinmaxDesc;
+
+ /*
+ * Globally-known function support numbers for Minmax indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+ #define MINMAX_PROCNUM_OPCINFO 1
+ #define MINMAX_PROCNUM_ADDVALUE 2
+ #define MINMAX_PROCNUM_CONSISTENT 3
+
+ #define MINMAX_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(MINMAX_DEBUG) && defined(__GNUC__)
+ #define MINMAX_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define MINMAX_elog(a) void(0)
+ #endif
+
+ /* minmax.c */
+ extern MinmaxDesc *minmax_build_mmdesc(Relation rel);
+ extern void minmax_free_mmdesc(MinmaxDesc *mmdesc);
+ extern void mm_page_init(Page page, uint16 type);
+ extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+ #endif /* MINMAX_INTERNAL_H */
*** /dev/null
--- b/src/include/access/minmax_page.h
***************
*** 0 ****
--- 1,70 ----
+ /*
+ * Prototypes and definitions for minmax page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific minmax files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef MINMAX_PAGE_H
+ #define MINMAX_PAGE_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+
+ /* special space on all minmax pages stores a "type" identifier */
+ #define MINMAX_PAGETYPE_META 0xF091
+ #define MINMAX_PAGETYPE_REVMAP 0xF092
+ #define MINMAX_PAGETYPE_REGULAR 0xF093
+
+ #define MINMAX_PAGE_TYPE(page) \
+ (((MinmaxSpecialSpace *) PageGetSpecialPointer(page))->type)
+ #define MINMAX_IS_REVMAP_PAGE(page) (MINMAX_PAGE_TYPE(page) == MINMAX_PAGETYPE_REVMAP)
+ #define MINMAX_IS_REGULAR_PAGE(page) (MINMAX_PAGE_TYPE(page) == MINMAX_PAGETYPE_REGULAR)
+
+ /* flags */
+ #define MINMAX_EVACUATE_PAGE 1
+
+ typedef struct MinmaxSpecialSpace
+ {
+ uint16 flags;
+ uint16 type;
+ } MinmaxSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct MinmaxMetaPageData
+ {
+ uint32 minmaxMagic;
+ uint32 minmaxVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber lastRevmapPage;
+ } MinmaxMetaPageData;
+
+ #define MINMAX_CURRENT_VERSION 1
+ #define MINMAX_META_MAGIC 0xA8109CFA
+
+ #define MINMAX_METAPAGE_BLKNO 0
+ #define MINMAX_REVMAP_FIRST_BLKNO 1
+
+ /* Definitions for regular revmap pages */
+ typedef struct RevmapContents
+ {
+ ItemPointerData rmr_tids[1]; /* really REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rmr_tids) - \
+ MAXALIGN(sizeof(MinmaxSpecialSpace)))
+ /* max num of items in the array */
+ #define REVMAP_PAGE_MAXITEMS \
+ (REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ #endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_pageops.h
***************
*** 0 ****
--- 1,29 ----
+ /*
+ * Prototypes for operating on minmax pages.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_pageops.h
+ */
+ #ifndef MINMAX_PAGEOPS_H
+ #define MINMAX_PAGEOPS_H
+
+ #include "access/minmax_revmap.h"
+
+ extern bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended);
+ extern void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ MMTuple *tup, Size itemsz, bool *extended);
+
+ extern bool mm_start_evacuating_page(Relation idxRel, Buffer buf);
+ extern void mm_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer buf);
+
+ #endif /* MINMAX_PAGEOPS_H */
*** /dev/null
--- b/src/include/access/minmax_revmap.h
***************
*** 0 ****
--- 1,36 ----
+ /*
+ * prototypes for minmax reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_revmap.h
+ */
+
+ #ifndef MINMAX_REVMAP_H
+ #define MINMAX_REVMAP_H
+
+ #include "access/minmax_tuple.h"
+ #include "storage/block.h"
+ #include "storage/buf.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in mmrevmap.c */
+ typedef struct mmRevmapAccess mmRevmapAccess;
+
+ extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
+ BlockNumber *pagesPerRange);
+ extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
+
+ extern Buffer mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk);
+ extern void mmSetHeapBlockItemptr(Buffer rmbuf, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, ItemPointerData tid);
+ extern MMTuple *mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess,
+ BlockNumber heapBlk, Buffer *buf, OffsetNumber *off,
+ int mode);
+
+ #endif /* MINMAX_REVMAP_H */
*** /dev/null
--- b/src/include/access/minmax_tuple.h
***************
*** 0 ****
--- 1,90 ----
+ /*
+ * Declarations for dealing with MinMax-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_tuple.h
+ */
+ #ifndef MINMAX_TUPLE_H
+ #define MINMAX_TUPLE_H
+
+ #include "access/minmax_internal.h"
+ #include "access/tupdesc.h"
+
+
+ /*
+ * A minmax index stores one index tuple per page range. Each index tuple
+ * has one MMValues struct for each indexed column; in turn, each MMValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+ typedef struct MMValues
+ {
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+ } MMValues;
+
+ /*
+ * This struct represents one index tuple, comprising the minimum and maximum
+ * values for all indexed columns, within one page range. These values can
+ * only be meaningfully decoded with an appropriate MinmaxDesc.
+ */
+ typedef struct DeformedMMTuple
+ {
+ BlockNumber dt_blkno; /* heap blkno that the tuple is for */
+ MMValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedMMTuple;
+
+ /*
+ * An on-disk minmax tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each indexed column); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+ typedef struct MMTuple
+ {
+ /* heap block number that the tuple is for */
+ BlockNumber mt_blkno;
+
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: unused
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 mt_info;
+ } MMTuple;
+
+ #define SizeOfMinMaxTuple (offsetof(MMTuple, mt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define MMIDX_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ /* bit 0x40 is not used at present */
+ #define MMIDX_NULLS_MASK 0x80
+
+ #define MMTupleDataOffset(mmtup) ((Size) (((MMTuple *) (mmtup))->mt_info & MMIDX_OFFSET_MASK))
+ #define MMTupleHasNulls(mmtup) (((((MMTuple *) (mmtup))->mt_info & MMIDX_NULLS_MASK)) != 0)
+
+
+ extern MMTuple *minmax_form_tuple(MinmaxDesc *mmdesc, BlockNumber blkno,
+ DeformedMMTuple *tuple, Size *size);
+ extern void minmax_free_tuple(MMTuple *tuple);
+ extern MMTuple *minmax_copy_tuple(MMTuple *tuple, Size len);
+ extern bool minmax_tuples_equal(const MMTuple *a, Size alen,
+ const MMTuple *b, Size blen);
+
+ extern DeformedMMTuple *minmax_new_dtuple(MinmaxDesc *mmdesc);
+ extern void minmax_dtuple_initialize(DeformedMMTuple *dtuple,
+ MinmaxDesc *mmdesc);
+ extern DeformedMMTuple *minmax_deform_tuple(MinmaxDesc *mmdesc,
+ MMTuple *tuple);
+
+ #endif /* MINMAX_TUPLE_H */
*** /dev/null
--- b/src/include/access/minmax_xlog.h
***************
*** 0 ****
--- 1,106 ----
+ /*-------------------------------------------------------------------------
+ *
+ * minmax_xlog.h
+ * POSTGRES MinMax access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/minmax_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef MINMAX_XLOG_H
+ #define MINMAX_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for minmax's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_MINMAX_CREATE_INDEX 0x00
+ #define XLOG_MINMAX_INSERT 0x10
+ #define XLOG_MINMAX_UPDATE 0x20
+ #define XLOG_MINMAX_SAMEPAGE_UPDATE 0x30
+ #define XLOG_MINMAX_REVMAP_EXTEND 0x40
+ #define XLOG_MINMAX_REVMAP_VACUUM 0x50
+
+ #define XLOG_MINMAX_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_MINMAX_INIT_PAGE 0x80
+
+ /* This is what we need to know about a minmax index create */
+ typedef struct xl_minmax_createidx
+ {
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+ } xl_minmax_createidx;
+ #define SizeOfMinmaxCreateIdx (offsetof(xl_minmax_createidx, version) + sizeof(uint16))
+
+ /*
+ * This is what we need to know about a minmax tuple insert
+ */
+ typedef struct xl_minmax_insert
+ {
+ RelFileNode node;
+ BlockNumber heapBlk;
+
+ /* extra information needed to update the revmap */
+ BlockNumber revmapBlk;
+ BlockNumber pagesPerRange;
+
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_minmax_insert;
+
+ #define SizeOfMinmaxInsert (offsetof(xl_minmax_insert, tid) + sizeof(ItemPointerData))
+
+ /*
+ * A cross-page update is the same as an insert, but also store the old tid.
+ */
+ typedef struct xl_minmax_update
+ {
+ xl_minmax_insert new;
+ ItemPointerData oldtid;
+ } xl_minmax_update;
+
+ #define SizeOfMinmaxUpdate (offsetof(xl_minmax_update, oldtid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a minmax tuple samepage update */
+ typedef struct xl_minmax_samepage_update
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_minmax_samepage_update;
+
+ #define SizeOfMinmaxSamepageUpdate (offsetof(xl_minmax_samepage_update, tid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a revmap extension */
+ typedef struct xl_minmax_revmap_extend
+ {
+ RelFileNode node;
+ BlockNumber targetBlk;
+ } xl_minmax_revmap_extend;
+
+ #define SizeOfMinmaxRevmapExtend (offsetof(xl_minmax_revmap_extend, targetBlk) + \
+ sizeof(BlockNumber))
+
+
+ extern void minmax_desc(StringInfo buf, XLogRecord *record);
+ extern void minmax_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* MINMAX_XLOG_H */
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_MINMAX = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_MINMAX,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_MINMAX_ID, "MinMax", minmax_redo, minmax_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( minmax 5 7 f f f f t t f t t f f 0 mminsert mmbeginscan - mmgetbitmap mmrescan mmendscan mmmarkpos mmrestrpos mmbuild mmbuildempty mmbulkdelete mmvacuumcleanup - mmcostestimate mmoptions ));
+ #define MINMAX_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * int4_minmax_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * numeric_minmax_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * text_minmax_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * time_minmax_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * timetz_minmax_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * timestamp_minmax_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * timestamptz_minmax_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * date_minmax_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * char_minmax_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 432,435 **** DATA(insert ( 4017 25 25 3 4029 ));
--- 432,508 ----
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+ /* minmax */
+ DATA(insert ( 4054 23 23 1 3383 ));
+ DATA(insert ( 4054 23 23 2 3384 ));
+ DATA(insert ( 4054 23 23 3 3385 ));
+ DATA(insert ( 4054 23 23 4 66 ));
+ DATA(insert ( 4054 23 23 5 149 ));
+ DATA(insert ( 4054 23 23 6 150 ));
+ DATA(insert ( 4054 23 23 7 147 ));
+
+ DATA(insert ( 4055 1700 1700 1 3386 ));
+ DATA(insert ( 4055 1700 1700 2 3384 ));
+ DATA(insert ( 4055 1700 1700 3 3385 ));
+ DATA(insert ( 4055 1700 1700 4 1722 ));
+ DATA(insert ( 4055 1700 1700 5 1723 ));
+ DATA(insert ( 4055 1700 1700 6 1721 ));
+ DATA(insert ( 4055 1700 1700 7 1720 ));
+
+ DATA(insert ( 4056 25 25 1 3387 ));
+ DATA(insert ( 4056 25 25 2 3384 ));
+ DATA(insert ( 4056 25 25 3 3385 ));
+ DATA(insert ( 4056 25 25 4 740 ));
+ DATA(insert ( 4056 25 25 5 741 ));
+ DATA(insert ( 4056 25 25 6 743 ));
+ DATA(insert ( 4056 25 25 7 742 ));
+
+ DATA(insert ( 4057 1083 1083 1 3388 ));
+ DATA(insert ( 4057 1083 1083 2 3384 ));
+ DATA(insert ( 4057 1083 1083 3 3385 ));
+ DATA(insert ( 4057 1083 1083 4 1102 ));
+ DATA(insert ( 4057 1083 1083 5 1103 ));
+ DATA(insert ( 4057 1083 1083 6 1105 ));
+ DATA(insert ( 4057 1083 1083 7 1104 ));
+
+ DATA(insert ( 4058 1266 1266 1 3389 ));
+ DATA(insert ( 4058 1266 1266 2 3384 ));
+ DATA(insert ( 4058 1266 1266 3 3385 ));
+ DATA(insert ( 4058 1266 1266 4 1354 ));
+ DATA(insert ( 4058 1266 1266 5 1355 ));
+ DATA(insert ( 4058 1266 1266 6 1356 ));
+ DATA(insert ( 4058 1266 1266 7 1357 ));
+
+ DATA(insert ( 4059 1114 1114 1 3390 ));
+ DATA(insert ( 4059 1114 1114 2 3384 ));
+ DATA(insert ( 4059 1114 1114 3 3385 ));
+ DATA(insert ( 4059 1114 1114 4 2054 ));
+ DATA(insert ( 4059 1114 1114 5 2055 ));
+ DATA(insert ( 4059 1114 1114 6 2056 ));
+ DATA(insert ( 4059 1114 1114 7 2057 ));
+
+ DATA(insert ( 4060 1184 1184 1 3391 ));
+ DATA(insert ( 4060 1184 1184 2 3384 ));
+ DATA(insert ( 4060 1184 1184 3 3385 ));
+ DATA(insert ( 4060 1184 1184 4 1154 ));
+ DATA(insert ( 4060 1184 1184 5 1155 ));
+ DATA(insert ( 4060 1184 1184 6 1156 ));
+ DATA(insert ( 4060 1184 1184 7 1157 ));
+
+ DATA(insert ( 4061 1082 1082 1 3392 ));
+ DATA(insert ( 4061 1082 1082 2 3384 ));
+ DATA(insert ( 4061 1082 1082 3 3385 ));
+ DATA(insert ( 4061 1082 1082 4 1087 ));
+ DATA(insert ( 4061 1082 1082 5 1088 ));
+ DATA(insert ( 4061 1082 1082 6 1090 ));
+ DATA(insert ( 4061 1082 1082 7 1089 ));
+
+ DATA(insert ( 4062 18 18 1 3393 ));
+ DATA(insert ( 4062 18 18 2 3384 ));
+ DATA(insert ( 4062 18 18 3 3385 ));
+ DATA(insert ( 4062 18 18 4 1246 ));
+ DATA(insert ( 4062 18 18 5 72 ));
+ DATA(insert ( 4062 18 18 6 74 ));
+ DATA(insert ( 4062 18 18 7 73 ));
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_minmax_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_minmax_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_minmax_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_minmax_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_minmax_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_minmax_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_minmax_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_minmax_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_minmax_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_minax_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_minmax_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( mmgetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ mmgetbitmap _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3790 ( mminsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mminsert _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3791 ( mmbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbeginscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3792 ( mmrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmrescan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3793 ( mmendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmendscan _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3794 ( mmmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmmarkpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3795 ( mmrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmrestrpos _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3796 ( mmbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ mmbuild _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3797 ( mmbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ mmbuildempty _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3798 ( mmbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmbulkdelete _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3799 ( mmvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3800 ( mmcostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmcostestimate _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+ DATA(insert OID = 3801 ( mmoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ mmoptions _null_ _null_ _null_ ));
+ DESCR("minmax(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
***************
*** 4066,4071 **** DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
--- 4094,4123 ----
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+ /* Minmax */
+ DATA(insert OID = 3384 ( minmax_sortable_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableAddValue _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3385 ( minmax_sortable_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ mmSortableConsistent _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3383 ( minmax_sortable_opcinfo_int4 PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_int4 _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3386 ( minmax_sortable_opcinfo_numeric PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_numeric _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3387 ( minmax_sortable_opcinfo_text PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_text _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3388 ( minmax_sortable_opcinfo_time PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_time _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3389 ( minmax_sortable_opcinfo_timetz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timetz _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3390 ( minmax_sortable_opcinfo_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timestamp _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3391 ( minmax_sortable_opcinfo_timestamptz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_timestamptz _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3392 ( minmax_sortable_opcinfo_date PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_date _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+ DATA(insert OID = 3393 ( minmax_sortable_opcinfo_char PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ mmSortableOpcInfo_char _null_ _null_ _null_ ));
+ DESCR("MinMax sortable datatype support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,410 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 195,200 **** extern Datum hashcostestimate(PG_FUNCTION_ARGS);
--- 195,201 ----
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
extern Datum spgcostestimate(PG_FUNCTION_ARGS);
extern Datum gincostestimate(PG_FUNCTION_ARGS);
+ extern Datum mmcostestimate(PG_FUNCTION_ARGS);
/* Functions in array_selfuncs.c */
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1658,1663 **** ORDER BY 1, 2, 3;
--- 1658,1668 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1680,1686 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1685,1691 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1842,1852 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
--- 1847,1859 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
***************
*** 1867,1873 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1874,1881 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1195,1205 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
--- 1195,1207 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- MinMax has seven support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
***************
*** 1218,1224 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1220,1227 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'minmax' AND procnums = '{1, 2, 3, 4, 5, 6, 7}'
);
-- Unfortunately, we can't check the amproc link very well because the
Alvaro Herrera wrote:
So here's v16, rebased on top of 9bac66020. As far as I am concerned,
this is the last version before I start renaming everything to BRIN and
then commit.
FWIW in case you or others have interest, here's the diff between your
patch and v16. Also, for illustrative purposes, the diff between
versions yours and mine of the code that got moved to mmpageops.c
because it's difficult to see it from the partial patch. (There's
nothing to do with that partial diff other than read it directly.)
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-16-partial.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/mmfuncs.c
--- b/contrib/pageinspect/mmfuncs.c
***************
*** 29,35 ****
PG_FUNCTION_INFO_V1(minmax_page_type);
PG_FUNCTION_INFO_V1(minmax_page_items);
PG_FUNCTION_INFO_V1(minmax_metapage_info);
- PG_FUNCTION_INFO_V1(minmax_revmap_array_data);
PG_FUNCTION_INFO_V1(minmax_revmap_data);
typedef struct mm_column_state
--- 29,34 ----
***************
*** 388,394 **** minmax_revmap_data(PG_FUNCTION_ARGS)
values[0] = Int64GetDatum((uint64) 0);
/* Extract (possibly empty) list of TIDs in this page. */
! for (i = 0; i < REGULAR_REVMAP_PAGE_MAXITEMS; i++)
{
ItemPointer tid;
--- 387,393 ----
values[0] = Int64GetDatum((uint64) 0);
/* Extract (possibly empty) list of TIDs in this page. */
! for (i = 0; i < REVMAP_PAGE_MAXITEMS; i++)
{
ItemPointer tid;
*** a/src/backend/access/minmax/Makefile
--- b/src/backend/access/minmax/Makefile
***************
*** 12,17 **** subdir = src/backend/access/minmax
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
! OBJS = minmax.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
include $(top_srcdir)/src/backend/common.mk
--- 12,17 ----
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
! OBJS = minmax.o mmpageops.o mmrevmap.o mmtuple.o mmxlog.o mmsortable.o
include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/access/minmax/minmax.c
--- b/src/backend/access/minmax/minmax.c
***************
*** 15,45 ****
*/
#include "postgres.h"
- #include "access/htup_details.h"
#include "access/minmax.h"
#include "access/minmax_internal.h"
#include "access/minmax_page.h"
! #include "access/minmax_revmap.h"
! #include "access/minmax_tuple.h"
#include "access/minmax_xlog.h"
#include "access/reloptions.h"
#include "access/relscan.h"
- #include "access/xlogutils.h"
#include "catalog/index.h"
- #include "catalog/pg_operator.h"
- #include "commands/vacuum.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
- #include "storage/indexfsm.h"
- #include "storage/lmgr.h"
- #include "storage/smgr.h"
- #include "utils/datum.h"
- #include "utils/lsyscache.h"
- #include "utils/memutils.h"
#include "utils/rel.h"
- #include "utils/syscache.h"
/*
--- 15,33 ----
*/
#include "postgres.h"
#include "access/minmax.h"
#include "access/minmax_internal.h"
#include "access/minmax_page.h"
! #include "access/minmax_pageops.h"
#include "access/minmax_xlog.h"
#include "access/reloptions.h"
#include "access/relscan.h"
#include "catalog/index.h"
#include "miscadmin.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/freespace.h"
#include "utils/rel.h"
/*
***************
*** 75,93 **** static MMBuildState *initialize_mm_buildstate(Relation idxRel,
static bool terminate_mm_buildstate(MMBuildState *state);
static void summarize_range(MMBuildState *mmstate, Relation heapRel,
BlockNumber heapBlk);
- static bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
- mmRevmapAccess *rmAccess, BlockNumber heapBlk,
- Buffer oldbuf, OffsetNumber oldoff,
- const MMTuple *origtup, Size origsz,
- const MMTuple *newtup, Size newsz,
- bool samepage, bool *extended);
- static void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
- mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapblkno,
- MMTuple *tup, Size itemsz, bool *extended);
- static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
- bool *extended);
static void form_and_insert_tuple(MMBuildState *mmstate);
- static Size mm_page_get_freespace(Page page);
/*
--- 63,69 ----
***************
*** 123,128 **** mminsert(PG_FUNCTION_ARGS)
--- 99,105 ----
rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
restart:
+ CHECK_FOR_INTERRUPTS();
heapBlk = ItemPointerGetBlockNumber(heaptid);
/* normalize the block number to be the first block in the range */
heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
***************
*** 155,161 **** restart:
addValue = index_getprocinfo(idxRel, keyno + 1,
MINMAX_PROCNUM_ADDVALUE);
-
result = FunctionCall5Coll(addValue,
idxRel->rd_indcollation[keyno],
PointerGetDatum(mmdesc),
--- 132,137 ----
***************
*** 197,203 **** restart:
/*
* Try to update the tuple. If this doesn't work for whatever reason,
* we need to restart from the top; the revmap might be pointing at a
! * different tuple for this block now.
*/
if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk, buf, off,
origtup, origsz, newtup, newsz, samepage, &extended))
--- 173,182 ----
/*
* Try to update the tuple. If this doesn't work for whatever reason,
* we need to restart from the top; the revmap might be pointing at a
! * different tuple for this block now, so we need to recompute
! * to ensure both our new heap tuple and the other inserter's are
! * covered by the combined tuple. It might be that we don't need to
! * update at all.
*/
if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk, buf, off,
origtup, origsz, newtup, newsz, samepage, &extended))
***************
*** 212,218 **** restart:
minmax_free_mmdesc(mmdesc);
if (extended)
! IndexFreeSpaceMapVacuum(idxRel);
return BoolGetDatum(false);
}
--- 191,197 ----
minmax_free_mmdesc(mmdesc);
if (extended)
! FreeSpaceMapVacuum(idxRel);
return BoolGetDatum(false);
}
***************
*** 313,318 **** mmgetbitmap(PG_FUNCTION_ARGS)
--- 292,299 ----
OffsetNumber off;
MMTuple *tup;
+ CHECK_FOR_INTERRUPTS();
+
tup = mmGetMMTupleForHeapBlock(opaque->rmAccess, heapBlk, &buf, &off,
BUFFER_LOCK_SHARE);
/*
***************
*** 488,494 **** mmbuildCallback(Relation index,
/* re-initialize state for it */
minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
- mmstate->seentup = false;
}
/* Accumulate the current tuple into the running state */
--- 469,474 ----
***************
*** 603,609 **** mmbuild(PG_FUNCTION_ARGS)
idxtuples = mmstate->numtuples;
mmRevmapAccessTerminate(mmstate->rmAccess);
if (terminate_mm_buildstate(mmstate))
! IndexFreeSpaceMapVacuum(index);
/*
* Return statistics
--- 583,589 ----
idxtuples = mmstate->numtuples;
mmRevmapAccessTerminate(mmstate->rmAccess);
if (terminate_mm_buildstate(mmstate))
! FreeSpaceMapVacuum(index);
/*
* Return statistics
***************
*** 684,689 **** mmvacuumcleanup(PG_FUNCTION_ARGS)
--- 664,671 ----
MMTuple *tup;
OffsetNumber off;
+ CHECK_FOR_INTERRUPTS();
+
tup = mmGetMMTupleForHeapBlock(rmAccess, heapBlk, &buf, &off,
BUFFER_LOCK_SHARE);
if (tup == NULL)
***************
*** 704,710 **** mmvacuumcleanup(PG_FUNCTION_ARGS)
/* free resources */
mmRevmapAccessTerminate(rmAccess);
if (mmstate && terminate_mm_buildstate(mmstate))
! IndexFreeSpaceMapVacuum(info->index);
heap_close(heapRel, AccessShareLock);
--- 686,692 ----
/* free resources */
mmRevmapAccessTerminate(rmAccess);
if (mmstate && terminate_mm_buildstate(mmstate))
! FreeSpaceMapVacuum(info->index);
heap_close(heapRel, AccessShareLock);
***************
*** 759,783 **** mm_page_init(Page page, uint16 type)
special->type = type;
}
- /*
- * Return the amount of free space on a regular minmax index page.
- *
- * If the page is not a regular page, or has been marked with the
- * MINMAX_EVACUATE_PAGE flag, returns 0.
- */
- static Size
- mm_page_get_freespace(Page page)
- {
- MinmaxSpecialSpace *special;
-
- special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
- if (!MINMAX_IS_REGULAR_PAGE(page) ||
- (special->flags & MINMAX_EVACUATE_PAGE) != 0)
- return 0;
- else
- return PageGetFreeSpace(page);
-
- }
/*
* Initialize a new minmax index' metapage.
--- 741,746 ----
***************
*** 792,799 **** mm_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
metadata = (MinmaxMetaPageData *) PageGetContents(page);
metadata->minmaxMagic = MINMAX_META_MAGIC;
- metadata->pagesPerRange = pagesPerRange;
metadata->minmaxVersion = version;
metadata->lastRevmapPage = 0;
}
--- 755,768 ----
metadata = (MinmaxMetaPageData *) PageGetContents(page);
metadata->minmaxMagic = MINMAX_META_MAGIC;
metadata->minmaxVersion = version;
+ metadata->pagesPerRange = pagesPerRange;
+
+ /*
+ * Note we cheat here a little. 0 is not a valid revmap block number
+ * (because it's the metapage buffer), but doing this enables the first
+ * revmap page to be created when the index is.
+ */
metadata->lastRevmapPage = 0;
}
***************
*** 876,886 **** initialize_mm_buildstate(Relation idxRel, mmRevmapAccess *rmAccess,
mmstate->currRangeStart = 0;
mmstate->rmAccess = rmAccess;
mmstate->mmDesc = minmax_build_mmdesc(idxRel);
! mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
mmstate->extended = false;
minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
- mmstate->seentup = false;
return mmstate;
}
--- 845,855 ----
mmstate->currRangeStart = 0;
mmstate->rmAccess = rmAccess;
mmstate->mmDesc = minmax_build_mmdesc(idxRel);
! mmstate->seentup = false;
mmstate->extended = false;
+ mmstate->dtuple = minmax_new_dtuple(mmstate->mmDesc);
minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
return mmstate;
}
***************
*** 902,908 **** terminate_mm_buildstate(MMBuildState *mmstate)
page = BufferGetPage(mmstate->currentInsertBuf);
RecordPageWithFreeSpace(mmstate->irel,
BufferGetBlockNumber(mmstate->currentInsertBuf),
! mm_page_get_freespace(page));
ReleaseBuffer(mmstate->currentInsertBuf);
}
vacuumfsm = mmstate->extended;
--- 871,877 ----
page = BufferGetPage(mmstate->currentInsertBuf);
RecordPageWithFreeSpace(mmstate->irel,
BufferGetBlockNumber(mmstate->currentInsertBuf),
! PageGetFreeSpace(page));
ReleaseBuffer(mmstate->currentInsertBuf);
}
vacuumfsm = mmstate->extended;
***************
*** 945,1525 **** summarize_range(MMBuildState *mmstate, Relation heapRel, BlockNumber heapBlk)
/* and re-initialize state for the next range */
minmax_dtuple_initialize(mmstate->dtuple, mmstate->mmDesc);
- mmstate->seentup = false;
- }
-
- /*
- * Update tuple origtup (size origsz), located in offset oldoff of buffer
- * oldbuf, to newtup (size newsz) as summary tuple for the page range starting
- * at heapBlk. If samepage is true, then attempt to put the new tuple in the same
- * page, otherwise get a new one.
- *
- * If the update is done, return true; the revmap is updated to point to the
- * new tuple. If the update is not done for whatever reason, return false.
- * Caller may retry the update if this happens.
- *
- * If the index had to be extended in the course of this operation, *extended
- * is set to true.
- */
- static bool
- mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
- mmRevmapAccess *rmAccess, BlockNumber heapBlk,
- Buffer oldbuf, OffsetNumber oldoff,
- const MMTuple *origtup, Size origsz,
- const MMTuple *newtup, Size newsz,
- bool samepage, bool *extended)
- {
- Page oldpage;
- ItemId origlp;
- MMTuple *oldtup;
- Size oldsz;
- Buffer newbuf;
- MinmaxSpecialSpace *special;
-
- if (!samepage)
- {
- /* need a page on which to put the item */
- newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz, extended);
- if (!BufferIsValid(newbuf))
- return false;
-
- /*
- * Note: it's possible (though unlikely) that the returned newbuf is
- * the same as oldbuf, if mm_getinsertbuffer determined that the old
- * buffer does in fact have enough space.
- */
- if (newbuf == oldbuf)
- newbuf = InvalidBuffer;
- }
- else
- {
- LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
- newbuf = InvalidBuffer;
- }
- oldpage = BufferGetPage(oldbuf);
- origlp = PageGetItemId(oldpage, oldoff);
-
- /* Check that the old tuple wasn't updated concurrently */
- if (!ItemIdIsNormal(origlp))
- {
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- return false;
- }
-
- oldsz = ItemIdGetLength(origlp);
- oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
-
- /* If both tuples are in fact equal, there is nothing to do */
- if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
- {
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- return false;
- }
-
- special = (MinmaxSpecialSpace *) PageGetSpecialPointer(oldpage);
-
- /*
- * Great, the old tuple is intact. We can proceed with the update.
- *
- * If there's enough room on the old page for the new tuple, replace it.
- *
- * Note that there might now be enough space on the page even though
- * the caller told us there isn't, if a concurrent updated moved a tuple
- * elsewhere or replaced a tuple with a smaller one.
- */
- if ((special->flags & MINMAX_EVACUATE_PAGE) == 0 &&
- (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz)))
- {
- if (BufferIsValid(newbuf))
- UnlockReleaseBuffer(newbuf);
-
- START_CRIT_SECTION();
- PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
- if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true, false) == InvalidOffsetNumber)
- elog(ERROR, "failed to add mmtuple");
- MarkBufferDirty(oldbuf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(idxrel))
- {
- BlockNumber blk = BufferGetBlockNumber(oldbuf);
- xl_minmax_samepage_update xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata[2];
- uint8 info = XLOG_MINMAX_SAMEPAGE_UPDATE;
-
- xlrec.node = idxrel->rd_node;
- ItemPointerSetBlockNumber(&xlrec.tid, blk);
- ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
- rdata[0].data = (char *) &xlrec;
- rdata[0].len = SizeOfMinmaxSamepageUpdate;
- rdata[0].buffer = InvalidBuffer;
- rdata[0].next = &(rdata[1]);
-
- rdata[1].data = (char *) newtup;
- rdata[1].len = newsz;
- rdata[1].buffer = oldbuf;
- rdata[1].buffer_std = true;
- rdata[1].next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
-
- PageSetLSN(oldpage, recptr);
- }
-
- END_CRIT_SECTION();
-
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- return true;
- }
- else if (newbuf == InvalidBuffer)
- {
- /*
- * Not enough space, but caller said that there was. Tell them to
- * start over
- */
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- return false;
- }
- else
- {
- /*
- * Not enough free space on the oldpage. Put the new tuple on the
- * new page, and update the revmap.
- */
- Page newpage = BufferGetPage(newbuf);
- Buffer revmapbuf;
- ItemPointerData newtid;
- OffsetNumber newoff;
-
- revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
-
- START_CRIT_SECTION();
-
- PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
- newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
- if (newoff == InvalidOffsetNumber)
- elog(ERROR, "failed to add mmtuple to new page");
- MarkBufferDirty(oldbuf);
- MarkBufferDirty(newbuf);
-
- ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
- mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
- MarkBufferDirty(revmapbuf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(idxrel))
- {
- xl_minmax_update xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata[4];
- uint8 info = XLOG_MINMAX_UPDATE;
-
- xlrec.new.node = idxrel->rd_node;
- ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
- xlrec.new.heapBlk = heapBlk;
- xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
- xlrec.new.pagesPerRange = pagesPerRange;
- ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
-
- rdata[0].data = (char *) &xlrec;
- rdata[0].len = SizeOfMinmaxUpdate;
- rdata[0].buffer = InvalidBuffer;
- rdata[0].next = &(rdata[1]);
-
- rdata[1].data = (char *) newtup;
- rdata[1].len = newsz;
- rdata[1].buffer = newbuf;
- rdata[1].buffer_std = true;
- rdata[1].next = &(rdata[2]);
-
- rdata[2].data = (char *) NULL;
- rdata[2].len = 0;
- rdata[2].buffer = revmapbuf;
- rdata[2].buffer_std = true;
- rdata[2].next = &(rdata[3]);
-
- rdata[3].data = (char *) NULL;
- rdata[3].len = 0;
- rdata[3].buffer = oldbuf;
- rdata[3].buffer_std = true;
- rdata[3].next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
-
- PageSetLSN(oldpage, recptr);
- PageSetLSN(newpage, recptr);
- PageSetLSN(BufferGetPage(revmapbuf), recptr);
- }
-
- END_CRIT_SECTION();
-
- LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- UnlockReleaseBuffer(newbuf);
- return true;
- }
- }
-
- /*
- * Insert an index tuple into the index relation. The revmap is updated to
- * mark the range containing the given page as pointing to the inserted entry.
- * A WAL record is written.
- *
- * The buffer, if valid, is first checked for free space to insert the new
- * entry; if there isn't enough, a new buffer is obtained and pinned.
- *
- * If the relation had to be extended to make room for the new index tuple,
- * *extended is set to true.
- */
- static void
- mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
- mmRevmapAccess *rmAccess, Buffer *buffer,
- BlockNumber heapBlk, MMTuple *tup, Size itemsz, bool *extended)
- {
- Page page;
- BlockNumber blk;
- OffsetNumber off;
- Buffer revmapbuf;
- ItemPointerData tid;
-
- itemsz = MAXALIGN(itemsz);
-
- /*
- * Lock the revmap page for the update. Note that this may require
- * extending the revmap, which in turn may require moving the currently
- * pinned index block out of the way.
- */
- revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
-
- /*
- * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
- * ensures there's enough space in the returned buffer.
- */
- if (BufferIsValid(*buffer))
- {
- page = BufferGetPage(*buffer);
- LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
-
- /*
- * It's possible that another backend (or ourselves!) extended the
- * revmap over the page we held a pin on, so we cannot assume that
- * it's still a regular page.
- */
- if (mm_page_get_freespace(page) < itemsz)
- {
- UnlockReleaseBuffer(*buffer);
- *buffer = InvalidBuffer;
- }
- }
- if (!BufferIsValid(*buffer))
- {
- *buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
- Assert(BufferIsValid(*buffer));
- page = BufferGetPage(*buffer);
- Assert(mm_page_get_freespace(page) >= itemsz);
- }
-
- page = BufferGetPage(*buffer);
- blk = BufferGetBlockNumber(*buffer);
-
- START_CRIT_SECTION();
- off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
- false, false);
- if (off == InvalidOffsetNumber)
- elog(ERROR, "could not insert new index tuple to page");
- MarkBufferDirty(*buffer);
-
- MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
- blk, off, heapBlk);
-
- ItemPointerSet(&tid, blk, off);
- mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
- MarkBufferDirty(revmapbuf);
-
- /* XLOG stuff */
- if (RelationNeedsWAL(idxrel))
- {
- xl_minmax_insert xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata[2];
- uint8 info = XLOG_MINMAX_INSERT;
-
- xlrec.node = idxrel->rd_node;
- xlrec.heapBlk = heapBlk;
- xlrec.pagesPerRange = pagesPerRange;
- xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
- ItemPointerSet(&xlrec.tid, blk, off);
-
- rdata[0].data = (char *) &xlrec;
- rdata[0].len = SizeOfMinmaxInsert;
- rdata[0].buffer = InvalidBuffer;
- rdata[0].buffer_std = false;
- rdata[0].next = &(rdata[1]);
-
- rdata[1].data = (char *) tup;
- rdata[1].len = itemsz;
- rdata[1].buffer = *buffer;
- rdata[1].buffer_std = true;
- rdata[1].next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
-
- PageSetLSN(page, recptr);
- PageSetLSN(BufferGetPage(revmapbuf), recptr);
- }
-
- END_CRIT_SECTION();
-
- /* Tuple is firmly on buffer; we can release our locks */
- LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
- LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
- }
-
- /*
- * Checks if a regular minmax index page is empty.
- *
- * If it's not, it's marked for "evacuation", meaning that no new tuples will
- * be added to it.
- */
- bool
- mm_start_evacuating_page(Relation idxRel, Buffer buf)
- {
- OffsetNumber off;
- OffsetNumber maxoff;
- MinmaxSpecialSpace *special;
- Page page;
-
- page = BufferGetPage(buf);
-
- if (PageIsNew(page))
- return false;
-
- special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
-
- maxoff = PageGetMaxOffsetNumber(page);
- for (off = FirstOffsetNumber; off <= maxoff; off++)
- {
- ItemId lp;
-
- lp = PageGetItemId(page, off);
- if (ItemIdIsUsed(lp))
- {
- /* prevent other backends from adding more stuff to this page. */
- special->flags |= MINMAX_EVACUATE_PAGE;
- MarkBufferDirtyHint(buf, true);
-
- return true;
- }
- }
- return false;
- }
-
- /*
- * Move all tuples out of a page.
- *
- * The caller must hold an exclusive lock on the page. The lock and pin are
- * released.
- */
- void
- mm_evacuate_page(Relation idxRel, Buffer buf)
- {
- OffsetNumber off;
- OffsetNumber maxoff;
- MinmaxSpecialSpace *special;
- Page page;
- mmRevmapAccess *rmAccess;
- BlockNumber pagesPerRange;
-
- rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
-
- page = BufferGetPage(buf);
- special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
-
- Assert(special->flags & MINMAX_EVACUATE_PAGE);
-
- maxoff = PageGetMaxOffsetNumber(page);
- for (off = FirstOffsetNumber; off <= maxoff; off++)
- {
- MMTuple *tup;
- Size sz;
- ItemId lp;
- bool extended = false;
-
- lp = PageGetItemId(page, off);
- if (ItemIdIsUsed(lp))
- {
- tup = (MMTuple *) PageGetItem(page, lp);
- sz = ItemIdGetLength(lp);
-
- tup = minmax_copy_tuple(tup, sz);
-
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-
- if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, tup->mt_blkno, buf,
- off, tup, sz, tup, sz, false, &extended))
- off--; /* retry */
-
- LockBuffer(buf, BUFFER_LOCK_SHARE);
-
- if (extended)
- IndexFreeSpaceMapVacuum(idxRel);
-
- /* It's possible that someone extended the revmap over this page */
- if (!MINMAX_IS_REGULAR_PAGE(page))
- break;
- }
- }
-
- mmRevmapAccessTerminate(rmAccess);
-
- UnlockReleaseBuffer(buf);
- }
-
- /*
- * Return a pinned and locked buffer which can be used to insert an index item
- * of size itemsz. If oldbuf is a valid buffer, it is also locked (in a order
- * determined to avoid deadlocks.)
- *
- * If there's no existing page with enough free space to accomodate the new
- * item, the relation is extended. If this happens, *extended is set to true.
- *
- * If we find that the old page is no longer a regular index page (because
- * of a revmap extension), the old buffer is unlocked and we return
- * InvalidBuffer.
- */
- static Buffer
- mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
- bool *was_extended)
- {
- BlockNumber oldblk;
- BlockNumber newblk;
- Page page;
- int freespace;
- bool extended = false;
-
- if (BufferIsValid(oldbuf))
- oldblk = BufferGetBlockNumber(oldbuf);
- else
- oldblk = InvalidBlockNumber;
-
- /*
- * Loop until we find a page with sufficient free space. By the time we
- * return to caller out of this loop, both buffers are valid and locked;
- * if we have to restart here, neither buffer is locked and buf is not
- * a pinned buffer.
- */
- newblk = RelationGetTargetBlock(irel);
- if (newblk == InvalidBlockNumber)
- newblk = GetPageWithFreeSpace(irel, itemsz);
- for (;;)
- {
- Buffer buf;
- bool extensionLockHeld = false;
-
- if (newblk == InvalidBlockNumber)
- {
- /*
- * There's not enough free space in any existing index page,
- * according to the FSM: extend the relation to obtain a shiny
- * new page.
- */
- if (!RELATION_IS_LOCAL(irel))
- {
- LockRelationForExtension(irel, ExclusiveLock);
- extensionLockHeld = true;
- }
- buf = ReadBuffer(irel, P_NEW);
- extended = true;
-
- MINMAX_elog(DEBUG2, "mm_getinsertbuffer: extending to page %u",
- BufferGetBlockNumber(buf));
- }
- else if (newblk == oldblk)
- {
- /*
- * There's an odd corner-case here where the FSM is out-of-date,
- * and gave us the old page.
- */
- buf = oldbuf;
- }
- else
- {
- buf = ReadBuffer(irel, newblk);
- }
-
- if (BufferIsValid(oldbuf) && oldblk < newblk)
- {
- LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
- if (!MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
- {
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
- ReleaseBuffer(buf);
- return InvalidBuffer;
- }
- }
-
- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-
- if (extensionLockHeld)
- UnlockRelationForExtension(irel, ExclusiveLock);
-
- page = BufferGetPage(buf);
-
- if (extended)
- mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
-
- /*
- * We have a new buffer from FSM now, and both pages are locked.
- * Check that the new page has enough free space, and return it if it
- * does; otherwise start over. Note that we allow for the FSM to be
- * out of date here, and in that case we update it and move on.
- *
- * (mm_page_get_freespace also checks that the FSM didn't hand us a
- * page that has since been repurposed for the revmap.)
- */
- freespace = mm_page_get_freespace(page);
- if (freespace >= itemsz)
- {
- if (extended)
- *was_extended = true;
- RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
-
- /* Lock the old buffer if not locked already */
- if (BufferIsValid(oldbuf) && newblk < oldblk)
- LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
-
- return buf;
- }
-
- /* This page is no good. */
-
- /*
- * If an entirely new page does not contain enough free space for
- * the new item, then surely that item is oversized. Complain
- * loudly; but first make sure we record the page as free, for
- * next time.
- */
- if (extended)
- {
- RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
- freespace);
- ereport(ERROR,
- (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
- errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
- (unsigned long) itemsz,
- (unsigned long) freespace,
- RelationGetRelationName(irel))));
- return InvalidBuffer; /* keep compiler quiet */
- }
-
- if (newblk != oldblk)
- UnlockReleaseBuffer(buf);
- if (BufferIsValid(oldbuf) && oldblk < newblk)
- LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
-
- newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
- }
}
/*
--- 914,919 ----
***************
*** 1543,1546 **** form_and_insert_tuple(MMBuildState *mmstate)
--- 937,942 ----
tup, size, &mmstate->extended);
mmstate->numtuples++;
pfree(tup);
+
+ mmstate->seentup = false;
}
*** /dev/null
--- b/src/backend/access/minmax/mmpageops.c
***************
*** 0 ****
--- 1,638 ----
+ /*
+ * mmpageops.c
+ * Page-handling routines for Minmax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmpageops.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_pageops.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+ #include "utils/rel.h"
+
+
+ static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended);
+ static Size mm_page_get_freespace(Page page);
+
+
+ /*
+ * Update tuple origtup (size origsz), located in offset oldoff of buffer
+ * oldbuf, to newtup (size newsz) as summary tuple for the page range starting
+ * at heapBlk. If samepage is true, then attempt to put the new tuple in the same
+ * page, otherwise use some other one.
+ *
+ * If the update is done, return true; the revmap is updated to point to the
+ * new tuple. If the update is not done for whatever reason, return false.
+ * Caller may retry the update if this happens.
+ *
+ * If the index had to be extended in the course of this operation, *extended
+ * is set to true.
+ */
+ bool
+ mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended)
+ {
+ Page oldpage;
+ ItemId origlp;
+ MMTuple *oldtup;
+ Size oldsz;
+ Buffer newbuf;
+ MinmaxSpecialSpace *special;
+
+ if (!samepage)
+ {
+ /* need a page on which to put the item */
+ newbuf = mm_getinsertbuffer(idxrel, oldbuf, newsz, extended);
+ if (!BufferIsValid(newbuf))
+ return false;
+
+ /*
+ * Note: it's possible (though unlikely) that the returned newbuf is
+ * the same as oldbuf, if mm_getinsertbuffer determined that the old
+ * buffer does in fact have enough space.
+ */
+ if (newbuf == oldbuf)
+ newbuf = InvalidBuffer;
+ }
+ else
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ newbuf = InvalidBuffer;
+ }
+ oldpage = BufferGetPage(oldbuf);
+ origlp = PageGetItemId(oldpage, oldoff);
+
+ /* Check that the old tuple wasn't updated concurrently */
+ if (!ItemIdIsNormal(origlp))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ oldsz = ItemIdGetLength(origlp);
+ oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
+
+ /*
+ * If both tuples are identical, there is nothing to do; except that if we
+ * were requested to move the tuple across pages, we do it even if they are
+ * equal.
+ */
+ if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(oldpage);
+
+ /*
+ * Great, the old tuple is intact. We can proceed with the update.
+ *
+ * If there's enough room on the old page for the new tuple, replace it.
+ *
+ * Note that there might now be enough space on the page even though
+ * the caller told us there isn't, if a concurrent updated moved a tuple
+ * elsewhere or replaced a tuple with a smaller one.
+ */
+ if ((special->flags & MINMAX_EVACUATE_PAGE) == 0 &&
+ (newsz <= origsz || PageGetExactFreeSpace(oldpage) >= (origsz - newsz)))
+ {
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+
+ START_CRIT_SECTION();
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple");
+ MarkBufferDirty(oldbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ BlockNumber blk = BufferGetBlockNumber(oldbuf);
+ xl_minmax_samepage_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_SAMEPAGE_UPDATE;
+
+ xlrec.node = idxrel->rd_node;
+ ItemPointerSetBlockNumber(&xlrec.tid, blk);
+ ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxSamepageUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = oldbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ else if (newbuf == InvalidBuffer)
+ {
+ /*
+ * Not enough space, but caller said that there was. Tell them to
+ * start over.
+ */
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+ else
+ {
+ /*
+ * Not enough free space on the oldpage. Put the new tuple on the
+ * new page, and update the revmap.
+ */
+ Page newpage = BufferGetPage(newbuf);
+ Buffer revmapbuf;
+ ItemPointerData newtid;
+ OffsetNumber newoff;
+
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
+ if (newoff == InvalidOffsetNumber)
+ elog(ERROR, "failed to add mmtuple to new page");
+ MarkBufferDirty(oldbuf);
+ MarkBufferDirty(newbuf);
+
+ ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[4];
+ uint8 info = XLOG_MINMAX_UPDATE;
+
+ xlrec.new.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
+ xlrec.new.heapBlk = heapBlk;
+ xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ xlrec.new.pagesPerRange = pagesPerRange;
+ ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = newbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = &(rdata[2]);
+
+ rdata[2].data = (char *) NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = revmapbuf;
+ rdata[2].buffer_std = true;
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = (char *) NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = oldbuf;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ PageSetLSN(newpage, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(newbuf);
+ return true;
+ }
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is first checked for free space to insert the new
+ * entry; if there isn't enough, a new buffer is obtained and pinned.
+ *
+ * If the relation had to be extended to make room for the new index tuple,
+ * *extended is set to true.
+ */
+ void
+ mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ MMTuple *tup, Size itemsz, bool *extended)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ Buffer revmapbuf;
+ ItemPointerData tid;
+
+ itemsz = MAXALIGN(itemsz);
+
+ /*
+ * Lock the revmap page for the update. Note that this may require
+ * extending the revmap, which in turn may require moving the currently
+ * pinned index block out of the way.
+ */
+ revmapbuf = mmLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note mm_getinsertbuffer
+ * ensures there's enough space in the returned buffer.
+ */
+ if (BufferIsValid(*buffer))
+ {
+ /*
+ * It's possible that another backend (or ourselves!) extended the
+ * revmap over the page we held a pin on, so we cannot assume that
+ * it's still a regular page.
+ */
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ if (mm_page_get_freespace(BufferGetPage(*buffer)) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buffer))
+ {
+ *buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
+ Assert(BufferIsValid(*buffer));
+ Assert(mm_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
+ }
+
+ page = BufferGetPage(*buffer);
+ blk = BufferGetBlockNumber(*buffer);
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "could not insert new index tuple to page");
+ MarkBufferDirty(*buffer);
+
+ MINMAX_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapBlk);
+
+ ItemPointerSet(&tid, blk, off);
+ mmSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_minmax_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_MINMAX_INSERT;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.heapBlk = heapBlk;
+ xlrec.pagesPerRange = pagesPerRange;
+ xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ ItemPointerSet(&xlrec.tid, blk, off);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfMinmaxInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_MINMAX_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our locks */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * Initiate page evacuation protocol.
+ *
+ * The page must be locked in exclusive mode by the caller.
+ *
+ * If the page is not yet initialized or empty, return false without doing
+ * anything; it can be used for revmap without any further changes. If it
+ * contains tuples, mark it for evacuation and return true.
+ */
+ bool
+ mm_start_evacuating_page(Relation idxRel, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ /* prevent other backends from adding more stuff to this page */
+ special->flags |= MINMAX_EVACUATE_PAGE;
+ MarkBufferDirtyHint(buf, true);
+
+ return true;
+ }
+ }
+ return false;
+ }
+
+ /*
+ * Move all tuples out of a page.
+ *
+ * The caller must hold lock on the page. The lock and pin are released.
+ */
+ void
+ mm_evacuate_page(Relation idxRel, BlockNumber pagesPerRange, mmRevmapAccess *rmAccess, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ MinmaxSpecialSpace *special;
+ Page page;
+ bool extended = false;
+
+ page = BufferGetPage(buf);
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+
+ Assert(special->flags & MINMAX_EVACUATE_PAGE);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ MMTuple *tup;
+ Size sz;
+ ItemId lp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ sz = ItemIdGetLength(lp);
+ tup = (MMTuple *) PageGetItem(page, lp);
+ tup = minmax_copy_tuple(tup, sz);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!mm_doupdate(idxRel, pagesPerRange, rmAccess, tup->mt_blkno, buf,
+ off, tup, sz, tup, sz, false, &extended))
+ off--; /* retry */
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ /* It's possible that someone extended the revmap over this page */
+ if (!MINMAX_IS_REGULAR_PAGE(page))
+ break;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
+ }
+
+ /*
+ * Return a pinned and locked buffer which can be used to insert an index item
+ * of size itemsz. If oldbuf is a valid buffer, it is also locked (in a order
+ * determined to avoid deadlocks.)
+ *
+ * If there's no existing page with enough free space to accomodate the new
+ * item, the relation is extended. If this happens, *extended is set to true.
+ *
+ * If we find that the old page is no longer a regular index page (because
+ * of a revmap extension), the old buffer is unlocked and we return
+ * InvalidBuffer.
+ */
+ static Buffer
+ mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended)
+ {
+ BlockNumber oldblk;
+ BlockNumber newblk;
+ Page page;
+ int freespace;
+ bool extended = false;
+
+ if (BufferIsValid(oldbuf))
+ oldblk = BufferGetBlockNumber(oldbuf);
+ else
+ oldblk = InvalidBlockNumber;
+
+ /*
+ * Loop until we find a page with sufficient free space. By the time we
+ * return to caller out of this loop, both buffers are valid and locked;
+ * if we have to restart here, neither buffer is locked and buf is not
+ * a pinned buffer.
+ */
+ newblk = RelationGetTargetBlock(irel);
+ if (newblk == InvalidBlockNumber)
+ newblk = GetPageWithFreeSpace(irel, itemsz);
+ for (;;)
+ {
+ Buffer buf;
+ bool extensionLockHeld = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (newblk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny
+ * new page.
+ */
+ if (!RELATION_IS_LOCAL(irel))
+ {
+ LockRelationForExtension(irel, ExclusiveLock);
+ extensionLockHeld = true;
+ }
+ buf = ReadBuffer(irel, P_NEW);
+ extended = true;
+
+ MINMAX_elog(DEBUG2, "mm_getinsertbuffer: extending to page %u",
+ BufferGetBlockNumber(buf));
+ }
+ else if (newblk == oldblk)
+ {
+ /*
+ * There's an odd corner-case here where the FSM is out-of-date,
+ * and gave us the old page.
+ */
+ buf = oldbuf;
+ }
+ else
+ {
+ buf = ReadBuffer(irel, newblk);
+ }
+
+ /*
+ * We lock the old buffer first, if it's earlier than the new one.
+ * We also need to check that it hasn't been turned into a revmap
+ * page concurrently; if we detect that it happened, give up and
+ * tell caller to start over.
+ */
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ if (!MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (extensionLockHeld)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ page = BufferGetPage(buf);
+
+ if (extended)
+ mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
+
+ /*
+ * We have a new buffer from FSM now. Check that the new page has
+ * enough free space, and return it if it does; otherwise start over.
+ * Note that we allow for the FSM to be out of date here, and in that
+ * case we update it and move on.
+ *
+ * (mm_page_get_freespace also checks that the FSM didn't hand us a
+ * page that has since been repurposed for the revmap.)
+ */
+ freespace = mm_page_get_freespace(page);
+ if (freespace >= itemsz)
+ {
+ if (extended)
+ *was_extended = true;
+
+ RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
+
+ /*
+ * Lock the old buffer if not locked already. Note that in this
+ * case we know for sure it's a regular page: it's later than the
+ * new page we just got, which is not a revmap page, and revmap
+ * pages are always consecutive.
+ */
+ if (BufferIsValid(oldbuf) && oldblk > newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)));
+ }
+
+ return buf;
+ }
+
+ /* This page is no good. */
+
+ /*
+ * If an entirely new page does not contain enough free space for
+ * the new item, then surely that item is oversized. Complain
+ * loudly; but first make sure we record the page as free, for
+ * next time.
+ */
+ if (extended)
+ {
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ return InvalidBuffer; /* keep compiler quiet */
+ }
+
+ if (newblk != oldblk)
+ UnlockReleaseBuffer(buf);
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+ newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+ }
+ }
+
+ /*
+ * Return the amount of free space on a regular minmax index page.
+ *
+ * If the page is not a regular page, or has been marked with the
+ * MINMAX_EVACUATE_PAGE flag, returns 0.
+ */
+ static Size
+ mm_page_get_freespace(Page page)
+ {
+ MinmaxSpecialSpace *special;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (!MINMAX_IS_REGULAR_PAGE(page) ||
+ (special->flags & MINMAX_EVACUATE_PAGE) != 0)
+ return 0;
+ else
+ return PageGetFreeSpace(page);
+
+ }
*** a/src/backend/access/minmax/mmrevmap.c
--- b/src/backend/access/minmax/mmrevmap.c
***************
*** 3,17 ****
* Reverse range map for MinMax indexes
*
* The reverse range map (revmap) is a translation structure for minmax
! * indexes: for each page range, there is one most-up-to-date summary tuple,
! * and its location is tracked by the revmap. Whenever a new tuple is inserted
! * into a table that violates the previously recorded min/max values, a new
! * tuple is inserted into the index and the revmap is updated to point to it.
*
! * The pages of the revmap are in the beginning of the index, starting at
! * immediately after the metapage at block 1. When the revmap needs to be
! * expanded, all tuples on the regular minmax page at that block (if any) are
! * moved out of the way.
*
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
--- 3,16 ----
* Reverse range map for MinMax indexes
*
* The reverse range map (revmap) is a translation structure for minmax
! * indexes: for each page range there is one summary tuple, and its location is
! * tracked by the revmap. Whenever a new tuple is inserted into a table that
! * violates the previously recorded summary values, a new tuple is inserted
! * into the index and the revmap is updated to point to it.
*
! * The revmap is stored in the first pages of the index, immediately following
! * the metapage. When the revmap needs to be expanded, all tuples on the
! * regular minmax page at that block (if any) are moved out of the way.
*
* Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
* Portions Copyright (c) 1994, Regents of the University of California
***************
*** 21,50 ****
*/
#include "postgres.h"
! #include "access/heapam_xlog.h"
! #include "access/minmax.h"
! #include "access/minmax_internal.h"
#include "access/minmax_page.h"
#include "access/minmax_revmap.h"
#include "access/minmax_xlog.h"
#include "access/rmgr.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
! #include "storage/relfilenode.h"
! #include "storage/smgr.h"
! #include "utils/memutils.h"
/*
! * In revmap pages, each item stores an ItemPointerData. These defines
! * let one find the logical revmap page number and index number of the revmap
! * item for the given heap block number.
*/
#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
! ((heapBlk / pagesPerRange) / REGULAR_REVMAP_PAGE_MAXITEMS)
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
! ((heapBlk / pagesPerRange) % REGULAR_REVMAP_PAGE_MAXITEMS)
struct mmRevmapAccess
--- 20,47 ----
*/
#include "postgres.h"
! #include "access/xlog.h"
#include "access/minmax_page.h"
+ #include "access/minmax_pageops.h"
#include "access/minmax_revmap.h"
+ #include "access/minmax_tuple.h"
#include "access/minmax_xlog.h"
#include "access/rmgr.h"
#include "miscadmin.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
! #include "utils/rel.h"
/*
! * In revmap pages, each item stores an ItemPointerData. These defines let one
! * find the logical revmap page number and index number of the revmap item for
! * the given heap block number.
*/
#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
! ((heapBlk / pagesPerRange) / REVMAP_PAGE_MAXITEMS)
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
! ((heapBlk / pagesPerRange) % REVMAP_PAGE_MAXITEMS)
struct mmRevmapAccess
***************
*** 58,63 **** struct mmRevmapAccess
--- 55,62 ----
/* typedef appears in minmax_revmap.h */
+ static BlockNumber rm_get_phys_blkno(mmRevmapAccess *rmAccess,
+ BlockNumber mapBlk, bool extend);
static void rm_extend(mmRevmapAccess *rmAccess);
/*
***************
*** 73,89 **** mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
MinmaxMetaPageData *metadata;
meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
rmAccess = palloc(sizeof(mmRevmapAccess));
- rmAccess->metaBuf = meta;
rmAccess->idxrel = idxrel;
rmAccess->pagesPerRange = metadata->pagesPerRange;
rmAccess->currBuf = InvalidBuffer;
- rmAccess->lastRevmapPage = InvalidBlockNumber;
! if (pagesPerRange)
! *pagesPerRange = metadata->pagesPerRange;
return rmAccess;
}
--- 72,90 ----
MinmaxMetaPageData *metadata;
meta = ReadBuffer(idxrel, MINMAX_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_SHARE);
metadata = (MinmaxMetaPageData *) PageGetContents(BufferGetPage(meta));
rmAccess = palloc(sizeof(mmRevmapAccess));
rmAccess->idxrel = idxrel;
rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ rmAccess->metaBuf = meta;
rmAccess->currBuf = InvalidBuffer;
! *pagesPerRange = metadata->pagesPerRange;
!
! LockBuffer(meta, BUFFER_LOCK_UNLOCK);
return rmAccess;
}
***************
*** 94,281 **** mmRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
void
mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
{
! if (rmAccess->metaBuf != InvalidBuffer)
! ReleaseBuffer(rmAccess->metaBuf);
if (rmAccess->currBuf != InvalidBuffer)
ReleaseBuffer(rmAccess->currBuf);
pfree(rmAccess);
}
/*
- * Read the metapage and update the given rmAccess with the metapage data.
- */
- static void
- rmaccess_read_metapage(mmRevmapAccess *rmAccess)
- {
- MinmaxMetaPageData *metadata;
- MinmaxSpecialSpace *special PG_USED_FOR_ASSERTS_ONLY;
- Page metapage;
-
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_SHARE);
- metapage = BufferGetPage(rmAccess->metaBuf);
-
- #ifdef USE_ASSERT_CHECKING
- /* ensure we really got the metapage */
- special = (MinmaxSpecialSpace *) PageGetSpecialPointer(metapage);
- Assert(special->type == MINMAX_PAGETYPE_META);
- #endif
-
- metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
-
- rmAccess->lastRevmapPage = metadata->lastRevmapPage;
-
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- }
-
- /*
- * Given a logical revmap block number, find its physical block number.
- *
- * Note this might involve up to two buffer reads, including a possible
- * update to the metapage.
- *
- * If extend is set to true, and the page hasn't been set yet, extend the
- * array to point to a newly allocated page.
- */
- static BlockNumber
- rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
- {
- BlockNumber targetblk;
-
- if (rmAccess->lastRevmapPage == InvalidBlockNumber)
- rmaccess_read_metapage(rmAccess);
-
- /* the first revmap page is always block number 1 */
- targetblk = mapBlk + 1;
-
- if (targetblk <= rmAccess->lastRevmapPage)
- return targetblk;
-
- if (!extend)
- return InvalidBlockNumber;
-
- /* Extend the revmap */
- while (targetblk > rmAccess->lastRevmapPage)
- rm_extend(rmAccess);
-
- return targetblk;
- }
-
- /*
- * Extend the revmap by one page.
- *
- * If there is an existing minmax page at that block, it is atomically moved
- * out of the way, and the redirect pointer on the new revmap page is set
- * to point to its new location.
- *
- * If rmAccess->lastRevmapPage is out-of-date, it's updated and nothing else
- * is done.
- */
- static void
- rm_extend(mmRevmapAccess *rmAccess)
- {
- Buffer buf;
- Page page;
- Page metapage;
- MinmaxMetaPageData *metadata;
- BlockNumber mapBlk;
- BlockNumber nblocks;
- Relation irel = rmAccess->idxrel;
- bool needLock = !RELATION_IS_LOCAL(irel);
-
- /*
- * Lock the metapage. This locks out concurrent extensions of the revmap,
- * but note that we still need to grab the relation extension lock because
- * another backend can still extend the index with regular minmax pages.
- */
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_EXCLUSIVE);
- metapage = BufferGetPage(rmAccess->metaBuf);
- metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
-
- /* Check that our cached lastRevmapPage value was up-to-date */
- if (metadata->lastRevmapPage != rmAccess->lastRevmapPage)
- {
- rmAccess->lastRevmapPage = metadata->lastRevmapPage;
-
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- return;
- }
- mapBlk = metadata->lastRevmapPage + 1;
-
- nblocks = RelationGetNumberOfBlocks(irel);
- if (mapBlk < nblocks)
- {
- /* Check that the existing index block is sane. */
- buf = ReadBuffer(rmAccess->idxrel, mapBlk);
- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- page = BufferGetPage(buf);
- }
- else
- {
- if (needLock)
- LockRelationForExtension(irel, ExclusiveLock);
-
- buf = ReadBuffer(irel, P_NEW);
- Assert(BufferGetBlockNumber(buf) == mapBlk);
- LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
- page = BufferGetPage(buf);
-
- if (needLock)
- UnlockRelationForExtension(irel, ExclusiveLock);
- }
-
- /* Check that it's a regular block (or an empty page) */
- if (!PageIsNew(page) && !MINMAX_IS_REGULAR_PAGE(page))
- elog(ERROR, "unexpected minmax page type: 0x%04X",
- MINMAX_PAGE_TYPE(page));
-
- /* If the page is in use, evacuate it and restart */
- if (mm_start_evacuating_page(rmAccess->idxrel, buf))
- {
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- mm_evacuate_page(rmAccess->idxrel, buf);
- return;
- }
-
- /*
- * Ok, we have now locked the metapage and the target block. Re-initialize
- * it as a revmap page.
- */
- START_CRIT_SECTION();
-
- /* the rmr_tids array is initialized to all invalid by PageInit */
- mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
- MarkBufferDirty(buf);
-
- metadata->lastRevmapPage = mapBlk;
- MarkBufferDirty(rmAccess->metaBuf);
-
- if (RelationNeedsWAL(rmAccess->idxrel))
- {
- xl_minmax_revmap_extend xlrec;
- XLogRecPtr recptr;
- XLogRecData rdata;
-
- xlrec.node = rmAccess->idxrel->rd_node;
- xlrec.targetBlk = mapBlk;
-
- rdata.data = (char *) &xlrec;
- rdata.len = SizeOfMinmaxRevmapExtend;
- rdata.buffer = InvalidBuffer;
- rdata.buffer_std = false;
- rdata.next = NULL;
-
- recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_REVMAP_EXTEND, &rdata);
- PageSetLSN(metapage, recptr);
- PageSetLSN(page, recptr);
- }
-
- END_CRIT_SECTION();
-
- LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
- UnlockReleaseBuffer(buf);
- }
-
- /*
* Prepare for updating an entry in the revmap.
*
* The map is extended, if necessary.
--- 95,107 ----
void
mmRevmapAccessTerminate(mmRevmapAccess *rmAccess)
{
! ReleaseBuffer(rmAccess->metaBuf);
if (rmAccess->currBuf != InvalidBuffer)
ReleaseBuffer(rmAccess->currBuf);
pfree(rmAccess);
}
/*
* Prepare for updating an entry in the revmap.
*
* The map is extended, if necessary.
***************
*** 285,294 **** mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk)
{
BlockNumber mapBlk;
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
-
- /* Translate the map block number to physical location */
mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
MINMAX_elog(DEBUG2, "locking revmap page for logical page %lu (physical %u) for heap %u",
HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
--- 111,123 ----
{
BlockNumber mapBlk;
+ /*
+ * Translate the map block number to physical location. Note this extends
+ * the revmap, if necessary.
+ */
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+ Assert(mapBlk != InvalidBlockNumber);
MINMAX_elog(DEBUG2, "locking revmap page for logical page %lu (physical %u) for heap %u",
HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
***************
*** 305,311 **** mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess, BlockNumber heapBlk)
if (rmAccess->currBuf != InvalidBuffer)
ReleaseBuffer(rmAccess->currBuf);
- Assert(mapBlk != InvalidBlockNumber);
rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
}
--- 134,139 ----
***************
*** 373,380 **** mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
/* normalize the heap block number to be the first page in the range */
heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
- /* Translate the map block number to physical location */
mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
if (mapBlk == InvalidBlockNumber)
{
--- 201,208 ----
/* normalize the heap block number to be the first page in the range */
heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
+ /* Compute the revmap page number we need */
mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
if (mapBlk == InvalidBlockNumber)
{
***************
*** 385,390 **** mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
--- 213,220 ----
ItemPointerSetInvalid(&previptr);
for (;;)
{
+ CHECK_FOR_INTERRUPTS();
+
if (rmAccess->currBuf == InvalidBuffer ||
BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
{
***************
*** 452,464 **** mmGetMMTupleForHeapBlock(mmRevmapAccess *rmAccess, BlockNumber heapBlk,
/*
* No luck. Assume that the revmap was updated concurrently.
- *
- * XXX: it would be nice to add some kind of a sanity check here to
- * avoid looping infinitely, if the revmap points to wrong tuple for
- * some reason.
*/
LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
}
/* not reached, but keep compiler quiet */
return NULL;
}
--- 282,451 ----
/*
* No luck. Assume that the revmap was updated concurrently.
*/
LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
}
/* not reached, but keep compiler quiet */
return NULL;
}
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(mmRevmapAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ BlockNumber targetblk;
+
+ /* skip the metapage to obtain physical block numbers of revmap pages */
+ targetblk = mapBlk + 1;
+
+ /* Normal case: the revmap page is already allocated */
+ if (targetblk <= rmAccess->lastRevmapPage)
+ return targetblk;
+
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /* Extend the revmap */
+ while (targetblk > rmAccess->lastRevmapPage)
+ rm_extend(rmAccess);
+
+ return targetblk;
+ }
+
+ /*
+ * Extend the revmap by one page.
+ *
+ * However, if the revmap was extended by someone else concurrently, we might
+ * return without actually doing anything.
+ *
+ * If there is an existing minmax page at that block, it is atomically moved
+ * out of the way, and the redirect pointer on the new revmap page is set
+ * to point to its new location.
+ */
+ static void
+ rm_extend(mmRevmapAccess *rmAccess)
+ {
+ Buffer buf;
+ Page page;
+ Page metapage;
+ MinmaxMetaPageData *metadata;
+ BlockNumber mapBlk;
+ BlockNumber nblocks;
+ Relation irel = rmAccess->idxrel;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * Lock the metapage. This locks out concurrent extensions of the revmap,
+ * but note that we still need to grab the relation extension lock because
+ * another backend can extend the index with regular minmax pages.
+ */
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_EXCLUSIVE);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+ metadata = (MinmaxMetaPageData *) PageGetContents(metapage);
+
+ /*
+ * Check that our cached lastRevmapPage value was up-to-date; if it wasn't,
+ * update the cached copy and have caller start over.
+ */
+ if (metadata->lastRevmapPage != rmAccess->lastRevmapPage)
+ {
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ mapBlk = metadata->lastRevmapPage + 1;
+
+ nblocks = RelationGetNumberOfBlocks(irel);
+ if (mapBlk < nblocks)
+ {
+ buf = ReadBuffer(irel, mapBlk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ }
+ else
+ {
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buf = ReadBuffer(irel, P_NEW);
+ if (BufferGetBlockNumber(buf) != mapBlk)
+ {
+ /*
+ * Very rare corner case: somebody extended the relation
+ * concurrently after we read its length. If this happens, give up
+ * and have caller start over. We will have to evacuate that page
+ * from under whoever is using it.
+ */
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ }
+
+ /* Check that it's a regular block (or an empty page) */
+ if (!PageIsNew(page) && !MINMAX_IS_REGULAR_PAGE(page))
+ elog(ERROR, "unexpected minmax page type: 0x%04X",
+ MINMAX_PAGE_TYPE(page));
+
+ /* If the page is in use, evacuate it and restart */
+ if (mm_start_evacuating_page(irel, buf))
+ {
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ mm_evacuate_page(irel, rmAccess->pagesPerRange, rmAccess, buf);
+
+ /* have caller start over */
+ return;
+ }
+
+ /*
+ * Ok, we have now locked the metapage and the target block. Re-initialize
+ * it as a revmap page.
+ */
+ START_CRIT_SECTION();
+
+ /* the rmr_tids array is initialized to all invalid by PageInit */
+ mm_page_init(page, MINMAX_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ metadata->lastRevmapPage = mapBlk;
+ MarkBufferDirty(rmAccess->metaBuf);
+
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_minmax_revmap_extend xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.targetBlk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfMinmaxRevmapExtend;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ /* FIXME don't we need to log the metapage buffer also? */
+
+ recptr = XLogInsert(RM_MINMAX_ID, XLOG_MINMAX_REVMAP_EXTEND, &rdata);
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+
+ UnlockReleaseBuffer(buf);
+ }
*** a/src/backend/access/minmax/mmxlog.c
--- b/src/backend/access/minmax/mmxlog.c
***************
*** 279,285 **** minmax_xlog_revmap_extend(XLogRecPtr lsn, XLogRecord *record)
}
}
! /* Re-init the target block as a revmap page */
buf = XLogReadBuffer(xlrec->node, xlrec->targetBlk, true);
page = (Page) BufferGetPage(buf);
--- 279,288 ----
}
}
! /*
! * Re-init the target block as a revmap page. There's never a full-
! * page image here.
! */
buf = XLogReadBuffer(xlrec->node, xlrec->targetBlk, true);
page = (Page) BufferGetPage(buf);
***************
*** 288,297 **** minmax_xlog_revmap_extend(XLogRecPtr lsn, XLogRecord *record)
PageSetLSN(page, lsn);
MarkBufferDirty(buf);
- metadata->lastRevmapPage = xlrec->targetBlk;
- PageSetLSN(metapg, lsn);
- MarkBufferDirty(metabuf);
-
UnlockReleaseBuffer(buf);
UnlockReleaseBuffer(metabuf);
}
--- 291,296 ----
*** a/src/backend/access/rmgrdesc/minmaxdesc.c
--- b/src/backend/access/rmgrdesc/minmaxdesc.c
***************
*** 40,46 **** minmax_desc(StringInfo buf, XLogRecord *record)
appendStringInfo(buf, "insert(init): ");
else
appendStringInfo(buf, "insert: ");
! appendStringInfo(buf, "%u/%u/%u blk %u revmapBlk %u pagesPerRange %u TID (%u,%u)",
xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode,
xlrec->heapBlk, xlrec->revmapBlk,
--- 40,46 ----
appendStringInfo(buf, "insert(init): ");
else
appendStringInfo(buf, "insert: ");
! appendStringInfo(buf, "%u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u TID (%u,%u)",
xlrec->node.spcNode, xlrec->node.dbNode,
xlrec->node.relNode,
xlrec->heapBlk, xlrec->revmapBlk,
***************
*** 56,70 **** minmax_desc(StringInfo buf, XLogRecord *record)
appendStringInfo(buf, "update(init): ");
else
appendStringInfo(buf, "update: ");
! appendStringInfo(buf, "rel %u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u TID (%u,%u) old TID (%u,%u)",
xlrec->new.node.spcNode, xlrec->new.node.dbNode,
xlrec->new.node.relNode,
xlrec->new.heapBlk, xlrec->new.revmapBlk,
xlrec->new.pagesPerRange,
- ItemPointerGetBlockNumber(&xlrec->new.tid),
- ItemPointerGetOffsetNumber(&xlrec->new.tid),
ItemPointerGetBlockNumber(&xlrec->oldtid),
! ItemPointerGetOffsetNumber(&xlrec->oldtid));
}
else if (info == XLOG_MINMAX_SAMEPAGE_UPDATE)
{
--- 56,70 ----
appendStringInfo(buf, "update(init): ");
else
appendStringInfo(buf, "update: ");
! appendStringInfo(buf, "rel %u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u old TID (%u,%u) TID (%u,%u)",
xlrec->new.node.spcNode, xlrec->new.node.dbNode,
xlrec->new.node.relNode,
xlrec->new.heapBlk, xlrec->new.revmapBlk,
xlrec->new.pagesPerRange,
ItemPointerGetBlockNumber(&xlrec->oldtid),
! ItemPointerGetOffsetNumber(&xlrec->oldtid),
! ItemPointerGetBlockNumber(&xlrec->new.tid),
! ItemPointerGetOffsetNumber(&xlrec->new.tid));
}
else if (info == XLOG_MINMAX_SAMEPAGE_UPDATE)
{
***************
*** 76,112 **** minmax_desc(StringInfo buf, XLogRecord *record)
ItemPointerGetBlockNumber(&xlrec->tid),
ItemPointerGetOffsetNumber(&xlrec->tid));
}
! else if (info == XLOG_MINMAX_METAPG_SET)
! {
! xl_minmax_metapg_set *xlrec = (xl_minmax_metapg_set *) rec;
!
! appendStringInfo(buf, "metapg: rel %u/%u/%u array revmap idx %d block %u",
! xlrec->node.spcNode, xlrec->node.dbNode,
! xlrec->node.relNode,
! xlrec->blkidx, xlrec->newpg);
! }
! else if (info == XLOG_MINMAX_RMARRAY_SET)
{
! xl_minmax_rmarray_set *xlrec = (xl_minmax_rmarray_set *) rec;
! appendStringInfoString(buf, "revmap array: ");
! appendStringInfo(buf, "rel %u/%u/%u array pg %u revmap idx %d block %u",
xlrec->node.spcNode, xlrec->node.dbNode,
! xlrec->node.relNode,
! xlrec->rmarray,
! xlrec->blkidx, xlrec->newpg);
! }
! else if (info == XLOG_MINMAX_INIT_RMPG)
! {
! xl_minmax_init_rmpg *xlrec = (xl_minmax_init_rmpg *) rec;
!
! appendStringInfo(buf, "init_rmpg: rel %u/%u/%u blk %u",
! xlrec->node.spcNode, xlrec->node.dbNode,
! xlrec->node.relNode, xlrec->blkno);
! if (xlrec->array)
! appendStringInfoString(buf, " (array)");
! else
! appendStringInfo(buf, "(regular) logblk %u", xlrec->logblk);
}
else
appendStringInfo(buf, "UNKNOWN");
--- 76,88 ----
ItemPointerGetBlockNumber(&xlrec->tid),
ItemPointerGetOffsetNumber(&xlrec->tid));
}
! else if (info == XLOG_MINMAX_REVMAP_EXTEND)
{
! xl_minmax_revmap_extend *xlrec = (xl_minmax_revmap_extend *) rec;
! appendStringInfo(buf, "revmap extend: rel %u/%u/%u targetBlk %u",
xlrec->node.spcNode, xlrec->node.dbNode,
! xlrec->node.relNode, xlrec->targetBlk);
}
else
appendStringInfo(buf, "UNKNOWN");
*** a/src/include/access/minmax_internal.h
--- b/src/include/access/minmax_internal.h
***************
*** 21,31 ****
/*
* A MinmaxDesc is a struct designed to enable decoding a MinMax tuple from the
* on-disk format to a DeformedMMTuple and vice-versa.
- *
- * Note: we assume, for now, that the data stored for each column is the same
- * datatype as the indexed heap column. This restriction can be lifted by
- * having an Oid array pointer on the PerCol struct, where each member of the
- * array indicates the typid of the stored data.
*/
/* struct returned by "OpcInfo" amproc */
--- 21,26 ----
***************
*** 60,66 **** typedef struct MinmaxDesc
int md_totalstored;
/* per-column info */
! MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* tupdesc->natts entries long */
} MinmaxDesc;
/*
--- 55,61 ----
int md_totalstored;
/* per-column info */
! MinmaxOpcInfo *md_info[FLEXIBLE_ARRAY_MEMBER]; /* md_tupdesc->natts entries long */
} MinmaxDesc;
/*
***************
*** 87,93 **** extern void minmax_free_mmdesc(MinmaxDesc *mmdesc);
extern void mm_page_init(Page page, uint16 type);
extern void mm_metapage_init(Page page, BlockNumber pagesPerRange,
uint16 version);
- extern bool mm_start_evacuating_page(Relation idxRel, Buffer buf);
- extern void mm_evacuate_page(Relation idxRel, Buffer buf);
#endif /* MINMAX_INTERNAL_H */
--- 82,86 ----
*** a/src/include/access/minmax_page.h
--- b/src/include/access/minmax_page.h
***************
*** 16,21 ****
--- 16,23 ----
#ifndef MINMAX_PAGE_H
#define MINMAX_PAGE_H
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
/* special space on all minmax pages stores a "type" identifier */
#define MINMAX_PAGETYPE_META 0xF091
***************
*** 54,68 **** typedef struct MinmaxMetaPageData
/* Definitions for regular revmap pages */
typedef struct RevmapContents
{
! ItemPointerData rmr_tids[1]; /* really REGULAR_REVMAP_PAGE_MAXITEMS */
} RevmapContents;
! #define REGULAR_REVMAP_CONTENT_SIZE \
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
offsetof(RevmapContents, rmr_tids) - \
MAXALIGN(sizeof(MinmaxSpecialSpace)))
/* max num of items in the array */
! #define REGULAR_REVMAP_PAGE_MAXITEMS \
! (REGULAR_REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
#endif /* MINMAX_PAGE_H */
--- 56,70 ----
/* Definitions for regular revmap pages */
typedef struct RevmapContents
{
! ItemPointerData rmr_tids[1]; /* really REVMAP_PAGE_MAXITEMS */
} RevmapContents;
! #define REVMAP_CONTENT_SIZE \
(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
offsetof(RevmapContents, rmr_tids) - \
MAXALIGN(sizeof(MinmaxSpecialSpace)))
/* max num of items in the array */
! #define REVMAP_PAGE_MAXITEMS \
! (REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
#endif /* MINMAX_PAGE_H */
*** /dev/null
--- b/src/include/access/minmax_pageops.h
***************
*** 0 ****
--- 1,29 ----
+ /*
+ * Prototypes for operating on minmax pages.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/minmax_pageops.h
+ */
+ #ifndef MINMAX_PAGEOPS_H
+ #define MINMAX_PAGEOPS_H
+
+ #include "access/minmax_revmap.h"
+
+ extern bool mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const MMTuple *origtup, Size origsz,
+ const MMTuple *newtup, Size newsz,
+ bool samepage, bool *extended);
+ extern void mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ MMTuple *tup, Size itemsz, bool *extended);
+
+ extern bool mm_start_evacuating_page(Relation idxRel, Buffer buf);
+ extern void mm_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
+ mmRevmapAccess *rmAccess, Buffer buf);
+
+ #endif /* MINMAX_PAGEOPS_H */
*** a/src/include/access/minmax_revmap.h
--- b/src/include/access/minmax_revmap.h
***************
*** 13,18 ****
--- 13,19 ----
#include "access/minmax_tuple.h"
#include "storage/block.h"
+ #include "storage/buf.h"
#include "storage/itemptr.h"
#include "storage/off.h"
#include "utils/relcache.h"
***************
*** 24,30 **** extern mmRevmapAccess *mmRevmapAccessInit(Relation idxrel,
BlockNumber *pagesPerRange);
extern void mmRevmapAccessTerminate(mmRevmapAccess *rmAccess);
- extern void mmRevmapCreate(Relation idxrel);
extern Buffer mmLockRevmapPageForUpdate(mmRevmapAccess *rmAccess,
BlockNumber heapBlk);
extern void mmSetHeapBlockItemptr(Buffer rmbuf, BlockNumber pagesPerRange,
--- 25,30 ----
minmax-pageops.difftext/x-diff; charset=us-asciiDownload
*** minmax.c.heikki 2014-08-20 19:06:27.000000000 -0400
--- src/backend/access/minmax/mmpageops.c 2014-08-20 17:10:55.000000000 -0400
***************
*** 1,8 ****
/*
* Update tuple origtup (size origsz), located in offset oldoff of buffer
* oldbuf, to newtup (size newsz) as summary tuple for the page range starting
* at heapBlk. If samepage is true, then attempt to put the new tuple in the same
! * page, otherwise get a new one.
*
* If the update is done, return true; the revmap is updated to point to the
* new tuple. If the update is not done for whatever reason, return false.
--- 1,37 ----
/*
+ * mmpageops.c
+ * Page-handling routines for Minmax indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/minmax/mmpageops.c
+ */
+ #include "postgres.h"
+
+ #include "access/minmax_pageops.h"
+ #include "access/minmax_page.h"
+ #include "access/minmax_revmap.h"
+ #include "access/minmax_xlog.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+ #include "utils/rel.h"
+
+
+ static Buffer mm_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended);
+ static Size mm_page_get_freespace(Page page);
+
+
+ /*
* Update tuple origtup (size origsz), located in offset oldoff of buffer
* oldbuf, to newtup (size newsz) as summary tuple for the page range starting
* at heapBlk. If samepage is true, then attempt to put the new tuple in the same
! * page, otherwise use some other one.
*
* If the update is done, return true; the revmap is updated to point to the
* new tuple. If the update is not done for whatever reason, return false.
***************
*** 11,17 ****
* If the index had to be extended in the course of this operation, *extended
* is set to true.
*/
! static bool
mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
mmRevmapAccess *rmAccess, BlockNumber heapBlk,
Buffer oldbuf, OffsetNumber oldoff,
--- 40,46 ----
* If the index had to be extended in the course of this operation, *extended
* is set to true.
*/
! bool
mm_doupdate(Relation idxrel, BlockNumber pagesPerRange,
mmRevmapAccess *rmAccess, BlockNumber heapBlk,
Buffer oldbuf, OffsetNumber oldoff,
***************
*** 59,66 ****
oldsz = ItemIdGetLength(origlp);
oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
! /* If both tuples are in fact equal, there is nothing to do */
! if (!minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
{
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
--- 88,99 ----
oldsz = ItemIdGetLength(origlp);
oldtup = (MMTuple *) PageGetItem(oldpage, origlp);
! /*
! * If both tuples are identical, there is nothing to do; except that if we
! * were requested to move the tuple across pages, we do it even if they are
! * equal.
! */
! if (samepage && minmax_tuples_equal(oldtup, oldsz, origtup, origsz))
{
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
***************
*** 126,132 ****
{
/*
* Not enough space, but caller said that there was. Tell them to
! * start over
*/
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
--- 159,165 ----
{
/*
* Not enough space, but caller said that there was. Tell them to
! * start over.
*/
LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
return false;
***************
*** 222,231 ****
* If the relation had to be extended to make room for the new index tuple,
* *extended is set to true.
*/
! static void
mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
! mmRevmapAccess *rmAccess, Buffer *buffer,
! BlockNumber heapBlk, MMTuple *tup, Size itemsz, bool *extended)
{
Page page;
BlockNumber blk;
--- 255,264 ----
* If the relation had to be extended to make room for the new index tuple,
* *extended is set to true.
*/
! void
mm_doinsert(Relation idxrel, BlockNumber pagesPerRange,
! mmRevmapAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
! MMTuple *tup, Size itemsz, bool *extended)
{
Page page;
BlockNumber blk;
***************
*** 248,273 ****
*/
if (BufferIsValid(*buffer))
{
- page = BufferGetPage(*buffer);
- LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
-
/*
* It's possible that another backend (or ourselves!) extended the
* revmap over the page we held a pin on, so we cannot assume that
* it's still a regular page.
*/
! if (mm_page_get_freespace(page) < itemsz)
{
UnlockReleaseBuffer(*buffer);
*buffer = InvalidBuffer;
}
}
if (!BufferIsValid(*buffer))
{
*buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
Assert(BufferIsValid(*buffer));
! page = BufferGetPage(*buffer);
! Assert(mm_page_get_freespace(page) >= itemsz);
}
page = BufferGetPage(*buffer);
--- 281,304 ----
*/
if (BufferIsValid(*buffer))
{
/*
* It's possible that another backend (or ourselves!) extended the
* revmap over the page we held a pin on, so we cannot assume that
* it's still a regular page.
*/
! LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
! if (mm_page_get_freespace(BufferGetPage(*buffer)) < itemsz)
{
UnlockReleaseBuffer(*buffer);
*buffer = InvalidBuffer;
}
}
+
if (!BufferIsValid(*buffer))
{
*buffer = mm_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
Assert(BufferIsValid(*buffer));
! Assert(mm_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
}
page = BufferGetPage(*buffer);
***************
*** 327,336 ****
}
/*
! * Checks if a regular minmax index page is empty.
*
! * If it's not, it's marked for "evacuation", meaning that no new tuples will
! * be added to it.
*/
bool
mm_start_evacuating_page(Relation idxRel, Buffer buf)
--- 358,370 ----
}
/*
! * Initiate page evacuation protocol.
*
! * The page must be locked in exclusive mode by the caller.
! *
! * If the page is not yet initialized or empty, return false without doing
! * anything; it can be used for revmap without any further changes. If it
! * contains tuples, mark it for evacuation and return true.
*/
bool
mm_start_evacuating_page(Relation idxRel, Buffer buf)
***************
*** 355,361 ****
lp = PageGetItemId(page, off);
if (ItemIdIsUsed(lp))
{
! /* prevent other backends from adding more stuff to this page. */
special->flags |= MINMAX_EVACUATE_PAGE;
MarkBufferDirtyHint(buf, true);
--- 389,395 ----
lp = PageGetItemId(page, off);
if (ItemIdIsUsed(lp))
{
! /* prevent other backends from adding more stuff to this page */
special->flags |= MINMAX_EVACUATE_PAGE;
MarkBufferDirtyHint(buf, true);
***************
*** 368,387 ****
/*
* Move all tuples out of a page.
*
! * The caller must hold an exclusive lock on the page. The lock and pin are
! * released.
*/
void
! mm_evacuate_page(Relation idxRel, Buffer buf)
{
OffsetNumber off;
OffsetNumber maxoff;
MinmaxSpecialSpace *special;
Page page;
! mmRevmapAccess *rmAccess;
! BlockNumber pagesPerRange;
!
! rmAccess = mmRevmapAccessInit(idxRel, &pagesPerRange);
page = BufferGetPage(buf);
special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
--- 402,417 ----
/*
* Move all tuples out of a page.
*
! * The caller must hold lock on the page. The lock and pin are released.
*/
void
! mm_evacuate_page(Relation idxRel, BlockNumber pagesPerRange, mmRevmapAccess *rmAccess, Buffer buf)
{
OffsetNumber off;
OffsetNumber maxoff;
MinmaxSpecialSpace *special;
Page page;
! bool extended = false;
page = BufferGetPage(buf);
special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
***************
*** 394,407 ****
MMTuple *tup;
Size sz;
ItemId lp;
! bool extended = false;
lp = PageGetItemId(page, off);
if (ItemIdIsUsed(lp))
{
- tup = (MMTuple *) PageGetItem(page, lp);
sz = ItemIdGetLength(lp);
!
tup = minmax_copy_tuple(tup, sz);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
--- 424,437 ----
MMTuple *tup;
Size sz;
ItemId lp;
!
! CHECK_FOR_INTERRUPTS();
lp = PageGetItemId(page, off);
if (ItemIdIsUsed(lp))
{
sz = ItemIdGetLength(lp);
! tup = (MMTuple *) PageGetItem(page, lp);
tup = minmax_copy_tuple(tup, sz);
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
***************
*** 412,429 ****
LockBuffer(buf, BUFFER_LOCK_SHARE);
- if (extended)
- IndexFreeSpaceMapVacuum(idxRel);
-
/* It's possible that someone extended the revmap over this page */
if (!MINMAX_IS_REGULAR_PAGE(page))
break;
}
}
- mmRevmapAccessTerminate(rmAccess);
-
UnlockReleaseBuffer(buf);
}
/*
--- 442,457 ----
LockBuffer(buf, BUFFER_LOCK_SHARE);
/* It's possible that someone extended the revmap over this page */
if (!MINMAX_IS_REGULAR_PAGE(page))
break;
}
}
UnlockReleaseBuffer(buf);
+
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
}
/*
***************
*** 467,472 ****
--- 495,502 ----
Buffer buf;
bool extensionLockHeld = false;
+ CHECK_FOR_INTERRUPTS();
+
if (newblk == InvalidBlockNumber)
{
/*
***************
*** 498,503 ****
--- 528,539 ----
buf = ReadBuffer(irel, newblk);
}
+ /*
+ * We lock the old buffer first, if it's earlier than the new one.
+ * We also need to check that it hasn't been turned into a revmap
+ * page concurrently; if we detect that it happened, give up and
+ * tell caller to start over.
+ */
if (BufferIsValid(oldbuf) && oldblk < newblk)
{
LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
***************
*** 520,529 ****
mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
/*
! * We have a new buffer from FSM now, and both pages are locked.
! * Check that the new page has enough free space, and return it if it
! * does; otherwise start over. Note that we allow for the FSM to be
! * out of date here, and in that case we update it and move on.
*
* (mm_page_get_freespace also checks that the FSM didn't hand us a
* page that has since been repurposed for the revmap.)
--- 556,565 ----
mm_page_init(page, MINMAX_PAGETYPE_REGULAR);
/*
! * We have a new buffer from FSM now. Check that the new page has
! * enough free space, and return it if it does; otherwise start over.
! * Note that we allow for the FSM to be out of date here, and in that
! * case we update it and move on.
*
* (mm_page_get_freespace also checks that the FSM didn't hand us a
* page that has since been repurposed for the revmap.)
***************
*** 533,543 ****
{
if (extended)
*was_extended = true;
RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
! /* Lock the old buffer if not locked already */
! if (BufferIsValid(oldbuf) && newblk < oldblk)
LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
return buf;
}
--- 569,588 ----
{
if (extended)
*was_extended = true;
+
RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
! /*
! * Lock the old buffer if not locked already. Note that in this
! * case we know for sure it's a regular page: it's later than the
! * new page we just got, which is not a revmap page, and revmap
! * pages are always consecutive.
! */
! if (BufferIsValid(oldbuf) && oldblk > newblk)
! {
LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(MINMAX_IS_REGULAR_PAGE(BufferGetPage(oldbuf)));
+ }
return buf;
}
***************
*** 571,573 ****
--- 616,638 ----
newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
}
}
+
+ /*
+ * Return the amount of free space on a regular minmax index page.
+ *
+ * If the page is not a regular page, or has been marked with the
+ * MINMAX_EVACUATE_PAGE flag, returns 0.
+ */
+ static Size
+ mm_page_get_freespace(Page page)
+ {
+ MinmaxSpecialSpace *special;
+
+ special = (MinmaxSpecialSpace *) PageGetSpecialPointer(page);
+ if (!MINMAX_IS_REGULAR_PAGE(page) ||
+ (special->flags & MINMAX_EVACUATE_PAGE) != 0)
+ return 0;
+ else
+ return PageGetFreeSpace(page);
+
+ }
Here's version 18. I have renamed it: These are now BRIN indexes.
I have fixed numerous race conditions and deadlocks. In particular I
fixed this problem you noted:
Heikki Linnakangas wrote:
Another race condition:
If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.
I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple. This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).
There should be only minor items left now, such as silencing the
WARNING: concurrent insert in progress within table "sales"
which is emitted by IndexBuildHeapScan (possibly thousands of times)
when doing a summarization of a range being inserted into or otherwise
modified. Basically the issue here is that IBHS assumes it's being run
with ShareLock in the heap (which blocks inserts), but here we're using
it with ShareUpdateExclusive only, which lets inserts in. There is no
harm AFAICS because of the placeholder tuple stuff I describe above.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-18.patchtext/x-diff; charset=us-asciiDownload
*** a/contrib/pageinspect/Makefile
--- b/contrib/pageinspect/Makefile
***************
*** 1,7 ****
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
--- 1,7 ----
# contrib/pageinspect/Makefile
MODULE_big = pageinspect
! OBJS = rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o brinfuncs.o $(WIN32RES)
EXTENSION = pageinspect
DATA = pageinspect--1.2.sql pageinspect--1.0--1.1.sql \
*** /dev/null
--- b/contrib/pageinspect/brinfuncs.c
***************
*** 0 ****
--- 1,410 ----
+ /*
+ * brinfuncs.c
+ * Functions to investigate BRIN indexes
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/pageinspect/brinfuncs.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/brin.h"
+ #include "access/brin_internal.h"
+ #include "access/brin_page.h"
+ #include "access/brin_revmap.h"
+ #include "access/brin_tuple.h"
+ #include "catalog/index.h"
+ #include "catalog/pg_type.h"
+ #include "funcapi.h"
+ #include "lib/stringinfo.h"
+ #include "utils/array.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+ #include "miscadmin.h"
+
+
+ PG_FUNCTION_INFO_V1(brin_page_type);
+ PG_FUNCTION_INFO_V1(brin_page_items);
+ PG_FUNCTION_INFO_V1(brin_metapage_info);
+ PG_FUNCTION_INFO_V1(brin_revmap_data);
+
+ typedef struct brin_column_state
+ {
+ int nstored;
+ FmgrInfo outputFn[FLEXIBLE_ARRAY_MEMBER];
+ } brin_column_state;
+
+ typedef struct brin_page_state
+ {
+ BrinDesc *bdesc;
+ Page page;
+ OffsetNumber offset;
+ bool unusedItem;
+ bool done;
+ AttrNumber attno;
+ DeformedBrTuple *dtup;
+ brin_column_state *columns[FLEXIBLE_ARRAY_MEMBER];
+ } brin_page_state;
+
+
+ static Page verify_brin_page(bytea *raw_page, uint16 type,
+ const char *strtype);
+
+ Datum
+ brin_page_type(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page = VARDATA(raw_page);
+ BrinSpecialSpace *special;
+ char *type;
+
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+
+ switch (special->type)
+ {
+ case BRIN_PAGETYPE_META:
+ type = "meta";
+ break;
+ case BRIN_PAGETYPE_REVMAP:
+ type = "revmap";
+ break;
+ case BRIN_PAGETYPE_REGULAR:
+ type = "regular";
+ break;
+ default:
+ type = psprintf("unknown (%02x)", special->type);
+ break;
+ }
+
+ PG_RETURN_TEXT_P(cstring_to_text(type));
+ }
+
+ /*
+ * Verify that the given bytea contains a BRIN page of the indicated page
+ * type, or die in the attempt. A pointer to the page is returned.
+ */
+ static Page
+ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype)
+ {
+ Page page;
+ int raw_page_size;
+ BrinSpecialSpace *special;
+
+ raw_page_size = VARSIZE(raw_page) - VARHDRSZ;
+
+ if (raw_page_size < SizeOfPageHeaderData)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d", raw_page_size, BLCKSZ)));
+
+ page = VARDATA(raw_page);
+
+ /* verify the special space says this page is what we want */
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+ if (special->type != type)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a BRIN page of type \"%s\"", strtype),
+ errdetail("Expected special type %08x, got %08x.",
+ type, special->type)));
+
+ return page;
+ }
+
+
+ /*
+ * Extract all item values from a BRIN index page
+ *
+ * Usage: SELECT * FROM brin_page_items(get_raw_page('idx', 1), 'idx'::regclass);
+ */
+ Datum
+ brin_page_items(PG_FUNCTION_ARGS)
+ {
+ brin_page_state *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Oid indexRelid = PG_GETARG_OID(1);
+ Page page;
+ TupleDesc tupdesc;
+ MemoryContext mctx;
+ Relation indexRel;
+ AttrNumber attno;
+
+ /* minimally verify the page we got */
+ page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ indexRel = index_open(indexRelid, AccessShareLock);
+
+ state = palloc(offsetof(brin_page_state, columns) +
+ sizeof(brin_column_state) * RelationGetDescr(indexRel)->natts);
+
+ state->bdesc = brin_build_desc(indexRel);
+ state->page = page;
+ state->offset = FirstOffsetNumber;
+ state->unusedItem = false;
+ state->done = false;
+ state->dtup = NULL;
+
+ for (attno = 1; attno <= state->bdesc->bd_tupdesc->natts; attno++)
+ {
+ Oid output;
+ bool isVarlena;
+ FmgrInfo *opcInfoFn;
+ BrinOpcInfo *opcinfo;
+ int i;
+ brin_column_state *column;
+
+ opcInfoFn = index_getprocinfo(indexRel, attno, BRIN_PROCNUM_OPCINFO);
+ opcinfo = (BrinOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+
+ column = palloc(offsetof(brin_column_state, outputFn) +
+ sizeof(FmgrInfo) * opcinfo->oi_nstored);
+
+ column->nstored = opcinfo->oi_nstored;
+ for (i = 0; i < opcinfo->oi_nstored; i++)
+ {
+ getTypeOutputInfo(opcinfo->oi_typids[i], &output, &isVarlena);
+ fmgr_info(output, &column->outputFn[i]);
+ }
+
+ state->columns[attno - 1] = column;
+ }
+
+ index_close(indexRel, AccessShareLock);
+
+ fctx->user_fctx = state;
+ fctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (!state->done)
+ {
+ HeapTuple result;
+ Datum values[5];
+ bool nulls[5];
+
+ /*
+ * This loop is called once for every attribute of every tuple in the
+ * page. At the start of a tuple, we get a NULL dtup; that's our
+ * signal for obtaining and decoding the next one. If that's not the
+ * case, we output the next attribute.
+ */
+ if (state->dtup == NULL)
+ {
+ BrTuple *tup;
+ MemoryContext mctx;
+ ItemId itemId;
+
+ /* deformed tuple must live across calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ /* verify item status: if there's no data, we can't decode */
+ itemId = PageGetItemId(state->page, state->offset);
+ if (ItemIdIsUsed(itemId))
+ {
+ tup = (BrTuple *) PageGetItem(state->page,
+ PageGetItemId(state->page,
+ state->offset));
+ state->dtup = brin_deform_tuple(state->bdesc, tup);
+ state->attno = 1;
+ state->unusedItem = false;
+ }
+ else
+ state->unusedItem = true;
+
+ MemoryContextSwitchTo(mctx);
+ }
+ else
+ state->attno++;
+
+ MemSet(nulls, 0, sizeof(nulls));
+
+ if (state->unusedItem)
+ {
+ values[0] = UInt16GetDatum(state->offset);
+ nulls[1] = true;
+ nulls[2] = true;
+ nulls[3] = true;
+ nulls[4] = true;
+ }
+ else
+ {
+ int att = state->attno - 1;
+
+ values[0] = UInt16GetDatum(state->offset);
+ values[1] = UInt16GetDatum(state->attno);
+ values[2] = BoolGetDatum(state->dtup->dt_columns[att].allnulls);
+ values[3] = BoolGetDatum(state->dtup->dt_columns[att].hasnulls);
+ if (!state->dtup->dt_columns[att].allnulls)
+ {
+ BrinValues *bvalues = &state->dtup->dt_columns[att];
+ StringInfoData s;
+ bool first;
+ int i;
+
+ initStringInfo(&s);
+ appendStringInfoChar(&s, '{');
+
+ first = true;
+ for (i = 0; i < state->columns[att]->nstored; i++)
+ {
+ char *val;
+
+ if (!first)
+ appendStringInfoString(&s, " .. ");
+ first = false;
+ val = OutputFunctionCall(&state->columns[att]->outputFn[i],
+ bvalues->values[i]);
+ appendStringInfoString(&s, val);
+ pfree(val);
+ }
+ appendStringInfoChar(&s, '}');
+
+ values[4] = CStringGetTextDatum(s.data);
+ pfree(s.data);
+ }
+ else
+ {
+ nulls[4] = true;
+ }
+ }
+
+ result = heap_form_tuple(fctx->tuple_desc, values, nulls);
+
+ /*
+ * If the item was unused, jump straight to the next one; otherwise,
+ * the only cleanup needed here is to set our signal to go to the next
+ * tuple in the following iteration, by freeing the current one.
+ */
+ if (state->unusedItem)
+ state->offset = OffsetNumberNext(state->offset);
+ else if (state->attno >= state->bdesc->bd_tupdesc->natts)
+ {
+ pfree(state->dtup);
+ state->dtup = NULL;
+ state->offset = OffsetNumberNext(state->offset);
+ }
+
+ /*
+ * If we're beyond the end of the page, set flag to end the function in
+ * the following iteration.
+ */
+ if (state->offset > PageGetMaxOffsetNumber(state->page))
+ state->done = true;
+
+ SRF_RETURN_NEXT(fctx, HeapTupleGetDatum(result));
+ }
+
+ brin_free_desc(state->bdesc);
+
+ SRF_RETURN_DONE(fctx);
+ }
+
+ Datum
+ brin_metapage_info(PG_FUNCTION_ARGS)
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ Page page;
+ BrinMetaPageData *meta;
+ TupleDesc tupdesc;
+ Datum values[4];
+ bool nulls[4];
+ HeapTuple htup;
+
+ page = verify_brin_page(raw_page, BRIN_PAGETYPE_META, "metapage");
+
+ /* Build a tuple descriptor for our result type */
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+ tupdesc = BlessTupleDesc(tupdesc);
+
+ /* Extract values from the metapage */
+ meta = (BrinMetaPageData *) PageGetContents(page);
+ MemSet(nulls, 0, sizeof(nulls));
+ values[0] = CStringGetTextDatum(psprintf("0x%08X", meta->brinMagic));
+ values[1] = Int32GetDatum(meta->brinVersion);
+ values[2] = Int32GetDatum(meta->pagesPerRange);
+ values[3] = Int64GetDatum(meta->lastRevmapPage);
+
+ htup = heap_form_tuple(tupdesc, values, nulls);
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+ }
+
+ /*
+ * Return the TID array stored in a BRIN revmap page
+ */
+ Datum
+ brin_revmap_data(PG_FUNCTION_ARGS)
+ {
+ struct
+ {
+ ItemPointerData *tids;
+ int idx;
+ } *state;
+ FuncCallContext *fctx;
+
+ if (!superuser())
+ ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ (errmsg("must be superuser to use raw page functions"))));
+
+ if (SRF_IS_FIRSTCALL())
+ {
+ bytea *raw_page = PG_GETARG_BYTEA_P(0);
+ MemoryContext mctx;
+ Page page;
+
+ /* minimally verify the page we got */
+ page = verify_brin_page(raw_page, BRIN_PAGETYPE_REVMAP, "revmap");
+
+ /* create a function context for cross-call persistence */
+ fctx = SRF_FIRSTCALL_INIT();
+
+ /* switch to memory context appropriate for multiple function calls */
+ mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+ state = palloc(sizeof(*state));
+ state->tids = ((RevmapContents *) PageGetContents(page))->rm_tids;
+ state->idx = 0;
+
+ fctx->user_fctx = state;
+
+ MemoryContextSwitchTo(mctx);
+ }
+
+ fctx = SRF_PERCALL_SETUP();
+ state = fctx->user_fctx;
+
+ if (state->idx < REVMAP_PAGE_MAXITEMS)
+ SRF_RETURN_NEXT(fctx, PointerGetDatum(&state->tids[state->idx++]));
+
+ SRF_RETURN_DONE(fctx);
+ }
*** a/contrib/pageinspect/pageinspect--1.2.sql
--- b/contrib/pageinspect/pageinspect--1.2.sql
***************
*** 99,104 **** AS 'MODULE_PATHNAME', 'bt_page_items'
--- 99,141 ----
LANGUAGE C STRICT;
--
+ -- brin_page_type()
+ --
+ CREATE FUNCTION brin_page_type(IN page bytea)
+ RETURNS text
+ AS 'MODULE_PATHNAME', 'brin_page_type'
+ LANGUAGE C STRICT;
+
+ --
+ -- brin_metapage_info()
+ --
+ CREATE FUNCTION brin_metapage_info(IN page bytea, OUT magic text,
+ OUT version integer, OUT pagesperrange integer, OUT lastrevmappage bigint)
+ AS 'MODULE_PATHNAME', 'brin_metapage_info'
+ LANGUAGE C STRICT;
+
+ --
+ -- brin_page_items()
+ --
+ CREATE FUNCTION brin_page_items(IN page bytea, IN index_oid oid,
+ OUT itemoffset int,
+ OUT attnum int,
+ OUT allnulls bool,
+ OUT hasnulls bool,
+ OUT value text)
+ RETURNS SETOF record
+ AS 'MODULE_PATHNAME', 'brin_page_items'
+ LANGUAGE C STRICT;
+
+ --
+ -- brin_revmap_data()
+ CREATE FUNCTION brin_revmap_data(IN page bytea,
+ OUT pages tid)
+ RETURNS SETOF tid
+ AS 'MODULE_PATHNAME', 'brin_revmap_data'
+ LANGUAGE C STRICT;
+
+ --
-- fsm_page_contents()
--
CREATE FUNCTION fsm_page_contents(IN page bytea)
*** a/contrib/pg_xlogdump/rmgrdesc.c
--- b/contrib/pg_xlogdump/rmgrdesc.c
***************
*** 8,13 ****
--- 8,14 ----
#define FRONTEND 1
#include "postgres.h"
+ #include "access/brin_xlog.h"
#include "access/clog.h"
#include "access/gin.h"
#include "access/gist_private.h"
*** /dev/null
--- b/doc/src/sgml/brin.sgml
***************
*** 0 ****
--- 1,264 ----
+ <!-- doc/src/sgml/brin.sgml -->
+
+ <chapter id="BRIN">
+ <title>BRIN Indexes</title>
+
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+
+ <sect1 id="brin-intro">
+ <title>Introduction</title>
+
+ <para>
+ <acronym>BRIN</acronym> stands for Block Range Index.
+ <acronym>BRIN</acronym> is designed for handling very large tables
+ in which certain columns have some natural correlation with its
+ physical position. For example, a table storing orders might have
+ a date column on which each order was placed, and much of the time
+ the earlier entries will appear earlier in the table as well; or a
+ table storing a ZIP code column might have all codes for a city
+ grouped together naturally. For each block range, some summary info
+ is stored in the index.
+ </para>
+
+ <para>
+ <acronym>BRIN</acronym> indexes can satisfy queries via the bitmap
+ scanning facility only, and will return all tuples in all pages within
+ each range if the summary info stored by the index indicates that some
+ tuples in the range might match the given query conditions. The executor
+ is in charge of rechecking these tuples and discarding those that do not
+ match — in other words, these indexes are lossy.
+ This enables them to work as very fast sequential scan helpers to avoid
+ scanning blocks that are known not to contain matching tuples.
+ </para>
+
+ <para>
+ The specific data that a <acronym>BRIN</acronym> index will store
+ depends on the operator class selected for the data type.
+ Datatypes having a linear sort order can have operator classes that
+ store the minimum and maximum value within each block range, for instance;
+ geometrical types might store the common bounding box.
+ </para>
+
+ <para>
+ The size of the block range is determined at index creation time with
+ the pages_per_range storage parameter. The smaller the number, the
+ larger the index becomes (because of the need to store more index entries),
+ but at the same time the summary data stored can be more precise and
+ more data blocks can be skipped.
+ </para>
+
+ <para>
+ The <acronym>BRIN</acronym> implementation in <productname>PostgreSQL</productname>
+ is primarily maintained by Álvaro Herrera.
+ </para>
+ </sect1>
+
+ <sect1 id="brin-builtin-opclasses">
+ <title>Built-in Operator Classes</title>
+
+ <para>
+ The core <productname>PostgreSQL</productname> distribution includes
+ includes the <acronym>BRIN</acronym> operator classes shown in
+ <xref linkend="gin-builtin-opclasses-table">.
+ </para>
+
+ <table id="brin-builtin-opclasses-table">
+ <title>Built-in <acronym>BRIN</acronym> Operator Classes</title>
+ <tgroup cols="3">
+ <thead>
+ <row>
+ <entry>Name</entry>
+ <entry>Indexed Data Type</entry>
+ <entry>Indexable Operators</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry><literal>char_minmax_ops</literal></entry>
+ <entry><type>"char"</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>date_minmax_ops</literal></entry>
+ <entry><type>date</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>int4_minmax_ops</literal></entry>
+ <entry><type>integer</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>numeric_minmax_ops</literal></entry>
+ <entry><type>numeric</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>text_minmax_ops</literal></entry>
+ <entry><type>text</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>time_minmax_ops</literal></entry>
+ <entry><type>time</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timetz_minmax_ops</literal></entry>
+ <entry><type>time with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamp_minmax_ops</literal></entry>
+ <entry><type>timestamp</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ <row>
+ <entry><literal>timestamptz_minmax_ops</literal></entry>
+ <entry><type>timestamp with time zone</type></entry>
+ <entry>
+ <literal><</literal>
+ <literal><=</literal>
+ <literal>=</literal>
+ <literal>>=</literal>
+ <literal>></literal>
+ </entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </sect1>
+
+ <sect1 id="brin-extensibility">
+ <title>Extensibility</title>
+
+ <para>
+ The <acronym>BRIN</acronym> interface has a high level of abstraction,
+ requiring the access method implementer only to implement the semantics
+ of the data type being accessed. The <acronym>BRIN</acronym> layer
+ itself takes care of concurrency, logging and searching the index structure.
+ </para>
+
+ <para>
+ All it takes to get a <acronym>BRIN</acronym> access method working is to
+ implement a few user-defined methods, which define the behavior of
+ summary values stored in the index and the way they interact with
+ scan keys.
+ In short, <acronym>BRIN</acronym> combines
+ extensibility with generality, code reuse, and a clean interface.
+ </para>
+
+ <para>
+ There are three methods that an operator class for <acronym>BRIN</acronym>
+ must provide:
+
+ <variablelist>
+ <varlistentry>
+ <term><function>BrinOpcInfo *opcInfo(void)</></term>
+ <listitem>
+ <para>
+ Returns internal information about the indexed columns' summary data.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool consistent(BrinDesc *bdesc, DeformedBrTuple *dtuple,
+ ScanKey key)</function></term>
+ <listitem>
+ <para>
+ Returns whether the ScanKey is consistent with the given index tuple.
+ The attribute number to use is passed as part of the scan key.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool addValue(BrinDesc *bdesc, DeformedBrTuple *dtuple,
+ AttrNumber attno, Datum newval, bool isnull)</function></term>
+ <listitem>
+ <para>
+ Given an index tuple and an indexed value, modifies the indicated
+ attribute of the tuple so that it additionally represents the new value.
+ If any modification was done to the tuple, <literal>true</literal> is
+ returned.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><function>bool unionTuples(BrinDesc *bdesc, DeformedBrTuple *a,
+ DeformedBrTuple *b, AttrNumber attno)</function></term>
+ <listitem>
+ <para>
+ Consolidates two index tuples. Given two index tuples, modifies the
+ indicated attribute of the first of them so that it represents both tuples.
+ The second tuple is not modified.
+ </para>
+ </listitem>
+ </varlistentry>
+ </variablelist>
+
+ <!-- this needs improvement ... -->
+ To implement these methods in a generic ways, normally the opclass
+ defines its own internal support functions. For instance, minmax
+ opclasses add the support functions for the four inequality operators
+ for the datatype.
+ Additionally, the operator class must supply appropriate
+ operator entries,
+ to enable the optimizer to use the index when those operators are
+ used in queries.
+ </para>
+ </sect1>
+ </chapter>
*** a/doc/src/sgml/filelist.sgml
--- b/doc/src/sgml/filelist.sgml
***************
*** 87,92 ****
--- 87,93 ----
<!ENTITY gist SYSTEM "gist.sgml">
<!ENTITY spgist SYSTEM "spgist.sgml">
<!ENTITY gin SYSTEM "gin.sgml">
+ <!ENTITY brin SYSTEM "brin.sgml">
<!ENTITY planstats SYSTEM "planstats.sgml">
<!ENTITY indexam SYSTEM "indexam.sgml">
<!ENTITY nls SYSTEM "nls.sgml">
*** a/doc/src/sgml/indices.sgml
--- b/doc/src/sgml/indices.sgml
***************
*** 116,122 **** CREATE INDEX test1_id_index ON test1 (id);
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST and GIN. Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
--- 116,123 ----
<para>
<productname>PostgreSQL</productname> provides several index types:
! B-tree, Hash, GiST, SP-GiST, GIN and BRIN.
! Each index type uses a different
algorithm that is best suited to different types of queries.
By default, the <command>CREATE INDEX</command> command creates
B-tree indexes, which fit the most common situations.
***************
*** 326,331 **** SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10;
--- 327,365 ----
classes are available in the <literal>contrib</> collection or as separate
projects. For more information see <xref linkend="GIN">.
</para>
+
+ <para>
+ <indexterm>
+ <primary>index</primary>
+ <secondary>BRIN</secondary>
+ </indexterm>
+ <indexterm>
+ <primary>BRIN</primary>
+ <see>index</see>
+ </indexterm>
+ BRIN indexes (a shorthand for Block Range indexes)
+ store summaries about the values stored in consecutive table physical block ranges.
+ Like GiST, SP-GiST and GIN,
+ BRIN can support many different indexing strategies,
+ and the particular operators with which a BRIN index can be used
+ vary depending on the indexing strategy.
+ For datatypes that have a linear sort order, the indexed data
+ corresponds to the minimum and maximum values of the
+ values in the column for each block range,
+ which support indexed queries using these operators:
+
+ <simplelist>
+ <member><literal><</literal></member>
+ <member><literal><=</literal></member>
+ <member><literal>=</literal></member>
+ <member><literal>>=</literal></member>
+ <member><literal>></literal></member>
+ </simplelist>
+
+ The BRIN operator classes included in the standard distribution are
+ documented in <xref linkend="brin-builtin-opclasses-table">.
+ For more information see <xref linkend="BRIN">.
+ </para>
</sect1>
*** a/doc/src/sgml/postgres.sgml
--- b/doc/src/sgml/postgres.sgml
***************
*** 247,252 ****
--- 247,253 ----
&gist;
&spgist;
&gin;
+ &brin;
&storage;
&bki;
&planstats;
*** a/src/backend/access/Makefile
--- b/src/backend/access/Makefile
***************
*** 8,13 **** subdir = src/backend/access
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
--- 8,13 ----
top_builddir = ../../..
include $(top_builddir)/src/Makefile.global
! SUBDIRS = brin common gin gist hash heap index nbtree rmgrdesc spgist transam
include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/brin/Makefile
***************
*** 0 ****
--- 1,18 ----
+ #-------------------------------------------------------------------------
+ #
+ # Makefile--
+ # Makefile for access/brin
+ #
+ # IDENTIFICATION
+ # src/backend/access/brin/Makefile
+ #
+ #-------------------------------------------------------------------------
+
+ subdir = src/backend/access/brin
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+
+ OBJS = brin.o brpageops.o brrevmap.o brtuple.o brxlog.o \
+ brin_minmax.o
+
+ include $(top_srcdir)/src/backend/common.mk
*** /dev/null
--- b/src/backend/access/brin/README
***************
*** 0 ****
--- 1,179 ----
+ Block Range Indexes (BRIN)
+ ==========================
+
+ BRIN indexes intend to enable very fast scanning of extremely large tables.
+
+ The essential idea of a BRIN index is to keep track of summarizing values in
+ consecutive groups of heap pages (page ranges); for example, the minimum and
+ maximum values for datatypes with a btree opclass, or the bounding box for
+ geometric types. These values can be used to avoid scanning such pages,
+ depending on query quals.
+
+ The cost of this is having to update the stored summary values of each
+ page range as tuples are inserted into them.
+
+ Access Method Design
+ --------------------
+
+ Since item pointers are not stored inside indexes of this type, it is not
+ possible to support the amgettuple interface. Instead, we only provide
+ amgetbitmap support; scanning a relation using this index always requires a
+ recheck node on top. The amgetbitmap routine returns a TIDBitmap comprising
+ all pages in those page groups that match the query qualifications. The
+ recheck node prunes tuples that are not visible according to the query
+ qualifications.
+
+ For each supported datatype, we need an operator class with the following
+ catalog entries:
+
+ - support procedures (pg_amproc):
+ * "opcinfo" (procno 1) initializes a structure for index creation or scanning
+ * "addValue" (procno 2) takes an index tuple and a heap item, and possibly
+ changes the index tuple so that it includes the heap item values
+ * "consistent" (procno 3) takes an index tuple and query quals, and returns
+ whether the index tuple values match the query quals.
+ * "union" (procno 4) takes two index tuples and modifies the first one so that
+ it represents the union of the two.
+ * For minmax, proc numbers 5-8 are used for the functions implementing
+ inequality operators for the type, in this order: less than, less or equal,
+ greater or equal, greater than. Opclasses using a different design will
+ require different additional procedure numbers.
+ - support operators (pg_amop): for minmax, the same operators as btree (<=, <,
+ =, >=, >) so that the index is chosen by the optimizer on queries.
+
+ In each index tuple (corresponding to one page range), we store:
+ - for each indexed column of a datatype with a btree-opclass:
+ * minimum value across all tuples in the range
+ * maximum value across all tuples in the range
+ * are there nulls present in any tuple?
+ * are null all the values in all tuples in the range?
+
+ Different datatypes store other values instead of min/max, for example
+ geometric types might store a bounding box. The NULL bits are always present.
+
+ These null bits are stored in a single null bitmask of length 2x number of
+ columns.
+
+ With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length
+ types such as timestamptz or bigint, each tuple would be 522 bytes in length,
+ which seems reasonable. There are 6 extra bytes for padding between the null
+ bitmask and the first data item, assuming 64-bit alignment; so the total size
+ for such an index tuple would actually be 528 bytes.
+
+ This maximum index tuple size is calculated as: mt_info (2 bytes) + null bitmap
+ (8 bytes) + data value (8 bytes) * 32 * 2
+
+ (Of course, larger columns are possible, such as varchar, but creating BRIN
+ indexes on such columns seems of little practical usefulness. Also, the
+ usefulness of an index containing so many columns is dubious.)
+
+ There can be gaps where some pages have no covering index entry.
+
+ The Range Reverse Map
+ ---------------------
+
+ To find out the index tuple for a particular page range, we have an internal
+ structure we call the range reverse map. This stores one TID per page range,
+ which is the address of the index tuple summarizing that range. Since these
+ map entries are fixed size, it is possible to compute the address of the range
+ map entry for any given heap page by simple arithmetic.
+
+ When a new heap tuple is inserted in a summarized page range, we compare the
+ existing index tuple with the new heap tuple. If the heap tuple is outside the
+ summarization data given by the index tuple for any indexed column (or if the
+ new heap tuple contains null values but the index tuple indicate there are no
+ nulls), it is necessary to create a new index tuple with the new values. To do
+ this, a new index tuple is inserted, and the reverse range map is updated to
+ point to it; the old index tuple is removed.
+
+ If the reverse range map points to an invalid TID, the corresponding page range
+ is considered to be not summarized. When tuples are added to unsummarized
+ pages, nothing needs to happen.
+
+ To scan a table following a BRIN index, we scan the reverse range map
+ sequentially. This yields index tuples in ascending page range order. Query
+ quals are matched to each index tuple; if they match, each page within the page
+ range is returned as part of the output TID bitmap. If there's no match, they
+ are skipped. Reverse range map entries returning invalid index TIDs, that is
+ unsummarized page ranges, are also returned in the TID bitmap.
+
+ The revmap is stored in the first few blocks of the index main fork, immediately
+ following the metapage. Whenever the revmap needs to be extended by another
+ page, existing tuples in that page are moved to some other page.
+
+ Heap tuples can be removed from anywhere without restriction. It might be
+ useful to mark the corresponding index tuple somehow, if the heap tuple is one
+ of the constraining values of the summary data (i.e. either min or max in the
+ case of a btree-opclass-bearing datatype), so that in the future we are aware
+ of the need to re-execute summarization on that range, leading to a possible
+ tightening of the summary values.
+
+ Summarization
+ -------------
+
+ At index creation time, the whole table is scanned; for each page range the
+ summarizing values of each indexed column and nulls bitmap are collected and
+ stored in the index.
+
+ Once in a while, it is necessary to summarize a bunch of unsummarized pages
+ (because the table has grown since the index was created), or re-summarize a
+ range that has been marked invalid. This is simple: scan the page range
+ calculating the summary values for each indexed column, then insert the new
+ index entry at the end of the index. We do this during vacuum.
+
+ Vacuuming
+ ---------
+
+ Vacuuming a table that has a BRIN index does not represent a significant
+ challenge. Since no heap TIDs are stored, it's not necessary to scan the index
+ when heap tuples are removed. It might be that some summary values can be
+ tightened if heap tuples have been deleted; but this would represent an
+ optimization opportunity only, not a correctness issue. It's simpler to
+ represent this as the need to re-run summarization on the affected page
+ range rather than "subtracting" values from the existing one.
+
+ Note that if there are no indexes on the table other than the BRIN index,
+ usage of maintenance_work_mem by vacuum can be decreased significantly, because
+ no detailed index scan needs to take place (and thus it's not necessary for
+ vacuum to save TIDs to remove). It's unlikely that BRIN would be the only
+ indexes in a table, though, because primary keys can be btrees only.
+
+
+ Optimizer
+ ---------
+
+ In order to make this all work, the only thing we need to do is ensure we have a
+ good enough opclass and amcostestimate. With this, the optimizer is able to pick
+ up the index on its own.
+
+
+ Open questions
+ --------------
+
+ * Same-size page ranges?
+ Current related literature seems to consider that each "index entry" in a
+ BRIN index must cover the same number of pages. There doesn't seem to be a
+ hard reason for this to be so; it might make sense to allow the index to
+ self-tune so that some index entries cover smaller page ranges, if this allows
+ the summary values to be more compact. This would incur larger BRIN
+ overhead for the index itself, but might allow better pruning of page ranges
+ during scan. In the limit of one index tuple per page, the index itself would
+ occupy too much space, even though we would be able to skip reading the most
+ heap pages, because the summary values are tight; in the opposite limit of
+ a single tuple that summarizes the whole table, we wouldn't be able to prune
+ anything even though the index is very small. This can probably be made to work
+ by using the reverse range map as an index in itself.
+
+ * More compact representation for TIDBitmap?
+ TIDBitmap is the structure used to represent bitmap scans. The
+ representation of lossy page ranges is not optimal for our purposes, because
+ it uses a Bitmapset to represent pages in the range; since we're going to return
+ all pages in a large range, it might be more convenient to allow for a
+ struct that uses start and end page numbers to represent the range, instead.
+
+ * Better vacuuming?
+ It might be useful to enable passing more useful info to BRIN indexes during
+ vacuuming about tuples that are deleted, i.e. do not require the callback to
+ pass each tuple's TID. For instance we might need a callback that passes a
+ block number instead. That would help determine when to re-run summarization
+ on blocks that have seen lots of tuple deletions.
*** /dev/null
--- b/src/backend/access/brin/brin.c
***************
*** 0 ****
--- 1,1095 ----
+ /*
+ * brin.c
+ * Implementation of BRIN indexes for Postgres
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/brin.c
+ *
+ * TODO
+ * * ScalarArrayOpExpr (amsearcharray -> SK_SEARCHARRAY)
+ * * add support for unlogged indexes
+ * * ditto expressional indexes
+ */
+ #include "postgres.h"
+
+ #include "access/brin.h"
+ #include "access/brin_internal.h"
+ #include "access/brin_page.h"
+ #include "access/brin_pageops.h"
+ #include "access/brin_xlog.h"
+ #include "access/reloptions.h"
+ #include "access/relscan.h"
+ #include "catalog/index.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "utils/memutils.h"
+ #include "utils/rel.h"
+
+
+ /*
+ * We use a BrinBuildState during initial construction of a BRIN index.
+ * The running state is kept in a DeformedBrTuple.
+ */
+ typedef struct BrinBuildState
+ {
+ Relation irel;
+ int numtuples;
+ Buffer currentInsertBuf;
+ BlockNumber pagesPerRange;
+ BlockNumber currRangeStart;
+ brinRmAccess *bs_rmAccess;
+ BrinDesc *bs_bdesc;
+ bool seentup;
+ bool extended;
+ DeformedBrTuple *dtuple;
+ } BrinBuildState;
+
+ /*
+ * Struct used as "opaque" during index scans
+ */
+ typedef struct BrinOpaque
+ {
+ BlockNumber bo_pagesPerRange;
+ brinRmAccess *bo_rmAccess;
+ BrinDesc *bo_bdesc;
+ } BrinOpaque;
+
+ static BrinBuildState *initialize_brin_buildstate(Relation idxRel,
+ brinRmAccess *rmAccess, BlockNumber pagesPerRange);
+ static bool terminate_brin_buildstate(BrinBuildState *state);
+ static void summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
+ BlockNumber heapBlk);
+ static void form_and_insert_tuple(BrinBuildState *state);
+ static void union_tuples(BrinDesc *bdesc, DeformedBrTuple *a,
+ BrTuple *b);
+
+
+ /*
+ * A tuple in the heap is being inserted. To keep a brin index up to date,
+ * we need to obtain the relevant index tuple, compare its stored values with
+ * those of the new tuple; if the tuple values are consistent with the summary
+ * tuple, there's nothing to do; otherwise we need to update the index.
+ *
+ * If the range is not currently summarized (i.e. the revmap returns InvalidTid
+ * for it), there's nothing to do either.
+ */
+ Datum
+ brininsert(PG_FUNCTION_ARGS)
+ {
+ Relation idxRel = (Relation) PG_GETARG_POINTER(0);
+ Datum *values = (Datum *) PG_GETARG_POINTER(1);
+ bool *nulls = (bool *) PG_GETARG_POINTER(2);
+ ItemPointer heaptid = (ItemPointer) PG_GETARG_POINTER(3);
+
+ /* we ignore the rest of our arguments */
+ BlockNumber pagesPerRange;
+ BrinDesc *bdesc = NULL;
+ brinRmAccess *rmAccess;
+ Buffer buf = InvalidBuffer;
+ bool extended = false;
+
+ rmAccess = brinRevmapAccessInit(idxRel, &pagesPerRange);
+
+ for (;;)
+ {
+ bool need_insert = false;
+ OffsetNumber off;
+ BrTuple *brtup;
+ DeformedBrTuple *dtup;
+ BlockNumber heapBlk;
+ int keyno;
+
+ CHECK_FOR_INTERRUPTS();
+
+ heapBlk = ItemPointerGetBlockNumber(heaptid);
+ /* normalize the block number to be the first block in the range */
+ heapBlk = (heapBlk / pagesPerRange) * pagesPerRange;
+ brtup = brinGetTupleForHeapBlock(rmAccess, heapBlk, &buf, &off, NULL,
+ BUFFER_LOCK_SHARE);
+
+ /* if range is unsummarized, there's nothing to do */
+ if (!brtup)
+ break;
+
+ if (bdesc == NULL)
+ bdesc = brin_build_desc(idxRel);
+ dtup = brin_deform_tuple(bdesc, brtup);
+
+ /*
+ * Compare the key values of the new tuple to the stored index values;
+ * our deformed tuple will get updated if the new tuple doesn't fit
+ * the original range (note this means we can't break out of the loop
+ * early). Make a note of whether this happens, so that we know to
+ * insert the modified tuple later.
+ */
+ for (keyno = 0; keyno < bdesc->bd_tupdesc->natts; keyno++)
+ {
+ Datum result;
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(idxRel, keyno + 1,
+ BRIN_PROCNUM_ADDVALUE);
+ result = FunctionCall5Coll(addValue,
+ idxRel->rd_indcollation[keyno],
+ PointerGetDatum(bdesc),
+ PointerGetDatum(dtup),
+ UInt16GetDatum(keyno + 1),
+ values[keyno],
+ nulls[keyno]);
+ /* if that returned true, we need to insert the updated tuple */
+ need_insert |= DatumGetBool(result);
+ }
+
+ if (!need_insert)
+ {
+ /*
+ * The tuple is consistent with the new values, so there's nothing
+ * to do.
+ */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ brin_free_dtuple(bdesc, dtup);
+ }
+ else
+ {
+ Page page = BufferGetPage(buf);
+ ItemId lp = PageGetItemId(page, off);
+ Size origsz;
+ BrTuple *origtup;
+ Size newsz;
+ BrTuple *newtup;
+ bool samepage;
+
+ /*
+ * Make a copy of the old tuple, so that we can compare it after
+ * re-acquiring the lock.
+ */
+ origsz = ItemIdGetLength(lp);
+ origtup = brin_copy_tuple(brtup, origsz);
+
+ /*
+ * Before releasing the lock, check if we can attempt a same-page
+ * update. Another process could insert a tuple concurrently in
+ * the same page though, so downstream we must be prepared to cope
+ * if this turns out to not be possible after all.
+ */
+ samepage = brin_can_do_samepage_update(buf, origsz, newsz);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ newtup = brin_form_tuple(bdesc, heapBlk, dtup, &newsz);
+ brin_free_dtuple(bdesc, dtup);
+
+ /*
+ * Try to update the tuple. If this doesn't work for whatever
+ * reason, we need to restart from the top; the revmap might be
+ * pointing at a different tuple for this block now, so we need to
+ * recompute to ensure both our new heap tuple and the other
+ * inserter's are covered by the combined tuple. It might be that
+ * we don't need to update at all.
+ */
+ if (!brin_doupdate(idxRel, pagesPerRange, rmAccess, heapBlk,
+ buf, off, origtup, origsz, newtup, newsz,
+ samepage, &extended))
+ {
+ brin_free_tuple(newtup);
+ brin_free_tuple(origtup);
+ /* start over */
+ continue;
+ }
+ }
+
+ /* success! */
+ break;
+ }
+
+ brinRevmapAccessTerminate(rmAccess);
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+ if (bdesc != NULL)
+ brin_free_desc(bdesc);
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
+
+ return BoolGetDatum(false);
+ }
+
+ /*
+ * Initialize state for a BRIN index scan.
+ *
+ * We read the metapage here to determine the pages-per-range number that this
+ * index was built with. Note that since this cannot be changed while we're
+ * holding lock on index, it's not necessary to recompute it during brinrescan.
+ */
+ Datum
+ brinbeginscan(PG_FUNCTION_ARGS)
+ {
+ Relation r = (Relation) PG_GETARG_POINTER(0);
+ int nkeys = PG_GETARG_INT32(1);
+ int norderbys = PG_GETARG_INT32(2);
+ IndexScanDesc scan;
+ BrinOpaque *opaque;
+
+ scan = RelationGetIndexScan(r, nkeys, norderbys);
+
+ opaque = (BrinOpaque *) palloc(sizeof(BrinOpaque));
+ opaque->bo_rmAccess = brinRevmapAccessInit(r, &opaque->bo_pagesPerRange);
+ opaque->bo_bdesc = brin_build_desc(r);
+ scan->opaque = opaque;
+
+ PG_RETURN_POINTER(scan);
+ }
+
+ /*
+ * Execute the index scan.
+ *
+ * This works by reading index TIDs from the revmap, and obtaining the index
+ * tuples pointed to by them; the summary values in the index tuples are
+ * compared to the scan keys. We return into the TID bitmap all the pages in
+ * ranges corresponding to index tuples that match the scan keys.
+ *
+ * If a TID from the revmap is read as InvalidTID, we know that range is
+ * unsummarized. Pages in those ranges need to be returned regardless of scan
+ * keys.
+ *
+ * XXX see _bt_first on what to do about sk_subtype.
+ */
+ Datum
+ bringetbitmap(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ TIDBitmap *tbm = (TIDBitmap *) PG_GETARG_POINTER(1);
+ Relation idxRel = scan->indexRelation;
+ Buffer buf = InvalidBuffer;
+ BrinDesc *bdesc;
+ Oid heapOid;
+ Relation heapRel;
+ BrinOpaque *opaque;
+ BlockNumber nblocks;
+ BlockNumber heapBlk;
+ int totalpages = 0;
+ int keyno;
+ FmgrInfo *consistentFn;
+
+ opaque = (BrinOpaque *) scan->opaque;
+ bdesc = opaque->bo_bdesc;
+ pgstat_count_index_scan(idxRel);
+
+ /*
+ * XXX We need to know the size of the table so that we know how long to
+ * iterate on the revmap. There's room for improvement here, in that we
+ * could have the revmap tell us when to stop iterating.
+ */
+ heapOid = IndexGetRelation(RelationGetRelid(idxRel), false);
+ heapRel = heap_open(heapOid, AccessShareLock);
+ nblocks = RelationGetNumberOfBlocks(heapRel);
+ heap_close(heapRel, AccessShareLock);
+
+ /*
+ * Obtain consistent functions for all indexed column. Maybe it'd be
+ * possible to do this lazily only the first time we see a scan key that
+ * involves each particular attribute.
+ */
+ consistentFn = palloc(sizeof(FmgrInfo) * bdesc->bd_tupdesc->natts);
+ for (keyno = 0; keyno < bdesc->bd_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *tmp;
+
+ tmp = index_getprocinfo(idxRel, keyno + 1, BRIN_PROCNUM_CONSISTENT);
+ fmgr_info_copy(&consistentFn[keyno], tmp, CurrentMemoryContext);
+ }
+
+ /*
+ * Now scan the revmap. We start by querying for heap page 0,
+ * incrementing by the number of pages per range; this gives us a full
+ * view of the table.
+ */
+ for (heapBlk = 0; heapBlk < nblocks; heapBlk += opaque->bo_pagesPerRange)
+ {
+ bool addrange;
+ BrTuple *tup;
+ OffsetNumber off;
+ Size size;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(opaque->bo_rmAccess, heapBlk, &buf,
+ &off, &size, BUFFER_LOCK_SHARE);
+ if (tup)
+ {
+ tup = brin_copy_tuple(tup, size);
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+
+ /*
+ * For page ranges with no indexed tuple, we must return the whole
+ * range; otherwise, compare it to the scan keys.
+ */
+ if (tup == NULL)
+ {
+ addrange = true;
+ }
+ else
+ {
+ DeformedBrTuple *dtup;
+ int keyno;
+
+ dtup = brin_deform_tuple(bdesc, tup);
+ if (dtup->dt_placeholder)
+ {
+ /*
+ * Placeholder tuples are always returned, regardless of the
+ * values stored in them.
+ */
+ addrange = true;
+ }
+ else
+ {
+ /*
+ * Compare scan keys with summary values stored for the range.
+ * If scan keys are matched, the page range must be added to
+ * the bitmap. We initially assume the range needs to be
+ * added; in particular this serves the case where there are
+ * no keys.
+ */
+ addrange = true;
+ for (keyno = 0; keyno < scan->numberOfKeys; keyno++)
+ {
+ ScanKey key = &scan->keyData[keyno];
+ AttrNumber keyattno = key->sk_attno;
+ Datum add;
+
+ /*
+ * The collation of the scan key must match the collation
+ * used in the index column. Otherwise we shouldn't be
+ * using this index ...
+ */
+ Assert(key->sk_collation ==
+ bdesc->bd_tupdesc->attrs[keyattno - 1]->attcollation);
+
+ /*
+ * Check whether the scan key is consistent with the page
+ * range values; if so, have the pages in the range added
+ * to the output bitmap.
+ *
+ * When there are multiple scan keys, failure to meet the
+ * criteria for a single one of them is enough to discard
+ * the range as a whole, so break out of the loop as soon
+ * as a false return value is obtained.
+ */
+ add = FunctionCall3Coll(&consistentFn[keyattno - 1],
+ key->sk_collation,
+ PointerGetDatum(bdesc),
+ PointerGetDatum(dtup),
+ PointerGetDatum(key));
+ addrange = DatumGetBool(add);
+ if (!addrange)
+ {
+ brin_free_tuple(tup);
+ break;
+ }
+ }
+ }
+
+ brin_free_tuple(tup);
+ brin_free_dtuple(bdesc, dtup);
+ }
+
+ /* add the pages in the range to the output bitmap, if needed */
+ if (addrange)
+ {
+ BlockNumber pageno;
+
+ for (pageno = heapBlk;
+ pageno <= heapBlk + opaque->bo_pagesPerRange - 1;
+ pageno++)
+ {
+ tbm_add_page(tbm, pageno);
+ totalpages++;
+ }
+ }
+ }
+
+ if (buf != InvalidBuffer)
+ ReleaseBuffer(buf);
+
+ /*
+ * XXX We have an approximation of the number of *pages* that our scan
+ * returns, but we don't have a precise idea of the number of heap tuples
+ * involved.
+ */
+ PG_RETURN_INT64(totalpages * 10);
+ }
+
+ /*
+ * Re-initialize state for a BRIN index scan
+ */
+ Datum
+ brinrescan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ ScanKey scankey = (ScanKey) PG_GETARG_POINTER(1);
+
+ /* other arguments ignored */
+
+ if (scankey && scan->numberOfKeys > 0)
+ memmove(scan->keyData, scankey,
+ scan->numberOfKeys * sizeof(ScanKeyData));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Close down a BRIN index scan
+ */
+ Datum
+ brinendscan(PG_FUNCTION_ARGS)
+ {
+ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
+ BrinOpaque *opaque = (BrinOpaque *) scan->opaque;
+
+ brinRevmapAccessTerminate(opaque->bo_rmAccess);
+ brin_free_desc(opaque->bo_bdesc);
+ pfree(opaque);
+
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ brinmarkpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "BRIN does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ Datum
+ brinrestrpos(PG_FUNCTION_ARGS)
+ {
+ elog(ERROR, "BRIN does not support mark/restore");
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Per-heap-tuple callback for IndexBuildHeapScan.
+ *
+ * Note we don't worry about the page range at the end of the table here; it is
+ * present in the build state struct after we're called the last time, but not
+ * inserted into the index. Caller must ensure to do so, if appropriate.
+ */
+ static void
+ brinbuildCallback(Relation index,
+ HeapTuple htup,
+ Datum *values,
+ bool *isnull,
+ bool tupleIsAlive,
+ void *brstate)
+ {
+ BrinBuildState *state = (BrinBuildState *) brstate;
+ BlockNumber thisblock;
+ int i;
+
+ thisblock = ItemPointerGetBlockNumber(&htup->t_self);
+
+ /*
+ * If we're in a new block which belongs to the next range, summarize what
+ * we've got and start afresh.
+ */
+ if (thisblock > (state->currRangeStart + state->pagesPerRange - 1))
+ {
+
+ BRIN_elog(DEBUG2, "brinbuildCallback: completed a range: %u--%u",
+ state->currRangeStart,
+ state->currRangeStart + state->pagesPerRange);
+
+ /* create the index tuple and insert it */
+ form_and_insert_tuple(state);
+
+ /* set state to correspond to the next range */
+ state->currRangeStart += state->pagesPerRange;
+
+ /* re-initialize state for it */
+ brin_dtuple_initialize(state->dtuple, state->bs_bdesc);
+ }
+
+ /* Accumulate the current tuple into the running state */
+ state->seentup = true;
+ for (i = 0; i < state->bs_bdesc->bd_tupdesc->natts; i++)
+ {
+ FmgrInfo *addValue;
+
+ addValue = index_getprocinfo(index, i + 1,
+ BRIN_PROCNUM_ADDVALUE);
+
+ /*
+ * Update dtuple state, if and as necessary.
+ */
+ FunctionCall5Coll(addValue,
+ state->bs_bdesc->bd_tupdesc->attrs[i]->attcollation,
+ PointerGetDatum(state->bs_bdesc),
+ PointerGetDatum(state->dtuple),
+ UInt16GetDatum(i + 1), values[i], isnull[i]);
+ }
+ }
+
+ /*
+ * brinbuild() -- build a new BRIN index.
+ */
+ Datum
+ brinbuild(PG_FUNCTION_ARGS)
+ {
+ Relation heap = (Relation) PG_GETARG_POINTER(0);
+ Relation index = (Relation) PG_GETARG_POINTER(1);
+ IndexInfo *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
+ IndexBuildResult *result;
+ double reltuples;
+ double idxtuples;
+ brinRmAccess *rmAccess;
+ BrinBuildState *state;
+ Buffer meta;
+ BlockNumber pagesPerRange;
+
+ /*
+ * We expect to be called exactly once for any index relation.
+ */
+ if (RelationGetNumberOfBlocks(index) != 0)
+ elog(ERROR, "index \"%s\" already contains data",
+ RelationGetRelationName(index));
+
+ /* partial indexes not supported */
+ if (indexInfo->ii_Predicate != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("partial indexes not supported")));
+ /* expressions not supported (yet?) */
+ if (indexInfo->ii_Expressions != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("expression indexes not supported")));
+
+ /*
+ * Critical section not required, because on error the creation of the
+ * whole relation will be rolled back.
+ */
+
+ meta = ReadBuffer(index, P_NEW);
+ Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
+
+ brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
+ BRIN_CURRENT_VERSION);
+ MarkBufferDirty(meta);
+
+ if (RelationNeedsWAL(index))
+ {
+ xl_brin_createidx xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+ Page page;
+
+ xlrec.node = index->rd_node;
+ xlrec.version = BRIN_CURRENT_VERSION;
+ xlrec.pagesPerRange = BrinGetPagesPerRange(index);
+
+ rdata.buffer = InvalidBuffer;
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfBrinCreateIdx;
+ rdata.next = NULL;
+
+ recptr = XLogInsert(RM_BRIN_ID, XLOG_BRIN_CREATE_INDEX, &rdata);
+
+ page = BufferGetPage(meta);
+ PageSetLSN(page, recptr);
+ }
+
+ UnlockReleaseBuffer(meta);
+
+ /*
+ * Initialize our state, including the deformed tuple state.
+ */
+ rmAccess = brinRevmapAccessInit(index, &pagesPerRange);
+ state = initialize_brin_buildstate(index, rmAccess, pagesPerRange);
+
+ /*
+ * Now scan the relation. No syncscan allowed here because we want the
+ * heap blocks in physical order.
+ */
+ reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+ brinbuildCallback, (void *) state);
+
+ /* process the final batch */
+ form_and_insert_tuple(state);
+
+ /* release resources */
+ idxtuples = state->numtuples;
+ brinRevmapAccessTerminate(state->bs_rmAccess);
+ if (terminate_brin_buildstate(state))
+ FreeSpaceMapVacuum(index);
+
+ /*
+ * Return statistics
+ */
+ result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+
+ result->heap_tuples = reltuples;
+ result->index_tuples = idxtuples;
+
+ PG_RETURN_POINTER(result);
+ }
+
+ Datum
+ brinbuildempty(PG_FUNCTION_ARGS)
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("unlogged BRIN indexes are not supported")));
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * brinbulkdelete
+ * Since there are no per-heap-tuple index tuples in BRIN indexes,
+ * there's not a lot we can do here.
+ *
+ * XXX we could mark item tuples as "dirty" (when a minimum or maximum heap
+ * tuple is deleted), meaning the need to re-run summarization on the affected
+ * range. Need to an extra flag in mmtuples for that.
+ */
+ Datum
+ brinbulkdelete(PG_FUNCTION_ARGS)
+ {
+ /* other arguments are not currently used */
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+
+ /* allocate stats if first time through, else re-use existing struct */
+ if (stats == NULL)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * This routine is in charge of "vacuuming" a BRIN index: we just summarize
+ * ranges that are currently unsummarized.
+ */
+ Datum
+ brinvacuumcleanup(PG_FUNCTION_ARGS)
+ {
+ IndexVacuumInfo *info = (IndexVacuumInfo *) PG_GETARG_POINTER(0);
+ IndexBulkDeleteResult *stats = (IndexBulkDeleteResult *) PG_GETARG_POINTER(1);
+ brinRmAccess *rmAccess;
+ BrinBuildState *state = NULL;
+ IndexInfo *indexInfo = NULL;
+ Relation heapRel;
+ BlockNumber heapNumBlocks;
+ BlockNumber heapBlk;
+ BlockNumber pagesPerRange;
+ Buffer buf;
+
+ /* No-op in ANALYZE ONLY mode */
+ if (info->analyze_only)
+ PG_RETURN_POINTER(stats);
+
+ if (!stats)
+ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+ stats->num_pages = RelationGetNumberOfBlocks(info->index);
+ /* rest of stats is initialized by zeroing */
+
+ heapRel = heap_open(IndexGetRelation(RelationGetRelid(info->index), false),
+ AccessShareLock);
+
+ rmAccess = brinRevmapAccessInit(info->index, &pagesPerRange);
+
+ /*
+ * Scan the revmap to find unsummarized items.
+ */
+ buf = InvalidBuffer;
+ heapNumBlocks = RelationGetNumberOfBlocks(heapRel);
+ for (heapBlk = 0; heapBlk < heapNumBlocks; heapBlk += pagesPerRange)
+ {
+ BrTuple *tup;
+ OffsetNumber off;
+
+ CHECK_FOR_INTERRUPTS();
+
+ tup = brinGetTupleForHeapBlock(rmAccess, heapBlk, &buf, &off, NULL,
+ BUFFER_LOCK_SHARE);
+ if (tup == NULL)
+ {
+ /* no revmap entry for this heap range. Summarize it. */
+ if (state == NULL)
+ {
+ /* first time through */
+ Assert(!indexInfo);
+ state = initialize_brin_buildstate(info->index, rmAccess,
+ pagesPerRange);
+ indexInfo = BuildIndexInfo(info->index);
+ }
+ summarize_range(indexInfo, state, heapRel, heapBlk);
+
+ /* and re-initialize state for the next range */
+ brin_dtuple_initialize(state->dtuple, state->bs_bdesc);
+
+ stats->num_index_tuples++;
+ }
+ else
+ {
+ stats->num_index_tuples++;
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ }
+ }
+
+ if (BufferIsValid(buf))
+ ReleaseBuffer(buf);
+
+ /* free resources */
+ brinRevmapAccessTerminate(rmAccess);
+ if (state && terminate_brin_buildstate(state))
+ FreeSpaceMapVacuum(info->index);
+
+ heap_close(heapRel, AccessShareLock);
+
+ PG_RETURN_POINTER(stats);
+ }
+
+ /*
+ * reloptions processor for BRIN indexes
+ */
+ Datum
+ brinoptions(PG_FUNCTION_ARGS)
+ {
+ Datum reloptions = PG_GETARG_DATUM(0);
+ bool validate = PG_GETARG_BOOL(1);
+ relopt_value *options;
+ BrinOptions *rdopts;
+ int numoptions;
+ static const relopt_parse_elt tab[] = {
+ {"pages_per_range", RELOPT_TYPE_INT, offsetof(BrinOptions, pagesPerRange)}
+ };
+
+ options = parseRelOptions(reloptions, validate, RELOPT_KIND_BRIN,
+ &numoptions);
+
+ /* if none set, we're done */
+ if (numoptions == 0)
+ PG_RETURN_NULL();
+
+ rdopts = allocateReloptStruct(sizeof(BrinOptions), options, numoptions);
+
+ fillRelOptions((void *) rdopts, sizeof(BrinOptions), options, numoptions,
+ validate, tab, lengthof(tab));
+
+ pfree(options);
+
+ PG_RETURN_BYTEA_P(rdopts);
+ }
+
+ /*
+ * Initialize a page with the given type.
+ *
+ * Caller is responsible for marking it dirty, as appropriate.
+ */
+ void
+ brin_page_init(Page page, uint16 type)
+ {
+ BrinSpecialSpace *special;
+
+ PageInit(page, BLCKSZ, sizeof(BrinSpecialSpace));
+
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+ special->type = type;
+ }
+
+ /*
+ * Initialize a new BRIN index' metapage.
+ */
+ void
+ brin_metapage_init(Page page, BlockNumber pagesPerRange, uint16 version)
+ {
+ BrinMetaPageData *metadata;
+
+ brin_page_init(page, BRIN_PAGETYPE_META);
+
+ metadata = (BrinMetaPageData *) PageGetContents(page);
+
+ metadata->brinMagic = BRIN_META_MAGIC;
+ metadata->brinVersion = version;
+ metadata->pagesPerRange = pagesPerRange;
+
+ /*
+ * Note we cheat here a little. 0 is not a valid revmap block number
+ * (because it's the metapage buffer), but doing this enables the first
+ * revmap page to be created when the index is.
+ */
+ metadata->lastRevmapPage = 0;
+ }
+
+ /*
+ * Build a BrinDesc used to create or scan a BRIN index
+ */
+ BrinDesc *
+ brin_build_desc(Relation rel)
+ {
+ BrinOpcInfo **opcinfo;
+ BrinDesc *bdesc;
+ TupleDesc tupdesc;
+ int totalstored = 0;
+ int keyno;
+ long totalsize;
+ MemoryContext cxt;
+ MemoryContext oldcxt;
+
+ cxt = AllocSetContextCreate(CurrentMemoryContext,
+ "brin desc cxt",
+ ALLOCSET_SMALL_INITSIZE,
+ ALLOCSET_SMALL_MINSIZE,
+ ALLOCSET_SMALL_MAXSIZE);
+ oldcxt = MemoryContextSwitchTo(cxt);
+
+ tupdesc = RelationGetDescr(rel);
+ IncrTupleDescRefCount(tupdesc);
+
+ /*
+ * Obtain BrinOpcInfo for each indexed column. While at it, accumulate
+ * the number of columns stored, since the number is opclass-defined.
+ */
+ opcinfo = (BrinOpcInfo **) palloc(sizeof(BrinOpcInfo *) * tupdesc->natts);
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ {
+ FmgrInfo *opcInfoFn;
+
+ opcInfoFn = index_getprocinfo(rel, keyno + 1, BRIN_PROCNUM_OPCINFO);
+
+ /* actually FunctionCall0 but we don't have that */
+ opcinfo[keyno] = (BrinOpcInfo *)
+ DatumGetPointer(FunctionCall1(opcInfoFn, InvalidOid));
+ totalstored += opcinfo[keyno]->oi_nstored;
+ }
+
+ /* Allocate our result struct and fill it in */
+ totalsize = offsetof(BrinDesc, bd_info) +
+ sizeof(BrinOpcInfo *) * tupdesc->natts;
+
+ bdesc = palloc(totalsize);
+ bdesc->bd_context = cxt;
+ bdesc->bd_index = rel;
+ bdesc->bd_tupdesc = tupdesc;
+ bdesc->bd_disktdesc = NULL; /* generated lazily */
+ bdesc->bd_totalstored = totalstored;
+
+ for (keyno = 0; keyno < tupdesc->natts; keyno++)
+ bdesc->bd_info[keyno] = opcinfo[keyno];
+ pfree(opcinfo);
+
+ MemoryContextSwitchTo(oldcxt);
+
+ return bdesc;
+ }
+
+ void
+ brin_free_desc(BrinDesc *bdesc)
+ {
+ DecrTupleDescRefCount(bdesc->bd_tupdesc);
+ /* no need for retail pfree */
+ MemoryContextDelete(bdesc->bd_context);
+ }
+
+ /*
+ * Initialize a BrinBuildState appropriate to create tuples on the given index.
+ */
+ static BrinBuildState *
+ initialize_brin_buildstate(Relation idxRel, brinRmAccess *rmAccess,
+ BlockNumber pagesPerRange)
+ {
+ BrinBuildState *state;
+
+ state = palloc(sizeof(BrinBuildState));
+
+ state->irel = idxRel;
+ state->numtuples = 0;
+ state->currentInsertBuf = InvalidBuffer;
+ state->pagesPerRange = pagesPerRange;
+ state->currRangeStart = 0;
+ state->bs_rmAccess = rmAccess;
+ state->bs_bdesc = brin_build_desc(idxRel);
+ state->seentup = false;
+ state->extended = false;
+ state->dtuple = brin_new_dtuple(state->bs_bdesc);
+
+ brin_dtuple_initialize(state->dtuple, state->bs_bdesc);
+
+ return state;
+ }
+
+ /*
+ * Release resources associated with a BrinBuildState. Returns whether the FSM
+ * should be vacuumed afterwards.
+ */
+ static bool
+ terminate_brin_buildstate(BrinBuildState *state)
+ {
+ bool vacuumfsm;
+
+ /* release the last index buffer used */
+ if (!BufferIsInvalid(state->currentInsertBuf))
+ {
+ Page page;
+
+ page = BufferGetPage(state->currentInsertBuf);
+ RecordPageWithFreeSpace(state->irel,
+ BufferGetBlockNumber(state->currentInsertBuf),
+ PageGetFreeSpace(page));
+ ReleaseBuffer(state->currentInsertBuf);
+ }
+ vacuumfsm = state->extended;
+
+ brin_free_desc(state->bs_bdesc);
+ pfree(state->dtuple);
+ pfree(state);
+
+ return vacuumfsm;
+ }
+
+ /*
+ * Summarize the given page range of the given index.
+ *
+ * This routine takes proper precautions to deal with concurrent insertions to
+ * heap pages already scanned.
+ */
+ static void
+ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
+ BlockNumber heapBlk)
+ {
+ Buffer phbuf;
+ BrTuple *phtup;
+ Size phsz;
+ OffsetNumber offset;
+
+ /*
+ * Insert the placeholder tuple
+ */
+ phbuf = InvalidBuffer;
+ phtup = brin_form_placeholder_tuple(state->bs_bdesc, heapBlk, &phsz);
+ offset = brin_doinsert(state->irel, state->pagesPerRange,
+ state->bs_rmAccess, &phbuf,
+ heapBlk, phtup, phsz, &state->extended);
+
+ /*
+ * Execute the partial heap scan covering the heap blocks in the specified
+ * page range, summarizing the heap tuples in it. This scan stops just
+ * short of brinbuildCallback creating the new index entry.
+ */
+ state->currRangeStart = heapBlk;
+ IndexBuildHeapRangeScan(heapRel, state->irel, indexInfo, false,
+ heapBlk, state->pagesPerRange,
+ brinbuildCallback, (void *) state);
+
+ for (;;)
+ {
+ BrTuple *newtup;
+ Size newsize;
+ bool didupdate;
+ bool samepage;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /*
+ * Update the summary tuple, being prepared to start over if this
+ * fails. In particular, this covers the case that some concurrent
+ * inserter changed the placeholder tuple.
+ */
+ newtup = brin_form_tuple(state->bs_bdesc,
+ heapBlk, state->dtuple, &newsize);
+ samepage = brin_can_do_samepage_update(phbuf, phsz, newsize);
+ didupdate =
+ brin_doupdate(state->irel, state->pagesPerRange, state->bs_rmAccess,
+ heapBlk, phbuf, offset,
+ phtup, phsz, newtup, newsize, samepage,
+ &state->extended);
+ brin_free_tuple(phtup);
+ brin_free_tuple(newtup);
+
+ /* If the update succeeded, we're done. */
+ if (didupdate)
+ break;
+
+ /*
+ * If the update didn't work, it might be because somebody updated the
+ * placeholder tuple concurrently. Extract the new version, union it
+ * with the values we have from the scan, and start over. (There are
+ * other reasons for the update to fail, but it's simple to treat them
+ * the same.)
+ */
+ phtup = brinGetTupleForHeapBlock(state->bs_rmAccess, heapBlk, &phbuf,
+ &offset, &phsz, BUFFER_LOCK_SHARE);
+ /* the placeholder tuple must exist */
+ if (phtup == NULL)
+ elog(ERROR, "missing placeholder tuple");
+ phtup = brin_copy_tuple(phtup, phsz);
+ LockBuffer(phbuf, BUFFER_LOCK_UNLOCK);
+
+ /* merge it into the tuple from the heap scan */
+ union_tuples(state->bs_bdesc, state->dtuple, phtup);
+ }
+
+ ReleaseBuffer(phbuf);
+ }
+
+ /*
+ * Given a deformed tuple in the build state, convert it into the on-disk
+ * format and insert it into the index, making the revmap point to it.
+ */
+ static void
+ form_and_insert_tuple(BrinBuildState *state)
+ {
+ BrTuple *tup;
+ Size size;
+
+ /* if we haven't seen any heap tuple yet, don't insert anything */
+ if (!state->seentup)
+ return;
+
+ tup = brin_form_tuple(state->bs_bdesc, state->currRangeStart,
+ state->dtuple, &size);
+ brin_doinsert(state->irel, state->pagesPerRange, state->bs_rmAccess,
+ &state->currentInsertBuf, state->currRangeStart,
+ tup, size, &state->extended);
+ state->numtuples++;
+
+ pfree(tup);
+
+ state->seentup = false;
+ }
+
+ /*
+ * Given two deformed tuples, adjust the first one so that it's consistent
+ * with the summary values in both.
+ */
+ static void
+ union_tuples(BrinDesc *bdesc, DeformedBrTuple *a, BrTuple *b)
+ {
+ int keyno;
+ DeformedBrTuple *db;
+
+ db = brin_deform_tuple(bdesc, b);
+
+ for (keyno = 0; keyno < bdesc->bd_tupdesc->natts; keyno++)
+ {
+ FmgrInfo *unionFn;
+
+ unionFn = index_getprocinfo(bdesc->bd_index, keyno + 1,
+ BRIN_PROCNUM_UNION);
+ FunctionCall4Coll(unionFn,
+ bdesc->bd_index->rd_indcollation[keyno],
+ PointerGetDatum(bdesc),
+ PointerGetDatum(a),
+ PointerGetDatum(db),
+ UInt16GetDatum(keyno + 1));
+ }
+
+ brin_free_dtuple(bdesc, db);
+ }
*** /dev/null
--- b/src/backend/access/brin/brin_minmax.c
***************
*** 0 ****
--- 1,326 ----
+ /*
+ * brin_minmax.c
+ * Implementation of Min/Max opclass for BRIN
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/brin_minmax.c
+ */
+ #include "postgres.h"
+
+ #include "access/genam.h"
+ #include "access/brin_internal.h"
+ #include "access/brin_tuple.h"
+ #include "access/skey.h"
+ #include "catalog/pg_type.h"
+ #include "utils/datum.h"
+ #include "utils/lsyscache.h"
+ #include "utils/syscache.h"
+
+
+ /*
+ * Procedure numbers must not collide with BRIN_PROCNUM defines in
+ * brin_internal.h. Note we only need inequality functions.
+ */
+ #define MINMAX_NUM_PROCNUMS 4 /* # support procs we need */
+ #define PROCNUM_LESS 5
+ #define PROCNUM_LESSEQUAL 6
+ #define PROCNUM_GREATEREQUAL 7
+ #define PROCNUM_GREATER 8
+
+ /*
+ * Subtract this from procnum to obtain index in MinmaxOpaque arrays
+ * (Must be equal to minimum of private procnums)
+ */
+ #define PROCNUM_BASE 5
+
+ static FmgrInfo *minmax_get_procinfo(BrinDesc *bdesc, uint16 attno,
+ uint16 procnum);
+
+ PG_FUNCTION_INFO_V1(minmaxAddValue);
+ PG_FUNCTION_INFO_V1(minmaxConsistent);
+ PG_FUNCTION_INFO_V1(minmaxUnion);
+
+
+ typedef struct MinmaxOpaque
+ {
+ FmgrInfo operators[MINMAX_NUM_PROCNUMS];
+ bool inited[MINMAX_NUM_PROCNUMS];
+ } MinmaxOpaque;
+
+ #define OPCINFO(typname, typoid) \
+ PG_FUNCTION_INFO_V1(minmaxOpcInfo_##typname); \
+ Datum \
+ minmaxOpcInfo_##typname(PG_FUNCTION_ARGS) \
+ { \
+ BrinOpcInfo *result; \
+ \
+ /* \
+ * opaque->operators is initialized lazily, as indicated by 'inited' \
+ * which is initialized to all false by palloc0. \
+ */ \
+ \
+ result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) + \
+ sizeof(MinmaxOpaque)); \
+ result->oi_nstored = 2; \
+ result->oi_opaque = (MinmaxOpaque *) \
+ MAXALIGN((char *) result + SizeofBrinOpcInfo(2)); \
+ result->oi_typids[0] = typoid; \
+ result->oi_typids[1] = typoid; \
+ \
+ PG_RETURN_POINTER(result); \
+ }
+
+ OPCINFO(int4, INT4OID)
+ OPCINFO(numeric, NUMERICOID)
+ OPCINFO(text, TEXTOID)
+ OPCINFO(time, TIMEOID)
+ OPCINFO(timetz, TIMETZOID)
+ OPCINFO(timestamp, TIMESTAMPOID)
+ OPCINFO(timestamptz, TIMESTAMPTZOID)
+ OPCINFO(date, DATEOID)
+ OPCINFO(char, CHAROID)
+
+ /*
+ * Examine the given index tuple (which contains partial status of a certain
+ * page range) by comparing it to the given value that comes from another heap
+ * tuple. If the new value is outside the min/max range specified by the
+ * existing tuple values, update the index tuple and return true. Otherwise,
+ * return false and do not modify in this case.
+ */
+ Datum
+ minmaxAddValue(PG_FUNCTION_ARGS)
+ {
+ BrinDesc *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+ DeformedBrTuple *dtuple = (DeformedBrTuple *) PG_GETARG_POINTER(1);
+ AttrNumber attno = PG_GETARG_UINT16(2);
+ Datum newval = PG_GETARG_DATUM(3);
+ bool isnull = PG_GETARG_DATUM(4);
+ Oid colloid = PG_GET_COLLATION();
+ FmgrInfo *cmpFn;
+ Datum compar;
+ bool updated = false;
+ Form_pg_attribute attr;
+ BrinValues *column = &dtuple->dt_columns[attno - 1];
+
+ /*
+ * If the new value is null, we record that we saw it if it's the first
+ * one; otherwise, there's nothing to do.
+ */
+ if (isnull)
+ {
+ if (column->hasnulls)
+ PG_RETURN_BOOL(false);
+
+ column->hasnulls = true;
+ PG_RETURN_BOOL(true);
+ }
+
+ attr = bdesc->bd_tupdesc->attrs[attno - 1];
+
+ /*
+ * If the recorded value is null, store the new value (which we know to be
+ * not null) as both minimum and maximum, and we're done.
+ */
+ if (column->allnulls)
+ {
+ column->values[0] = datumCopy(newval, attr->attbyval, attr->attlen);
+ column->values[1] = datumCopy(newval, attr->attbyval, attr->attlen);
+ column->allnulls = false;
+ PG_RETURN_BOOL(true);
+ }
+
+ /*
+ * Otherwise, need to compare the new value with the existing boundaries
+ * and update them accordingly. First check if it's less than the
+ * existing minimum.
+ */
+ cmpFn = minmax_get_procinfo(bdesc, attno, PROCNUM_LESS);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval, column->values[0]);
+ if (DatumGetBool(compar))
+ {
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(column->values[0]));
+ column->values[0] = datumCopy(newval, attr->attbyval, attr->attlen);
+ updated = true;
+ }
+
+ /*
+ * And now compare it to the existing maximum.
+ */
+ cmpFn = minmax_get_procinfo(bdesc, attno, PROCNUM_GREATER);
+ compar = FunctionCall2Coll(cmpFn, colloid, newval, column->values[1]);
+ if (DatumGetBool(compar))
+ {
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(column->values[1]));
+ column->values[1] = datumCopy(newval, attr->attbyval, attr->attlen);
+ updated = true;
+ }
+
+ PG_RETURN_BOOL(updated);
+ }
+
+ /*
+ * Given an index tuple corresponding to a certain page range and a scan key,
+ * return whether the scan key is consistent with the index tuple's min/max
+ * values. Return true if so, false otherwise.
+ */
+ Datum
+ minmaxConsistent(PG_FUNCTION_ARGS)
+ {
+ BrinDesc *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+ DeformedBrTuple *dtup = (DeformedBrTuple *) PG_GETARG_POINTER(1);
+ ScanKey key = (ScanKey) PG_GETARG_POINTER(2);
+ Oid colloid = PG_GET_COLLATION();
+ AttrNumber attno = key->sk_attno;
+ Datum value;
+ Datum matches;
+ BrinValues *column = &dtup->dt_columns[attno - 1];
+
+ /* handle IS NULL/IS NOT NULL tests */
+ if (key->sk_flags & SK_ISNULL)
+ {
+ if (key->sk_flags & SK_SEARCHNULL)
+ {
+ if (column->allnulls || column->hasnulls)
+ PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(false);
+ }
+
+ /*
+ * For IS NOT NULL, we can only skip ranges that are known to have
+ * only nulls.
+ */
+ Assert(key->sk_flags & SK_SEARCHNOTNULL);
+ PG_RETURN_BOOL(!column->allnulls);
+ }
+
+ value = key->sk_argument;
+ switch (key->sk_strategy)
+ {
+ case BTLessStrategyNumber:
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_LESS),
+ colloid, column->values[0], value);
+ break;
+ case BTLessEqualStrategyNumber:
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid, column->values[0], value);
+ break;
+ case BTEqualStrategyNumber:
+
+ /*
+ * In the equality case (WHERE col = someval), we want to return
+ * the current page range if the minimum value in the range <=
+ * scan key, and the maximum value >= scan key.
+ */
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_LESSEQUAL),
+ colloid, column->values[0], value);
+ if (!DatumGetBool(matches))
+ break;
+ /* max() >= scankey */
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid, column->values[1], value);
+ break;
+ case BTGreaterEqualStrategyNumber:
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_GREATEREQUAL),
+ colloid, column->values[1], value);
+ break;
+ case BTGreaterStrategyNumber:
+ matches = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_GREATER),
+ colloid, column->values[1], value);
+ break;
+ default:
+ /* shouldn't happen */
+ elog(ERROR, "invalid strategy number %d", key->sk_strategy);
+ matches = 0;
+ break;
+ }
+
+ PG_RETURN_DATUM(matches);
+ }
+
+ /*
+ * Given two index tuples, update the first of them with a union of the summary
+ * values contained in both, for the given attribute. The second tuple is
+ * untouched. Return value is true if there was any adjustment applied,
+ * otherwise false.
+ */
+ Datum
+ minmaxUnion(PG_FUNCTION_ARGS)
+ {
+ BrinDesc *bdesc = (BrinDesc *) PG_GETARG_POINTER(0);
+ DeformedBrTuple *a = (DeformedBrTuple *) PG_GETARG_POINTER(1);
+ DeformedBrTuple *b = (DeformedBrTuple *) PG_GETARG_POINTER(2);
+ AttrNumber attno = PG_GETARG_INT16(3);
+ Oid colloid = PG_GET_COLLATION();
+ BrinValues *col_a = &a->dt_columns[attno - 1];
+ BrinValues *col_b = &b->dt_columns[attno - 1];
+ Form_pg_attribute attr;
+ bool result = false;
+ bool needsadj;
+
+ attr = bdesc->bd_tupdesc->attrs[attno - 1];
+
+ /* Adjust minimum, if b's min is less than a's min */
+ needsadj = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_LESS),
+ colloid, col_b->values[0], col_a->values[0]);
+ if (needsadj)
+ {
+ result = true;
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(col_a->values[0]));
+ col_a->values[0] = datumCopy(col_b->values[0],
+ attr->attbyval, attr->attlen);
+ }
+
+ /* Adjust maximum, if b's max is greater than a's max */
+ needsadj = FunctionCall2Coll(minmax_get_procinfo(bdesc, attno,
+ PROCNUM_GREATER),
+ colloid, col_b->values[1], col_a->values[1]);
+ if (needsadj)
+ {
+ result = true;
+ if (!attr->attbyval)
+ pfree(DatumGetPointer(col_a->values[1]));
+ col_a->values[1] = datumCopy(col_b->values[1],
+ attr->attbyval, attr->attlen);
+ }
+
+ PG_RETURN_DATUM(result);
+ }
+
+ /*
+ * Return the procedure corresponding to the given function support number.
+ */
+ static FmgrInfo *
+ minmax_get_procinfo(BrinDesc *bdesc, uint16 attno, uint16 procnum)
+ {
+ MinmaxOpaque *opaque;
+ uint16 basenum = procnum - PROCNUM_BASE;
+
+ opaque = (MinmaxOpaque *) bdesc->bd_info[attno - 1]->oi_opaque;
+
+ /*
+ * We cache these in the opaque struct, to avoid repetitive syscache
+ * lookups.
+ */
+ if (!opaque->inited[basenum])
+ {
+ fmgr_info_copy(&opaque->operators[basenum],
+ index_getprocinfo(bdesc->bd_index, attno, procnum),
+ bdesc->bd_context);
+ opaque->inited[basenum] = true;
+ }
+
+ return &opaque->operators[basenum];
+ }
*** /dev/null
--- b/src/backend/access/brin/brpageops.c
***************
*** 0 ****
--- 1,665 ----
+ /*
+ * brpageops.c
+ * Page-handling routines for BRIN indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/brpageops.c
+ */
+ #include "postgres.h"
+
+ #include "access/brin_pageops.h"
+ #include "access/brin_page.h"
+ #include "access/brin_revmap.h"
+ #include "access/brin_xlog.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/freespace.h"
+ #include "storage/lmgr.h"
+ #include "storage/smgr.h"
+ #include "utils/rel.h"
+
+
+ static Buffer brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended);
+ static Size br_page_get_freespace(Page page);
+
+
+ /*
+ * Update tuple origtup (size origsz), located in offset oldoff of buffer
+ * oldbuf, to newtup (size newsz) as summary tuple for the page range starting
+ * at heapBlk. oldbuf must not be locked on entry, and is not locked at exit.
+ *
+ * If samepage is true, attempt to put the new tuple in the same page, but if
+ * there's no room, use some other one.
+ *
+ * If the update is successful, return true; the revmap is updated to point to
+ * the new tuple. If the update is not done for whatever reason, return false.
+ * Caller may retry the update if this happens.
+ *
+ * If the index had to be extended in the course of this operation, *extended
+ * is set to true.
+ */
+ bool
+ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const BrTuple *origtup, Size origsz,
+ const BrTuple *newtup, Size newsz,
+ bool samepage, bool *extended)
+ {
+ Page oldpage;
+ ItemId oldlp;
+ BrTuple *oldtup;
+ Size oldsz;
+ Buffer newbuf;
+ BrinSpecialSpace *special;
+
+ /* make sure the revmap is long enough to contain the entry we need */
+ brinRevmapExtend(rmAccess, heapBlk);
+
+ if (!samepage)
+ {
+ /* need a page on which to put the item */
+ newbuf = brin_getinsertbuffer(idxrel, oldbuf, newsz, extended);
+ if (!BufferIsValid(newbuf))
+ return false;
+
+ /*
+ * Note: it's possible (though unlikely) that the returned newbuf is
+ * the same as oldbuf, if brin_getinsertbuffer determined that the old
+ * buffer does in fact have enough space.
+ */
+ if (newbuf == oldbuf)
+ newbuf = InvalidBuffer;
+ }
+ else
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ newbuf = InvalidBuffer;
+ }
+ oldpage = BufferGetPage(oldbuf);
+ oldlp = PageGetItemId(oldpage, oldoff);
+
+ /*
+ * Check that the old tuple wasn't updated concurrently: it might have
+ * moved someplace else entirely ...
+ */
+ if (!ItemIdIsNormal(oldlp))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+ return false;
+ }
+
+ oldsz = ItemIdGetLength(oldlp);
+ oldtup = (BrTuple *) PageGetItem(oldpage, oldlp);
+
+ /*
+ * ... or it might have been updated in place to different contents.
+ */
+ if (!brin_tuples_equal(oldtup, oldsz, origtup, origsz))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+ return false;
+ }
+
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(oldpage);
+
+ /*
+ * Great, the old tuple is intact. We can proceed with the update.
+ *
+ * If there's enough room in the old page for the new tuple, replace it.
+ *
+ * Note that there might now be enough space on the page even though the
+ * caller told us there isn't, if a concurrent update moved another tuple
+ * elsewhere or replaced a tuple with a smaller one.
+ */
+ if (((special->flags & BRIN_EVACUATE_PAGE) == 0) &&
+ brin_can_do_samepage_update(oldbuf, origsz, newsz))
+ {
+ if (BufferIsValid(newbuf))
+ UnlockReleaseBuffer(newbuf);
+
+ START_CRIT_SECTION();
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true,
+ false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add BRIN tuple");
+ MarkBufferDirty(oldbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ BlockNumber blk = BufferGetBlockNumber(oldbuf);
+ xl_brin_samepage_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_BRIN_SAMEPAGE_UPDATE;
+
+ xlrec.node = idxrel->rd_node;
+ ItemPointerSetBlockNumber(&xlrec.tid, blk);
+ ItemPointerSetOffsetNumber(&xlrec.tid, oldoff);
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfBrinSamepageUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = oldbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_BRIN_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return true;
+ }
+ else if (newbuf == InvalidBuffer)
+ {
+ /*
+ * Not enough space, but caller said that there was. Tell them to
+ * start over.
+ */
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ return false;
+ }
+ else
+ {
+ /*
+ * Not enough free space on the oldpage. Put the new tuple on the new
+ * page, and update the revmap.
+ */
+ Page newpage = BufferGetPage(newbuf);
+ Buffer revmapbuf;
+ ItemPointerData newtid;
+ OffsetNumber newoff;
+
+ revmapbuf = brinLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ START_CRIT_SECTION();
+
+ PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
+ newoff = PageAddItem(newpage, (Item) newtup, newsz, InvalidOffsetNumber, false, false);
+ if (newoff == InvalidOffsetNumber)
+ elog(ERROR, "failed to add BRIN tuple to new page");
+ MarkBufferDirty(oldbuf);
+ MarkBufferDirty(newbuf);
+
+ ItemPointerSet(&newtid, BufferGetBlockNumber(newbuf), newoff);
+ brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_brin_update xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[4];
+ uint8 info = XLOG_BRIN_UPDATE;
+
+ xlrec.new.node = idxrel->rd_node;
+ ItemPointerSet(&xlrec.new.tid, BufferGetBlockNumber(newbuf), newoff);
+ xlrec.new.heapBlk = heapBlk;
+ xlrec.new.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ xlrec.new.pagesPerRange = pagesPerRange;
+ ItemPointerSet(&xlrec.oldtid, BufferGetBlockNumber(oldbuf), oldoff);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfBrinUpdate;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) newtup;
+ rdata[1].len = newsz;
+ rdata[1].buffer = newbuf;
+ rdata[1].buffer_std = true;
+ rdata[1].next = &(rdata[2]);
+
+ rdata[2].data = (char *) NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = revmapbuf;
+ rdata[2].buffer_std = true;
+ rdata[2].next = &(rdata[3]);
+
+ rdata[3].data = (char *) NULL;
+ rdata[3].len = 0;
+ rdata[3].buffer = oldbuf;
+ rdata[3].buffer_std = true;
+ rdata[3].next = NULL;
+
+ recptr = XLogInsert(RM_BRIN_ID, info, rdata);
+
+ PageSetLSN(oldpage, recptr);
+ PageSetLSN(newpage, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ UnlockReleaseBuffer(newbuf);
+ return true;
+ }
+ }
+
+ /*
+ * Return whether brin_doupdate can do a samepage update.
+ */
+ bool
+ brin_can_do_samepage_update(Buffer buffer, Size origsz, Size newsz)
+ {
+ return
+ ((newsz <= origsz) ||
+ PageGetExactFreeSpace(BufferGetPage(buffer)) >= (newsz - origsz));
+ }
+
+ /*
+ * Insert an index tuple into the index relation. The revmap is updated to
+ * mark the range containing the given page as pointing to the inserted entry.
+ * A WAL record is written.
+ *
+ * The buffer, if valid, is first checked for free space to insert the new
+ * entry; if there isn't enough, a new buffer is obtained and pinned. No
+ * buffer lock must be held on entry, no buffer lock is held on exit.
+ *
+ * If the relation had to be extended to make room for the new index tuple,
+ * *extended is set to true.
+ *
+ * Return value is the offset number where the tuple was inserted.
+ */
+ OffsetNumber
+ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ BrTuple *tup, Size itemsz, bool *extended)
+ {
+ Page page;
+ BlockNumber blk;
+ OffsetNumber off;
+ Buffer revmapbuf;
+ ItemPointerData tid;
+
+ itemsz = MAXALIGN(itemsz);
+
+ /*
+ * Make sure the revmap is long enough to contain the entry we need.
+ */
+ brinRevmapExtend(rmAccess, heapBlk);
+
+ /*
+ * Obtain a locked buffer to insert the new tuple. Note
+ * brin_getinsertbuffer ensures there's enough space in the returned
+ * buffer.
+ */
+ if (BufferIsValid(*buffer))
+ {
+ /*
+ * It's possible that another backend (or ourselves!) extended the
+ * revmap over the page we held a pin on, so we cannot assume that
+ * it's still a regular page.
+ */
+ LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+ if (br_page_get_freespace(BufferGetPage(*buffer)) < itemsz)
+ {
+ UnlockReleaseBuffer(*buffer);
+ *buffer = InvalidBuffer;
+ }
+ }
+
+ if (!BufferIsValid(*buffer))
+ {
+ *buffer = brin_getinsertbuffer(idxrel, InvalidBuffer, itemsz, extended);
+ Assert(BufferIsValid(*buffer));
+ Assert(br_page_get_freespace(BufferGetPage(*buffer)) >= itemsz);
+ }
+
+ /* Now obtain lock on revmap buffer */
+ revmapbuf = brinLockRevmapPageForUpdate(rmAccess, heapBlk);
+
+ page = BufferGetPage(*buffer);
+ blk = BufferGetBlockNumber(*buffer);
+
+ START_CRIT_SECTION();
+ off = PageAddItem(page, (Item) tup, itemsz, InvalidOffsetNumber,
+ false, false);
+ if (off == InvalidOffsetNumber)
+ elog(ERROR, "could not insert new index tuple to page");
+ MarkBufferDirty(*buffer);
+
+ BRIN_elog(DEBUG2, "inserted tuple (%u,%u) for range starting at %u",
+ blk, off, heapBlk);
+
+ ItemPointerSet(&tid, blk, off);
+ brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, tid);
+ MarkBufferDirty(revmapbuf);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(idxrel))
+ {
+ xl_brin_insert xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata[2];
+ uint8 info = XLOG_BRIN_INSERT;
+
+ xlrec.node = idxrel->rd_node;
+ xlrec.heapBlk = heapBlk;
+ xlrec.pagesPerRange = pagesPerRange;
+ xlrec.revmapBlk = BufferGetBlockNumber(revmapbuf);
+ ItemPointerSet(&xlrec.tid, blk, off);
+
+ rdata[0].data = (char *) &xlrec;
+ rdata[0].len = SizeOfBrinInsert;
+ rdata[0].buffer = InvalidBuffer;
+ rdata[0].buffer_std = false;
+ rdata[0].next = &(rdata[1]);
+
+ rdata[1].data = (char *) tup;
+ rdata[1].len = itemsz;
+ rdata[1].buffer = *buffer;
+ rdata[1].buffer_std = true;
+ rdata[1].next = NULL;
+
+ recptr = XLogInsert(RM_BRIN_ID, info, rdata);
+
+ PageSetLSN(page, recptr);
+ PageSetLSN(BufferGetPage(revmapbuf), recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ /* Tuple is firmly on buffer; we can release our locks */
+ LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+ LockBuffer(revmapbuf, BUFFER_LOCK_UNLOCK);
+
+ return off;
+ }
+
+ /*
+ * Initiate page evacuation protocol.
+ *
+ * The page must be locked in exclusive mode by the caller.
+ *
+ * If the page is not yet initialized or empty, return false without doing
+ * anything; it can be used for revmap without any further changes. If it
+ * contains tuples, mark it for evacuation and return true.
+ */
+ bool
+ brin_start_evacuating_page(Relation idxRel, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ BrinSpecialSpace *special;
+ Page page;
+
+ page = BufferGetPage(buf);
+
+ if (PageIsNew(page))
+ return false;
+
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ /* prevent other backends from adding more stuff to this page */
+ special->flags |= BRIN_EVACUATE_PAGE;
+ MarkBufferDirtyHint(buf, true);
+
+ return true;
+ }
+ }
+ return false;
+ }
+
+ /*
+ * Move all tuples out of a page.
+ *
+ * The caller must hold lock on the page. The lock and pin are released.
+ */
+ void
+ brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, Buffer buf)
+ {
+ OffsetNumber off;
+ OffsetNumber maxoff;
+ BrinSpecialSpace *special;
+ Page page;
+ bool extended = false;
+
+ page = BufferGetPage(buf);
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+
+ Assert(special->flags & BRIN_EVACUATE_PAGE);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (off = FirstOffsetNumber; off <= maxoff; off++)
+ {
+ BrTuple *tup;
+ Size sz;
+ ItemId lp;
+
+ CHECK_FOR_INTERRUPTS();
+
+ lp = PageGetItemId(page, off);
+ if (ItemIdIsUsed(lp))
+ {
+ sz = ItemIdGetLength(lp);
+ tup = (BrTuple *) PageGetItem(page, lp);
+ tup = brin_copy_tuple(tup, sz);
+
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+ if (!brin_doupdate(idxRel, pagesPerRange, rmAccess, tup->bt_blkno,
+ buf, off, tup, sz, tup, sz, false, &extended))
+ off--; /* retry */
+
+ LockBuffer(buf, BUFFER_LOCK_SHARE);
+
+ /* It's possible that someone extended the revmap over this page */
+ if (!BRIN_IS_REGULAR_PAGE(page))
+ break;
+ }
+ }
+
+ UnlockReleaseBuffer(buf);
+
+ if (extended)
+ FreeSpaceMapVacuum(idxRel);
+ }
+
+ /*
+ * Return a pinned and exclusively locked buffer which can be used to insert an
+ * index item of size itemsz. If oldbuf is a valid buffer, it is also locked
+ * (in a order determined to avoid deadlocks.)
+ *
+ * If there's no existing page with enough free space to accomodate the new
+ * item, the relation is extended. If this happens, *extended is set to true.
+ *
+ * If we find that the old page is no longer a regular index page (because
+ * of a revmap extension), the old buffer is unlocked and we return
+ * InvalidBuffer.
+ */
+ static Buffer
+ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
+ bool *was_extended)
+ {
+ BlockNumber oldblk;
+ BlockNumber newblk;
+ Page page;
+ int freespace;
+
+ if (BufferIsValid(oldbuf))
+ oldblk = BufferGetBlockNumber(oldbuf);
+ else
+ oldblk = InvalidBlockNumber;
+
+ /*
+ * Loop until we find a page with sufficient free space. By the time we
+ * return to caller out of this loop, both buffers are valid and locked;
+ * if we have to restart here, neither buffer is locked and buf is not a
+ * pinned buffer.
+ */
+ newblk = RelationGetTargetBlock(irel);
+ if (newblk == InvalidBlockNumber)
+ newblk = GetPageWithFreeSpace(irel, itemsz);
+ for (;;)
+ {
+ Buffer buf;
+ bool extensionLockHeld = false;
+ bool extended = false;
+
+ CHECK_FOR_INTERRUPTS();
+
+ if (newblk == InvalidBlockNumber)
+ {
+ /*
+ * There's not enough free space in any existing index page,
+ * according to the FSM: extend the relation to obtain a shiny new
+ * page.
+ */
+ if (!RELATION_IS_LOCAL(irel))
+ {
+ LockRelationForExtension(irel, ExclusiveLock);
+ extensionLockHeld = true;
+ }
+ buf = ReadBuffer(irel, P_NEW);
+ newblk = BufferGetBlockNumber(buf);
+ *was_extended = extended = true;
+
+ BRIN_elog(DEBUG2, "brin_getinsertbuffer: extending to page %u",
+ BufferGetBlockNumber(buf));
+ }
+ else if (newblk == oldblk)
+ {
+ /*
+ * There's an odd corner-case here where the FSM is out-of-date,
+ * and gave us the old page.
+ */
+ buf = oldbuf;
+ }
+ else
+ {
+ buf = ReadBuffer(irel, newblk);
+ }
+
+ /*
+ * We lock the old buffer first, if it's earlier than the new one; but
+ * before we do, we need to check that it hasn't been turned into a
+ * revmap page concurrently; if we detect that it happened, give up and
+ * tell caller to start over.
+ */
+ if (BufferIsValid(oldbuf) && oldblk < newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ if (!BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)))
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+ ReleaseBuffer(buf);
+ return InvalidBuffer;
+ }
+ }
+
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+ if (extensionLockHeld)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+
+ page = BufferGetPage(buf);
+
+ if (extended)
+ brin_page_init(page, BRIN_PAGETYPE_REGULAR);
+
+ /*
+ * We have a new buffer to insert into. Check that the new page has
+ * enough free space, and return it if it does; otherwise start over.
+ * Note that we allow for the FSM to be out of date here, and in that
+ * case we update it and move on.
+ *
+ * (br_page_get_freespace also checks that the FSM didn't hand us a
+ * page that has since been repurposed for the revmap.)
+ */
+ freespace = br_page_get_freespace(page);
+ if (freespace >= itemsz)
+ {
+ RelationSetTargetBlock(irel, BufferGetBlockNumber(buf));
+
+ /*
+ * Lock the old buffer if not locked already. Note that in this
+ * case we know for sure it's a regular page: it's later than the
+ * new page we just got, which is not a revmap page, and revmap
+ * pages are always consecutive.
+ */
+ if (BufferIsValid(oldbuf) && oldblk > newblk)
+ {
+ LockBuffer(oldbuf, BUFFER_LOCK_EXCLUSIVE);
+ Assert(BRIN_IS_REGULAR_PAGE(BufferGetPage(oldbuf)));
+ }
+
+ return buf;
+ }
+
+ /* This page is no good. */
+
+ /*
+ * If an entirely new page does not contain enough free space for the
+ * new item, then surely that item is oversized. Complain loudly; but
+ * first make sure we record the page as free, for next time.
+ */
+ if (extended)
+ {
+ RecordPageWithFreeSpace(irel, BufferGetBlockNumber(buf),
+ freespace);
+ ereport(ERROR,
+ (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+ errmsg("index row size %lu exceeds maximum %lu for index \"%s\"",
+ (unsigned long) itemsz,
+ (unsigned long) freespace,
+ RelationGetRelationName(irel))));
+ return InvalidBuffer; /* keep compiler quiet */
+ }
+
+ if (newblk != oldblk)
+ UnlockReleaseBuffer(buf);
+ if (BufferIsValid(oldbuf) && oldblk <= newblk)
+ LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
+
+ newblk = RecordAndGetPageWithFreeSpace(irel, newblk, freespace, itemsz);
+ }
+ }
+
+ /*
+ * Return the amount of free space on a regular BRIN index page.
+ *
+ * If the page is not a regular page, or has been marked with the
+ * BRIN_EVACUATE_PAGE flag, returns 0.
+ */
+ static Size
+ br_page_get_freespace(Page page)
+ {
+ BrinSpecialSpace *special;
+
+ special = (BrinSpecialSpace *) PageGetSpecialPointer(page);
+ if (!BRIN_IS_REGULAR_PAGE(page) ||
+ (special->flags & BRIN_EVACUATE_PAGE) != 0)
+ return 0;
+ else
+ return PageGetFreeSpace(page);
+
+ }
*** /dev/null
--- b/src/backend/access/brin/brrevmap.c
***************
*** 0 ****
--- 1,473 ----
+ /*
+ * brrevmap.c
+ * Reverse range map for BRIN indexes
+ *
+ * The reverse range map (revmap) is a translation structure for BRIN indexes:
+ * for each page range there is one summary tuple, and its location is tracked
+ * by the revmap. Whenever a new tuple is inserted into a table that violates
+ * the previously recorded summary values, a new tuple is inserted into the
+ * index and the revmap is updated to point to it.
+ *
+ * The revmap is stored in the first pages of the index, immediately following
+ * the metapage. When the revmap needs to be expanded, all tuples on the
+ * regular BRIN page at that block (if any) are moved out of the way.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/brrevmap.c
+ */
+ #include "postgres.h"
+
+ #include "access/xlog.h"
+ #include "access/brin_page.h"
+ #include "access/brin_pageops.h"
+ #include "access/brin_revmap.h"
+ #include "access/brin_tuple.h"
+ #include "access/brin_xlog.h"
+ #include "access/rmgr.h"
+ #include "miscadmin.h"
+ #include "storage/bufmgr.h"
+ #include "storage/lmgr.h"
+ #include "utils/rel.h"
+
+
+ /*
+ * In revmap pages, each item stores an ItemPointerData. These defines let one
+ * find the logical revmap page number and index number of the revmap item for
+ * the given heap block number.
+ */
+ #define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) / REVMAP_PAGE_MAXITEMS)
+ #define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
+ ((heapBlk / pagesPerRange) % REVMAP_PAGE_MAXITEMS)
+
+
+ struct brinRmAccess
+ {
+ Relation idxrel;
+ BlockNumber pagesPerRange;
+ BlockNumber lastRevmapPage; /* cached from the metapage */
+ Buffer metaBuf;
+ Buffer currBuf;
+ };
+
+ /* typedef appears in brin_revmap.h */
+
+
+ static BlockNumber rm_get_phys_blkno(brinRmAccess *rmAccess,
+ BlockNumber mapBlk, bool extend);
+ static void revmap_physical_extend(brinRmAccess *rmAccess);
+
+ /*
+ * Initialize an access object for a reverse range map, which can be used to
+ * read stuff from it. This must be freed by brinRevmapAccessTerminate when caller
+ * is done with it.
+ */
+ brinRmAccess *
+ brinRevmapAccessInit(Relation idxrel, BlockNumber *pagesPerRange)
+ {
+ brinRmAccess *rmAccess;
+ Buffer meta;
+ BrinMetaPageData *metadata;
+
+ meta = ReadBuffer(idxrel, BRIN_METAPAGE_BLKNO);
+ LockBuffer(meta, BUFFER_LOCK_SHARE);
+ metadata = (BrinMetaPageData *) PageGetContents(BufferGetPage(meta));
+
+ rmAccess = palloc(sizeof(brinRmAccess));
+ rmAccess->idxrel = idxrel;
+ rmAccess->pagesPerRange = metadata->pagesPerRange;
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ rmAccess->metaBuf = meta;
+ rmAccess->currBuf = InvalidBuffer;
+
+ *pagesPerRange = metadata->pagesPerRange;
+
+ LockBuffer(meta, BUFFER_LOCK_UNLOCK);
+
+ return rmAccess;
+ }
+
+ /*
+ * Release resources associated with a revmap access object.
+ */
+ void
+ brinRevmapAccessTerminate(brinRmAccess *rmAccess)
+ {
+ ReleaseBuffer(rmAccess->metaBuf);
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+ pfree(rmAccess);
+ }
+
+ /*
+ * Ensure the revmap is long enough to contain the entry for the given heap
+ * page; return the buffer containing that page. This buffer is also recorded
+ * in the rmAccess; finishing that releases the buffer, therefore the caller
+ * needn't do it explicitely.
+ */
+ Buffer
+ brinRevmapExtend(brinRmAccess *rmAccess, BlockNumber heapBlk)
+ {
+ BlockNumber mapBlk;
+
+ /*
+ * Translate the map block number to physical location. Note this extends
+ * the revmap, if necessary.
+ */
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, true);
+ Assert(mapBlk != InvalidBlockNumber);
+
+ BRIN_elog(DEBUG2, "getting revmap page for logical page %lu (physical %u) for heap %u",
+ HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk),
+ mapBlk, heapBlk);
+
+ /*
+ * Obtain the buffer from which we need to read. If we already have the
+ * correct buffer in our access struct, use that; otherwise, release that,
+ * (if valid) and read the one we need.
+ */
+ if (rmAccess->currBuf == InvalidBuffer ||
+ mapBlk != BufferGetBlockNumber(rmAccess->currBuf))
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ return rmAccess->currBuf;
+ }
+
+ /*
+ * Prepare to insert an entry into the revmap; the revmap buffer in which the
+ * entry is to reside is locked and returned. Most callers should call
+ * brinRevmapExtend first without holding any locks.
+ *
+ * The returned buffer is also recorded in rmAccess; finishing that releases
+ * the buffer, therefore the caller needn't do it explicitely.
+ */
+ Buffer
+ brinLockRevmapPageForUpdate(brinRmAccess *rmAccess, BlockNumber heapBlk)
+ {
+ Buffer rmBuf;
+
+ rmBuf = brinRevmapExtend(rmAccess, heapBlk);
+ LockBuffer(rmBuf, BUFFER_LOCK_EXCLUSIVE);
+
+ return rmBuf;
+ }
+
+ /*
+ * In the given revmap buffer (locked appropriately by caller), which is used
+ * in a BRIN index of pagesPerRange pages per range, set the element
+ * corresponding to heap block number heapBlk to the given TID.
+ *
+ * Once the operation is complete, the caller must update the LSN on the
+ * returned buffer.
+ *
+ * This is used both in regular operation and during WAL replay.
+ */
+ void
+ brinSetHeapBlockItemptr(Buffer buf, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, ItemPointerData tid)
+ {
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ Page page;
+
+ /* The correct page should already be pinned and locked */
+ page = BufferGetPage(buf);
+ contents = (RevmapContents *) PageGetContents(page);
+ iptr = (ItemPointerData *) contents->rm_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk);
+
+ ItemPointerSet(iptr,
+ ItemPointerGetBlockNumber(&tid),
+ ItemPointerGetOffsetNumber(&tid));
+ }
+
+ /*
+ * Fetch the BrTuple for a given heap block.
+ *
+ * The buffer containing the tuple is locked, and returned in *buf. As an
+ * optimization, the caller can pass a pinned buffer *buf on entry, which will
+ * avoid a pin-unpin cycle when the next tuple is on the same page as previous
+ * one.
+ *
+ * If no tuple is found for the given heap range, returns NULL. In that case,
+ * *buf might still be updated, but it's not locked.
+ *
+ * The output tuple offset within the buffer is returned in *off, and its size
+ * is returned in *size.
+ */
+ BrTuple *
+ brinGetTupleForHeapBlock(brinRmAccess *rmAccess, BlockNumber heapBlk,
+ Buffer *buf, OffsetNumber *off, Size *size, int mode)
+ {
+ Relation idxRel = rmAccess->idxrel;
+ BlockNumber mapBlk;
+ RevmapContents *contents;
+ ItemPointerData *iptr;
+ BlockNumber blk;
+ Page page;
+ ItemId lp;
+ BrTuple *tup;
+ ItemPointerData previptr;
+
+ /* normalize the heap block number to be the first page in the range */
+ heapBlk = (heapBlk / rmAccess->pagesPerRange) * rmAccess->pagesPerRange;
+
+ /* Compute the revmap page number we need */
+ mapBlk = HEAPBLK_TO_REVMAP_BLK(rmAccess->pagesPerRange, heapBlk);
+ mapBlk = rm_get_phys_blkno(rmAccess, mapBlk, false);
+ if (mapBlk == InvalidBlockNumber)
+ {
+ *off = InvalidOffsetNumber;
+ return NULL;
+ }
+
+ ItemPointerSetInvalid(&previptr);
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ if (rmAccess->currBuf == InvalidBuffer ||
+ BufferGetBlockNumber(rmAccess->currBuf) != mapBlk)
+ {
+ if (rmAccess->currBuf != InvalidBuffer)
+ ReleaseBuffer(rmAccess->currBuf);
+
+ Assert(mapBlk != InvalidBlockNumber);
+ rmAccess->currBuf = ReadBuffer(rmAccess->idxrel, mapBlk);
+ }
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_SHARE);
+
+ contents = (RevmapContents *)
+ PageGetContents(BufferGetPage(rmAccess->currBuf));
+ iptr = contents->rm_tids;
+ iptr += HEAPBLK_TO_REVMAP_INDEX(rmAccess->pagesPerRange, heapBlk);
+
+ if (!ItemPointerIsValid(iptr))
+ {
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+ return NULL;
+ }
+
+ /*
+ * Save the current TID we got from the revmap; if we loop we can
+ * sanity-check that the new one is different. Otherwise we might be
+ * stuck looping forever if the revmap is somehow badly broken.
+ */
+ if (ItemPointerIsValid(&previptr) && ItemPointerEquals(&previptr, iptr))
+ ereport(ERROR,
+ /* FIXME improve message */
+ (errmsg("revmap was updated but still contains same TID as before")));
+ previptr = *iptr;
+
+ blk = ItemPointerGetBlockNumber(iptr);
+ *off = ItemPointerGetOffsetNumber(iptr);
+
+ LockBuffer(rmAccess->currBuf, BUFFER_LOCK_UNLOCK);
+
+ /* Ok, got a pointer to where the BrTuple should be. Fetch it. */
+ if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != blk)
+ {
+ if (BufferIsValid(*buf))
+ ReleaseBuffer(*buf);
+ *buf = ReadBuffer(idxRel, blk);
+ }
+ LockBuffer(*buf, mode);
+ page = BufferGetPage(*buf);
+
+ /* If we land on a revmap page, start over */
+ if (BRIN_IS_REGULAR_PAGE(page))
+ {
+ lp = PageGetItemId(page, *off);
+ if (ItemIdIsUsed(lp))
+ {
+ tup = (BrTuple *) PageGetItem(page, lp);
+
+ if (tup->bt_blkno == heapBlk)
+ {
+ if (size)
+ *size = ItemIdGetLength(lp);
+ /* found it! */
+ return tup;
+ }
+ }
+ }
+
+ /*
+ * No luck. Assume that the revmap was updated concurrently.
+ */
+ LockBuffer(*buf, BUFFER_LOCK_UNLOCK);
+ }
+ /* not reached, but keep compiler quiet */
+ return NULL;
+ }
+
+ /*
+ * Given a logical revmap block number, find its physical block number.
+ *
+ * If extend is set to true, and the page hasn't been set yet, extend the
+ * array to point to a newly allocated page.
+ */
+ static BlockNumber
+ rm_get_phys_blkno(brinRmAccess *rmAccess, BlockNumber mapBlk, bool extend)
+ {
+ BlockNumber targetblk;
+
+ /* skip the metapage to obtain physical block numbers of revmap pages */
+ targetblk = mapBlk + 1;
+
+ /* Normal case: the revmap page is already allocated */
+ if (targetblk <= rmAccess->lastRevmapPage)
+ return targetblk;
+
+ if (!extend)
+ return InvalidBlockNumber;
+
+ /* Extend the revmap */
+ while (targetblk > rmAccess->lastRevmapPage)
+ revmap_physical_extend(rmAccess);
+
+ return targetblk;
+ }
+
+ /*
+ * Extend the revmap by one page.
+ *
+ * However, if the revmap was extended by someone else concurrently, we might
+ * return without actually doing anything.
+ *
+ * If there is an existing BRIN page at that block, it is atomically moved
+ * out of the way, and the redirect pointer on the new revmap page is set
+ * to point to its new location.
+ */
+ static void
+ revmap_physical_extend(brinRmAccess *rmAccess)
+ {
+ Buffer buf;
+ Page page;
+ Page metapage;
+ BrinMetaPageData *metadata;
+ BlockNumber mapBlk;
+ BlockNumber nblocks;
+ Relation irel = rmAccess->idxrel;
+ bool needLock = !RELATION_IS_LOCAL(irel);
+
+ /*
+ * Lock the metapage. This locks out concurrent extensions of the revmap,
+ * but note that we still need to grab the relation extension lock because
+ * another backend can extend the index with regular BRIN pages.
+ */
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_EXCLUSIVE);
+ metapage = BufferGetPage(rmAccess->metaBuf);
+ metadata = (BrinMetaPageData *) PageGetContents(metapage);
+
+ /*
+ * Check that our cached lastRevmapPage value was up-to-date; if it
+ * wasn't, update the cached copy and have caller start over.
+ */
+ if (metadata->lastRevmapPage != rmAccess->lastRevmapPage)
+ {
+ rmAccess->lastRevmapPage = metadata->lastRevmapPage;
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ mapBlk = metadata->lastRevmapPage + 1;
+
+ nblocks = RelationGetNumberOfBlocks(irel);
+ if (mapBlk < nblocks)
+ {
+ buf = ReadBuffer(irel, mapBlk);
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+ }
+ else
+ {
+ if (needLock)
+ LockRelationForExtension(irel, ExclusiveLock);
+
+ buf = ReadBuffer(irel, P_NEW);
+ if (BufferGetBlockNumber(buf) != mapBlk)
+ {
+ /*
+ * Very rare corner case: somebody extended the relation
+ * concurrently after we read its length. If this happens, give
+ * up and have caller start over. We will have to evacuate that
+ * page from under whoever is using it.
+ */
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ return;
+ }
+ LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+ page = BufferGetPage(buf);
+
+ if (needLock)
+ UnlockRelationForExtension(irel, ExclusiveLock);
+ }
+
+ /* Check that it's a regular block (or an empty page) */
+ if (!PageIsNew(page) && !BRIN_IS_REGULAR_PAGE(page))
+ elog(ERROR, "unexpected BRIN page type: 0x%04X",
+ BRIN_PAGE_TYPE(page));
+
+ /* If the page is in use, evacuate it and restart */
+ if (brin_start_evacuating_page(irel, buf))
+ {
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+ brin_evacuate_page(irel, rmAccess->pagesPerRange, rmAccess, buf);
+
+ /* have caller start over */
+ return;
+ }
+
+ /*
+ * Ok, we have now locked the metapage and the target block. Re-initialize
+ * it as a revmap page.
+ */
+ START_CRIT_SECTION();
+
+ /* the rm_tids array is initialized to all invalid by PageInit */
+ brin_page_init(page, BRIN_PAGETYPE_REVMAP);
+ MarkBufferDirty(buf);
+
+ metadata->lastRevmapPage = mapBlk;
+ MarkBufferDirty(rmAccess->metaBuf);
+
+ if (RelationNeedsWAL(rmAccess->idxrel))
+ {
+ xl_brin_revmap_extend xlrec;
+ XLogRecPtr recptr;
+ XLogRecData rdata;
+
+ xlrec.node = rmAccess->idxrel->rd_node;
+ xlrec.targetBlk = mapBlk;
+
+ rdata.data = (char *) &xlrec;
+ rdata.len = SizeOfBrinRevmapExtend;
+ rdata.buffer = InvalidBuffer;
+ rdata.buffer_std = false;
+ rdata.next = NULL;
+
+ /* FIXME don't we need to log the metapage buffer also? */
+
+ recptr = XLogInsert(RM_BRIN_ID, XLOG_BRIN_REVMAP_EXTEND, &rdata);
+ PageSetLSN(metapage, recptr);
+ PageSetLSN(page, recptr);
+ }
+
+ END_CRIT_SECTION();
+
+ LockBuffer(rmAccess->metaBuf, BUFFER_LOCK_UNLOCK);
+
+ UnlockReleaseBuffer(buf);
+ }
*** /dev/null
--- b/src/backend/access/brin/brtuple.c
***************
*** 0 ****
--- 1,552 ----
+ /*
+ * BRIN-specific tuples
+ * Method implementations for tuples in BRIN indexes.
+ *
+ * Intended usage is that code outside this file only deals with
+ * DeformedBrTuples, and convert to and from the on-disk representation through
+ * functions in this file.
+ *
+ * NOTES
+ *
+ * A BRIN tuple is similar to a heap tuple, with a few key differences. The
+ * first interesting difference is that the tuple header is much simpler, only
+ * containing its total length and a small area for flags. Also, the stored
+ * data does not match the relation tuple descriptor exactly: for each
+ * attribute in the descriptor, the index tuple carries an arbitrary number
+ * of values, depending on the opclass.
+ *
+ * Also, for each column of the index relation there are two null bits: one
+ * (hasnulls) stores whether any tuple within the page range has that column
+ * set to null; the other one (allnulls) stores whether the column values are
+ * all null. If allnulls is true, then the tuple data area does not contain
+ * values for that column at all; whereas it does if the hasnulls is set.
+ * Note the size of the null bitmask may not be the same as that of the
+ * datum array.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/brtuple.c
+ */
+ #include "postgres.h"
+
+ #include "access/htup_details.h"
+ #include "access/brin_tuple.h"
+ #include "access/tupdesc.h"
+ #include "access/tupmacs.h"
+ #include "utils/datum.h"
+
+
+ static inline void br_deconstruct_tuple(BrinDesc *brdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls);
+
+
+ /*
+ * Return a tuple descriptor used for on-disk storage of BRIN tuples.
+ */
+ static TupleDesc
+ brtuple_disk_tupdesc(BrinDesc *brdesc)
+ {
+ /* We cache these in the BrinDesc */
+ if (brdesc->bd_disktdesc == NULL)
+ {
+ int i;
+ int j;
+ AttrNumber attno = 1;
+ TupleDesc tupdesc;
+
+ tupdesc = CreateTemplateTupleDesc(brdesc->bd_totalstored, false);
+
+ for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
+ {
+ for (j = 0; j < brdesc->bd_info[i]->oi_nstored; j++)
+ TupleDescInitEntry(tupdesc, attno++, NULL,
+ brdesc->bd_info[i]->oi_typids[j],
+ -1, 0);
+ }
+
+ brdesc->bd_disktdesc = tupdesc;
+ }
+
+ return brdesc->bd_disktdesc;
+ }
+
+ /*
+ * Generate a new on-disk tuple to be inserted in a BRIN index.
+ *
+ * See brin_form_placeholder_tuple if you touch this.
+ */
+ BrTuple *
+ brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno,
+ DeformedBrTuple *tuple, Size *size)
+ {
+ Datum *values;
+ bool *nulls;
+ bool anynulls = false;
+ BrTuple *rettuple;
+ int keyno;
+ int idxattno;
+ uint16 phony_infomask;
+ bits8 *phony_nullbitmap;
+ Size len,
+ hoff,
+ data_len;
+
+ Assert(brdesc->bd_totalstored > 0);
+
+ values = palloc(sizeof(Datum) * brdesc->bd_totalstored);
+ nulls = palloc0(sizeof(bool) * brdesc->bd_totalstored);
+ phony_nullbitmap = palloc(sizeof(bits8) * BITMAPLEN(brdesc->bd_totalstored));
+
+ /*
+ * Set up the values/nulls arrays for heap_fill_tuple
+ */
+ idxattno = 0;
+ for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
+ {
+ int datumno;
+
+ /*
+ * "allnulls" is set when there's no nonnull value in any row in the
+ * column; when this happens, there is no data to store. Thus set the
+ * nullable bits for all data elements of this column and we're done.
+ */
+ if (tuple->dt_columns[keyno].allnulls)
+ {
+ for (datumno = 0;
+ datumno < brdesc->bd_info[keyno]->oi_nstored;
+ datumno++)
+ nulls[idxattno++] = true;
+ anynulls = true;
+ continue;
+ }
+
+ /*
+ * The "hasnulls" bit is set when there are some null values in the
+ * data. We still need to store a real value, but the presence of
+ * this means we need a null bitmap.
+ */
+ if (tuple->dt_columns[keyno].hasnulls)
+ anynulls = true;
+
+ for (datumno = 0;
+ datumno < brdesc->bd_info[keyno]->oi_nstored;
+ datumno++)
+ values[idxattno++] = tuple->dt_columns[keyno].values[datumno];
+ }
+
+ /* compute total space needed */
+ len = SizeOfBrinTuple;
+ if (anynulls)
+ {
+ /*
+ * We need a double-length bitmap on an on-disk BRIN index tuple; the
+ * first half stores the "allnulls" bits, the second stores
+ * "hasnulls".
+ */
+ len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
+ }
+
+ len = hoff = MAXALIGN(len);
+
+ data_len = heap_compute_data_size(brtuple_disk_tupdesc(brdesc),
+ values, nulls);
+
+ len += data_len;
+
+ rettuple = palloc0(len);
+ rettuple->bt_blkno = blkno;
+ rettuple->bt_info = hoff;
+ Assert((rettuple->bt_info & BRIN_OFFSET_MASK) == hoff);
+
+ /*
+ * The infomask and null bitmap as computed by heap_fill_tuple are useless
+ * to us. However, that function will not accept a null infomask; and we
+ * need to pass a valid null bitmap so that it will correctly skip
+ * outputting null attributes in the data area.
+ */
+ heap_fill_tuple(brtuple_disk_tupdesc(brdesc),
+ values,
+ nulls,
+ (char *) rettuple + hoff,
+ data_len,
+ &phony_infomask,
+ phony_nullbitmap);
+
+ /* done with these */
+ pfree(values);
+ pfree(nulls);
+ pfree(phony_nullbitmap);
+
+ /*
+ * Now fill in the real null bitmasks. allnulls first.
+ */
+ if (anynulls)
+ {
+ bits8 *bitP;
+ int bitmask;
+
+ rettuple->bt_info |= BRIN_NULLS_MASK;
+
+ /*
+ * Note that we reverse the sense of null bits in this module: we
+ * store a 1 for a null attribute rather than a 0. So we must reverse
+ * the sense of the att_isnull test in br_deconstruct_tuple as well.
+ */
+ bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
+ bitmask = HIGHBIT;
+ for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].allnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ /* hasnulls bits follow */
+ for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ if (!tuple->dt_columns[keyno].hasnulls)
+ continue;
+
+ *bitP |= bitmask;
+ }
+ bitP = ((bits8 *) (rettuple + SizeOfBrinTuple)) - 1;
+ }
+
+ if (tuple->dt_placeholder)
+ rettuple->bt_info |= BRIN_PLACEHOLDER_MASK;
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Generate a new on-disk tuple with no data values, marked as placeholder.
+ *
+ * This is a cut-down version of brin_form_tuple.
+ */
+ BrTuple *
+ brin_form_placeholder_tuple(BrinDesc *brdesc, BlockNumber blkno, Size *size)
+ {
+ Size len;
+ Size hoff;
+ BrTuple *rettuple;
+ int keyno;
+ bits8 *bitP;
+ int bitmask;
+
+ /* compute total space needed: always add nulls */
+ len = SizeOfBrinTuple;
+ len += BITMAPLEN(brdesc->bd_tupdesc->natts * 2);
+ len = hoff = MAXALIGN(len);
+
+ rettuple = palloc0(len);
+ rettuple->bt_blkno = blkno;
+ rettuple->bt_info = hoff;
+ rettuple->bt_info |= BRIN_NULLS_MASK | BRIN_PLACEHOLDER_MASK;
+
+ bitP = ((bits8 *) ((char *) rettuple + SizeOfBrinTuple)) - 1;
+ bitmask = HIGHBIT;
+ /* set allnulls true for all attributes */
+ for (keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
+ {
+ if (bitmask != HIGHBIT)
+ bitmask <<= 1;
+ else
+ {
+ bitP += 1;
+ *bitP = 0x0;
+ bitmask = 1;
+ }
+
+ *bitP |= bitmask;
+ }
+ /* no need to set hasnulls */
+
+ *size = len;
+ return rettuple;
+ }
+
+ /*
+ * Free a tuple created by brin_form_tuple
+ */
+ void
+ brin_free_tuple(BrTuple *tuple)
+ {
+ pfree(tuple);
+ }
+
+ BrTuple *
+ brin_copy_tuple(BrTuple *tuple, Size len)
+ {
+ BrTuple *newtup;
+
+ newtup = palloc(len);
+ memcpy(newtup, tuple, len);
+
+ return newtup;
+ }
+
+ bool
+ brin_tuples_equal(const BrTuple *a, Size alen, const BrTuple *b, Size blen)
+ {
+ if (alen != blen)
+ return false;
+ if (memcmp(a, b, alen) != 0)
+ return false;
+ return true;
+ }
+
+ /*
+ * Create a new DeformedBrTuple from scratch, and initialize it to an empty
+ * state.
+ */
+ DeformedBrTuple *
+ brin_new_dtuple(BrinDesc *brdesc)
+ {
+ DeformedBrTuple *dtup;
+ char *currdatum;
+ long basesize;
+ int i;
+
+ basesize = MAXALIGN(sizeof(DeformedBrTuple) +
+ sizeof(BrinValues) * brdesc->bd_tupdesc->natts);
+ dtup = palloc0(basesize + sizeof(Datum) * brdesc->bd_totalstored);
+ currdatum = (char *) dtup + basesize;
+ for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
+ {
+ dtup->dt_columns[i].allnulls = true;
+ dtup->dt_columns[i].hasnulls = false;
+ dtup->dt_columns[i].values = (Datum *) currdatum;
+ currdatum += sizeof(Datum) * brdesc->bd_info[i]->oi_nstored;
+ }
+
+ return dtup;
+ }
+
+ /*
+ * Reset a DeformedBrTuple to initial state
+ */
+ void
+ brin_dtuple_initialize(DeformedBrTuple *dtuple, BrinDesc *brdesc)
+ {
+ int i;
+ int j;
+
+ for (i = 0; i < brdesc->bd_tupdesc->natts; i++)
+ {
+ if (!brdesc->bd_tupdesc->attrs[i]->attbyval &&
+ !dtuple->dt_columns[i].allnulls)
+ for (j = 0; j < brdesc->bd_info[i]->oi_nstored; j++)
+ pfree(DatumGetPointer(dtuple->dt_columns[i].values[j]));
+ dtuple->dt_columns[i].allnulls = true;
+ dtuple->dt_columns[i].hasnulls = false;
+ memset(dtuple->dt_columns[i].values, 0,
+ sizeof(Datum) * brdesc->bd_info[i]->oi_nstored);
+ }
+ }
+
+ /*
+ * Convert a BrTuple back to a DeformedBrTuple. This is the reverse of
+ * brin_form_tuple.
+ *
+ * Note we don't need the "on disk tupdesc" here; we rely on our own routine to
+ * deconstruct the tuple from the on-disk format.
+ */
+ DeformedBrTuple *
+ brin_deform_tuple(BrinDesc *brdesc, BrTuple *tuple)
+ {
+ DeformedBrTuple *dtup;
+ Datum *values;
+ bool *allnulls;
+ bool *hasnulls;
+ char *tp;
+ bits8 *nullbits;
+ int keyno;
+ int valueno;
+
+ dtup = brin_new_dtuple(brdesc);
+
+ if (BrinTupleIsPlaceholder(tuple))
+ dtup->dt_placeholder = true;
+
+ values = palloc(sizeof(Datum) * brdesc->bd_totalstored);
+ allnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
+ hasnulls = palloc(sizeof(bool) * brdesc->bd_tupdesc->natts);
+
+ tp = (char *) tuple + BrinTupleDataOffset(tuple);
+
+ if (BrinTupleHasNulls(tuple))
+ nullbits = (bits8 *) ((char *) tuple + SizeOfBrinTuple);
+ else
+ nullbits = NULL;
+ br_deconstruct_tuple(brdesc,
+ tp, nullbits, BrinTupleHasNulls(tuple),
+ values, allnulls, hasnulls);
+
+ /*
+ * Iterate to assign each of the values to the corresponding item in the
+ * values array of each column.
+ */
+ for (valueno = 0, keyno = 0; keyno < brdesc->bd_tupdesc->natts; keyno++)
+ {
+ int i;
+
+ if (allnulls[keyno])
+ {
+ valueno += brdesc->bd_info[keyno]->oi_nstored;
+ continue;
+ }
+
+ /*
+ * We would like to skip datumCopy'ing the values datum in some cases,
+ * caller permitting, but this would make life harder for
+ * brin_free_dtuple and brin_dtuple_initialize, so refrain.
+ */
+ for (i = 0; i < brdesc->bd_info[keyno]->oi_nstored; i++)
+ dtup->dt_columns[keyno].values[i] =
+ datumCopy(values[valueno++],
+ brdesc->bd_tupdesc->attrs[keyno]->attbyval,
+ brdesc->bd_tupdesc->attrs[keyno]->attlen);
+
+ dtup->dt_columns[keyno].hasnulls = hasnulls[keyno];
+ dtup->dt_columns[keyno].allnulls = false;
+ }
+
+ pfree(values);
+ pfree(allnulls);
+ pfree(hasnulls);
+
+ return dtup;
+ }
+
+ /* free resources allocated in a deformed tuple */
+ void
+ brin_free_dtuple(BrinDesc *bdesc, DeformedBrTuple *dtup)
+ {
+ int i;
+ int j;
+
+ /* if we had a mcxt to reset here .. */
+ for (i = 0; i < bdesc->bd_tupdesc->natts; i++)
+ {
+ if (!bdesc->bd_tupdesc->attrs[i]->attbyval &&
+ !dtup->dt_columns[i].allnulls)
+ for (j = 0; j < bdesc->bd_info[i]->oi_nstored; j++)
+ pfree(DatumGetPointer(dtup->dt_columns[i].values[j]));
+ }
+ pfree(dtup);
+ }
+
+ /*
+ * br_deconstruct_tuple
+ * Guts of attribute extraction from an on-disk BRIN tuple.
+ *
+ * Its arguments are:
+ * brdesc BRIN descriptor for the stored tuple
+ * tp pointer to the tuple data area
+ * nullbits pointer to the tuple nulls bitmask
+ * nulls "has nulls" bit in tuple infomask
+ * values output values, array of size brdesc->bd_totalstored
+ * allnulls output "allnulls", size brdesc->bd_tupdesc->natts
+ * hasnulls output "hasnulls", size brdesc->bd_tupdesc->natts
+ *
+ * Output arrays must have been allocated by caller.
+ */
+ static inline void
+ br_deconstruct_tuple(BrinDesc *brdesc,
+ char *tp, bits8 *nullbits, bool nulls,
+ Datum *values, bool *allnulls, bool *hasnulls)
+ {
+ int attnum;
+ int stored;
+ TupleDesc diskdsc;
+ long off;
+
+ /*
+ * First iterate to natts to obtain both null flags for each attribute.
+ * Note that we reverse the sense of the att_isnull test, because we store
+ * 1 for a null value (rather than a 1 for a not null value as is the
+ * att_isnull convention used elsewhere.) See brin_form_tuple.
+ */
+ for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
+ {
+ /*
+ * the "all nulls" bit means that all values in the page range for
+ * this column are nulls. Therefore there are no values in the tuple
+ * data area.
+ */
+ allnulls[attnum] = nulls && !att_isnull(attnum, nullbits);
+
+ /*
+ * the "has nulls" bit means that some tuples have nulls, but others
+ * have not-null values. Therefore we know the tuple contains data
+ * for this column.
+ *
+ * The hasnulls bits follow the allnulls bits in the same bitmask.
+ */
+ hasnulls[attnum] =
+ nulls && !att_isnull(brdesc->bd_tupdesc->natts + attnum, nullbits);
+ }
+
+ /*
+ * Iterate to obtain each attribute's stored values. Note that since we
+ * may reuse attribute entries for more than one column, we cannot cache
+ * offsets here.
+ */
+ diskdsc = brtuple_disk_tupdesc(brdesc);
+ stored = 0;
+ off = 0;
+ for (attnum = 0; attnum < brdesc->bd_tupdesc->natts; attnum++)
+ {
+ int datumno;
+
+ if (allnulls[attnum])
+ {
+ stored += brdesc->bd_info[attnum]->oi_nstored;
+ continue;
+ }
+
+ for (datumno = 0;
+ datumno < brdesc->bd_info[attnum]->oi_nstored;
+ datumno++)
+ {
+ Form_pg_attribute thisatt = diskdsc->attrs[stored];
+
+ if (thisatt->attlen == -1)
+ {
+ off = att_align_pointer(off, thisatt->attalign, -1,
+ tp + off);
+ }
+ else
+ {
+ /* not varlena, so safe to use att_align_nominal */
+ off = att_align_nominal(off, thisatt->attalign);
+ }
+
+ values[stored++] = fetchatt(thisatt, tp + off);
+
+ off = att_addlength_pointer(off, thisatt->attlen, tp + off);
+ }
+ }
+ }
*** /dev/null
--- b/src/backend/access/brin/brxlog.c
***************
*** 0 ****
--- 1,323 ----
+ /*
+ * mmxlog.c
+ * XLog replay routines for BRIN indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/access/brin/mmxlog.c
+ */
+ #include "postgres.h"
+
+ #include "access/brin.h"
+ #include "access/brin_internal.h"
+ #include "access/brin_page.h"
+ #include "access/brin_revmap.h"
+ #include "access/brin_tuple.h"
+ #include "access/brin_xlog.h"
+ #include "access/xlogutils.h"
+ #include "storage/freespace.h"
+
+
+ /*
+ * xlog replay routines
+ */
+ static void
+ brin_xlog_createidx(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_brin_createidx *xlrec = (xl_brin_createidx *) XLogRecGetData(record);
+ Buffer buf;
+ Page page;
+
+ /* Backup blocks are not used in create_index records */
+ Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+
+ /* create the index' metapage */
+ buf = XLogReadBuffer(xlrec->node, BRIN_METAPAGE_BLKNO, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ brin_metapage_init(page, xlrec->pagesPerRange, xlrec->version);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+
+ /* also initialize its first revmap page */
+ buf = XLogReadBuffer(xlrec->node, 1, true);
+ Assert(BufferIsValid(buf));
+ page = (Page) BufferGetPage(buf);
+ brin_page_init(page, BRIN_PAGETYPE_REVMAP);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ UnlockReleaseBuffer(buf);
+ }
+
+ /*
+ * Common part of an insert or update. Inserts the new tuple and updates the
+ * revmap.
+ */
+ static void
+ brin_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, xl_brin_insert *xlrec,
+ BrTuple *tuple, int tuplen)
+ {
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ Assert(tuple->bt_blkno == xlrec->heapBlk);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ if (record->xl_info & XLOG_BRIN_INIT_PAGE)
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, true);
+ Assert(BufferIsValid(buffer));
+ page = (Page) BufferGetPage(buffer);
+
+ brin_page_init(page, BRIN_PAGETYPE_REGULAR);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ }
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "brin_xlog_insert_update: invalid max offset number");
+
+ offnum = PageAddItem(page, (Item) tuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "brin_xlog_insert_update: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* update the revmap */
+ if (record->xl_info & XLR_BKP_BLOCK(1))
+ {
+ (void) RestoreBackupBlock(lsn, record, 1, false, false);
+ }
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->node, xlrec->revmapBlk, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ brinSetHeapBlockItemptr(buffer, xlrec->pagesPerRange, xlrec->heapBlk, xlrec->tid);
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+ static void
+ brin_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_brin_insert *xlrec = (xl_brin_insert *) XLogRecGetData(record);
+ BrTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfBrinInsert;
+ newtup = (BrTuple *) ((char *) xlrec + SizeOfBrinInsert);
+
+ brin_xlog_insert_update(lsn, record, xlrec, newtup, tuplen);
+ }
+
+ static void
+ brin_xlog_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_brin_update *xlrec = (xl_brin_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ OffsetNumber offnum;
+ Buffer buffer;
+ Page page;
+ BrTuple *newtup;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfBrinUpdate;
+ newtup = (BrTuple *) ((char *) xlrec + SizeOfBrinUpdate);
+
+ /* First insert the new tuple and update revmap, like in an insertion. */
+ brin_xlog_insert_update(lsn, record, &xlrec->new, newtup, tuplen);
+
+ /* Then remove the old tuple */
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ {
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ }
+ else
+ {
+ blkno = ItemPointerGetBlockNumber(&(xlrec->oldtid));
+ buffer = XLogReadBuffer(xlrec->new.node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->oldtid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "brin_xlog_update: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+ }
+
+ /*
+ * Update a tuple on a single page.
+ */
+ static void
+ brin_xlog_samepage_update(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_brin_samepage_update *xlrec = (xl_brin_samepage_update *) XLogRecGetData(record);
+ BlockNumber blkno;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+
+ /* If we have a full-page image, restore it */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ }
+ else
+ {
+ BrTuple *mmtuple;
+ int tuplen;
+
+ tuplen = record->xl_len - SizeOfBrinSamepageUpdate;
+ mmtuple = (BrTuple *) ((char *) xlrec + SizeOfBrinSamepageUpdate);
+
+ blkno = ItemPointerGetBlockNumber(&(xlrec->tid));
+ buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+
+ if (lsn > PageGetLSN(page))
+ {
+ offnum = ItemPointerGetOffsetNumber(&(xlrec->tid));
+ if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ elog(PANIC, "brin_xlog_samepage_update: invalid max offset number");
+
+ PageIndexDeleteNoCompact(page, &offnum, 1);
+ offnum = PageAddItem(page, (Item) mmtuple, tuplen, offnum, true, false);
+ if (offnum == InvalidOffsetNumber)
+ elog(PANIC, "brin_xlog_samepage_update: failed to add tuple");
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+
+ /* XXX no FSM updates here ... */
+ }
+
+
+ static void
+ brin_xlog_revmap_extend(XLogRecPtr lsn, XLogRecord *record)
+ {
+ xl_brin_revmap_extend *xlrec = (xl_brin_revmap_extend *) XLogRecGetData(record);
+ Buffer metabuf;
+ Page metapg;
+ BrinMetaPageData *metadata;
+ Buffer buf;
+ Page page;
+
+ /* Update the metapage */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ {
+ metabuf = RestoreBackupBlock(lsn, record, 0, false, true);
+ }
+ else
+ {
+ metabuf = XLogReadBuffer(xlrec->node, BRIN_METAPAGE_BLKNO, false);
+ if (BufferIsValid(metabuf))
+ {
+ metapg = BufferGetPage(metabuf);
+ if (lsn > PageGetLSN(metapg))
+ {
+ metadata = (BrinMetaPageData *) PageGetContents(metapg);
+
+ Assert(metadata->lastRevmapPage == xlrec->targetBlk - 1);
+ metadata->lastRevmapPage = xlrec->targetBlk;
+
+ PageSetLSN(metapg, lsn);
+ MarkBufferDirty(metabuf);
+ }
+ }
+ }
+
+ /*
+ * Re-init the target block as a revmap page. There's never a full- page
+ * image here.
+ */
+
+ buf = XLogReadBuffer(xlrec->node, xlrec->targetBlk, true);
+ page = (Page) BufferGetPage(buf);
+ brin_page_init(page, BRIN_PAGETYPE_REVMAP);
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+
+ UnlockReleaseBuffer(buf);
+ UnlockReleaseBuffer(metabuf);
+ }
+
+ void
+ brin_redo(XLogRecPtr lsn, XLogRecord *record)
+ {
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ switch (info & XLOG_BRIN_OPMASK)
+ {
+ case XLOG_BRIN_CREATE_INDEX:
+ brin_xlog_createidx(lsn, record);
+ break;
+ case XLOG_BRIN_INSERT:
+ brin_xlog_insert(lsn, record);
+ break;
+ case XLOG_BRIN_UPDATE:
+ brin_xlog_update(lsn, record);
+ break;
+ case XLOG_BRIN_SAMEPAGE_UPDATE:
+ brin_xlog_samepage_update(lsn, record);
+ break;
+ case XLOG_BRIN_REVMAP_EXTEND:
+ brin_xlog_revmap_extend(lsn, record);
+ break;
+ default:
+ elog(PANIC, "brin_redo: unknown op code %u", info);
+ }
+ }
*** a/src/backend/access/common/reloptions.c
--- b/src/backend/access/common/reloptions.c
***************
*** 209,214 **** static relopt_int intRelOpts[] =
--- 209,221 ----
RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
}, -1, 0, 2000000000
},
+ {
+ {
+ "pages_per_range",
+ "Number of pages that each page range covers in a BRIN index",
+ RELOPT_KIND_BRIN
+ }, 128, 1, 131072
+ },
/* list terminator */
{{NULL}}
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 270,275 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 270,277 ----
scan->rs_startblock = 0;
}
+ scan->rs_initblock = 0;
+ scan->rs_numblocks = InvalidBlockNumber;
scan->rs_inited = false;
scan->rs_ctup.t_data = NULL;
ItemPointerSetInvalid(&scan->rs_ctup.t_self);
***************
*** 295,300 **** initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
--- 297,310 ----
pgstat_count_heap_scan(scan->rs_rd);
}
+ void
+ heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk, BlockNumber numBlks)
+ {
+ scan->rs_startblock = startBlk;
+ scan->rs_initblock = startBlk;
+ scan->rs_numblocks = numBlks;
+ }
+
/*
* heapgetpage - subroutine for heapgettup()
*
***************
*** 635,641 **** heapgettup(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 645,652 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 645,651 **** heapgettup(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 656,663 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
***************
*** 896,902 **** heapgettup_pagemode(HeapScanDesc scan,
*/
if (backward)
{
! finished = (page == scan->rs_startblock);
if (page == 0)
page = scan->rs_nblocks;
page--;
--- 908,915 ----
*/
if (backward)
{
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
if (page == 0)
page = scan->rs_nblocks;
page--;
***************
*** 906,912 **** heapgettup_pagemode(HeapScanDesc scan,
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock);
/*
* Report our new scan position for synchronization purposes. We
--- 919,926 ----
page++;
if (page >= scan->rs_nblocks)
page = 0;
! finished = (page == scan->rs_startblock) ||
! (scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks <= 0 : false);
/*
* Report our new scan position for synchronization purposes. We
*** a/src/backend/access/rmgrdesc/Makefile
--- b/src/backend/access/rmgrdesc/Makefile
***************
*** 8,14 **** subdir = src/backend/access/rmgrdesc
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
! OBJS = clogdesc.o dbasedesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
--- 8,15 ----
top_builddir = ../../../..
include $(top_builddir)/src/Makefile.global
! OBJS = brindesc.o clogdesc.o dbasedesc.o gindesc.o gistdesc.o \
! hashdesc.o heapdesc.o \
mxactdesc.o nbtdesc.o relmapdesc.o seqdesc.o smgrdesc.o spgdesc.o \
standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
*** /dev/null
--- b/src/backend/access/rmgrdesc/brindesc.c
***************
*** 0 ****
--- 1,89 ----
+ /*-------------------------------------------------------------------------
+ *
+ * brindesc.c
+ * rmgr descriptor routines for BRIN indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/access/rmgrdesc/brindesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/brin_xlog.h"
+
+ void
+ brin_desc(StringInfo buf, XLogRecord *record)
+ {
+ char *rec = XLogRecGetData(record);
+ uint8 info = record->xl_info & ~XLR_INFO_MASK;
+
+ info &= XLOG_BRIN_OPMASK;
+ if (info == XLOG_BRIN_CREATE_INDEX)
+ {
+ xl_brin_createidx *xlrec = (xl_brin_createidx *) rec;
+
+ appendStringInfo(buf, "create index: v%d pagesPerRange %u %u/%u/%u",
+ xlrec->version, xlrec->pagesPerRange,
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode);
+ }
+ else if (info == XLOG_BRIN_INSERT)
+ {
+ xl_brin_insert *xlrec = (xl_brin_insert *) rec;
+
+ if (record->xl_info & XLOG_BRIN_INIT_PAGE)
+ appendStringInfo(buf, "insert(init): ");
+ else
+ appendStringInfo(buf, "insert: ");
+ appendStringInfo(buf, "%u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ xlrec->heapBlk, xlrec->revmapBlk,
+ xlrec->pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_BRIN_UPDATE)
+ {
+ xl_brin_update *xlrec = (xl_brin_update *) rec;
+
+ if (record->xl_info & XLOG_BRIN_INIT_PAGE)
+ appendStringInfo(buf, "update(init): ");
+ else
+ appendStringInfo(buf, "update: ");
+ appendStringInfo(buf, "rel %u/%u/%u heapBlk %u revmapBlk %u pagesPerRange %u old TID (%u,%u) TID (%u,%u)",
+ xlrec->new.node.spcNode, xlrec->new.node.dbNode,
+ xlrec->new.node.relNode,
+ xlrec->new.heapBlk, xlrec->new.revmapBlk,
+ xlrec->new.pagesPerRange,
+ ItemPointerGetBlockNumber(&xlrec->oldtid),
+ ItemPointerGetOffsetNumber(&xlrec->oldtid),
+ ItemPointerGetBlockNumber(&xlrec->new.tid),
+ ItemPointerGetOffsetNumber(&xlrec->new.tid));
+ }
+ else if (info == XLOG_BRIN_SAMEPAGE_UPDATE)
+ {
+ xl_brin_samepage_update *xlrec = (xl_brin_samepage_update *) rec;
+
+ appendStringInfo(buf, "samepage_update: rel %u/%u/%u TID (%u,%u)",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode,
+ ItemPointerGetBlockNumber(&xlrec->tid),
+ ItemPointerGetOffsetNumber(&xlrec->tid));
+ }
+ else if (info == XLOG_BRIN_REVMAP_EXTEND)
+ {
+ xl_brin_revmap_extend *xlrec = (xl_brin_revmap_extend *) rec;
+
+ appendStringInfo(buf, "revmap extend: rel %u/%u/%u targetBlk %u",
+ xlrec->node.spcNode, xlrec->node.dbNode,
+ xlrec->node.relNode, xlrec->targetBlk);
+ }
+ else
+ appendStringInfo(buf, "UNKNOWN");
+ }
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 12,17 ****
--- 12,18 ----
#include "access/gist_private.h"
#include "access/hash.h"
#include "access/heapam_xlog.h"
+ #include "access/brin_xlog.h"
#include "access/multixact.h"
#include "access/nbtree.h"
#include "access/spgist.h"
*** a/src/backend/catalog/index.c
--- b/src/backend/catalog/index.c
***************
*** 2097,2102 **** IndexBuildHeapScan(Relation heapRelation,
--- 2097,2123 ----
IndexBuildCallback callback,
void *callback_state)
{
+ return IndexBuildHeapRangeScan(heapRelation, indexRelation,
+ indexInfo, allow_sync,
+ 0, InvalidBlockNumber,
+ callback, callback_state);
+ }
+
+ /*
+ * As above, except that instead of scanning the complete heap, only the given
+ * number of blocks are scanned. Scan to end-of-rel can be signalled by
+ * passing InvalidBlockNumber as numblocks.
+ */
+ double
+ IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber numblocks,
+ IndexBuildCallback callback,
+ void *callback_state)
+ {
bool is_system_catalog;
bool checking_uniqueness;
HeapScanDesc scan;
***************
*** 2167,2172 **** IndexBuildHeapScan(Relation heapRelation,
--- 2188,2196 ----
true, /* buffer access strategy OK */
allow_sync); /* syncscan OK? */
+ /* set our endpoints */
+ heap_setscanlimits(scan, start_blockno, numblocks);
+
reltuples = 0;
/*
*** a/src/backend/replication/logical/decode.c
--- b/src/backend/replication/logical/decode.c
***************
*** 132,137 **** LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogRecord *record)
--- 132,138 ----
case RM_GIST_ID:
case RM_SEQ_ID:
case RM_SPGIST_ID:
+ case RM_BRIN_ID:
break;
case RM_NEXT_ID:
elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) buf.record.xl_rmid);
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
***************
*** 399,405 **** PageRestoreTempPage(Page tempPage, Page oldPage)
}
/*
! * sorting support for PageRepairFragmentation and PageIndexMultiDelete
*/
typedef struct itemIdSortData
{
--- 399,406 ----
}
/*
! * sorting support for PageRepairFragmentation, PageIndexMultiDelete,
! * PageIndexDeleteNoCompact
*/
typedef struct itemIdSortData
{
***************
*** 896,901 **** PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems)
--- 897,1078 ----
phdr->pd_upper = upper;
}
+ /*
+ * PageIndexDeleteNoCompact
+ * Delete the given items for an index page, and defragment the resulting
+ * free space, but do not compact the item pointers array.
+ *
+ * itemnos is the array of tuples to delete; nitems is its size. maxIdxTuples
+ * is the maximum number of tuples that can exist in a page.
+ *
+ * Unused items at the end of the array are removed.
+ *
+ * This is used for index AMs that require that existing TIDs of live tuples
+ * remain unchanged.
+ */
+ void
+ PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos, int nitems)
+ {
+ PageHeader phdr = (PageHeader) page;
+ LocationIndex pd_lower = phdr->pd_lower;
+ LocationIndex pd_upper = phdr->pd_upper;
+ LocationIndex pd_special = phdr->pd_special;
+ int nline;
+ bool empty;
+ OffsetNumber offnum;
+ int nextitm;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (pd_lower < SizeOfPageHeaderData ||
+ pd_lower > pd_upper ||
+ pd_upper > pd_special ||
+ pd_special > BLCKSZ ||
+ pd_special != MAXALIGN(pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ pd_lower, pd_upper, pd_special)));
+
+ /*
+ * Scan the existing item pointer array and mark as unused those that are
+ * in our kill-list; make sure any non-interesting ones are marked unused
+ * as well.
+ */
+ nline = PageGetMaxOffsetNumber(page);
+ empty = true;
+ nextitm = 0;
+ for (offnum = FirstOffsetNumber; offnum <= nline; offnum = OffsetNumberNext(offnum))
+ {
+ ItemId lp;
+ ItemLength itemlen;
+ ItemOffset offset;
+
+ lp = PageGetItemId(page, offnum);
+
+ itemlen = ItemIdGetLength(lp);
+ offset = ItemIdGetOffset(lp);
+
+ if (ItemIdIsUsed(lp))
+ {
+ if (offset < pd_upper ||
+ (offset + itemlen) > pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, length = %u",
+ offset, (unsigned int) itemlen)));
+
+ if (nextitm < nitems && offnum == itemnos[nextitm])
+ {
+ /* this one is on our list to delete, so mark it unused */
+ ItemIdSetUnused(lp);
+ nextitm++;
+ }
+ else if (ItemIdHasStorage(lp))
+ {
+ /* This one's live -- must do the compaction dance */
+ empty = false;
+ }
+ else
+ {
+ /* get rid of this one too */
+ ItemIdSetUnused(lp);
+ }
+ }
+ }
+
+ /* this will catch invalid or out-of-order itemnos[] */
+ if (nextitm != nitems)
+ elog(ERROR, "incorrect index offsets supplied");
+
+ if (empty)
+ {
+ /* Page is completely empty, so just reset it quickly */
+ phdr->pd_lower = SizeOfPageHeaderData;
+ phdr->pd_upper = pd_special;
+ }
+ else
+ {
+ /* There are live items: need to compact the page the hard way */
+ itemIdSortData itemidbase[MaxOffsetNumber];
+ itemIdSort itemidptr;
+ int i;
+ Size totallen;
+ Offset upper;
+
+ /*
+ * Scan the page taking note of each item that we need to preserve.
+ * This includes both live items (those that contain data) and
+ * interspersed unused ones. It's critical to preserve these unused
+ * items, because otherwise the offset numbers for later live items
+ * would change, which is not acceptable. Unused items might get used
+ * again later; that is fine.
+ */
+ itemidptr = itemidbase;
+ totallen = 0;
+ for (i = 0; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ itemidptr->offsetindex = i;
+
+ lp = PageGetItemId(page, i + 1);
+ if (ItemIdHasStorage(lp))
+ {
+ itemidptr->itemoff = ItemIdGetOffset(lp);
+ itemidptr->alignedlen = MAXALIGN(ItemIdGetLength(lp));
+ totallen += itemidptr->alignedlen;
+ }
+ else
+ {
+ itemidptr->itemoff = 0;
+ itemidptr->alignedlen = 0;
+ }
+ }
+ /* By here, there are exactly nline elements in itemidbase array */
+
+ if (totallen > (Size) (pd_special - pd_lower))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item lengths: total %u, available space %u",
+ (unsigned int) totallen, pd_special - pd_lower)));
+
+ /* sort itemIdSortData array into decreasing itemoff order */
+ qsort((char *) itemidbase, nline, sizeof(itemIdSortData),
+ itemoffcompare);
+
+ /*
+ * Defragment the data areas of each tuple, being careful to preserve
+ * each item's position in the linp array.
+ */
+ upper = pd_special;
+ PageClearHasFreeLinePointers(page);
+ for (i = 0, itemidptr = itemidbase; i < nline; i++, itemidptr++)
+ {
+ ItemId lp;
+
+ lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+ if (itemidptr->alignedlen == 0)
+ {
+ PageSetHasFreeLinePointers(page);
+ ItemIdSetUnused(lp);
+ continue;
+ }
+ upper -= itemidptr->alignedlen;
+ memmove((char *) page + upper,
+ (char *) page + itemidptr->itemoff,
+ itemidptr->alignedlen);
+ lp->lp_off = upper;
+ /* lp_flags and lp_len remain the same as originally */
+ }
+
+ /* Set the new page limits */
+ phdr->pd_upper = upper;
+ phdr->pd_lower = SizeOfPageHeaderData + i * sizeof(ItemIdData);
+ }
+ }
/*
* Set checksum for a page in shared buffers.
*** a/src/backend/utils/adt/selfuncs.c
--- b/src/backend/utils/adt/selfuncs.c
***************
*** 6081,6087 **** genericcostestimate(PlannerInfo *root,
else
numIndexPages = 1.0;
! /* fetch estimated page cost for schema containing index */
get_tablespace_page_costs(index->reltablespace,
&spc_random_page_cost,
NULL);
--- 6081,6087 ----
else
numIndexPages = 1.0;
! /* fetch estimated page cost for tablespace containing index */
get_tablespace_page_costs(index->reltablespace,
&spc_random_page_cost,
NULL);
***************
*** 7162,7168 **** gincostestimate(PG_FUNCTION_ARGS)
JOIN_INNER,
NULL);
! /* fetch estimated page cost for schema containing index */
get_tablespace_page_costs(index->reltablespace,
&spc_random_page_cost,
NULL);
--- 7162,7168 ----
JOIN_INNER,
NULL);
! /* fetch estimated page cost for tablespace containing index */
get_tablespace_page_costs(index->reltablespace,
&spc_random_page_cost,
NULL);
***************
*** 7349,7351 **** gincostestimate(PG_FUNCTION_ARGS)
--- 7349,7421 ----
PG_RETURN_VOID();
}
+
+ /*
+ * BRIN has search behavior completely different from other index types
+ */
+ Datum
+ brincostestimate(PG_FUNCTION_ARGS)
+ {
+ PlannerInfo *root = (PlannerInfo *) PG_GETARG_POINTER(0);
+ IndexPath *path = (IndexPath *) PG_GETARG_POINTER(1);
+ double loop_count = PG_GETARG_FLOAT8(2);
+ Cost *indexStartupCost = (Cost *) PG_GETARG_POINTER(3);
+ Cost *indexTotalCost = (Cost *) PG_GETARG_POINTER(4);
+ Selectivity *indexSelectivity = (Selectivity *) PG_GETARG_POINTER(5);
+ double *indexCorrelation = (double *) PG_GETARG_POINTER(6);
+ IndexOptInfo *index = path->indexinfo;
+ List *indexQuals = path->indexquals;
+ List *indexOrderBys = path->indexorderbys;
+ double numPages = index->pages;
+ double numTuples = index->tuples;
+ Cost spc_seq_page_cost;
+ Cost spc_random_page_cost;
+ QualCost index_qual_cost;
+ double qual_op_cost;
+ double qual_arg_cost;
+
+ /* fetch estimated page cost for tablespace containing index */
+ get_tablespace_page_costs(index->reltablespace,
+ &spc_random_page_cost,
+ &spc_seq_page_cost);
+
+ /*
+ * BRIN indexes are always read in full; use that as startup cost.
+ * XXX maybe only include revmap pages here?
+ */
+ *indexStartupCost = spc_seq_page_cost * numPages * loop_count;
+
+ /*
+ * To read a BRIN index there might be a bit of back and forth over regular
+ * pages, as revmap might point to them out of sequential order; calculate
+ * this as reading the whole index in random order.
+ */
+ *indexTotalCost = spc_random_page_cost * numPages * loop_count;
+
+ *indexSelectivity =
+ clauselist_selectivity(root, path->indexquals,
+ path->indexinfo->rel->relid,
+ JOIN_INNER, NULL);
+ *indexCorrelation = 1;
+
+ /*
+ * Add on index qual eval costs, much as in genericcostestimate.
+ */
+ cost_qual_eval(&index_qual_cost, indexQuals, root);
+ qual_arg_cost = index_qual_cost.startup + index_qual_cost.per_tuple;
+ cost_qual_eval(&index_qual_cost, indexOrderBys, root);
+ qual_arg_cost += index_qual_cost.startup + index_qual_cost.per_tuple;
+ qual_op_cost = cpu_operator_cost *
+ (list_length(indexQuals) + list_length(indexOrderBys));
+ qual_arg_cost -= qual_op_cost;
+ if (qual_arg_cost < 0) /* just in case... */
+ qual_arg_cost = 0;
+
+ *indexStartupCost += qual_arg_cost;
+ *indexTotalCost += qual_arg_cost;
+ *indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost);
+
+ /* XXX what about pages_per_range? */
+
+ PG_RETURN_VOID();
+ }
*** /dev/null
--- b/src/include/access/brin.h
***************
*** 0 ****
--- 1,52 ----
+ /*
+ * AM-callable functions for BRIN indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin.h
+ */
+ #ifndef BRIN_H
+ #define BRIN_H
+
+ #include "fmgr.h"
+ #include "nodes/execnodes.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * prototypes for functions in brin.c (external entry points for BRIN)
+ */
+ extern Datum brinbuild(PG_FUNCTION_ARGS);
+ extern Datum brinbuildempty(PG_FUNCTION_ARGS);
+ extern Datum brininsert(PG_FUNCTION_ARGS);
+ extern Datum brinbeginscan(PG_FUNCTION_ARGS);
+ extern Datum bringettuple(PG_FUNCTION_ARGS);
+ extern Datum bringetbitmap(PG_FUNCTION_ARGS);
+ extern Datum brinrescan(PG_FUNCTION_ARGS);
+ extern Datum brinendscan(PG_FUNCTION_ARGS);
+ extern Datum brinmarkpos(PG_FUNCTION_ARGS);
+ extern Datum brinrestrpos(PG_FUNCTION_ARGS);
+ extern Datum brinbulkdelete(PG_FUNCTION_ARGS);
+ extern Datum brinvacuumcleanup(PG_FUNCTION_ARGS);
+ extern Datum brincanreturn(PG_FUNCTION_ARGS);
+ extern Datum brincostestimate(PG_FUNCTION_ARGS);
+ extern Datum brinoptions(PG_FUNCTION_ARGS);
+
+ /*
+ * Storage type for BRIN's reloptions
+ */
+ typedef struct BrinOptions
+ {
+ int32 vl_len_; /* varlena header (do not touch directly!) */
+ BlockNumber pagesPerRange;
+ } BrinOptions;
+
+ #define BRIN_DEFAULT_PAGES_PER_RANGE 128
+ #define BrinGetPagesPerRange(relation) \
+ ((relation)->rd_options ? \
+ ((BrinOptions *) (relation)->rd_options)->pagesPerRange : \
+ BRIN_DEFAULT_PAGES_PER_RANGE)
+
+ #endif /* BRIN_H */
*** /dev/null
--- b/src/include/access/brin_internal.h
***************
*** 0 ****
--- 1,90 ----
+ /*
+ * brin_internal.h
+ * internal declarations for BRIN indexes
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin_internal.h
+ */
+ #ifndef BRIN_INTERNAL_H
+ #define BRIN_INTERNAL_H
+
+ #include "fmgr.h"
+ #include "storage/buf.h"
+ #include "storage/bufpage.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * A BrinDesc is a struct designed to enable decoding a BRIN tuple from the
+ * on-disk format to a DeformedBrTuple and vice-versa.
+ */
+
+ /* struct returned by "OpcInfo" amproc */
+ typedef struct BrinOpcInfo
+ {
+ /* Number of columns stored in an index column of this opclass */
+ uint16 oi_nstored;
+
+ /* Opaque pointer for the opclass' private use */
+ void *oi_opaque;
+
+ /* Type IDs of the stored columns */
+ Oid oi_typids[FLEXIBLE_ARRAY_MEMBER];
+ } BrinOpcInfo;
+
+ /* the size of a BrinOpcInfo for the given number of columns */
+ #define SizeofBrinOpcInfo(ncols) \
+ (offsetof(BrinOpcInfo, oi_typids) + sizeof(Oid) * ncols)
+
+ typedef struct BrinDesc
+ {
+ /* Containing memory context */
+ MemoryContext bd_context;
+
+ /* the index relation itself */
+ Relation bd_index;
+
+ /* tuple descriptor of the index relation */
+ TupleDesc bd_tupdesc;
+
+ /* cached copy for on-disk tuples; generated at first use */
+ TupleDesc bd_disktdesc;
+
+ /* total number of Datum entries that are stored on-disk for all columns */
+ int bd_totalstored;
+
+ /* per-column info; bd_tupdesc->natts entries long */
+ BrinOpcInfo *bd_info[FLEXIBLE_ARRAY_MEMBER];
+ } BrinDesc;
+
+ /*
+ * Globally-known function support numbers for BRIN indexes. Individual
+ * opclasses define their own function support numbers, which must not collide
+ * with the definitions here.
+ */
+ #define BRIN_PROCNUM_OPCINFO 1
+ #define BRIN_PROCNUM_ADDVALUE 2
+ #define BRIN_PROCNUM_CONSISTENT 3
+ #define BRIN_PROCNUM_UNION 4
+
+ #define BRIN_DEBUG
+
+ /* we allow debug if using GCC; otherwise don't bother */
+ #if defined(BRIN_DEBUG) && defined(__GNUC__)
+ #define BRIN_elog(level, ...) elog(level, __VA_ARGS__)
+ #else
+ #define BRIN_elog(a) void(0)
+ #endif
+
+ /* brin.c */
+ extern BrinDesc *brin_build_desc(Relation rel);
+ extern void brin_free_desc(BrinDesc *bdesc);
+ extern void brin_page_init(Page page, uint16 type);
+ extern void brin_metapage_init(Page page, BlockNumber pagesPerRange,
+ uint16 version);
+
+ #endif /* BRIN_INTERNAL_H */
*** /dev/null
--- b/src/include/access/brin_page.h
***************
*** 0 ****
--- 1,69 ----
+ /*
+ * Prototypes and definitions for BRIN page layouts
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin_page.h
+ *
+ * NOTES
+ *
+ * These structs should really be private to specific BRIN files, but it's
+ * useful to have them here so that they can be used by pageinspect and similar
+ * tools.
+ */
+ #ifndef BRIN_PAGE_H
+ #define BRIN_PAGE_H
+
+ #include "storage/block.h"
+ #include "storage/itemptr.h"
+
+ /* special space on all BRIN pages stores a "type" identifier */
+ #define BRIN_PAGETYPE_META 0xF091
+ #define BRIN_PAGETYPE_REVMAP 0xF092
+ #define BRIN_PAGETYPE_REGULAR 0xF093
+
+ #define BRIN_PAGE_TYPE(page) \
+ (((BrinSpecialSpace *) PageGetSpecialPointer(page))->type)
+ #define BRIN_IS_REVMAP_PAGE(page) (BRIN_PAGE_TYPE(page) == BRIN_PAGETYPE_REVMAP)
+ #define BRIN_IS_REGULAR_PAGE(page) (BRIN_PAGE_TYPE(page) == BRIN_PAGETYPE_REGULAR)
+
+ /* flags for BrinSpecialSpace */
+ #define BRIN_EVACUATE_PAGE (1 << 0)
+
+ typedef struct BrinSpecialSpace
+ {
+ uint16 flags;
+ uint16 type;
+ } BrinSpecialSpace;
+
+ /* Metapage definitions */
+ typedef struct BrinMetaPageData
+ {
+ uint32 brinMagic;
+ uint32 brinVersion;
+ BlockNumber pagesPerRange;
+ BlockNumber lastRevmapPage;
+ } BrinMetaPageData;
+
+ #define BRIN_CURRENT_VERSION 1
+ #define BRIN_META_MAGIC 0xA8109CFA
+
+ #define BRIN_METAPAGE_BLKNO 0
+
+ /* Definitions for revmap pages */
+ typedef struct RevmapContents
+ {
+ ItemPointerData rm_tids[1]; /* really REVMAP_PAGE_MAXITEMS */
+ } RevmapContents;
+
+ #define REVMAP_CONTENT_SIZE \
+ (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
+ offsetof(RevmapContents, rm_tids) - \
+ MAXALIGN(sizeof(BrinSpecialSpace)))
+ /* max num of items in the array */
+ #define REVMAP_PAGE_MAXITEMS \
+ (REVMAP_CONTENT_SIZE / sizeof(ItemPointerData))
+
+ #endif /* BRIN_PAGE_H */
*** /dev/null
--- b/src/include/access/brin_pageops.h
***************
*** 0 ****
--- 1,31 ----
+ /*
+ * Prototypes for operating on BRIN pages.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin_pageops.h
+ */
+ #ifndef BRIN_PAGEOPS_H
+ #define BRIN_PAGEOPS_H
+
+ #include "access/brin_revmap.h"
+
+ extern bool brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, BlockNumber heapBlk,
+ Buffer oldbuf, OffsetNumber oldoff,
+ const BrTuple *origtup, Size origsz,
+ const BrTuple *newtup, Size newsz,
+ bool samepage, bool *extended);
+ extern bool brin_can_do_samepage_update(Buffer buffer, Size origsz,
+ Size newsz);
+ extern OffsetNumber brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, Buffer *buffer, BlockNumber heapBlk,
+ BrTuple *tup, Size itemsz, bool *extended);
+
+ extern bool brin_start_evacuating_page(Relation idxRel, Buffer buf);
+ extern void brin_evacuate_page(Relation idxRel, BlockNumber pagesPerRange,
+ brinRmAccess *rmAccess, Buffer buf);
+
+ #endif /* BRIN_PAGEOPS_H */
*** /dev/null
--- b/src/include/access/brin_revmap.h
***************
*** 0 ****
--- 1,38 ----
+ /*
+ * prototypes for BRIN reverse range maps
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin_revmap.h
+ */
+
+ #ifndef BRIN_REVMAP_H
+ #define BRIN_REVMAP_H
+
+ #include "access/brin_tuple.h"
+ #include "storage/block.h"
+ #include "storage/buf.h"
+ #include "storage/itemptr.h"
+ #include "storage/off.h"
+ #include "utils/relcache.h"
+
+ /* struct definition lives in brrevmap.c */
+ typedef struct brinRmAccess brinRmAccess;
+
+ extern brinRmAccess *brinRevmapAccessInit(Relation idxrel,
+ BlockNumber *pagesPerRange);
+ extern void brinRevmapAccessTerminate(brinRmAccess *rmAccess);
+
+ extern Buffer brinRevmapExtend(brinRmAccess *rmAccess,
+ BlockNumber heapBlk);
+ extern Buffer brinLockRevmapPageForUpdate(brinRmAccess *rmAccess,
+ BlockNumber heapBlk);
+ extern void brinSetHeapBlockItemptr(Buffer rmbuf, BlockNumber pagesPerRange,
+ BlockNumber heapBlk, ItemPointerData tid);
+ extern BrTuple *brinGetTupleForHeapBlock(brinRmAccess *rmAccess,
+ BlockNumber heapBlk, Buffer *buf, OffsetNumber *off,
+ Size *size, int mode);
+
+ #endif /* BRIN_REVMAP_H */
*** /dev/null
--- b/src/include/access/brin_tuple.h
***************
*** 0 ****
--- 1,95 ----
+ /*
+ * Declarations for dealing with BRIN-specific tuples.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/access/brin_tuple.h
+ */
+ #ifndef BRIN_TUPLE_H
+ #define BRIN_TUPLE_H
+
+ #include "access/brin_internal.h"
+ #include "access/tupdesc.h"
+
+
+ /*
+ * A BRIN index stores one index tuple per page range. Each index tuple
+ * has one BrinValues struct for each indexed column; in turn, each BrinValues
+ * has (besides the null flags) an array of Datum whose size is determined by
+ * the opclass.
+ */
+ typedef struct BrinValues
+ {
+ bool hasnulls; /* is there any nulls in the page range? */
+ bool allnulls; /* are all values nulls in the page range? */
+ Datum *values; /* current accumulated values */
+ } BrinValues;
+
+ /*
+ * This struct is used to represent an in-memory index tuple. The values can
+ * only be meaningfully decoded with an appropriate BrinDesc.
+ */
+ typedef struct DeformedBrTuple
+ {
+ bool dt_placeholder; /* this is a placeholder tuple */
+ BlockNumber dt_blkno; /* heap blkno that the tuple is for */
+ BrinValues dt_columns[FLEXIBLE_ARRAY_MEMBER];
+ } DeformedBrTuple;
+
+ /*
+ * An on-disk BRIN tuple. This is possibly followed by a nulls bitmask, with
+ * room for 2 null bits (two bits for each indexed column); an opclass-defined
+ * number of Datum values for each column follow.
+ */
+ typedef struct BrTuple
+ {
+ /* heap block number that the tuple is for */
+ BlockNumber bt_blkno;
+
+ /* ---------------
+ * mt_info is laid out in the following fashion:
+ *
+ * 7th (high) bit: has nulls
+ * 6th bit: is placeholder tuple
+ * 5th bit: unused
+ * 4-0 bit: offset of data
+ * ---------------
+ */
+ uint8 bt_info;
+ } BrTuple;
+
+ #define SizeOfBrinTuple (offsetof(BrTuple, bt_info) + sizeof(uint8))
+
+ /*
+ * t_info manipulation macros
+ */
+ #define BRIN_OFFSET_MASK 0x1F
+ /* bit 0x20 is not used at present */
+ #define BRIN_PLACEHOLDER_MASK 0x40
+ #define BRIN_NULLS_MASK 0x80
+
+ #define BrinTupleDataOffset(tup) ((Size) (((BrTuple *) (tup))->bt_info & BRIN_OFFSET_MASK))
+ #define BrinTupleHasNulls(tup) (((((BrTuple *) (tup))->bt_info & BRIN_NULLS_MASK)) != 0)
+ #define BrinTupleIsPlaceholder(tup) (((((BrTuple *) (tup))->bt_info & BRIN_PLACEHOLDER_MASK)) != 0)
+
+
+ extern BrTuple *brin_form_tuple(BrinDesc *brdesc, BlockNumber blkno,
+ DeformedBrTuple *tuple, Size *size);
+ extern BrTuple *brin_form_placeholder_tuple(BrinDesc *brdesc,
+ BlockNumber blkno, Size *size);
+ extern void brin_free_tuple(BrTuple *tuple);
+ extern BrTuple *brin_copy_tuple(BrTuple *tuple, Size len);
+ extern bool brin_tuples_equal(const BrTuple *a, Size alen,
+ const BrTuple *b, Size blen);
+
+ extern DeformedBrTuple *brin_new_dtuple(BrinDesc *brdesc);
+ extern void brin_dtuple_initialize(DeformedBrTuple *dtuple,
+ BrinDesc *brdesc);
+ extern DeformedBrTuple *brin_deform_tuple(BrinDesc *brdesc,
+ BrTuple *tuple);
+ extern void brin_free_dtuple(BrinDesc *brdesc,
+ DeformedBrTuple *dtup);
+
+ #endif /* BRIN_TUPLE_H */
*** /dev/null
--- b/src/include/access/brin_xlog.h
***************
*** 0 ****
--- 1,106 ----
+ /*-------------------------------------------------------------------------
+ *
+ * brin_xlog.h
+ * POSTGRES BRIN access XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/brin_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+ #ifndef BRIN_XLOG_H
+ #define BRIN_XLOG_H
+
+ #include "access/xlog.h"
+ #include "storage/bufpage.h"
+ #include "storage/itemptr.h"
+ #include "storage/relfilenode.h"
+ #include "utils/relcache.h"
+
+
+ /*
+ * WAL record definitions for BRIN's WAL operations
+ *
+ * XLOG allows to store some information in high 4 bits of log
+ * record xl_info field.
+ */
+ #define XLOG_BRIN_CREATE_INDEX 0x00
+ #define XLOG_BRIN_INSERT 0x10
+ #define XLOG_BRIN_UPDATE 0x20
+ #define XLOG_BRIN_SAMEPAGE_UPDATE 0x30
+ #define XLOG_BRIN_REVMAP_EXTEND 0x40
+ #define XLOG_BRIN_REVMAP_VACUUM 0x50
+
+ #define XLOG_BRIN_OPMASK 0x70
+ /*
+ * When we insert the first item on a new page, we restore the entire page in
+ * redo.
+ */
+ #define XLOG_BRIN_INIT_PAGE 0x80
+
+ /* This is what we need to know about a BRIN index create */
+ typedef struct xl_brin_createidx
+ {
+ BlockNumber pagesPerRange;
+ RelFileNode node;
+ uint16 version;
+ } xl_brin_createidx;
+ #define SizeOfBrinCreateIdx (offsetof(xl_brin_createidx, version) + sizeof(uint16))
+
+ /*
+ * This is what we need to know about a BRIN tuple insert
+ */
+ typedef struct xl_brin_insert
+ {
+ RelFileNode node;
+ BlockNumber heapBlk;
+
+ /* extra information needed to update the revmap */
+ BlockNumber revmapBlk;
+ BlockNumber pagesPerRange;
+
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_brin_insert;
+
+ #define SizeOfBrinInsert (offsetof(xl_brin_insert, tid) + sizeof(ItemPointerData))
+
+ /*
+ * A cross-page update is the same as an insert, but also store the old tid.
+ */
+ typedef struct xl_brin_update
+ {
+ xl_brin_insert new;
+ ItemPointerData oldtid;
+ } xl_brin_update;
+
+ #define SizeOfBrinUpdate (offsetof(xl_brin_update, oldtid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a BRIN tuple samepage update */
+ typedef struct xl_brin_samepage_update
+ {
+ RelFileNode node;
+ ItemPointerData tid;
+ /* tuple data follows at end of struct */
+ } xl_brin_samepage_update;
+
+ #define SizeOfBrinSamepageUpdate (offsetof(xl_brin_samepage_update, tid) + sizeof(ItemPointerData))
+
+ /* This is what we need to know about a revmap extension */
+ typedef struct xl_brin_revmap_extend
+ {
+ RelFileNode node;
+ BlockNumber targetBlk;
+ } xl_brin_revmap_extend;
+
+ #define SizeOfBrinRevmapExtend (offsetof(xl_brin_revmap_extend, targetBlk) + \
+ sizeof(BlockNumber))
+
+
+ extern void brin_desc(StringInfo buf, XLogRecord *record);
+ extern void brin_redo(XLogRecPtr lsn, XLogRecord *record);
+
+ #endif /* BRIN_XLOG_H */
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 112,117 **** extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
--- 112,119 ----
bool allow_strat, bool allow_sync);
extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
int nkeys, ScanKey key);
+ extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
+ BlockNumber endBlk);
extern void heap_rescan(HeapScanDesc scan, ScanKey key);
extern void heap_endscan(HeapScanDesc scan);
extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
*** a/src/include/access/reloptions.h
--- b/src/include/access/reloptions.h
***************
*** 45,52 **** typedef enum relopt_kind
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_VIEW,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
--- 45,53 ----
RELOPT_KIND_TABLESPACE = (1 << 7),
RELOPT_KIND_SPGIST = (1 << 8),
RELOPT_KIND_VIEW = (1 << 9),
+ RELOPT_KIND_BRIN = (1 << 10),
/* if you add a new kind, make sure you update "last_default" too */
! RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_BRIN,
/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
RELOPT_KIND_MAX = (1 << 30)
} relopt_kind;
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 35,42 **** typedef struct HeapScanDescData
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* number of blocks to scan */
BlockNumber rs_startblock; /* block # to start at */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
--- 35,44 ----
bool rs_temp_snap; /* unregister snapshot at scan end? */
/* state set up at initscan time */
! BlockNumber rs_nblocks; /* total number of blocks in rel */
BlockNumber rs_startblock; /* block # to start at */
+ BlockNumber rs_initblock; /* block # to consider initial of rel */
+ BlockNumber rs_numblocks; /* number of blocks to scan */
BufferAccessStrategy rs_strategy; /* access strategy for reads */
bool rs_syncscan; /* report location to syncscan logic? */
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 42,44 **** PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup
--- 42,45 ----
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup)
PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, NULL, NULL)
PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_xlog_startup, spg_xlog_cleanup)
+ PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, NULL, NULL)
*** a/src/include/catalog/index.h
--- b/src/include/catalog/index.h
***************
*** 97,102 **** extern double IndexBuildHeapScan(Relation heapRelation,
--- 97,110 ----
bool allow_sync,
IndexBuildCallback callback,
void *callback_state);
+ extern double IndexBuildHeapRangeScan(Relation heapRelation,
+ Relation indexRelation,
+ IndexInfo *indexInfo,
+ bool allow_sync,
+ BlockNumber start_blockno,
+ BlockNumber end_blockno,
+ IndexBuildCallback callback,
+ void *callback_state);
extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
*** a/src/include/catalog/pg_am.h
--- b/src/include/catalog/pg_am.h
***************
*** 132,136 **** DESCR("GIN index access method");
--- 132,138 ----
DATA(insert OID = 4000 ( spgist 0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
DESCR("SP-GiST index access method");
#define SPGIST_AM_OID 4000
+ DATA(insert OID = 3580 ( brin 5 8 f f f f t t f t t f f 0 brininsert brinbeginscan - bringetbitmap brinrescan brinendscan brinmarkpos brinrestrpos brinbuild brinbuildempty brinbulkdelete brinvacuumcleanup - brincostestimate brinoptions ));
+ #define BRIN_AM_OID 3580
#endif /* PG_AM_H */
*** a/src/include/catalog/pg_amop.h
--- b/src/include/catalog/pg_amop.h
***************
*** 845,848 **** DATA(insert ( 3550 869 869 25 s 932 783 0 ));
--- 845,929 ----
DATA(insert ( 3550 869 869 26 s 933 783 0 ));
DATA(insert ( 3550 869 869 27 s 934 783 0 ));
+ /*
+ * int4_minmax_ops
+ */
+ DATA(insert ( 4054 23 23 1 s 97 3580 0 ));
+ DATA(insert ( 4054 23 23 2 s 523 3580 0 ));
+ DATA(insert ( 4054 23 23 3 s 96 3580 0 ));
+ DATA(insert ( 4054 23 23 4 s 525 3580 0 ));
+ DATA(insert ( 4054 23 23 5 s 521 3580 0 ));
+
+ /*
+ * numeric_minmax_ops
+ */
+ DATA(insert ( 4055 1700 1700 1 s 1754 3580 0 ));
+ DATA(insert ( 4055 1700 1700 2 s 1755 3580 0 ));
+ DATA(insert ( 4055 1700 1700 3 s 1752 3580 0 ));
+ DATA(insert ( 4055 1700 1700 4 s 1757 3580 0 ));
+ DATA(insert ( 4055 1700 1700 5 s 1756 3580 0 ));
+
+ /*
+ * text_minmax_ops
+ */
+ DATA(insert ( 4056 25 25 1 s 664 3580 0 ));
+ DATA(insert ( 4056 25 25 2 s 665 3580 0 ));
+ DATA(insert ( 4056 25 25 3 s 98 3580 0 ));
+ DATA(insert ( 4056 25 25 4 s 667 3580 0 ));
+ DATA(insert ( 4056 25 25 5 s 666 3580 0 ));
+
+ /*
+ * time_minmax_ops
+ */
+ DATA(insert ( 4057 1083 1083 1 s 1110 3580 0 ));
+ DATA(insert ( 4057 1083 1083 2 s 1111 3580 0 ));
+ DATA(insert ( 4057 1083 1083 3 s 1108 3580 0 ));
+ DATA(insert ( 4057 1083 1083 4 s 1113 3580 0 ));
+ DATA(insert ( 4057 1083 1083 5 s 1112 3580 0 ));
+
+ /*
+ * timetz_minmax_ops
+ */
+ DATA(insert ( 4058 1266 1266 1 s 1552 3580 0 ));
+ DATA(insert ( 4058 1266 1266 2 s 1553 3580 0 ));
+ DATA(insert ( 4058 1266 1266 3 s 1550 3580 0 ));
+ DATA(insert ( 4058 1266 1266 4 s 1555 3580 0 ));
+ DATA(insert ( 4058 1266 1266 5 s 1554 3580 0 ));
+
+ /*
+ * timestamp_minmax_ops
+ */
+ DATA(insert ( 4059 1114 1114 1 s 2062 3580 0 ));
+ DATA(insert ( 4059 1114 1114 2 s 2063 3580 0 ));
+ DATA(insert ( 4059 1114 1114 3 s 2060 3580 0 ));
+ DATA(insert ( 4059 1114 1114 4 s 2065 3580 0 ));
+ DATA(insert ( 4059 1114 1114 5 s 2064 3580 0 ));
+
+ /*
+ * timestamptz_minmax_ops
+ */
+ DATA(insert ( 4060 1184 1184 1 s 1322 3580 0 ));
+ DATA(insert ( 4060 1184 1184 2 s 1323 3580 0 ));
+ DATA(insert ( 4060 1184 1184 3 s 1320 3580 0 ));
+ DATA(insert ( 4060 1184 1184 4 s 1325 3580 0 ));
+ DATA(insert ( 4060 1184 1184 5 s 1324 3580 0 ));
+
+ /*
+ * date_minmax_ops
+ */
+ DATA(insert ( 4061 1082 1082 1 s 1095 3580 0 ));
+ DATA(insert ( 4061 1082 1082 2 s 1096 3580 0 ));
+ DATA(insert ( 4061 1082 1082 3 s 1093 3580 0 ));
+ DATA(insert ( 4061 1082 1082 4 s 1098 3580 0 ));
+ DATA(insert ( 4061 1082 1082 5 s 1097 3580 0 ));
+
+ /*
+ * char_minmax_ops
+ */
+ DATA(insert ( 4062 18 18 1 s 631 3580 0 ));
+ DATA(insert ( 4062 18 18 2 s 632 3580 0 ));
+ DATA(insert ( 4062 18 18 3 s 92 3580 0 ));
+ DATA(insert ( 4062 18 18 4 s 634 3580 0 ));
+ DATA(insert ( 4062 18 18 5 s 633 3580 0 ));
+
#endif /* PG_AMOP_H */
*** a/src/include/catalog/pg_amproc.h
--- b/src/include/catalog/pg_amproc.h
***************
*** 432,435 **** DATA(insert ( 4017 25 25 3 4029 ));
--- 432,517 ----
DATA(insert ( 4017 25 25 4 4030 ));
DATA(insert ( 4017 25 25 5 4031 ));
+ /* minmax */
+ DATA(insert ( 4054 23 23 1 3383 ));
+ DATA(insert ( 4054 23 23 2 3384 ));
+ DATA(insert ( 4054 23 23 3 3385 ));
+ DATA(insert ( 4054 23 23 4 3394 ));
+ DATA(insert ( 4054 23 23 5 66 ));
+ DATA(insert ( 4054 23 23 6 149 ));
+ DATA(insert ( 4054 23 23 7 150 ));
+ DATA(insert ( 4054 23 23 8 147 ));
+
+ DATA(insert ( 4055 1700 1700 1 3386 ));
+ DATA(insert ( 4055 1700 1700 2 3384 ));
+ DATA(insert ( 4055 1700 1700 3 3385 ));
+ DATA(insert ( 4055 1700 1700 4 3394 ));
+ DATA(insert ( 4055 1700 1700 5 1722 ));
+ DATA(insert ( 4055 1700 1700 6 1723 ));
+ DATA(insert ( 4055 1700 1700 7 1721 ));
+ DATA(insert ( 4055 1700 1700 8 1720 ));
+
+ DATA(insert ( 4056 25 25 1 3387 ));
+ DATA(insert ( 4056 25 25 2 3384 ));
+ DATA(insert ( 4056 25 25 3 3385 ));
+ DATA(insert ( 4056 25 25 4 3394 ));
+ DATA(insert ( 4056 25 25 5 740 ));
+ DATA(insert ( 4056 25 25 6 741 ));
+ DATA(insert ( 4056 25 25 7 743 ));
+ DATA(insert ( 4056 25 25 8 742 ));
+
+ DATA(insert ( 4057 1083 1083 1 3388 ));
+ DATA(insert ( 4057 1083 1083 2 3384 ));
+ DATA(insert ( 4057 1083 1083 3 3385 ));
+ DATA(insert ( 4057 1083 1083 4 3394 ));
+ DATA(insert ( 4057 1083 1083 5 1102 ));
+ DATA(insert ( 4057 1083 1083 6 1103 ));
+ DATA(insert ( 4057 1083 1083 7 1105 ));
+ DATA(insert ( 4057 1083 1083 8 1104 ));
+
+ DATA(insert ( 4058 1266 1266 1 3389 ));
+ DATA(insert ( 4058 1266 1266 2 3384 ));
+ DATA(insert ( 4058 1266 1266 3 3385 ));
+ DATA(insert ( 4058 1266 1266 4 3394 ));
+ DATA(insert ( 4058 1266 1266 5 1354 ));
+ DATA(insert ( 4058 1266 1266 6 1355 ));
+ DATA(insert ( 4058 1266 1266 7 1356 ));
+ DATA(insert ( 4058 1266 1266 8 1357 ));
+
+ DATA(insert ( 4059 1114 1114 1 3390 ));
+ DATA(insert ( 4059 1114 1114 2 3384 ));
+ DATA(insert ( 4059 1114 1114 3 3385 ));
+ DATA(insert ( 4059 1114 1114 4 3394 ));
+ DATA(insert ( 4059 1114 1114 5 2054 ));
+ DATA(insert ( 4059 1114 1114 6 2055 ));
+ DATA(insert ( 4059 1114 1114 7 2056 ));
+ DATA(insert ( 4059 1114 1114 8 2057 ));
+
+ DATA(insert ( 4060 1184 1184 1 3391 ));
+ DATA(insert ( 4060 1184 1184 2 3384 ));
+ DATA(insert ( 4060 1184 1184 3 3385 ));
+ DATA(insert ( 4060 1184 1184 4 3394 ));
+ DATA(insert ( 4060 1184 1184 5 1154 ));
+ DATA(insert ( 4060 1184 1184 6 1155 ));
+ DATA(insert ( 4060 1184 1184 7 1156 ));
+ DATA(insert ( 4060 1184 1184 8 1157 ));
+
+ DATA(insert ( 4061 1082 1082 1 3392 ));
+ DATA(insert ( 4061 1082 1082 2 3384 ));
+ DATA(insert ( 4061 1082 1082 3 3385 ));
+ DATA(insert ( 4061 1082 1082 4 3394 ));
+ DATA(insert ( 4061 1082 1082 5 1087 ));
+ DATA(insert ( 4061 1082 1082 6 1088 ));
+ DATA(insert ( 4061 1082 1082 7 1090 ));
+ DATA(insert ( 4061 1082 1082 8 1089 ));
+
+ DATA(insert ( 4062 18 18 1 3393 ));
+ DATA(insert ( 4062 18 18 2 3384 ));
+ DATA(insert ( 4062 18 18 3 3385 ));
+ DATA(insert ( 4062 18 18 4 3394 ));
+ DATA(insert ( 4062 18 18 5 1246 ));
+ DATA(insert ( 4062 18 18 6 72 ));
+ DATA(insert ( 4062 18 18 7 74 ));
+ DATA(insert ( 4062 18 18 8 73 ));
+
#endif /* PG_AMPROC_H */
*** a/src/include/catalog/pg_opclass.h
--- b/src/include/catalog/pg_opclass.h
***************
*** 235,239 **** DATA(insert ( 403 jsonb_ops PGNSP PGUID 4033 3802 t 0 ));
--- 235,248 ----
DATA(insert ( 405 jsonb_ops PGNSP PGUID 4034 3802 t 0 ));
DATA(insert ( 2742 jsonb_ops PGNSP PGUID 4036 3802 t 25 ));
DATA(insert ( 2742 jsonb_path_ops PGNSP PGUID 4037 3802 f 23 ));
+ DATA(insert ( 3580 int4_minmax_ops PGNSP PGUID 4054 23 t 0 ));
+ DATA(insert ( 3580 numeric_minmax_ops PGNSP PGUID 4055 1700 t 0 ));
+ DATA(insert ( 3580 text_minmax_ops PGNSP PGUID 4056 25 t 0 ));
+ DATA(insert ( 3580 time_minmax_ops PGNSP PGUID 4057 1083 t 0 ));
+ DATA(insert ( 3580 timetz_minmax_ops PGNSP PGUID 4058 1266 t 0 ));
+ DATA(insert ( 3580 timestamp_minmax_ops PGNSP PGUID 4059 1114 t 0 ));
+ DATA(insert ( 3580 timestamptz_minmax_ops PGNSP PGUID 4060 1184 t 0 ));
+ DATA(insert ( 3580 date_minmax_ops PGNSP PGUID 4061 1082 t 0 ));
+ DATA(insert ( 3580 char_minmax_ops PGNSP PGUID 4062 18 t 0 ));
#endif /* PG_OPCLASS_H */
*** a/src/include/catalog/pg_opfamily.h
--- b/src/include/catalog/pg_opfamily.h
***************
*** 157,160 **** DATA(insert OID = 4035 ( 783 jsonb_ops PGNSP PGUID ));
--- 157,170 ----
DATA(insert OID = 4036 ( 2742 jsonb_ops PGNSP PGUID ));
DATA(insert OID = 4037 ( 2742 jsonb_path_ops PGNSP PGUID ));
+ DATA(insert OID = 4054 ( 3580 int4_minax_ops PGNSP PGUID ));
+ DATA(insert OID = 4055 ( 3580 numeric_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4056 ( 3580 text_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4057 ( 3580 time_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4058 ( 3580 timetz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4059 ( 3580 timestamp_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4060 ( 3580 timestamptz_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4061 ( 3580 date_minmax_ops PGNSP PGUID ));
+ DATA(insert OID = 4062 ( 3580 char_minmax_ops PGNSP PGUID ));
+
#endif /* PG_OPFAMILY_H */
*** a/src/include/catalog/pg_proc.h
--- b/src/include/catalog/pg_proc.h
***************
*** 565,570 **** DESCR("btree(internal)");
--- 565,598 ----
DATA(insert OID = 2785 ( btoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ btoptions _null_ _null_ _null_ ));
DESCR("btree(internal)");
+ DATA(insert OID = 3789 ( bringetbitmap PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 20 "2281 2281" _null_ _null_ _null_ _null_ bringetbitmap _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3790 ( brininsert PGNSP PGUID 12 1 0 0 0 f f f f t f v 6 0 16 "2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ brininsert _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3791 ( brinbeginscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ brinbeginscan _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3792 ( brinrescan PGNSP PGUID 12 1 0 0 0 f f f f t f v 5 0 2278 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ brinrescan _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3793 ( brinendscan PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ brinendscan _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3794 ( brinmarkpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ brinmarkpos _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3795 ( brinrestrpos PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ brinrestrpos _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3796 ( brinbuild PGNSP PGUID 12 1 0 0 0 f f f f t f v 3 0 2281 "2281 2281 2281" _null_ _null_ _null_ _null_ brinbuild _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3797 ( brinbuildempty PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "2281" _null_ _null_ _null_ _null_ brinbuildempty _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3798 ( brinbulkdelete PGNSP PGUID 12 1 0 0 0 f f f f t f v 4 0 2281 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ brinbulkdelete _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3799 ( brinvacuumcleanup PGNSP PGUID 12 1 0 0 0 f f f f t f v 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ brinvacuumcleanup _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3800 ( brincostestimate PGNSP PGUID 12 1 0 0 0 f f f f t f v 7 0 2278 "2281 2281 2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ brincostestimate _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+ DATA(insert OID = 3801 ( brinoptions PGNSP PGUID 12 1 0 0 0 f f f f t f s 2 0 17 "1009 16" _null_ _null_ _null_ _null_ brinoptions _null_ _null_ _null_ ));
+ DESCR("brin(internal)");
+
+
DATA(insert OID = 339 ( poly_same PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_same _null_ _null_ _null_ ));
DATA(insert OID = 340 ( poly_contain PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_contain _null_ _null_ _null_ ));
DATA(insert OID = 341 ( poly_left PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "604 604" _null_ _null_ _null_ _null_ poly_left _null_ _null_ _null_ ));
***************
*** 4074,4079 **** DATA(insert OID = 2747 ( arrayoverlap PGNSP PGUID 12 1 0 0 0 f f f f t f i
--- 4102,4133 ----
DATA(insert OID = 2748 ( arraycontains PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontains _null_ _null_ _null_ ));
DATA(insert OID = 2749 ( arraycontained PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 16 "2277 2277" _null_ _null_ _null_ _null_ arraycontained _null_ _null_ _null_ ));
+ /* Minmax */
+ DATA(insert OID = 3384 ( minmax_add_value PGNSP PGUID 12 1 0 0 0 f f f f t f i 5 0 16 "2281 2281 2281 2281 2281" _null_ _null_ _null_ _null_ minmaxAddValue _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3385 ( minmax_consistent PGNSP PGUID 12 1 0 0 0 f f f f t f i 3 0 16 "2281 2281 2281" _null_ _null_ _null_ _null_ minmaxConsistent _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3394 ( minmax_union PGNSP PGUID 12 1 0 0 0 f f f f t f i 4 0 16 "2281 2281 2281 2281" _null_ _null_ _null_ _null_ minmaxUnion _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3383 ( minmax_sortable_opcinfo_int4 PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_int4 _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3386 ( minmax_sortable_opcinfo_numeric PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_numeric _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3387 ( minmax_sortable_opcinfo_text PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_text _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3388 ( minmax_sortable_opcinfo_time PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_time _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3389 ( minmax_sortable_opcinfo_timetz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_timetz _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3390 ( minmax_sortable_opcinfo_timestamp PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_timestamp _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3391 ( minmax_sortable_opcinfo_timestamptz PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_timestamptz _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3392 ( minmax_sortable_opcinfo_date PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_date _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+ DATA(insert OID = 3393 ( minmax_sortable_opcinfo_char PGNSP PGUID 12 1 0 0 0 f f f f t f i 2 0 2281 "2281 2281" _null_ _null_ _null_ _null_ minmaxOpcInfo_char _null_ _null_ _null_ ));
+ DESCR("BRIN minmax support");
+
/* userlock replacements */
DATA(insert OID = 2880 ( pg_advisory_lock PGNSP PGUID 12 1 0 0 0 f f f f t f v 1 0 2278 "20" _null_ _null_ _null_ _null_ pg_advisory_lock_int8 _null_ _null_ _null_ ));
DESCR("obtain exclusive advisory lock");
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
***************
*** 403,408 **** extern Size PageGetExactFreeSpace(Page page);
--- 403,410 ----
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
+ extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
+ int nitems);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
*** a/src/include/utils/selfuncs.h
--- b/src/include/utils/selfuncs.h
***************
*** 190,195 **** extern double estimate_num_groups(PlannerInfo *root, List *groupExprs,
--- 190,196 ----
extern Selectivity estimate_hash_bucketsize(PlannerInfo *root, Node *hashkey,
double nbuckets);
+ extern Datum brincostestimate(PG_FUNCTION_ARGS);
extern Datum btcostestimate(PG_FUNCTION_ARGS);
extern Datum hashcostestimate(PG_FUNCTION_ARGS);
extern Datum gistcostestimate(PG_FUNCTION_ARGS);
*** a/src/test/regress/expected/opr_sanity.out
--- b/src/test/regress/expected/opr_sanity.out
***************
*** 1658,1663 **** ORDER BY 1, 2, 3;
--- 1658,1668 ----
2742 | 9 | ?
2742 | 10 | ?|
2742 | 11 | ?&
+ 3580 | 1 | <
+ 3580 | 2 | <=
+ 3580 | 3 | =
+ 3580 | 4 | >=
+ 3580 | 5 | >
4000 | 1 | <<
4000 | 1 | ~<~
4000 | 2 | &<
***************
*** 1680,1686 **** ORDER BY 1, 2, 3;
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (80 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
--- 1685,1691 ----
4000 | 15 | >
4000 | 16 | @>
4000 | 18 | =
! (85 rows)
-- Check that all opclass search operators have selectivity estimators.
-- This is not absolutely required, but it seems a reasonable thing
***************
*** 1842,1852 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
--- 1847,1859 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- BRIN has eight support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'brin' AND procnums = '{1, 2, 3, 4, 5, 6, 7, 8}'
);
amname | opfname | amproclefttype | amprocrighttype | procnums
--------+---------+----------------+-----------------+----------
***************
*** 1867,1873 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
amname | opcname | procnums
--------+---------+----------
--- 1874,1881 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'brin' AND procnums = '{1, 2, 3, 4, 5, 6, 7, 8}'
);
amname | opcname | procnums
--------+---------+----------
*** a/src/test/regress/sql/opr_sanity.sql
--- b/src/test/regress/sql/opr_sanity.sql
***************
*** 1195,1205 **** WHERE NOT (
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
--- 1195,1207 ----
-- GIN has six support functions. 1-3 are mandatory, 5 is optional, and
-- at least one of 4 and 6 must be given.
-- SP-GiST has five support functions, all mandatory
+ -- BRIN has eight support functions, all mandatory
amname = 'btree' AND procnums @> '{1}' OR
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'brin' AND procnums = '{1, 2, 3, 4, 5, 6, 7, 8}'
);
-- Also, check if there are any pg_opclass entries that don't seem to have
***************
*** 1218,1224 **** WHERE NOT (
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}'
);
-- Unfortunately, we can't check the amproc link very well because the
--- 1220,1227 ----
amname = 'hash' AND procnums = '{1}' OR
amname = 'gist' AND procnums @> '{1, 2, 3, 4, 5, 6, 7}' OR
amname = 'gin' AND (procnums @> '{1, 2, 3}' AND (procnums && '{4, 6}')) OR
! amname = 'spgist' AND procnums = '{1, 2, 3, 4, 5}' OR
! amname = 'brin' AND procnums = '{1, 2, 3, 4, 5, 6, 7, 8}'
);
-- Unfortunately, we can't check the amproc link very well because the
On Mon, September 8, 2014 18:02, Alvaro Herrera wrote:
Here's version 18. I have renamed it: These are now BRIN indexes.
I get into a BadArgument after:
$ cat crash.sql
-- drop table if exists t_100_000_000 cascade;
create table t_100_000_000 as select cast(i as integer) from generate_series(1, 100000000) as f(i) ;
-- drop index if exists t_100_000_000_i_brin_idx;
create index t_100_000_000_i_brin_idx on t_100_000_000 using brin(i); select
pg_size_pretty(pg_relation_size('t_100_000_000_i_brin_idx'));
select i from t_100_000_000 where i between 10000 and 1009999; -- ( + 999999 )
Log file says:
TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 752)
2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted
2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 10000
and 1009999;
The crash is caused by the last select statement; the table and index create are OK.
it only happens with a largish table; small tables are OK.
Linux / Centos / 32 GB.
PostgreSQL 9.5devel_minmax_20140908_1809_0640c1bfc091 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.9.1, 64-bit
setting | current_setting
--------------------------+--------------------------------------------
autovacuum | off
port | 6444
shared_buffers | 100MB
effective_cache_size | 4GB
work_mem | 10MB
maintenance_work_mem | 1GB
checkpoint_segments | 20
data_checksums | on
server_version | 9.5devel_minmax_20140908_1809_0640c1bfc091
pg_postmaster_start_time | 2014-09-08 19:53 (uptime: 0d 0h 6m 54s)
'--prefix=/var/data1/pg_stuff/pg_installations/pgsql.minmax' '--with-pgport=6444'
'--bindir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/bin'
'--libdir=/var/data1/pg_stuff/pg_installations/pgsql.minmax/lib' '--enable-depend' '--enable-cassert' '--enable-debug'
'--with-perl' '--with-openssl' '--with-libxml' '--with-extra-version=_minmax_20140908_1809_0640c1bfc091'
pgpatches/0095/minmax/20140908/minmax-18.patch
thanks,
Erik Rijkers
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Erik Rijkers wrote:
Log file says:
TRAP: BadArgument("!(((context) != ((void *)0) && (((((const Node*)((context)))->type) == T_AllocSetContext))))", File:
"mcxt.c", Line: 752)
2014-09-08 19:54:46.071 CEST 30151 LOG: server process (PID 30336) was terminated by signal 6: Aborted
2014-09-08 19:54:46.071 CEST 30151 DETAIL: Failed process was running: select i from t_100_000_000 where i between 10000
and 1009999;
A double-free mistake -- here's a patch. Thanks.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachments:
minmax-18a.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c89a167..6ac012c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -388,10 +388,7 @@ bringetbitmap(PG_FUNCTION_ARGS)
PointerGetDatum(key));
addrange = DatumGetBool(add);
if (!addrange)
- {
- brin_free_tuple(tup);
break;
- }
}
}
On 09/08/2014 07:02 PM, Alvaro Herrera wrote:
Here's version 18. I have renamed it: These are now BRIN indexes.
I have fixed numerous race conditions and deadlocks. In particular I
fixed this problem you noted:Heikki Linnakangas wrote:
Another race condition:
If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple. This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).
Hmm. So the union support proc is only called if there is a race
condition? That makes it very difficult to test, I'm afraid.
It would make more sense to pass BrinValues to the support functions,
rather than DeformedBrTuple. An opclass'es support function should never
need to access the values for other columns.
Does minmaxUnion handle NULLs correctly?
minmaxUnion pfrees the old values. Is that necessary? What memory
context does the function run in? If the code runs in a short-lived
memory context, you might as well just let them leak. If it runs in a
long-lived context, well, perhaps it shouldn't. It's nicer to write
functions that can leak freely. IIRC, GiST and GIN runs the support
functions in a temporary context. In any case, it might be worth noting
explicitly in the docs which functions may leak and which may not.
If you add a new datatype, and define b-tree operators for it, what is
required to create a minmax opclass for it? Would it be possible to
generalize the functions in brin_minmax.c so that they can be reused for
any datatype (with b-tree operators) without writing any new C code? I
think we're almost there; the only thing that differs between each data
type is the opcinfo function. Let's pass the type OID as argument to the
opcinfo function. You could then have just a single minmax_opcinfo
function, instead of the macro to generate a separate function for each
built-in datatype.
In general, this patch is in pretty good shape now, thanks!
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
El 08/09/14 13:02, Alvaro Herrera escribi�:
Here's version 18. I have renamed it: These are now BRIN indexes.
I have fixed numerous race conditions and deadlocks. In particular I
fixed this problem you noted:Heikki Linnakangas wrote:
Another race condition:
If a new tuple is inserted to the range while summarization runs,
it's possible that the new tuple isn't included in the tuple that
the summarization calculated, nor does the insertion itself udpate
it.I did it mostly in the way you outlined, i.e. by way of a placeholder
tuple that gets updated by concurrent inserters and then the tuple
resulting from the scan is unioned with the values in the updated
placeholder tuple. This required the introduction of one extra support
proc for opclasses (pretty simple stuff anyhow).There should be only minor items left now, such as silencing the
WARNING: concurrent insert in progress within table "sales"
which is emitted by IndexBuildHeapScan (possibly thousands of times)
when doing a summarization of a range being inserted into or otherwise
modified. Basically the issue here is that IBHS assumes it's being run
with ShareLock in the heap (which blocks inserts), but here we're using
it with ShareUpdateExclusive only, which lets inserts in. There is no
harm AFAICS because of the placeholder tuple stuff I describe above.
Debuging VACUUM VERBOSE ANALYZE over a concurrent table being
updated/insert.
(gbd)
Breakpoint 1, errfinish (dummy=0) at elog.c:411
411 ErrorData *edata = &errordata[errordata_stack_depth];
The complete backtrace is at http://pastebin.com/gkigSNm7
Also, I found pages with an unkown type (using deafult parameters for
the index
creation):
brin_page_type | array_agg
----------------+-----------
unknown (00) | {3,4}
revmap | {1}
regular | {2}
meta | {0}
(4 rows)
--
--
Emanuel Calvo
@3manuek