[PATCH] reduce page overlap of GiST indexes built using sorted method
Hey!
Postgres 14 introduces an option to create a GiST index using a sort
method. It allows to create indexes much faster but as it had been
mentioned in sort support patch discussion the faster build performance
comes at cost of higher degree of overlap between pages than for indexes
built with regular method.
Sort support was implemented for GiST opclass in PostGIS but eventually got
removed as default behaviour in latest 3.2 release because as it had been
discovered by Paul Ramsey
https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html
performance of queries might degrade by 50%.
Together with Darafei Praliaskouski, Andrey Borodin and me we tried several
approaches to solve query performance degrade:
- The first attempt was try to decide whether to make a split depending
on direction of curve (Z-curve for Postgres geometry type, Hilbert curve
for PostGIS). It was implemented by filling page until fillfactor / 2 and
then checking penalty for every next item and keep inserting in current
page if penalty is 0 or start new page if penalty is not 0. It turned out
that with this approach index becomes significantly larger whereas pages
overlap still remains high.
- Andrey Borodin implemented LRU + split: a fixed amount of pages are
kept in memory and the best candidate page to insert the next item in is
selected by minimum penalty among these pages. If the best page for
insertion is full, it gets splited into multiple pages, and if the amount
of candidate pages after split exceeds the limit, the pages insertion to
which has not happened recently are flushed.
https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73.
We made some tests and while query performance speed using index built with
this approach is fine a size of index is extremely large.
Eventually we made implementation of an idea outlined in sort support patch
discussion here
/messages/by-id/08173bd0-488d-da76-a904-912c35da446b@iki.fi
that several pages can be collected and then divided into actual index
pages by calling picksplit. My benchmarks with data provided in
postgis-devel show that query performance using index built with patched
sort support is comparable with performance of query using index built with
regular method. The size of index is also matches size of index built with
non-sorting method.
It should be noted that with the current implementation of the sorting
build method, pages are always filled up to fillfactor. This patch changes
this behavior to what it would be if using a non-sorting method, and pages
are not always filled to fillfactor for the sake of query performance. I'm
interested in improving it and I wonder if there are any ideas on this.
Benchmark summary:
create index roads_rdr_idx on roads_rdr using gist (geom);
with sort support before patch / CREATE INDEX 76.709 ms
with sort support after patch / CREATE INDEX 225.238 ms
without sort support / CREATE INDEX 446.071 ms
select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
with sort support before patch / SELECT 5766.526 ms
with sort support after patch / SELECT 2646.554 ms
without sort support / SELECT 2721.718 ms
index size
with sort support before patch / IDXSIZE 2940928 bytes
with sort support after patch / IDXSIZE 4956160 bytes
without sort support / IDXSIZE 5447680 bytes
More detailed:
Before patch using sorted method:
postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);
CREATE INDEX
Time: 76.709 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 5766.526 ms (00:05.767)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 5880.142 ms (00:05.880)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 5778.437 ms (00:05.778)
postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 359 +
Number of leaf pages: 356 +
Number of tuples: 93034 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2609260 bytes+
Total size of leaf tuples: 2599200 bytes+
Total size of index: 2940928 bytes+
(1 row)
After patch using sorted method:
postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);
CREATE INDEX
Time: 225.238 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 2646.554 ms (00:02.647)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 2499.107 ms (00:02.499)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 2519.815 ms (00:02.520)
postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 605 +
Number of leaf pages: 600 +
Number of tuples: 93280 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2619100 bytes+
Total size of leaf tuples: 2602128 bytes+
Total size of index: 4956160 bytes+
(1 row)
With index built using default method:
postgres=# create index roads_rdr_geom_idx_no_sortsupport on roads_rdr
using gist(geom);
CREATE INDEX
Time: 446.071 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 2721.718 ms (00:02.722)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
Time: 3549.549 ms (00:03.550)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;
count
--------
505806
(1 row)
postgres=# select gist_stat('roads_rdr_geom_idx_no_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 665 +
Number of leaf pages: 660 +
Number of tuples: 93340 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2621500 bytes+
Total size of leaf tuples: 2602848 bytes+
Total size of index: 5447680 bytes+
(1 row)
Attachments:
reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchapplication/octet-stream; name=reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchDownload
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index 86c9e9caa9..8cb390feed 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -33,14 +33,15 @@ COMMIT;
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
itemoffset | ctid | itemlen | dead | keys
------------+-----------+---------+------+-------------------
- 1 | (1,65535) | 40 | f | (p)=((166,166))
- 2 | (2,65535) | 40 | f | (p)=((332,332))
- 3 | (3,65535) | 40 | f | (p)=((498,498))
- 4 | (4,65535) | 40 | f | (p)=((664,664))
- 5 | (5,65535) | 40 | f | (p)=((830,830))
- 6 | (6,65535) | 40 | f | (p)=((996,996))
- 7 | (7,65535) | 40 | f | (p)=((1000,1000))
-(7 rows)
+ 1 | (1,65535) | 40 | f | (p)=((125,125))
+ 2 | (2,65535) | 40 | f | (p)=((250,250))
+ 3 | (3,65535) | 40 | f | (p)=((375,375))
+ 4 | (4,65535) | 40 | f | (p)=((500,500))
+ 5 | (5,65535) | 40 | f | (p)=((625,625))
+ 6 | (6,65535) | 40 | f | (p)=((750,750))
+ 7 | (7,65535) | 40 | f | (p)=((875,875))
+ 8 | (8,65535) | 40 | f | (p)=((1000,1000))
+(8 rows)
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
itemoffset | ctid | itemlen | dead | keys
@@ -64,6 +65,7 @@ SELECT itemoffset, ctid, itemlen FROM gist_page_items_bytea(get_raw_page('test_g
5 | (5,65535) | 40
6 | (6,65535) | 40
7 | (7,65535) | 40
-(7 rows)
+ 8 | (8,65535) | 40
+(8 rows)
DROP TABLE test_gist;
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..d686278476 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -120,7 +120,9 @@ typedef struct
*/
typedef struct GistSortedBuildPageState
{
- Page page;
+ Page pages[8];
+ int current_page;
+ BlockNumber prevpage_blkno;
struct GistSortedBuildPageState *parent; /* Upper level, if any */
} GistSortedBuildPageState;
@@ -415,8 +417,8 @@ gist_indexsortbuild(GISTBuildState *state)
state->pages_written++;
/* Allocate a temporary buffer for the first leaf page. */
- leafstate = palloc(sizeof(GistSortedBuildPageState));
- leafstate->page = page;
+ leafstate = palloc0(sizeof(GistSortedBuildPageState));
+ leafstate->pages[0] = page;
leafstate->parent = NULL;
gistinitpage(page, F_LEAF);
@@ -435,13 +437,15 @@ gist_indexsortbuild(GISTBuildState *state)
* Keep in mind that flush can build a new root.
*/
pagestate = leafstate;
- while (pagestate->parent != NULL)
+ while (pagestate->parent != NULL || pagestate->current_page != 0)
{
GistSortedBuildPageState *parent;
gist_indexsortbuild_pagestate_flush(state, pagestate);
parent = pagestate->parent;
- pfree(pagestate->page);
+ for (int i = 0; i < 8; i++)
+ if (pagestate->pages[i])
+ pfree(pagestate->pages[i]);
pfree(pagestate);
pagestate = parent;
}
@@ -449,15 +453,15 @@ gist_indexsortbuild(GISTBuildState *state)
gist_indexsortbuild_flush_ready_pages(state);
/* Write out the root */
- PageSetLSN(pagestate->page, GistBuildLSN);
- PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
+ PageSetLSN(pagestate->pages[0], GistBuildLSN);
+ PageSetChecksumInplace(pagestate->pages[0], GIST_ROOT_BLKNO);
smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ pagestate->pages[0], true);
if (RelationNeedsWAL(state->indexrel))
log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ pagestate->pages[0], true);
- pfree(pagestate->page);
+ pfree(pagestate->pages[0]);
pfree(pagestate);
}
@@ -474,10 +478,24 @@ gist_indexsortbuild_pagestate_add(GISTBuildState *state,
/* Does the tuple fit? If not, flush */
sizeNeeded = IndexTupleSize(itup) + sizeof(ItemIdData) + state->freespace;
- if (PageGetFreeSpace(pagestate->page) < sizeNeeded)
- gist_indexsortbuild_pagestate_flush(state, pagestate);
+ if (PageGetFreeSpace(pagestate->pages[pagestate->current_page]) < sizeNeeded)
+ {
+ Page newPage;
+ Page old_page = pagestate->pages[pagestate->current_page];
+ uint16 old_page_flags = GistPageGetOpaque(old_page)->flags;
+ if (pagestate->current_page + 1 == 8)
+ gist_indexsortbuild_pagestate_flush(state, pagestate);
+ else
+ pagestate->current_page++;
+
+ if (pagestate->pages[pagestate->current_page] == NULL)
+ pagestate->pages[pagestate->current_page] = palloc(BLCKSZ);
+
+ newPage = pagestate->pages[pagestate->current_page];
+ gistinitpage(newPage, old_page_flags);
+ }
- gistfillbuffer(pagestate->page, &itup, 1, InvalidOffsetNumber);
+ gistfillbuffer(pagestate->pages[pagestate->current_page], &itup, 1, InvalidOffsetNumber);
}
static void
@@ -485,73 +503,106 @@ gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
GistSortedBuildPageState *pagestate)
{
GistSortedBuildPageState *parent;
- IndexTuple *itvec;
- IndexTuple union_tuple;
- int vect_len;
- bool isleaf;
- BlockNumber blkno;
+ BlockNumber blkno;
MemoryContext oldCtx;
+ IndexTuple union_tuple;
+ SplitedPageLayout *dist;
+ IndexTuple *itvec;
+ int vect_len;
+ bool isleaf = GistPageIsLeaf(pagestate->pages[0]);
- /* check once per page */
+ /* check once per whatever */
CHECK_FOR_INTERRUPTS();
- if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
- gist_indexsortbuild_flush_ready_pages(state);
-
- /*
- * The page is now complete. Assign a block number to it, and add it to
- * the list of finished pages. (We don't write it out immediately, because
- * we want to WAL-log the pages in batches.)
- */
- blkno = state->pages_allocated++;
- state->ready_blknos[state->ready_num_pages] = blkno;
- state->ready_pages[state->ready_num_pages] = pagestate->page;
- state->ready_num_pages++;
+ oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- isleaf = GistPageIsLeaf(pagestate->page);
+ itvec = gistextractpage(pagestate->pages[0], &vect_len);
+ if (pagestate->current_page > 0)
+ {
+ for (int i = 1; i < pagestate->current_page + 1; i++)
+ {
+ int len_local;
+ IndexTuple *itvec_local = gistextractpage(pagestate->pages[i], &len_local);
+ itvec = gistjoinvector(itvec, &vect_len, itvec_local, len_local);
+ }
- /*
- * Form a downlink tuple to represent all the tuples on the page.
- */
- oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- itvec = gistextractpage(pagestate->page, &vect_len);
- union_tuple = gistunion(state->indexrel, itvec, vect_len,
+ dist = gistSplit(state->indexrel, pagestate->pages[0], itvec, vect_len, state->giststate);
+ }
+ else
+ {
+ dist = (SplitedPageLayout*) palloc0(sizeof(SplitedPageLayout));
+ union_tuple = gistunion(state->indexrel, itvec, vect_len,
state->giststate);
- ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+ dist->itup = union_tuple;
+ dist->list = gistfillitupvec(itvec, vect_len, &(dist->lenlist));
+ dist->block.num = vect_len;
+ }
+
MemoryContextSwitchTo(oldCtx);
- /*
- * Insert the downlink to the parent page. If this was the root, create a
- * new page as the parent, which becomes the new root.
- */
- parent = pagestate->parent;
- if (parent == NULL)
+ pagestate->current_page = 0;
+
+ for (;dist != NULL; dist = dist->next)
{
- parent = palloc(sizeof(GistSortedBuildPageState));
- parent->page = (Page) palloc(BLCKSZ);
- parent->parent = NULL;
- gistinitpage(parent->page, 0);
+ char *data = (char *) (dist->list);
+ Page target = palloc0(BLCKSZ);
+ gistinitpage(target, isleaf? F_LEAF:0);
+ for (int i = 0; i < dist->block.num; i++)
+ {
+ IndexTuple thistup = (IndexTuple) data;
- pagestate->parent = parent;
- }
- gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+ if (PageAddItem(target, (Item) data, IndexTupleSize(thistup), i + FirstOffsetNumber, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(state->indexrel));
- /* Re-initialize the page buffer for next page on this level. */
- pagestate->page = palloc(BLCKSZ);
- gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+ data += IndexTupleSize(thistup);
+ }
+ union_tuple = dist->itup;
- /*
- * Set the right link to point to the previous page. This is just for
- * debugging purposes: GiST only follows the right link if a page is split
- * concurrently to a scan, and that cannot happen during index build.
- *
- * It's a bit counterintuitive that we set the right link on the new page
- * to point to the previous page, and not the other way round. But GiST
- * pages are not ordered like B-tree pages are, so as long as the
- * right-links form a chain through all the pages in the same level, the
- * order doesn't matter.
- */
- GistPageGetOpaque(pagestate->page)->rightlink = blkno;
+ if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
+ gist_indexsortbuild_flush_ready_pages(state);
+
+ /*
+ * The page is now complete. Assign a block number to it, and add it to
+ * the list of finished pages. (We don't write it out immediately, because
+ * we want to WAL-log the pages in batches.)
+ */
+ blkno = state->pages_allocated++;
+ state->ready_blknos[state->ready_num_pages] = blkno;
+ state->ready_pages[state->ready_num_pages] = target;
+ state->ready_num_pages++;
+ ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+
+ /*
+ * Insert the downlink to the parent page. If this was the root, create a
+ * new page as the parent, which becomes the new root.
+ */
+ parent = pagestate->parent;
+ if (parent == NULL)
+ {
+ parent = palloc0(sizeof(GistSortedBuildPageState));
+ parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->parent = NULL;
+ gistinitpage(parent->pages[0], 0);
+
+ pagestate->parent = parent;
+ }
+ gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+
+ /*
+ * Set the right link to point to the previous page. This is just for
+ * debugging purposes: GiST only follows the right link if a page is split
+ * concurrently to a scan, and that cannot happen during index build.
+ *
+ * It's a bit counterintuitive that we set the right link on the new page
+ * to point to the previous page, and not the other way round. But GiST
+ * pages are not ordered like B-tree pages are, so as long as the
+ * right-links form a chain through all the pages in the same level, the
+ * order doesn't matter.
+ */
+ if (pagestate->prevpage_blkno)
+ GistPageGetOpaque(target)->rightlink = pagestate->prevpage_blkno;
+ pagestate->prevpage_blkno = blkno;
+ }
}
static void
Hi Aliaksandr!
Thanks for working on this!
Benchmark summary:
create index roads_rdr_idx on roads_rdr using gist (geom);
with sort support before patch / CREATE INDEX 76.709 ms
with sort support after patch / CREATE INDEX 225.238 ms
without sort support / CREATE INDEX 446.071 ms
select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
with sort support before patch / SELECT 5766.526 ms
with sort support after patch / SELECT 2646.554 ms
without sort support / SELECT 2721.718 ms
index size
with sort support before patch / IDXSIZE 2940928 bytes
with sort support after patch / IDXSIZE 4956160 bytes
without sort support / IDXSIZE 5447680 bytes
The numbers are impressive, newly build index is actually performing better!
I've conducted some tests over points, not PostGIS geometry. For points build is 2x slower now, but this is the cost of faster scans.
Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that performs better than the original sorting build. We actually researched every idea we knew from literature and intuition. Although we consciously limited the search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves seriously limit possible solutions. If anyone knows an any better way to build GiST faster or with better scan performance - please let us know.
The proposed algorithm contains the current algorithm as a special case: there is a parameter - the number of buffers accumulated before calling Split. If this parameter is 1 proposed algorithm will produce exactly the same output.
At this stage, we would like to hear some feedback from Postgres and PostGIS community. What other performance aspects should we test?
Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add a fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck [0]/messages/by-id/59D0DA6B-1652-4D44-B0EF-A582D5824F83@yandex-team.ru, although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of picksplit. But can we improve anything here? I think if not - it's still OK.
5. GistSortedBuildPageState is no more page state. It's Level state or something like that.
6. The patch desperately needs comments.
Thanks!
Best regards, Andrey Borodin.
[0]: /messages/by-id/59D0DA6B-1652-4D44-B0EF-A582D5824F83@yandex-team.ru
After further testing, here is v2 where the issue that rightlink can be set
when an index page is already flushed is fixed.
On Sat, Dec 25, 2021 at 4:35 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
Show quoted text
Hi Aliaksandr!
Thanks for working on this!
Benchmark summary:
create index roads_rdr_idx on roads_rdr using gist (geom);
with sort support before patch / CREATE INDEX 76.709 ms
with sort support after patch / CREATE INDEX 225.238 ms
without sort support / CREATE INDEX 446.071 ms
select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
with sort support before patch / SELECT 5766.526 ms
with sort support after patch / SELECT 2646.554 ms
without sort support / SELECT 2721.718 ms
index size
with sort support before patch / IDXSIZE 2940928 bytes
with sort support after patch / IDXSIZE 4956160 bytes
without sort support / IDXSIZE 5447680 bytes
The numbers are impressive, newly build index is actually performing
better!
I've conducted some tests over points, not PostGIS geometry. For points
build is 2x slower now, but this is the cost of faster scans.Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that performs
better than the original sorting build. We actually researched every idea
we knew from literature and intuition. Although we consciously limited the
search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves
seriously limit possible solutions. If anyone knows an any better way to
build GiST faster or with better scan performance - please let us know.
The proposed algorithm contains the current algorithm as a special case:
there is a parameter - the number of buffers accumulated before calling
Split. If this parameter is 1 proposed algorithm will produce exactly the
same output.At this stage, we would like to hear some feedback from Postgres and
PostGIS community. What other performance aspects should we test?Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add a
fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck [0],
although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of picksplit.
But can we improve anything here? I think if not - it's still OK.
5. GistSortedBuildPageState is no more page state. It's Level state or
something like that.
6. The patch desperately needs comments.Thanks!
Best regards, Andrey Borodin.
[0]
/messages/by-id/59D0DA6B-1652-4D44-B0EF-A582D5824F83@yandex-team.ru
Attachments:
02_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchapplication/octet-stream; name=02_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchDownload
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index 86c9e9caa9..8cb390feed 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -33,14 +33,15 @@ COMMIT;
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
itemoffset | ctid | itemlen | dead | keys
------------+-----------+---------+------+-------------------
- 1 | (1,65535) | 40 | f | (p)=((166,166))
- 2 | (2,65535) | 40 | f | (p)=((332,332))
- 3 | (3,65535) | 40 | f | (p)=((498,498))
- 4 | (4,65535) | 40 | f | (p)=((664,664))
- 5 | (5,65535) | 40 | f | (p)=((830,830))
- 6 | (6,65535) | 40 | f | (p)=((996,996))
- 7 | (7,65535) | 40 | f | (p)=((1000,1000))
-(7 rows)
+ 1 | (1,65535) | 40 | f | (p)=((125,125))
+ 2 | (2,65535) | 40 | f | (p)=((250,250))
+ 3 | (3,65535) | 40 | f | (p)=((375,375))
+ 4 | (4,65535) | 40 | f | (p)=((500,500))
+ 5 | (5,65535) | 40 | f | (p)=((625,625))
+ 6 | (6,65535) | 40 | f | (p)=((750,750))
+ 7 | (7,65535) | 40 | f | (p)=((875,875))
+ 8 | (8,65535) | 40 | f | (p)=((1000,1000))
+(8 rows)
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
itemoffset | ctid | itemlen | dead | keys
@@ -64,6 +65,7 @@ SELECT itemoffset, ctid, itemlen FROM gist_page_items_bytea(get_raw_page('test_g
5 | (5,65535) | 40
6 | (6,65535) | 40
7 | (7,65535) | 40
-(7 rows)
+ 8 | (8,65535) | 40
+(8 rows)
DROP TABLE test_gist;
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index baad28c09f..069695f6e9 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -120,7 +120,9 @@ typedef struct
*/
typedef struct GistSortedBuildPageState
{
- Page page;
+ Page pages[8];
+ int current_page;
+ BlockNumber prevpage_blkno;
struct GistSortedBuildPageState *parent; /* Upper level, if any */
} GistSortedBuildPageState;
@@ -415,8 +417,8 @@ gist_indexsortbuild(GISTBuildState *state)
state->pages_written++;
/* Allocate a temporary buffer for the first leaf page. */
- leafstate = palloc(sizeof(GistSortedBuildPageState));
- leafstate->page = page;
+ leafstate = palloc0(sizeof(GistSortedBuildPageState));
+ leafstate->pages[0] = page;
leafstate->parent = NULL;
gistinitpage(page, F_LEAF);
@@ -435,13 +437,15 @@ gist_indexsortbuild(GISTBuildState *state)
* Keep in mind that flush can build a new root.
*/
pagestate = leafstate;
- while (pagestate->parent != NULL)
+ while (pagestate->parent != NULL || pagestate->current_page != 0)
{
GistSortedBuildPageState *parent;
gist_indexsortbuild_pagestate_flush(state, pagestate);
parent = pagestate->parent;
- pfree(pagestate->page);
+ for (int i = 0; i < 8; i++)
+ if (pagestate->pages[i])
+ pfree(pagestate->pages[i]);
pfree(pagestate);
pagestate = parent;
}
@@ -449,15 +453,15 @@ gist_indexsortbuild(GISTBuildState *state)
gist_indexsortbuild_flush_ready_pages(state);
/* Write out the root */
- PageSetLSN(pagestate->page, GistBuildLSN);
- PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
+ PageSetLSN(pagestate->pages[0], GistBuildLSN);
+ PageSetChecksumInplace(pagestate->pages[0], GIST_ROOT_BLKNO);
smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ pagestate->pages[0], true);
if (RelationNeedsWAL(state->indexrel))
log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ pagestate->pages[0], true);
- pfree(pagestate->page);
+ pfree(pagestate->pages[0]);
pfree(pagestate);
}
@@ -474,10 +478,24 @@ gist_indexsortbuild_pagestate_add(GISTBuildState *state,
/* Does the tuple fit? If not, flush */
sizeNeeded = IndexTupleSize(itup) + sizeof(ItemIdData) + state->freespace;
- if (PageGetFreeSpace(pagestate->page) < sizeNeeded)
- gist_indexsortbuild_pagestate_flush(state, pagestate);
+ if (PageGetFreeSpace(pagestate->pages[pagestate->current_page]) < sizeNeeded)
+ {
+ Page newPage;
+ Page old_page = pagestate->pages[pagestate->current_page];
+ uint16 old_page_flags = GistPageGetOpaque(old_page)->flags;
+ if (pagestate->current_page + 1 == 8)
+ gist_indexsortbuild_pagestate_flush(state, pagestate);
+ else
+ pagestate->current_page++;
+
+ if (pagestate->pages[pagestate->current_page] == NULL)
+ pagestate->pages[pagestate->current_page] = palloc(BLCKSZ);
+
+ newPage = pagestate->pages[pagestate->current_page];
+ gistinitpage(newPage, old_page_flags);
+ }
- gistfillbuffer(pagestate->page, &itup, 1, InvalidOffsetNumber);
+ gistfillbuffer(pagestate->pages[pagestate->current_page], &itup, 1, InvalidOffsetNumber);
}
static void
@@ -485,73 +503,106 @@ gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
GistSortedBuildPageState *pagestate)
{
GistSortedBuildPageState *parent;
- IndexTuple *itvec;
- IndexTuple union_tuple;
- int vect_len;
- bool isleaf;
- BlockNumber blkno;
+ BlockNumber blkno;
MemoryContext oldCtx;
+ IndexTuple union_tuple;
+ SplitedPageLayout *dist;
+ IndexTuple *itvec;
+ int vect_len;
+ bool isleaf = GistPageIsLeaf(pagestate->pages[0]);
- /* check once per page */
+ /* check once per whatever */
CHECK_FOR_INTERRUPTS();
- if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
- gist_indexsortbuild_flush_ready_pages(state);
-
- /*
- * The page is now complete. Assign a block number to it, and add it to
- * the list of finished pages. (We don't write it out immediately, because
- * we want to WAL-log the pages in batches.)
- */
- blkno = state->pages_allocated++;
- state->ready_blknos[state->ready_num_pages] = blkno;
- state->ready_pages[state->ready_num_pages] = pagestate->page;
- state->ready_num_pages++;
+ oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- isleaf = GistPageIsLeaf(pagestate->page);
+ itvec = gistextractpage(pagestate->pages[0], &vect_len);
+ if (pagestate->current_page > 0)
+ {
+ for (int i = 1; i < pagestate->current_page + 1; i++)
+ {
+ int len_local;
+ IndexTuple *itvec_local = gistextractpage(pagestate->pages[i], &len_local);
+ itvec = gistjoinvector(itvec, &vect_len, itvec_local, len_local);
+ }
- /*
- * Form a downlink tuple to represent all the tuples on the page.
- */
- oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- itvec = gistextractpage(pagestate->page, &vect_len);
- union_tuple = gistunion(state->indexrel, itvec, vect_len,
+ dist = gistSplit(state->indexrel, pagestate->pages[0], itvec, vect_len, state->giststate);
+ }
+ else
+ {
+ dist = (SplitedPageLayout*) palloc0(sizeof(SplitedPageLayout));
+ union_tuple = gistunion(state->indexrel, itvec, vect_len,
state->giststate);
- ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+ dist->itup = union_tuple;
+ dist->list = gistfillitupvec(itvec, vect_len, &(dist->lenlist));
+ dist->block.num = vect_len;
+ }
+
MemoryContextSwitchTo(oldCtx);
- /*
- * Insert the downlink to the parent page. If this was the root, create a
- * new page as the parent, which becomes the new root.
- */
- parent = pagestate->parent;
- if (parent == NULL)
+ pagestate->current_page = 0;
+
+ for (;dist != NULL; dist = dist->next)
{
- parent = palloc(sizeof(GistSortedBuildPageState));
- parent->page = (Page) palloc(BLCKSZ);
- parent->parent = NULL;
- gistinitpage(parent->page, 0);
+ char *data = (char *) (dist->list);
+ Page target = palloc0(BLCKSZ);
+ gistinitpage(target, isleaf? F_LEAF:0);
+ for (int i = 0; i < dist->block.num; i++)
+ {
+ IndexTuple thistup = (IndexTuple) data;
- pagestate->parent = parent;
- }
- gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+ if (PageAddItem(target, (Item) data, IndexTupleSize(thistup), i + FirstOffsetNumber, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(state->indexrel));
- /* Re-initialize the page buffer for next page on this level. */
- pagestate->page = palloc(BLCKSZ);
- gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+ data += IndexTupleSize(thistup);
+ }
+ union_tuple = dist->itup;
- /*
- * Set the right link to point to the previous page. This is just for
- * debugging purposes: GiST only follows the right link if a page is split
- * concurrently to a scan, and that cannot happen during index build.
- *
- * It's a bit counterintuitive that we set the right link on the new page
- * to point to the previous page, and not the other way round. But GiST
- * pages are not ordered like B-tree pages are, so as long as the
- * right-links form a chain through all the pages in the same level, the
- * order doesn't matter.
- */
- GistPageGetOpaque(pagestate->page)->rightlink = blkno;
+ if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
+ gist_indexsortbuild_flush_ready_pages(state);
+
+ /*
+ * The page is now complete. Assign a block number to it, and add it to
+ * the list of finished pages. (We don't write it out immediately, because
+ * we want to WAL-log the pages in batches.)
+ */
+ blkno = state->pages_allocated++;
+ state->ready_blknos[state->ready_num_pages] = blkno;
+ state->ready_pages[state->ready_num_pages] = target;
+ state->ready_num_pages++;
+ ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+
+ /*
+ * Set the right link to point to the previous page. This is just for
+ * debugging purposes: GiST only follows the right link if a page is split
+ * concurrently to a scan, and that cannot happen during index build.
+ *
+ * It's a bit counterintuitive that we set the right link on the new page
+ * to point to the previous page, and not the other way round. But GiST
+ * pages are not ordered like B-tree pages are, so as long as the
+ * right-links form a chain through all the pages in the same level, the
+ * order doesn't matter.
+ */
+ if (pagestate->prevpage_blkno)
+ GistPageGetOpaque(target)->rightlink = pagestate->prevpage_blkno;
+ pagestate->prevpage_blkno = blkno;
+
+ /*
+ * Insert the downlink to the parent page. If this was the root, create a
+ * new page as the parent, which becomes the new root.
+ */
+ parent = pagestate->parent;
+ if (parent == NULL)
+ {
+ parent = palloc0(sizeof(GistSortedBuildPageState));
+ parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->parent = NULL;
+ gistinitpage(parent->pages[0], 0);
+
+ pagestate->parent = parent;
+ }
+ gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+ }
}
static void
Hi,
here are some benchmark results for GiST patch:
https://gist.github.com/mngr777/88ae200c9c30ba5656583d92e8d2cf9e
Code used for benchmarking: https://github.com/mngr777/pg_index_bm2,
see README for the list of test queries.
Results show query performance close to indexes built with no
sortsupport function, with index creation time reduced approx. by half.
Show quoted text
On 1/8/22 10:20 PM, Aliaksandr Kalenik wrote:
After further testing, here is v2 where the issue that rightlink can be
set when an index page is already flushed is fixed.On Sat, Dec 25, 2021 at 4:35 PM Andrey Borodin <x4mmm@yandex-team.ru
<mailto:x4mmm@yandex-team.ru>> wrote:Hi Aliaksandr!
Thanks for working on this!
Benchmark summary:
create index roads_rdr_idx on roads_rdr using gist (geom);
with sort support before patch / CREATE INDEX 76.709 ms
with sort support after patch / CREATE INDEX 225.238 ms
without sort support / CREATE INDEX 446.071 ms
select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
with sort support before patch / SELECT 5766.526 ms
with sort support after patch / SELECT 2646.554 ms
without sort support / SELECT 2721.718 ms
index size
with sort support before patch / IDXSIZE 2940928 bytes
with sort support after patch / IDXSIZE 4956160 bytes
without sort support / IDXSIZE 5447680 bytes
The numbers are impressive, newly build index is actually performing
better!
I've conducted some tests over points, not PostGIS geometry. For
points build is 2x slower now, but this is the cost of faster scans.Some strong points of this index building technology.
The proposed algorithm is not randomly chosen as anything that
performs better than the original sorting build. We actually
researched every idea we knew from literature and intuition.
Although we consciously limited the search area by existing GiST API.
Stuff like penalty-based choose-subtree and split in equal halves
seriously limit possible solutions. If anyone knows an any better
way to build GiST faster or with better scan performance - please
let us know.
The proposed algorithm contains the current algorithm as a special
case: there is a parameter - the number of buffers accumulated
before calling Split. If this parameter is 1 proposed algorithm will
produce exactly the same output.At this stage, we would like to hear some feedback from Postgres and
PostGIS community. What other performance aspects should we test?Current patch implementation has some known deficiencies:
1. We need a GUC instead of the hard-coded buffer of 8 pages.
2. Is GiST sorting build still deterministic? If not - we should add
a fixed random seed into pageinspect tests.
3. It would make sense to check the resulting indexes with amcheck
[0], although it's not committed.
4. We cannot make an exact fillfactor due to the behavior of
picksplit. But can we improve anything here? I think if not - it's
still OK.
5. GistSortedBuildPageState is no more page state. It's Level state
or something like that.
6. The patch desperately needs comments.Thanks!
Best regards, Andrey Borodin.
[0]
/messages/by-id/59D0DA6B-1652-4D44-B0EF-A582D5824F83@yandex-team.ru
</messages/by-id/59D0DA6B-1652-4D44-B0EF-A582D5824F83@yandex-team.ru>
Hi Aliaksandr,
Nice work on this. I've been following it a bit since the regression when
it was noted and it sparked renewed interest in R-tree structure and
optimization for me.
As for ideas. I'm not deep into details of postgresql and gist, but I've
learned that the node size for gist indexes are equal to the page size, and
as such they are quite large in this context. This is what caused the
overly flat tree when increasing the fill factor, if I'm not mistaken, and
why it helps to decrease the fill factor.
I think there is a large potential if tree node size could be tweaked to
fit the use case and combine higher fill factor with optimal tree depth.
It's data dependent so it would even make sense to make it an index
parameter, if possible.
There might be some deep reason in the architecture that I'm unaware of
that could make it difficult to affect the node size but regardless, I
believe there could be a substantial win if node size could be controlled.
Regards,
Björn
Den mån 17 jan. 2022 kl 23:46 skrev Aliaksandr Kalenik <akalenik@kontur.io>:
Show quoted text
Hey!
Postgres 14 introduces an option to create a GiST index using a sort
method. It allows to create indexes much faster but as it had been
mentioned in sort support patch discussion the faster build performance
comes at cost of higher degree of overlap between pages than for indexes
built with regular method.Sort support was implemented for GiST opclass in PostGIS but eventually
got removed as default behaviour in latest 3.2 release because as it had
been discovered by Paul Ramsey
https://lists.osgeo.org/pipermail/postgis-devel/2021-November/029225.html
performance of queries might degrade by 50%.Together with Darafei Praliaskouski, Andrey Borodin and me we tried
several approaches to solve query performance degrade:- The first attempt was try to decide whether to make a split
depending on direction of curve (Z-curve for Postgres geometry type,
Hilbert curve for PostGIS). It was implemented by filling page until
fillfactor / 2 and then checking penalty for every next item and keep
inserting in current page if penalty is 0 or start new page if penalty is
not 0. It turned out that with this approach index becomes significantly
larger whereas pages overlap still remains high.
- Andrey Borodin implemented LRU + split: a fixed amount of pages are
kept in memory and the best candidate page to insert the next item in is
selected by minimum penalty among these pages. If the best page for
insertion is full, it gets splited into multiple pages, and if the amount
of candidate pages after split exceeds the limit, the pages insertion to
which has not happened recently are flushed.
https://github.com/x4m/postgres_g/commit/0f2ed5f32a00f6c3019048e0c145b7ebda629e73.
We made some tests and while query performance speed using index built with
this approach is fine a size of index is extremely large.Eventually we made implementation of an idea outlined in sort support
patch discussion here
/messages/by-id/08173bd0-488d-da76-a904-912c35da446b@iki.fi
that several pages can be collected and then divided into actual index
pages by calling picksplit. My benchmarks with data provided in
postgis-devel show that query performance using index built with patched
sort support is comparable with performance of query using index built with
regular method. The size of index is also matches size of index built with
non-sorting method.It should be noted that with the current implementation of the sorting
build method, pages are always filled up to fillfactor. This patch changes
this behavior to what it would be if using a non-sorting method, and pages
are not always filled to fillfactor for the sake of query performance. I'm
interested in improving it and I wonder if there are any ideas on this.Benchmark summary:
create index roads_rdr_idx on roads_rdr using gist (geom);
with sort support before patch / CREATE INDEX 76.709 ms
with sort support after patch / CREATE INDEX 225.238 ms
without sort support / CREATE INDEX 446.071 ms
select count(*) from roads_rdr a, roads_rdr b where a.geom && b.geom;
with sort support before patch / SELECT 5766.526 ms
with sort support after patch / SELECT 2646.554 ms
without sort support / SELECT 2721.718 ms
index size
with sort support before patch / IDXSIZE 2940928 bytes
with sort support after patch / IDXSIZE 4956160 bytes
without sort support / IDXSIZE 5447680 bytes
More detailed:
Before patch using sorted method:
postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);CREATE INDEX
Time: 76.709 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 5766.526 ms (00:05.767)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 5880.142 ms (00:05.880)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 5778.437 ms (00:05.778)
postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 359 +
Number of leaf pages: 356 +
Number of tuples: 93034 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2609260 bytes+
Total size of leaf tuples: 2599200 bytes+
Total size of index: 2940928 bytes+
(1 row)
After patch using sorted method:
postgres=# create index roads_rdr_geom_idx_sortsupport on roads_rdr using
gist(geom);CREATE INDEX
Time: 225.238 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 2646.554 ms (00:02.647)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 2499.107 ms (00:02.499)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 2519.815 ms (00:02.520)
postgres=# select gist_stat('roads_rdr_geom_idx_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 605 +
Number of leaf pages: 600 +
Number of tuples: 93280 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2619100 bytes+
Total size of leaf tuples: 2602128 bytes+
Total size of index: 4956160 bytes+
(1 row)
With index built using default method:
postgres=# create index roads_rdr_geom_idx_no_sortsupport on roads_rdr
using gist(geom);CREATE INDEX
Time: 446.071 ms
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 2721.718 ms (00:02.722)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
Time: 3549.549 ms (00:03.550)
postgres=# select count(*) from roads_rdr a, roads_rdr b where a.geom &&
b.geom;count
--------
505806
(1 row)
postgres=# select gist_stat('roads_rdr_geom_idx_no_sortsupport');
gist_stat
------------------------------------------
Number of levels: 3 +
Number of pages: 665 +
Number of leaf pages: 660 +
Number of tuples: 93340 +
Number of invalid tuples: 0 +
Number of leaf tuples: 92676 +
Total size of tuples: 2621500 bytes+
Total size of leaf tuples: 2602848 bytes+
Total size of index: 5447680 bytes+
(1 row)
18 янв. 2022 г., в 03:54, Björn Harrtell <bjorn.harrtell@gmail.com> написал(а):
There might be some deep reason in the architecture that I'm unaware of that could make it difficult to affect the node size but regardless, I believe there could be a substantial win if node size could be controlled.
That's kind of orthogonal development path. Some years ago I had posted "GiST intrapage indexing" patch [0]/messages/by-id/7780A07B-4D04-41E2-B228-166B41D07EEE@yandex-team.ru, that was aiming to make a tree with fanout that is Sqrt(Items on page). But for now decreasing fillfactor == wasting a lot of space, both in shared_buffers and on disk...
Thank you for raising this topic, I think I should rebase and refresh that patch too...
Best regards, Andrey Borodin.
[0]: /messages/by-id/7780A07B-4D04-41E2-B228-166B41D07EEE@yandex-team.ru
Hi,
I've addressed Andrey Borodin's concerns about v2 of this patch by
Aliaksandr
Kalenik in attached version. Change list:
* Number of pages to collect moved to GUC parameter
"gist_sorted_build_page_buffer_size".
* GistSortedBuildPageState type renamed to GistSortedBuildLevelState.
* Comments added.
Sorted build remaind deterministic as long as picksplit implementation
for given
opclass is, which seem to be true for builtin types, so setting random
seed is
not required for testing.
Andrey Borodin's GiST support patch for amcheck was used to verify built
indexes:
https://commitfest.postgresql.org/25/1800/
PSA modified version working with current Postgres code (btree functions
removed).
Attachments:
03_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchtext/x-patch; charset=UTF-8; name=03_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchDownload
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index 86c9e9caa9..8cb390feed 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -33,14 +33,15 @@ COMMIT;
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
itemoffset | ctid | itemlen | dead | keys
------------+-----------+---------+------+-------------------
- 1 | (1,65535) | 40 | f | (p)=((166,166))
- 2 | (2,65535) | 40 | f | (p)=((332,332))
- 3 | (3,65535) | 40 | f | (p)=((498,498))
- 4 | (4,65535) | 40 | f | (p)=((664,664))
- 5 | (5,65535) | 40 | f | (p)=((830,830))
- 6 | (6,65535) | 40 | f | (p)=((996,996))
- 7 | (7,65535) | 40 | f | (p)=((1000,1000))
-(7 rows)
+ 1 | (1,65535) | 40 | f | (p)=((125,125))
+ 2 | (2,65535) | 40 | f | (p)=((250,250))
+ 3 | (3,65535) | 40 | f | (p)=((375,375))
+ 4 | (4,65535) | 40 | f | (p)=((500,500))
+ 5 | (5,65535) | 40 | f | (p)=((625,625))
+ 6 | (6,65535) | 40 | f | (p)=((750,750))
+ 7 | (7,65535) | 40 | f | (p)=((875,875))
+ 8 | (8,65535) | 40 | f | (p)=((1000,1000))
+(8 rows)
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
itemoffset | ctid | itemlen | dead | keys
@@ -64,6 +65,7 @@ SELECT itemoffset, ctid, itemlen FROM gist_page_items_bytea(get_raw_page('test_g
5 | (5,65535) | 40
6 | (6,65535) | 40
7 | (7,65535) | 40
-(7 rows)
+ 8 | (8,65535) | 40
+(8 rows)
DROP TABLE test_gist;
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9854116fca..239fd08495 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -59,6 +59,11 @@
*/
#define BUFFERING_MODE_TUPLE_SIZE_STATS_TARGET 4096
+/**
+ * Number of pages to collect during sorted build before applying split operation
+ */
+int GistSortedBuildPageBufferSize = 8;
+
/*
* Strategy used to build the index. It can change between the
* GIST_BUFFERING_* modes on the fly, but if the Sorted method is used,
@@ -115,14 +120,34 @@ typedef struct
/*
* In sorted build, we use a stack of these structs, one for each level,
- * to hold an in-memory buffer of the rightmost page at the level. When the
- * page fills up, it is written out and a new page is allocated.
+ * to hold an in-memory buffer of last pages at the level.
+ *
+ * Sorting GiST build requires good linearization of the sort opclass. This is
+ * not always the case in multidimensional data. To fight the anomalies, we
+ * buffer index tuples and apply picksplit that can be multidimension-aware.
*/
-typedef struct GistSortedBuildPageState
+typedef struct GistSortedBuildLevelState
{
- Page page;
- struct GistSortedBuildPageState *parent; /* Upper level, if any */
-} GistSortedBuildPageState;
+ int current_page;
+ BlockNumber last_blkno;
+ struct GistSortedBuildLevelState *parent; /* Upper level, if any */
+ uint16 item_num_total;
+ Size page_max_num;
+ Page pages[FLEXIBLE_ARRAY_MEMBER];
+} GistSortedBuildLevelState;
+
+/*
+ * Max. number of items to apply gistSplit to is limited by OffsetNumber type
+ * used in GIST_SPLITVEC.
+ */
+#define GistSortedBuildLevelStateIsMaxItemNum(levelstate) \
+ (levelstate->item_num_total + 1 == PG_UINT16_MAX)
+
+#define GistSortedBuildLevelStateRequiredSize(page_max_num) \
+( \
+ AssertMacro(page_max_num > 0), \
+ (offsetof(GistSortedBuildLevelState, pages) + page_max_num * sizeof(Page)) \
+)
/* prototypes for private functions */
@@ -130,11 +155,11 @@ static void gistSortedBuildCallback(Relation index, ItemPointer tid,
Datum *values, bool *isnull,
bool tupleIsAlive, void *state);
static void gist_indexsortbuild(GISTBuildState *state);
-static void gist_indexsortbuild_pagestate_add(GISTBuildState *state,
- GistSortedBuildPageState *pagestate,
+static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate,
IndexTuple itup);
-static void gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
- GistSortedBuildPageState *pagestate);
+static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate);
static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
static void gistInitBuffering(GISTBuildState *buildstate);
@@ -396,9 +421,14 @@ static void
gist_indexsortbuild(GISTBuildState *state)
{
IndexTuple itup;
- GistSortedBuildPageState *leafstate;
- GistSortedBuildPageState *pagestate;
+ GistSortedBuildLevelState *levelstate;
Page page;
+ Size page_max_num;
+
+ /* Number of pages to collect before splitting */
+ page_max_num = GistSortedBuildPageBufferSize;
+ if (page_max_num > 1)
+ state->freespace = 0; /* collected tuples will all be split between new pages */
state->pages_allocated = 0;
state->pages_written = 0;
@@ -414,10 +444,12 @@ gist_indexsortbuild(GISTBuildState *state)
state->pages_allocated++;
state->pages_written++;
- /* Allocate a temporary buffer for the first leaf page. */
- leafstate = palloc(sizeof(GistSortedBuildPageState));
- leafstate->page = page;
- leafstate->parent = NULL;
+ /* Allocate a temporary buffer for the first leaf page batch. */
+ levelstate = palloc0(GistSortedBuildLevelStateRequiredSize(page_max_num));
+ levelstate->item_num_total = 0;
+ levelstate->page_max_num = page_max_num;
+ levelstate->pages[0] = page;
+ levelstate->parent = NULL;
gistinitpage(page, F_LEAF);
/*
@@ -425,7 +457,7 @@ gist_indexsortbuild(GISTBuildState *state)
*/
while ((itup = tuplesort_getindextuple(state->sortstate, true)) != NULL)
{
- gist_indexsortbuild_pagestate_add(state, leafstate, itup);
+ gist_indexsortbuild_levelstate_add(state, levelstate, itup);
MemoryContextReset(state->giststate->tempCxt);
}
@@ -433,32 +465,34 @@ gist_indexsortbuild(GISTBuildState *state)
* Write out the partially full non-root pages.
*
* Keep in mind that flush can build a new root.
+ * If number of pages is > 1 then new root is required.
*/
- pagestate = leafstate;
- while (pagestate->parent != NULL)
+ while (levelstate->parent != NULL || levelstate->current_page != 0)
{
- GistSortedBuildPageState *parent;
-
- gist_indexsortbuild_pagestate_flush(state, pagestate);
- parent = pagestate->parent;
- pfree(pagestate->page);
- pfree(pagestate);
- pagestate = parent;
+ GistSortedBuildLevelState *parent;
+
+ gist_indexsortbuild_levelstate_flush(state, levelstate);
+ parent = levelstate->parent;
+ for (Size i = 0; i < levelstate->page_max_num; i++)
+ if (levelstate->pages[i])
+ pfree(levelstate->pages[i]);
+ pfree(levelstate);
+ levelstate = parent;
}
gist_indexsortbuild_flush_ready_pages(state);
/* Write out the root */
- PageSetLSN(pagestate->page, GistBuildLSN);
- PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
+ PageSetLSN(levelstate->pages[0], GistBuildLSN);
+ PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ levelstate->pages[0], true);
if (RelationNeedsWAL(state->indexrel))
log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ levelstate->pages[0], true);
- pfree(pagestate->page);
- pfree(pagestate);
+ pfree(levelstate->pages[0]);
+ pfree(levelstate);
}
/*
@@ -466,92 +500,159 @@ gist_indexsortbuild(GISTBuildState *state)
* a new page first.
*/
static void
-gist_indexsortbuild_pagestate_add(GISTBuildState *state,
- GistSortedBuildPageState *pagestate,
+gist_indexsortbuild_levelstate_add(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate,
IndexTuple itup)
{
Size sizeNeeded;
- /* Does the tuple fit? If not, flush */
+ /* Does the tuple fit?
+ * Tuple fits if total number of tuples is less than GistSortedBuildLevelStateMaxItemNum
+ * and tuple fits into current page or new page can be added.
+ * If not, flush */
sizeNeeded = IndexTupleSize(itup) + sizeof(ItemIdData) + state->freespace;
- if (PageGetFreeSpace(pagestate->page) < sizeNeeded)
- gist_indexsortbuild_pagestate_flush(state, pagestate);
+ if (GistSortedBuildLevelStateIsMaxItemNum(levelstate)
+ || PageGetFreeSpace(levelstate->pages[levelstate->current_page]) < sizeNeeded)
+ {
+ Page newPage;
+ Page old_page = levelstate->pages[levelstate->current_page];
+ uint16 old_page_flags = GistPageGetOpaque(old_page)->flags;
+ if (GistSortedBuildLevelStateIsMaxItemNum(levelstate)
+ || levelstate->current_page + 1 == levelstate->page_max_num)
+ {
+ gist_indexsortbuild_levelstate_flush(state, levelstate);
+ }
+ else
+ levelstate->current_page++;
- gistfillbuffer(pagestate->page, &itup, 1, InvalidOffsetNumber);
+ if (levelstate->pages[levelstate->current_page] == NULL)
+ levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+
+ newPage = levelstate->pages[levelstate->current_page];
+ gistinitpage(newPage, old_page_flags);
+ }
+
+ Assert(!GistSortedBuildLevelStateIsMaxItemNum(levelstate));
+ levelstate->item_num_total++;
+ gistfillbuffer(levelstate->pages[levelstate->current_page], &itup, 1, InvalidOffsetNumber);
}
static void
-gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
- GistSortedBuildPageState *pagestate)
+gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate)
{
- GistSortedBuildPageState *parent;
- IndexTuple *itvec;
- IndexTuple union_tuple;
- int vect_len;
- bool isleaf;
- BlockNumber blkno;
+ GistSortedBuildLevelState *parent;
+ BlockNumber blkno;
MemoryContext oldCtx;
+ IndexTuple union_tuple;
+ SplitedPageLayout *dist;
+ IndexTuple *itvec;
+ int vect_len;
+ bool isleaf = GistPageIsLeaf(levelstate->pages[0]);
- /* check once per page */
+ /* check once per whatever */
CHECK_FOR_INTERRUPTS();
- if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
- gist_indexsortbuild_flush_ready_pages(state);
-
- /*
- * The page is now complete. Assign a block number to it, and add it to
- * the list of finished pages. (We don't write it out immediately, because
- * we want to WAL-log the pages in batches.)
- */
- blkno = state->pages_allocated++;
- state->ready_blknos[state->ready_num_pages] = blkno;
- state->ready_pages[state->ready_num_pages] = pagestate->page;
- state->ready_num_pages++;
+ oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- isleaf = GistPageIsLeaf(pagestate->page);
+ /* Get index tuples from first page */
+ itvec = gistextractpage(levelstate->pages[0], &vect_len);
+ if (levelstate->current_page > 0)
+ {
+ /* Append tuples from each page */
+ for (int i = 1; i < levelstate->current_page + 1; i++)
+ {
+ int len_local;
+ IndexTuple *itvec_local = gistextractpage(levelstate->pages[i], &len_local);
+ itvec = gistjoinvector(itvec, &vect_len, itvec_local, len_local);
+ pfree(itvec_local);
+ }
- /*
- * Form a downlink tuple to represent all the tuples on the page.
- */
- oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- itvec = gistextractpage(pagestate->page, &vect_len);
- union_tuple = gistunion(state->indexrel, itvec, vect_len,
+ /* Apply picksplit to list of all collected tuples */
+ dist = gistSplit(state->indexrel, levelstate->pages[0], itvec, vect_len, state->giststate);
+ }
+ else
+ {
+ /* Create splitted layout from single page */
+ dist = (SplitedPageLayout*) palloc0(sizeof(SplitedPageLayout));
+ union_tuple = gistunion(state->indexrel, itvec, vect_len,
state->giststate);
- ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+ dist->itup = union_tuple;
+ dist->list = gistfillitupvec(itvec, vect_len, &(dist->lenlist));
+ dist->block.num = vect_len;
+ }
+
MemoryContextSwitchTo(oldCtx);
- /*
- * Insert the downlink to the parent page. If this was the root, create a
- * new page as the parent, which becomes the new root.
- */
- parent = pagestate->parent;
- if (parent == NULL)
+ /* Reset page and item counters */
+ levelstate->current_page = 0;
+ levelstate->item_num_total = 0;
+
+ /* Create pages for all partitions in split result */
+ for (;dist != NULL; dist = dist->next)
{
- parent = palloc(sizeof(GistSortedBuildPageState));
- parent->page = (Page) palloc(BLCKSZ);
- parent->parent = NULL;
- gistinitpage(parent->page, 0);
+ /* Create page and copy data */
+ char *data = (char *) (dist->list);
+ Page target = palloc0(BLCKSZ);
+ gistinitpage(target, isleaf? F_LEAF:0);
+ for (int i = 0; i < dist->block.num; i++)
+ {
+ IndexTuple thistup = (IndexTuple) data;
- pagestate->parent = parent;
- }
- gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+ if (PageAddItem(target, (Item) data, IndexTupleSize(thistup), i + FirstOffsetNumber, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(state->indexrel));
- /* Re-initialize the page buffer for next page on this level. */
- pagestate->page = palloc(BLCKSZ);
- gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+ data += IndexTupleSize(thistup);
+ }
+ union_tuple = dist->itup;
- /*
- * Set the right link to point to the previous page. This is just for
- * debugging purposes: GiST only follows the right link if a page is split
- * concurrently to a scan, and that cannot happen during index build.
- *
- * It's a bit counterintuitive that we set the right link on the new page
- * to point to the previous page, and not the other way round. But GiST
- * pages are not ordered like B-tree pages are, so as long as the
- * right-links form a chain through all the pages in the same level, the
- * order doesn't matter.
- */
- GistPageGetOpaque(pagestate->page)->rightlink = blkno;
+ if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
+ gist_indexsortbuild_flush_ready_pages(state);
+
+ /*
+ * The page is now complete. Assign a block number to it, and add it to
+ * the list of finished pages. (We don't write it out immediately, because
+ * we want to WAL-log the pages in batches.)
+ */
+ blkno = state->pages_allocated++;
+ state->ready_blknos[state->ready_num_pages] = blkno;
+ state->ready_pages[state->ready_num_pages] = target;
+ state->ready_num_pages++;
+ ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+
+ /*
+ * Set the right link to point to the previous page. This is just for
+ * debugging purposes: GiST only follows the right link if a page is split
+ * concurrently to a scan, and that cannot happen during index build.
+ *
+ * It's a bit counterintuitive that we set the right link on the new page
+ * to point to the previous page, and not the other way round. But GiST
+ * pages are not ordered like B-tree pages are, so as long as the
+ * right-links form a chain through all the pages in the same level, the
+ * order doesn't matter.
+ */
+ if (levelstate->last_blkno)
+ GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+ levelstate->last_blkno = blkno;
+
+ /*
+ * Insert the downlink to the parent page. If this was the root, create a
+ * new page as the parent, which becomes the new root.
+ */
+ parent = levelstate->parent;
+ if (parent == NULL)
+ {
+ parent = palloc0(GistSortedBuildLevelStateRequiredSize(levelstate->page_max_num));
+ parent->item_num_total = 0;
+ parent->page_max_num = levelstate->page_max_num;
+ parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->parent = NULL;
+ gistinitpage(parent->pages[0], 0);
+
+ levelstate->parent = parent;
+ }
+ gist_indexsortbuild_levelstate_add(state, parent, union_tuple);
+ }
}
static void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c94f09c64..4755113737 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -34,6 +34,7 @@
#include "access/commit_ts.h"
#include "access/gin.h"
+#include "access/gist.h"
#include "access/rmgr.h"
#include "access/tableam.h"
#include "access/toast_compression.h"
@@ -3600,6 +3601,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, NULL, NULL
},
+ {
+ {"gist_sorted_build_page_buffer_size", PGC_USERSET, UNGROUPED,
+ gettext_noop("GiST sorted build: # of pages to collect before applying split"),
+ NULL,
+ 0
+ },
+ &GistSortedBuildPageBufferSize,
+ 8, 1, INT_MAX,
+ NULL, NULL, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index a3337627b8..eb66b9e548 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -186,6 +186,7 @@ typedef struct GISTENTRY
#define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
#define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
+extern int GistSortedBuildPageBufferSize;
/*
* On a deleted page, we store this struct. A deleted page doesn't contain any
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 89249ecc97..bfb7802f2d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -969,7 +969,7 @@ GistHstoreOptions
GistInetKey
GistNSN
GistOptBufferingMode
-GistSortedBuildPageState
+GistSortedBuildLevelState
GistSplitUnion
GistSplitVector
GistTsVectorOptions
v1-amcheck-gist-no-btree.patchtext/x-patch; charset=UTF-8; name=v1-amcheck-gist-no-btree.patchDownload
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221e50..51e5af221a 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -3,14 +3,21 @@
MODULE_big = amcheck
OBJS = \
$(WIN32RES) \
+ amcheck.o \
verify_heapam.o \
- verify_nbtree.o
+ verify_gist.o
+# verify_nbtree.o \
EXTENSION = amcheck
-DATA = amcheck--1.2--1.3.sql amcheck--1.1--1.2.sql amcheck--1.0--1.1.sql amcheck--1.0.sql
+DATA = amcheck--1.3--1.4.sql \
+ amcheck--1.2--1.3.sql \
+ amcheck--1.1--1.2.sql \
+ amcheck--1.0--1.1.sql \
+ amcheck--1.0.sql
PGFILEDESC = "amcheck - function for verifying relation integrity"
-REGRESS = check check_btree check_heap
+REGRESS = check check_heap check_gist
+# check_btree
TAP_TESTS = 1
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
index 088416e7c2..af09ac1714 100644
--- a/contrib/amcheck/amcheck--1.0--1.1.sql
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -6,24 +6,24 @@
-- In order to avoid issues with dependencies when updating amcheck to 1.1,
-- create new, overloaded versions of the 1.0 functions
---
--- bt_index_check()
---
-CREATE FUNCTION bt_index_check(index regclass,
- heapallindexed boolean)
-RETURNS VOID
-AS 'MODULE_PATHNAME', 'bt_index_check'
-LANGUAGE C STRICT PARALLEL RESTRICTED;
+-- --
+-- -- bt_index_check()
+-- --
+-- CREATE FUNCTION bt_index_check(index regclass,
+-- heapallindexed boolean)
+-- RETURNS VOID
+-- AS 'MODULE_PATHNAME', 'bt_index_check'
+-- LANGUAGE C STRICT PARALLEL RESTRICTED;
---
--- bt_index_parent_check()
---
-CREATE FUNCTION bt_index_parent_check(index regclass,
- heapallindexed boolean)
-RETURNS VOID
-AS 'MODULE_PATHNAME', 'bt_index_parent_check'
-LANGUAGE C STRICT PARALLEL RESTRICTED;
+-- --
+-- -- bt_index_parent_check()
+-- --
+-- CREATE FUNCTION bt_index_parent_check(index regclass,
+-- heapallindexed boolean)
+-- RETURNS VOID
+-- AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+-- LANGUAGE C STRICT PARALLEL RESTRICTED;
--- Don't want these to be available to public
-REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
-REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
+-- -- Don't want these to be available to public
+-- REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+-- REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
index a6612d130c..c0aebe7bea 100644
--- a/contrib/amcheck/amcheck--1.0.sql
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -3,22 +3,22 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
---
--- bt_index_check()
---
-CREATE FUNCTION bt_index_check(index regclass)
-RETURNS VOID
-AS 'MODULE_PATHNAME', 'bt_index_check'
-LANGUAGE C STRICT PARALLEL RESTRICTED;
+-- --
+-- -- bt_index_check()
+-- --
+-- CREATE FUNCTION bt_index_check(index regclass)
+-- RETURNS VOID
+-- AS 'MODULE_PATHNAME', 'bt_index_check'
+-- LANGUAGE C STRICT PARALLEL RESTRICTED;
---
--- bt_index_parent_check()
---
-CREATE FUNCTION bt_index_parent_check(index regclass)
-RETURNS VOID
-AS 'MODULE_PATHNAME', 'bt_index_parent_check'
-LANGUAGE C STRICT PARALLEL RESTRICTED;
+-- --
+-- -- bt_index_parent_check()
+-- --
+-- CREATE FUNCTION bt_index_parent_check(index regclass)
+-- RETURNS VOID
+-- AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+-- LANGUAGE C STRICT PARALLEL RESTRICTED;
--- Don't want these to be available to public
-REVOKE ALL ON FUNCTION bt_index_check(regclass) FROM PUBLIC;
-REVOKE ALL ON FUNCTION bt_index_parent_check(regclass) FROM PUBLIC;
+-- -- Don't want these to be available to public
+-- REVOKE ALL ON FUNCTION bt_index_check(regclass) FROM PUBLIC;
+-- REVOKE ALL ON FUNCTION bt_index_parent_check(regclass) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck--1.1--1.2.sql b/contrib/amcheck/amcheck--1.1--1.2.sql
index 883530dec7..224510f0e5 100644
--- a/contrib/amcheck/amcheck--1.1--1.2.sql
+++ b/contrib/amcheck/amcheck--1.1--1.2.sql
@@ -3,17 +3,17 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.2'" to load this file. \quit
--- In order to avoid issues with dependencies when updating amcheck to 1.2,
--- create new, overloaded version of the 1.1 function signature
+-- -- In order to avoid issues with dependencies when updating amcheck to 1.2,
+-- -- create new, overloaded version of the 1.1 function signature
---
--- bt_index_parent_check()
---
-CREATE FUNCTION bt_index_parent_check(index regclass,
- heapallindexed boolean, rootdescend boolean)
-RETURNS VOID
-AS 'MODULE_PATHNAME', 'bt_index_parent_check'
-LANGUAGE C STRICT PARALLEL RESTRICTED;
+-- --
+-- -- bt_index_parent_check()
+-- --
+-- CREATE FUNCTION bt_index_parent_check(index regclass,
+-- heapallindexed boolean, rootdescend boolean)
+-- RETURNS VOID
+-- AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+-- LANGUAGE C STRICT PARALLEL RESTRICTED;
--- Don't want this to be available to public
-REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
+-- -- Don't want this to be available to public
+-- REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck--1.3--1.4.sql b/contrib/amcheck/amcheck--1.3--1.4.sql
new file mode 100644
index 0000000000..1631dfa79b
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.3--1.4.sql
@@ -0,0 +1,14 @@
+/* contrib/amcheck/amcheck--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.4'" to load this file. \quit
+
+--
+-- gist_index_parent_check()
+--
+CREATE FUNCTION gist_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'gist_index_parent_check'
+LANGUAGE C STRICT;
+
+REVOKE ALL ON FUNCTION gist_index_parent_check(regclass) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 0000000000..006a54ab83
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,162 @@
+/*-------------------------------------------------------------------------
+ *
+ * amcheck.c
+ * Utility functions common to all access methods.
+ *
+ * Copyright (c) 2017-2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/amcheck/amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/genam.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "amcheck.h"
+#include "catalog/index.h"
+#include "commands/tablecmds.h"
+
+
+PG_MODULE_MAGIC;
+
+/*
+ * Lock acquisition reused across different am types
+ */
+void
+amcheck_lock_relation(Oid indrelid, Relation *indrel,
+ Relation *heaprel, LOCKMODE lockmode)
+{
+ Oid heapid;
+
+ /*
+ * We must lock table before index to avoid deadlocks. However, if the
+ * passed indrelid isn't an index then IndexGetRelation() will fail.
+ * Rather than emitting a not-very-helpful error message, postpone
+ * complaining, expecting that the is-it-an-index test below will fail.
+ *
+ * In hot standby mode this will raise an error when lockmode is
+ * AccessExclusiveLock.
+ */
+ heapid = IndexGetRelation(indrelid, true);
+ if (OidIsValid(heapid))
+ *heaprel = table_open(heapid, lockmode);
+ else
+ *heaprel = NULL;
+
+ /*
+ * Open the target index relations separately (like relation_openrv(), but
+ * with heap relation locked first to prevent deadlocking). In hot
+ * standby mode this will raise an error when lockmode is
+ * AccessExclusiveLock.
+ *
+ * There is no need for the usual indcheckxmin usability horizon test
+ * here, even in the nbtree heapallindexed case, because index undergoing
+ * verification only needs to have entries for a new transaction snapshot.
+ * (If caller is about to do nbtree parentcheck verification, there is no
+ * question about committed or recently dead heap tuples lacking index
+ * entries due to concurrent activity.)
+ */
+ *indrel = index_open(indrelid, lockmode);
+
+ /*
+ * Since we did the IndexGetRelation call above without any lock, it's
+ * barely possible that a race against an index drop/recreation could have
+ * netted us the wrong table.
+ */
+ if (*heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
+ ereport(ERROR,
+ (errcode(ERRCODE_UNDEFINED_TABLE),
+ errmsg("could not open parent table of index %s",
+ RelationGetRelationName(*indrel))));
+}
+
+/*
+ * Unlock index and heap relations early for amcheck_lock_relation() caller.
+ *
+ * This is ok because nothing in the called routines will trigger shared cache
+ * invalidations to be sent, so we can relax the usual pattern of only
+ * releasing locks after commit.
+ */
+void
+amcheck_unlock_relation(Oid indrelid, Relation indrel, Relation heaprel,
+ LOCKMODE lockmode)
+{
+ index_close(indrel, lockmode);
+ if (heaprel)
+ table_close(heaprel, lockmode);
+}
+
+/*
+ * Check if index relation should have a file for its main relation
+ * fork. Verification uses this to skip unlogged indexes when in hot standby
+ * mode, where there is simply nothing to verify.
+ *
+ * NB: Caller should call btree_index_checkable() or gist_index_checkable()
+ * before calling here.
+ */
+bool
+amcheck_index_mainfork_expected(Relation rel)
+{
+ if (rel->rd_rel->relpersistence != RELPERSISTENCE_UNLOGGED ||
+ !RecoveryInProgress())
+ return true;
+
+ ereport(NOTICE,
+ (errcode(ERRCODE_READ_ONLY_SQL_TRANSACTION),
+ errmsg("cannot verify unlogged index \"%s\" during recovery, skipping",
+ RelationGetRelationName(rel))));
+
+ return false;
+}
+
+/*
+ * PageGetItemId() wrapper that validates returned line pointer.
+ *
+ * Buffer page/page item access macros generally trust that line pointers are
+ * not corrupt, which might cause problems for verification itself. For
+ * example, there is no bounds checking in PageGetItem(). Passing it a
+ * corrupt line pointer can cause it to return a tuple/pointer that is unsafe
+ * to dereference.
+ *
+ * Validating line pointers before tuples avoids undefined behavior and
+ * assertion failures with corrupt indexes, making the verification process
+ * more robust and predictable.
+ */
+ItemId
+PageGetItemIdCareful(Relation rel, BlockNumber block, Page page,
+ OffsetNumber offset, size_t opaquesize)
+{
+ ItemId itemid = PageGetItemId(page, offset);
+
+ if (ItemIdGetOffset(itemid) + ItemIdGetLength(itemid) >
+ BLCKSZ - opaquesize)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("line pointer points past end of tuple space in index \"%s\"",
+ RelationGetRelationName(rel)),
+ errdetail_internal("Index tid=(%u,%u) lp_off=%u, lp_len=%u lp_flags=%u.",
+ block, offset, ItemIdGetOffset(itemid),
+ ItemIdGetLength(itemid),
+ ItemIdGetFlags(itemid))));
+
+ /*
+ * Verify that line pointer isn't LP_REDIRECT or LP_UNUSED, since nbtree and gist
+ * never uses either. Verify that line pointer has storage, too, since
+ * even LP_DEAD items should.
+ */
+ if (ItemIdIsRedirected(itemid) || !ItemIdIsUsed(itemid) ||
+ ItemIdGetLength(itemid) == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("invalid line pointer storage in index \"%s\"",
+ RelationGetRelationName(rel)),
+ errdetail_internal("Index tid=(%u,%u) lp_off=%u, lp_len=%u lp_flags=%u.",
+ block, offset, ItemIdGetOffset(itemid),
+ ItemIdGetLength(itemid),
+ ItemIdGetFlags(itemid))));
+
+ return itemid;
+}
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index ab50931f75..e67ace01c9 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
# amcheck extension
comment = 'functions for verifying relation integrity'
-default_version = '1.3'
+default_version = '1.4'
module_pathname = '$libdir/amcheck'
relocatable = true
diff --git a/contrib/amcheck/amcheck.h b/contrib/amcheck/amcheck.h
new file mode 100644
index 0000000000..426f2ba15d
--- /dev/null
+++ b/contrib/amcheck/amcheck.h
@@ -0,0 +1,23 @@
+/*-------------------------------------------------------------------------
+ *
+ * amcheck.h
+ * Shared routines for amcheck verifications.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/amcheck/amcheck.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "storage/lockdefs.h"
+#include "utils/relcache.h"
+
+extern void amcheck_lock_relation(Oid indrelid, Relation *indrel,
+ Relation *heaprel, LOCKMODE lockmode);
+extern void amcheck_unlock_relation(Oid indrelid, Relation indrel,
+ Relation heaprel, LOCKMODE lockmode);
+extern bool amcheck_index_mainfork_expected(Relation rel);
+
+extern ItemId PageGetItemIdCareful(Relation rel, BlockNumber block,
+ Page page, OffsetNumber offset, size_t opaquesize);
diff --git a/contrib/amcheck/expected/check_gist.out b/contrib/amcheck/expected/check_gist.out
new file mode 100644
index 0000000000..8f3ec20946
--- /dev/null
+++ b/contrib/amcheck/expected/check_gist.out
@@ -0,0 +1,18 @@
+-- minimal test, basically just verifying that amcheck works with GiST
+SELECT setseed(1);
+ setseed
+---------
+
+(1 row)
+
+CREATE TABLE gist_check AS SELECT point(random(),s) c FROM generate_series(1,10000) s;
+INSERT INTO gist_check SELECT point(random(),s) c FROM generate_series(1,100000) s;
+CREATE INDEX gist_check_idx ON gist_check USING gist(c);
+SELECT gist_index_parent_check('gist_check_idx');
+ gist_index_parent_check
+-------------------------
+
+(1 row)
+
+-- cleanup
+DROP TABLE gist_check;
diff --git a/contrib/amcheck/sql/check_gist.sql b/contrib/amcheck/sql/check_gist.sql
new file mode 100644
index 0000000000..0ee2e943c9
--- /dev/null
+++ b/contrib/amcheck/sql/check_gist.sql
@@ -0,0 +1,9 @@
+-- minimal test, basically just verifying that amcheck works with GiST
+SELECT setseed(1);
+CREATE TABLE gist_check AS SELECT point(random(),s) c FROM generate_series(1,10000) s;
+INSERT INTO gist_check SELECT point(random(),s) c FROM generate_series(1,100000) s;
+CREATE INDEX gist_check_idx ON gist_check USING gist(c);
+SELECT gist_index_parent_check('gist_check_idx');
+
+-- cleanup
+DROP TABLE gist_check;
diff --git a/contrib/amcheck/verify_gist.c b/contrib/amcheck/verify_gist.c
new file mode 100644
index 0000000000..3c741071f0
--- /dev/null
+++ b/contrib/amcheck/verify_gist.c
@@ -0,0 +1,374 @@
+/*-------------------------------------------------------------------------
+ *
+ * verify_gist.c
+ * Verifies the integrity of GiST indexes based on invariants.
+ *
+ * Verification checks that all paths in GiST graph contain
+ * consistent keys: tuples on parent pages consistently include tuples
+ * from children pages. Also, verification checks graph invariants:
+ * internal page must have at least one downlinks, internal page can
+ * reference either only leaf pages or only internal pages.
+ *
+ *
+ * Copyright (c) 2017-2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/amcheck/verify_gist.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/gist_private.h"
+#include "amcheck.h"
+#include "catalog/pg_am.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+#include "utils/rel.h"
+
+/*
+ * GistScanItem represents one item of depth-first scan of GiST index.
+ */
+typedef struct GistScanItem
+{
+ int depth;
+ IndexTuple parenttup;
+ BlockNumber parentblk;
+ XLogRecPtr parentlsn;
+ BlockNumber blkno;
+ struct GistScanItem *next;
+} GistScanItem;
+
+PG_FUNCTION_INFO_V1(gist_index_parent_check);
+
+static void gist_index_checkable(Relation rel);
+static void gist_check_parent_keys_consistency(Relation rel);
+static void check_index_page(Relation rel, Buffer buffer, BlockNumber blockNo);
+static IndexTuple gist_refind_parent(Relation rel, BlockNumber parentblkno,
+ BlockNumber childblkno,
+ BufferAccessStrategy strategy);
+
+/*
+ * gist_index_parent_check(index regclass)
+ *
+ * Verify integrity of GiST index.
+ *
+ * Acquires AccessShareLock on heap & index relations.
+ */
+Datum gist_index_parent_check(PG_FUNCTION_ARGS)
+{
+ Oid indrelid = PG_GETARG_OID(0);
+ Relation indrel;
+ Relation heaprel;
+ LOCKMODE lockmode = AccessShareLock;
+
+ /* lock table and index with neccesary level */
+ amcheck_lock_relation(indrelid, &indrel, &heaprel, lockmode);
+
+ /* verify that this is GiST eligible for check */
+ gist_index_checkable(indrel);
+
+ if (amcheck_index_mainfork_expected(indrel))
+ gist_check_parent_keys_consistency(indrel);
+
+ /* Unlock index and table */
+ amcheck_unlock_relation(indrelid, indrel, heaprel, lockmode);
+
+ PG_RETURN_VOID();
+}
+
+/*
+ * Check that relation is eligible for GiST verification
+ */
+static void
+gist_index_checkable(Relation rel)
+{
+ if (rel->rd_rel->relkind != RELKIND_INDEX ||
+ rel->rd_rel->relam != GIST_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("only GiST indexes are supported as targets for this verification"),
+ errdetail("Relation \"%s\" is not a GiST index.",
+ RelationGetRelationName(rel))));
+
+ if (RELATION_IS_OTHER_TEMP(rel))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot access temporary tables of other sessions"),
+ errdetail("Index \"%s\" is associated with temporary relation.",
+ RelationGetRelationName(rel))));
+
+ if (!rel->rd_index->indisvalid)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot check index \"%s\"",
+ RelationGetRelationName(rel)),
+ errdetail("Index is not valid")));
+}
+
+/*
+ * Main entry point for GiST check. Allocates memory context and scans through
+ * GiST graph. This function verifies that tuples of internal pages cover all
+ * the key space of each tuple on leaf page. To do this we invoke
+ * gist_check_internal_page() for every internal page.
+ *
+ * gist_check_internal_page() in it's turn takes every tuple and tries to
+ * adjust it by tuples on referenced child page. Parent gist tuple should
+ * never require any adjustments.
+ */
+static void
+gist_check_parent_keys_consistency(Relation rel)
+{
+ BufferAccessStrategy strategy = GetAccessStrategy(BAS_BULKREAD);
+ GistScanItem *stack;
+ MemoryContext mctx;
+ MemoryContext oldcontext;
+ GISTSTATE *state;
+ int leafdepth;
+
+ mctx = AllocSetContextCreate(CurrentMemoryContext,
+ "amcheck context",
+ ALLOCSET_DEFAULT_SIZES);
+ oldcontext = MemoryContextSwitchTo(mctx);
+
+ state = initGISTstate(rel);
+
+ /*
+ * We don't know the height of the tree yet, but as soon as we encounter a
+ * leaf page, we will set 'leafdepth' to its depth.
+ */
+ leafdepth = -1;
+
+ /* Start the scan at the root page */
+ stack = (GistScanItem *) palloc0(sizeof(GistScanItem));
+ stack->depth = 0;
+ stack->parenttup = NULL;
+ stack->parentblk = InvalidBlockNumber;
+ stack->parentlsn = InvalidXLogRecPtr;
+ stack->blkno = GIST_ROOT_BLKNO;
+
+ while (stack)
+ {
+ GistScanItem *stack_next;
+ Buffer buffer;
+ Page page;
+ OffsetNumber i, maxoff;
+ XLogRecPtr lsn;
+
+ CHECK_FOR_INTERRUPTS();
+
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
+ RBM_NORMAL, strategy);
+ LockBuffer(buffer, GIST_SHARE);
+ page = (Page) BufferGetPage(buffer);
+ lsn = BufferGetLSNAtomic(buffer);
+
+ /* Do basic sanity checks on the page headers */
+ check_index_page(rel, buffer, stack->blkno);
+
+ /*
+ * It's possible that the page was split since we looked at the
+ * parent, so that we didn't missed the downlink of the right sibling
+ * when we scanned the parent. If so, add the right sibling to the
+ * stack now.
+ */
+ if (GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page))
+ {
+ /* split page detected, install right link to the stack */
+ GistScanItem *ptr = (GistScanItem *) palloc(sizeof(GistScanItem));
+
+ ptr->depth = stack->depth;
+ ptr->parenttup = CopyIndexTuple(stack->parenttup);
+ ptr->parentblk = stack->parentblk;
+ ptr->parentlsn = stack->parentlsn;
+ ptr->blkno = GistPageGetOpaque(page)->rightlink;
+ ptr->next = stack->next;
+ stack->next = ptr;
+ }
+
+ /* Check that the tree has the same height in all branches */
+ if (GistPageIsLeaf(page))
+ {
+ if (leafdepth == -1)
+ leafdepth = stack->depth;
+ else if (stack->depth != leafdepth)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\": internal pages traversal encountered leaf page unexpectedly on block %u",
+ RelationGetRelationName(rel), stack->blkno)));
+ }
+
+ /*
+ * Check that each tuple looks valid, and is consistent with the
+ * downlink we followed when we stepped on this page.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+ {
+ ItemId iid = PageGetItemIdCareful(rel, stack->blkno, page, i, sizeof(GISTPageOpaqueData));
+ IndexTuple idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+ /*
+ * Check that it's not a leftover invalid tuple from pre-9.1 See
+ * also gistdoinsert() and gistbulkdelete() handling of such
+ * tuples. We do consider it error here.
+ */
+ if (GistTupleIsInvalid(idxtuple))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("index \"%s\" contains an inner tuple marked as invalid, block %u, offset %u",
+ RelationGetRelationName(rel), stack->blkno, i),
+ errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+ errhint("Please REINDEX it.")));
+
+ if (MAXALIGN(ItemIdGetLength(iid)) != MAXALIGN(IndexTupleSize(idxtuple)))
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has inconsistent tuple sizes, block %u, offset %u",
+ RelationGetRelationName(rel), stack->blkno, i)));
+
+ /*
+ * Check if this tuple is consistent with the downlink in the
+ * parent.
+ */
+ if (stack->parenttup &&
+ gistgetadjusted(rel, stack->parenttup, idxtuple, state))
+ {
+ /*
+ * There was a discrepancy between parent and child tuples.
+ * We need to verify it is not a result of concurrent call of
+ * gistplacetopage(). So, lock parent and try to find downlink
+ * for current page. It may be missing due to concurrent page
+ * split, this is OK.
+ */
+ pfree(stack->parenttup);
+ stack->parenttup = gist_refind_parent(rel, stack->parentblk,
+ stack->blkno, strategy);
+
+ /* We found it - make a final check before failing */
+ if (!stack->parenttup)
+ elog(NOTICE, "Unable to find parent tuple for block %u on block %u due to concurrent split",
+ stack->blkno, stack->parentblk);
+ else if (gistgetadjusted(rel, stack->parenttup, idxtuple, state))
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has inconsistent records on page %u offset %u",
+ RelationGetRelationName(rel), stack->blkno, i)));
+ else
+ {
+ /*
+ * But now it is properly adjusted - nothing to do here.
+ */
+ }
+ }
+
+ /* If this is an internal page, recurse into the child */
+ if (!GistPageIsLeaf(page))
+ {
+ GistScanItem *ptr;
+
+ ptr = (GistScanItem *) palloc(sizeof(GistScanItem));
+ ptr->depth = stack->depth + 1;
+ ptr->parenttup = CopyIndexTuple(idxtuple);
+ ptr->parentblk = stack->blkno;
+ ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+ ptr->parentlsn = lsn;
+ ptr->next = stack->next;
+ stack->next = ptr;
+ }
+ }
+
+ LockBuffer(buffer, GIST_UNLOCK);
+ ReleaseBuffer(buffer);
+
+ /* Step to next item in the queue */
+ stack_next = stack->next;
+ if (stack->parenttup)
+ pfree(stack->parenttup);
+ pfree(stack);
+ stack = stack_next;
+ }
+
+ MemoryContextSwitchTo(oldcontext);
+ MemoryContextDelete(mctx);
+}
+
+static void
+check_index_page(Relation rel, Buffer buffer, BlockNumber blockNo)
+{
+ Page page = BufferGetPage(buffer);
+
+ gistcheckpage(rel, buffer);
+
+ if (GistPageGetOpaque(page)->gist_page_id != GIST_PAGE_ID)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has corrupted page %d",
+ RelationGetRelationName(rel), blockNo)));
+
+ if (GistPageIsDeleted(page))
+ {
+ if (!GistPageIsLeaf(page))
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has deleted internal page %d",
+ RelationGetRelationName(rel), blockNo)));
+ if (PageGetMaxOffsetNumber(page) > InvalidOffsetNumber)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has deleted page %d with tuples",
+ RelationGetRelationName(rel), blockNo)));
+ }
+ else if (PageGetMaxOffsetNumber(page) > MaxIndexTuplesPerPage)
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" has page %d with exceeding count of tuples",
+ RelationGetRelationName(rel), blockNo)));
+}
+
+/*
+ * Try to re-find downlink pointing to 'blkno', in 'parentblkno'.
+ *
+ * If found, returns a palloc'd copy of the downlink tuple. Otherwise,
+ * returns NULL.
+ */
+static IndexTuple
+gist_refind_parent(Relation rel, BlockNumber parentblkno,
+ BlockNumber childblkno, BufferAccessStrategy strategy)
+{
+ Buffer parentbuf;
+ Page parentpage;
+ OffsetNumber o,
+ parent_maxoff;
+ IndexTuple result = NULL;
+
+ parentbuf = ReadBufferExtended(rel, MAIN_FORKNUM, parentblkno, RBM_NORMAL,
+ strategy);
+
+ LockBuffer(parentbuf, GIST_SHARE);
+ parentpage = BufferGetPage(parentbuf);
+
+ if (GistPageIsLeaf(parentpage))
+ {
+ UnlockReleaseBuffer(parentbuf);
+ return result;
+ }
+
+ parent_maxoff = PageGetMaxOffsetNumber(parentpage);
+ for (o = FirstOffsetNumber; o <= parent_maxoff; o = OffsetNumberNext(o))
+ {
+ ItemId p_iid = PageGetItemIdCareful(rel, parentblkno, parentpage, o, sizeof(GISTPageOpaqueData));
+ IndexTuple itup = (IndexTuple) PageGetItem(parentpage, p_iid);
+
+ if (ItemPointerGetBlockNumber(&(itup->t_tid)) == childblkno)
+ {
+ /* Found it! Make copy and return it */
+ result = CopyIndexTuple(itup);
+ break;
+ }
+ }
+
+ UnlockReleaseBuffer(parentbuf);
+
+ return result;
+}
Hello hackers,
On Tue, Jan 18, 2022 at 11:26 PM sergei sh. <sshoulbakov@kontur.io> wrote:
Hi,
I've addressed Andrey Borodin's concerns about v2 of this patch by
Aliaksandr
Kalenik in attached version.
[snip]
This patchset got some attention in the PostGIS development channel, as it
is important to really enable the fast GiST build there for the end user.
The reviews are positive, it saves build time and performs as well as
original non-sorting build on tested workloads.
Test by Giuseppe Broccolo:
https://lists.osgeo.org/pipermail/postgis-devel/2022-January/029330.html
Test by Regina Obe:
https://lists.osgeo.org/pipermail/postgis-devel/2022-January/029335.html
I believe this patch is an important addition to Postgres 15 and will make
a lot of GIS users happier.
Thanks,
Darafei.
19 янв. 2022 г., в 01:26, sergei sh. <sshoulbakov@kontur.io> написал(а):
Hi,
I've addressed Andrey Borodin's concerns about v2 of this patch by Aliaksandr
Kalenik in attached version.
Thank you! I'll make a new iteration of review. From a first glance everything looks good, but gist_sorted_build_page_buffer_size haven't any documentation....
Best regards, Andrey Borodin.
19 янв. 2022 г., в 09:31, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I've addressed Andrey Borodin's concerns about v2 of this patch by Aliaksandr
Kalenik in attached version.Thank you! I'll make a new iteration of review. From a first glance everything looks good, but gist_sorted_build_page_buffer_size haven't any documentation....
I've made one more iteration. The code generally looks OK to me.
Some nitpicking:
1. gist_sorted_build_page_buffer_size is not documented yet
2. Comments correctly state that check for interrupts is done once per whatever. Let's make "whatever" == "1 page flush" again.
3. There is "Size i" in a loop. I haven't found usage of Size, but many size_t-s. For the same purpose in the same file mostly "int i" is used.
4. Many multiline comments are formatted in an unusual manner.
Besides this I think the patch is ready for committer.
Thanks!
Best regards, Andrey Borodin.
Hi,
On 2022-01-18 23:26:05 +0300, sergei sh. wrote:
I've addressed Andrey Borodin's concerns about v2 of this patch by
Aliaksandr
Kalenik in attached version. Change list:
* Number of pages to collect moved to GUC parameter
"gist_sorted_build_page_buffer_size".
* GistSortedBuildPageState type renamed to GistSortedBuildLevelState.
* Comments added.
Unfortunately the tests don't seem to pass on any platform, starting with this
version:
https://cirrus-ci.com/build/4808414281859072
Presumably the fault of v1-amcheck-gist-no-btree.patch, which comments out
some functions. I assume that patch isn't intended to be part of the
submission?
https://wiki.postgresql.org/wiki/Cfbot#Which_attachments_are_considered_to_be_patches.3F
Greetings,
Andres Freund
On 1/23/22 12:33, Andrey Borodin wrote:
19 янв. 2022 г., в 09:31, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
I've addressed Andrey Borodin's concerns about v2 of this patch by Aliaksandr
Kalenik in attached version.Thank you! I'll make a new iteration of review. From a first glance everything looks good, but gist_sorted_build_page_buffer_size haven't any documentation....
I've made one more iteration. The code generally looks OK to me.
Some nitpicking:
1. gist_sorted_build_page_buffer_size is not documented yet
2. Comments correctly state that check for interrupts is done once per whatever. Let's make "whatever" == "1 page flush" again.
3. There is "Size i" in a loop. I haven't found usage of Size, but many size_t-s. For the same purpose in the same file mostly "int i" is used.
4. Many multiline comments are formatted in an unusual manner.Besides this I think the patch is ready for committer.
Thanks!
Best regards, Andrey Borodin.
Hi,
I've fixed issues 2 -- 4 in attached version.
GUC parameter has been removed for the number of pages to collect
before splitting and fixed value of 4 is used instead, as discussed
off-list with Andrey Borodin, Aliaksandr Kalenik, Darafei
Praliaskouski. Benchmarking shows that using higher values has almost
no effect on query efficiency while increasing index building time.
PSA graphs for index creation and query time, "tiling" and "self-join"
refer to queries used in previous benchmarks:
https://github.com/mngr777/pg_index_bm2
Sorted build method description has been added in GiST README.
Attachments:
04_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchtext/x-patch; charset=UTF-8; name=04_reduce_page_overlap_of_gist_indexes_built_using_sorted_method.patchDownload
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
index 86c9e9caa9..b26b56ba15 100644
--- a/contrib/pageinspect/expected/gist.out
+++ b/contrib/pageinspect/expected/gist.out
@@ -33,14 +33,13 @@ COMMIT;
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
itemoffset | ctid | itemlen | dead | keys
------------+-----------+---------+------+-------------------
- 1 | (1,65535) | 40 | f | (p)=((166,166))
- 2 | (2,65535) | 40 | f | (p)=((332,332))
- 3 | (3,65535) | 40 | f | (p)=((498,498))
- 4 | (4,65535) | 40 | f | (p)=((664,664))
- 5 | (5,65535) | 40 | f | (p)=((830,830))
- 6 | (6,65535) | 40 | f | (p)=((996,996))
- 7 | (7,65535) | 40 | f | (p)=((1000,1000))
-(7 rows)
+ 1 | (1,65535) | 40 | f | (p)=((185,185))
+ 2 | (2,65535) | 40 | f | (p)=((370,370))
+ 3 | (3,65535) | 40 | f | (p)=((555,555))
+ 4 | (4,65535) | 40 | f | (p)=((740,740))
+ 5 | (5,65535) | 40 | f | (p)=((870,870))
+ 6 | (6,65535) | 40 | f | (p)=((1000,1000))
+(6 rows)
SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
itemoffset | ctid | itemlen | dead | keys
@@ -63,7 +62,6 @@ SELECT itemoffset, ctid, itemlen FROM gist_page_items_bytea(get_raw_page('test_g
4 | (4,65535) | 40
5 | (5,65535) | 40
6 | (6,65535) | 40
- 7 | (7,65535) | 40
-(7 rows)
+(6 rows)
DROP TABLE test_gist;
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 25cab0047b..b4feed9a22 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -26,6 +26,7 @@ The current implementation of GiST supports:
* Concurrency
* Recovery support via WAL logging
* Buffering build algorithm
+ * Sorted build method
The support for concurrency implemented in PostgreSQL was developed based on
the paper "Access Methods for Next-Generation Database Systems" by
@@ -414,6 +415,21 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
through buffers at a given level until all buffers at that level have been
emptied, and then moves down to the next level.
+Sorted build method
+-------------------
+
+Sort all input tuples, pack them into GiST leaf pages in the sorted order, and
+create downlinks and internal pages as we go. This builds the index from the
+bottom up, similar to how B-tree index build works.
+
+The sorted method is used if the operator classes for all columns have a
+"sortsupport" defined. Otherwise, we fall back on inserting tuples one by one
+with optional buffering.
+
+Sorting GiST build requires good linearization of the sort opclass. This is not
+always the case in multidimensional data. To tackle the anomalies, we buffer
+index tuples and apply picksplit function that can be multidimension-aware.
+
Bulk delete algorithm (VACUUM)
------------------------------
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9854116fca..d7fd11f75c 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -113,16 +113,23 @@ typedef struct
Page ready_pages[XLR_MAX_BLOCK_ID];
} GISTBuildState;
+#define GIST_SORTED_BUILD_PAGE_NUM 4
+
/*
* In sorted build, we use a stack of these structs, one for each level,
- * to hold an in-memory buffer of the rightmost page at the level. When the
- * page fills up, it is written out and a new page is allocated.
+ * to hold an in-memory buffer of last pages at the level.
+ *
+ * Sorting GiST build requires good linearization of the sort opclass. This is
+ * not always the case in multidimensional data. To tackle the anomalies, we
+ * buffer index tuples and apply picksplit that can be multidimension-aware.
*/
-typedef struct GistSortedBuildPageState
+typedef struct GistSortedBuildLevelState
{
- Page page;
- struct GistSortedBuildPageState *parent; /* Upper level, if any */
-} GistSortedBuildPageState;
+ int current_page;
+ BlockNumber last_blkno;
+ struct GistSortedBuildLevelState *parent; /* Upper level, if any */
+ Page pages[GIST_SORTED_BUILD_PAGE_NUM];
+} GistSortedBuildLevelState;
/* prototypes for private functions */
@@ -130,11 +137,11 @@ static void gistSortedBuildCallback(Relation index, ItemPointer tid,
Datum *values, bool *isnull,
bool tupleIsAlive, void *state);
static void gist_indexsortbuild(GISTBuildState *state);
-static void gist_indexsortbuild_pagestate_add(GISTBuildState *state,
- GistSortedBuildPageState *pagestate,
+static void gist_indexsortbuild_levelstate_add(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate,
IndexTuple itup);
-static void gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
- GistSortedBuildPageState *pagestate);
+static void gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate);
static void gist_indexsortbuild_flush_ready_pages(GISTBuildState *state);
static void gistInitBuffering(GISTBuildState *buildstate);
@@ -396,8 +403,7 @@ static void
gist_indexsortbuild(GISTBuildState *state)
{
IndexTuple itup;
- GistSortedBuildPageState *leafstate;
- GistSortedBuildPageState *pagestate;
+ GistSortedBuildLevelState *levelstate;
Page page;
state->pages_allocated = 0;
@@ -414,10 +420,10 @@ gist_indexsortbuild(GISTBuildState *state)
state->pages_allocated++;
state->pages_written++;
- /* Allocate a temporary buffer for the first leaf page. */
- leafstate = palloc(sizeof(GistSortedBuildPageState));
- leafstate->page = page;
- leafstate->parent = NULL;
+ /* Allocate a temporary buffer for the first leaf page batch. */
+ levelstate = palloc0(sizeof(GistSortedBuildLevelState));
+ levelstate->pages[0] = page;
+ levelstate->parent = NULL;
gistinitpage(page, F_LEAF);
/*
@@ -425,7 +431,7 @@ gist_indexsortbuild(GISTBuildState *state)
*/
while ((itup = tuplesort_getindextuple(state->sortstate, true)) != NULL)
{
- gist_indexsortbuild_pagestate_add(state, leafstate, itup);
+ gist_indexsortbuild_levelstate_add(state, levelstate, itup);
MemoryContextReset(state->giststate->tempCxt);
}
@@ -433,125 +439,189 @@ gist_indexsortbuild(GISTBuildState *state)
* Write out the partially full non-root pages.
*
* Keep in mind that flush can build a new root.
+ * If number of pages is > 1 then new root is required.
*/
- pagestate = leafstate;
- while (pagestate->parent != NULL)
+ while (levelstate->parent != NULL || levelstate->current_page != 0)
{
- GistSortedBuildPageState *parent;
-
- gist_indexsortbuild_pagestate_flush(state, pagestate);
- parent = pagestate->parent;
- pfree(pagestate->page);
- pfree(pagestate);
- pagestate = parent;
+ GistSortedBuildLevelState *parent;
+
+ gist_indexsortbuild_levelstate_flush(state, levelstate);
+ parent = levelstate->parent;
+ for (int i = 0; i < GIST_SORTED_BUILD_PAGE_NUM; i++)
+ if (levelstate->pages[i])
+ pfree(levelstate->pages[i]);
+ pfree(levelstate);
+ levelstate = parent;
}
gist_indexsortbuild_flush_ready_pages(state);
/* Write out the root */
- PageSetLSN(pagestate->page, GistBuildLSN);
- PageSetChecksumInplace(pagestate->page, GIST_ROOT_BLKNO);
+ PageSetLSN(levelstate->pages[0], GistBuildLSN);
+ PageSetChecksumInplace(levelstate->pages[0], GIST_ROOT_BLKNO);
smgrwrite(RelationGetSmgr(state->indexrel), MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ levelstate->pages[0], true);
if (RelationNeedsWAL(state->indexrel))
log_newpage(&state->indexrel->rd_node, MAIN_FORKNUM, GIST_ROOT_BLKNO,
- pagestate->page, true);
+ levelstate->pages[0], true);
- pfree(pagestate->page);
- pfree(pagestate);
+ pfree(levelstate->pages[0]);
+ pfree(levelstate);
}
/*
- * Add tuple to a page. If the pages is full, write it out and re-initialize
+ * Add tuple to a page. If the pages are full, write them out and re-initialize
* a new page first.
*/
static void
-gist_indexsortbuild_pagestate_add(GISTBuildState *state,
- GistSortedBuildPageState *pagestate,
+gist_indexsortbuild_levelstate_add(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate,
IndexTuple itup)
{
Size sizeNeeded;
- /* Does the tuple fit? If not, flush */
- sizeNeeded = IndexTupleSize(itup) + sizeof(ItemIdData) + state->freespace;
- if (PageGetFreeSpace(pagestate->page) < sizeNeeded)
- gist_indexsortbuild_pagestate_flush(state, pagestate);
+ /* Check if tuple can be added to the current page */
+ sizeNeeded = IndexTupleSize(itup) + sizeof(ItemIdData); /* fillfactor ignored */
+ if (PageGetFreeSpace(levelstate->pages[levelstate->current_page]) < sizeNeeded)
+ {
+ Page newPage;
+ Page old_page = levelstate->pages[levelstate->current_page];
+ uint16 old_page_flags = GistPageGetOpaque(old_page)->flags;
+ if (levelstate->current_page + 1 == GIST_SORTED_BUILD_PAGE_NUM)
+ {
+ gist_indexsortbuild_levelstate_flush(state, levelstate);
+ }
+ else
+ levelstate->current_page++;
+
+ if (levelstate->pages[levelstate->current_page] == NULL)
+ levelstate->pages[levelstate->current_page] = palloc(BLCKSZ);
+
+ newPage = levelstate->pages[levelstate->current_page];
+ gistinitpage(newPage, old_page_flags);
+ }
- gistfillbuffer(pagestate->page, &itup, 1, InvalidOffsetNumber);
+ gistfillbuffer(levelstate->pages[levelstate->current_page], &itup, 1, InvalidOffsetNumber);
}
static void
-gist_indexsortbuild_pagestate_flush(GISTBuildState *state,
- GistSortedBuildPageState *pagestate)
+gist_indexsortbuild_levelstate_flush(GISTBuildState *state,
+ GistSortedBuildLevelState *levelstate)
{
- GistSortedBuildPageState *parent;
- IndexTuple *itvec;
- IndexTuple union_tuple;
- int vect_len;
- bool isleaf;
- BlockNumber blkno;
+ GistSortedBuildLevelState *parent;
+ BlockNumber blkno;
MemoryContext oldCtx;
+ IndexTuple union_tuple;
+ SplitedPageLayout *dist;
+ IndexTuple *itvec;
+ int vect_len;
+ bool isleaf = GistPageIsLeaf(levelstate->pages[0]);
- /* check once per page */
CHECK_FOR_INTERRUPTS();
- if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
- gist_indexsortbuild_flush_ready_pages(state);
-
- /*
- * The page is now complete. Assign a block number to it, and add it to
- * the list of finished pages. (We don't write it out immediately, because
- * we want to WAL-log the pages in batches.)
- */
- blkno = state->pages_allocated++;
- state->ready_blknos[state->ready_num_pages] = blkno;
- state->ready_pages[state->ready_num_pages] = pagestate->page;
- state->ready_num_pages++;
+ oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- isleaf = GistPageIsLeaf(pagestate->page);
+ /* Get index tuples from first page */
+ itvec = gistextractpage(levelstate->pages[0], &vect_len);
+ if (levelstate->current_page > 0)
+ {
+ /* Append tuples from each page */
+ for (int i = 1; i < levelstate->current_page + 1; i++)
+ {
+ int len_local;
+ IndexTuple *itvec_local = gistextractpage(levelstate->pages[i], &len_local);
+ itvec = gistjoinvector(itvec, &vect_len, itvec_local, len_local);
+ pfree(itvec_local);
+ }
- /*
- * Form a downlink tuple to represent all the tuples on the page.
- */
- oldCtx = MemoryContextSwitchTo(state->giststate->tempCxt);
- itvec = gistextractpage(pagestate->page, &vect_len);
- union_tuple = gistunion(state->indexrel, itvec, vect_len,
+ /* Apply picksplit to list of all collected tuples */
+ dist = gistSplit(state->indexrel, levelstate->pages[0], itvec, vect_len, state->giststate);
+ }
+ else
+ {
+ /* Create splitted layout from single page */
+ dist = (SplitedPageLayout*) palloc0(sizeof(SplitedPageLayout));
+ union_tuple = gistunion(state->indexrel, itvec, vect_len,
state->giststate);
- ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+ dist->itup = union_tuple;
+ dist->list = gistfillitupvec(itvec, vect_len, &(dist->lenlist));
+ dist->block.num = vect_len;
+ }
+
MemoryContextSwitchTo(oldCtx);
- /*
- * Insert the downlink to the parent page. If this was the root, create a
- * new page as the parent, which becomes the new root.
- */
- parent = pagestate->parent;
- if (parent == NULL)
+ /* Reset page counter */
+ levelstate->current_page = 0;
+
+ /* Create pages for all partitions in split result */
+ for (;dist != NULL; dist = dist->next)
{
- parent = palloc(sizeof(GistSortedBuildPageState));
- parent->page = (Page) palloc(BLCKSZ);
- parent->parent = NULL;
- gistinitpage(parent->page, 0);
+ char *data;
+ Page target;
- pagestate->parent = parent;
- }
- gist_indexsortbuild_pagestate_add(state, parent, union_tuple);
+ /* check once per whatever */
+ CHECK_FOR_INTERRUPTS();
- /* Re-initialize the page buffer for next page on this level. */
- pagestate->page = palloc(BLCKSZ);
- gistinitpage(pagestate->page, isleaf ? F_LEAF : 0);
+ /* Create page and copy data */
+ data = (char *) (dist->list);
+ target = palloc0(BLCKSZ);
+ gistinitpage(target, isleaf? F_LEAF:0);
+ for (int i = 0; i < dist->block.num; i++)
+ {
+ IndexTuple thistup = (IndexTuple) data;
- /*
- * Set the right link to point to the previous page. This is just for
- * debugging purposes: GiST only follows the right link if a page is split
- * concurrently to a scan, and that cannot happen during index build.
- *
- * It's a bit counterintuitive that we set the right link on the new page
- * to point to the previous page, and not the other way round. But GiST
- * pages are not ordered like B-tree pages are, so as long as the
- * right-links form a chain through all the pages in the same level, the
- * order doesn't matter.
- */
- GistPageGetOpaque(pagestate->page)->rightlink = blkno;
+ if (PageAddItem(target, (Item) data, IndexTupleSize(thistup), i + FirstOffsetNumber, false, false) == InvalidOffsetNumber)
+ elog(ERROR, "failed to add item to index page in \"%s\"", RelationGetRelationName(state->indexrel));
+
+ data += IndexTupleSize(thistup);
+ }
+ union_tuple = dist->itup;
+
+ if (state->ready_num_pages == XLR_MAX_BLOCK_ID)
+ gist_indexsortbuild_flush_ready_pages(state);
+
+ /*
+ * The page is now complete. Assign a block number to it, and add it to
+ * the list of finished pages. (We don't write it out immediately, because
+ * we want to WAL-log the pages in batches.)
+ */
+ blkno = state->pages_allocated++;
+ state->ready_blknos[state->ready_num_pages] = blkno;
+ state->ready_pages[state->ready_num_pages] = target;
+ state->ready_num_pages++;
+ ItemPointerSetBlockNumber(&(union_tuple->t_tid), blkno);
+
+ /*
+ * Set the right link to point to the previous page. This is just for
+ * debugging purposes: GiST only follows the right link if a page is split
+ * concurrently to a scan, and that cannot happen during index build.
+ *
+ * It's a bit counterintuitive that we set the right link on the new page
+ * to point to the previous page, and not the other way round. But GiST
+ * pages are not ordered like B-tree pages are, so as long as the
+ * right-links form a chain through all the pages in the same level, the
+ * order doesn't matter.
+ */
+ if (levelstate->last_blkno)
+ GistPageGetOpaque(target)->rightlink = levelstate->last_blkno;
+ levelstate->last_blkno = blkno;
+
+ /*
+ * Insert the downlink to the parent page. If this was the root, create a
+ * new page as the parent, which becomes the new root.
+ */
+ parent = levelstate->parent;
+ if (parent == NULL)
+ {
+ parent = palloc0(sizeof(GistSortedBuildLevelState));
+ parent->pages[0] = (Page) palloc(BLCKSZ);
+ parent->parent = NULL;
+ gistinitpage(parent->pages[0], 0);
+
+ levelstate->parent = parent;
+ }
+ gist_indexsortbuild_levelstate_add(state, parent, union_tuple);
+ }
}
static void
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 89249ecc97..bfb7802f2d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -969,7 +969,7 @@ GistHstoreOptions
GistInetKey
GistNSN
GistOptBufferingMode
-GistSortedBuildPageState
+GistSortedBuildLevelState
GistSplitUnion
GistSplitVector
GistTsVectorOptions
tiling-1-9.pngimage/png; name=tiling-1-9.pngDownload
�PNG
IHDR � � ?�� 9tEXtSoftware Matplotlib version3.4.3, https://matplotlib.org/��D� pHYs a a�?�i ��IDATx���y���a����kfv�QeP�hL\P��@��i�s��7�Mcr'��������&M�cn�X���MR������
��&" �(WF�f���44�~f�y<���a��u�9�9��z_�J�R ]^�� 0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T �J� UB �0 @�� �� P%` �*! T ��2z��|�C����;�H�T�w��q��>���=��o ��M �K���{��/9/��R�S ���-z �{��7�g�}�C0`@��k��I�������nH{{�a^ G� @U����������X G�#� �r���/����|�d��1)�J)�JY�n�^��/������K�T�_��_�o��o2v������MozS-Z�������9�&MJ}}}����������OOccc�/_�s�=7
7n\~���&I.\�3�8#�z�����3��=���/��+�����SWW��C���.��%K�k ��'� �r~��~/?�p���!�W���'I�rP�����$/��r>����T*��_�z~��~/�=�X�������+�y�{2e��\}��y����8G}����/��9s����}o��� �����������+��"�������?���7�����Y�~}����$���?�������?��L�4)�?�|����<��C�6m�A�� �k� �r�N��i�������w�s�?y�Z�x��<��#8p`�d���y�;��[n�%s��I�\y��9���s�=��O�>I�3fd���9�����5�~����'?�����$�\� &��������q�I��'�mo{[��_������_���|�#��7���|��|�� ���� �����=�7I�>��$�c�=�dw�mjj�?�����$��{n�L�r@���O�������y���0`@&N���t���-I2`����W���O?}@�
@�� ��{�?��_|��$����$��q��z���{#�9���J�=����F���u��%I����g��5jTN?��|��_�# ��� �����}^_�T��k��-�~����c��;��NF��o|��<yr~��_�� t~0 ]�oj�Hx�;~}���n��uG��#��O~2��o���<�
������ �� � tI�{�N����KG�5G���������]6o��q������t�v$I[[[6m���uC����#���zD� �y�= �����$��?������M�=2w�����_�����w�#o}�[s�e���_���?�'���{D�����_�1�������I'��>}�d���Y�hQ���o� t.0 ]�����|�+_������|��ioo���?~�_w�������|��_���_�+'�pB~����?�aV�\y�_�U
��'?�[o�5?�������q������n>��O� t.�J�R)z tu'�|r��y��= �n�w �~��sgv����uw�qG�-[����3
^�� ���[��3g��@F����W�{��^����+Vd��Ay���c���|����2�� ��� `?l��)��Gs�=���g�M���3c��\s�5;vl�d���Y�p�k>�q��u���� t'0 b�/��/�������+o}�[��"