GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
Hi, hackers!
I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.
Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:
if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
That is: remove old tuple if it’s there, then place updated tuple at the end.
Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.
If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.
I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().
By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (
/messages/by-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=GHo4fewu3Yg@mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.
So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.
Current work is here https://github.com/x4m/pggistopt/tree/simplefix
What do you think about found performance problem and about patch
attached? Am I missing something?
Best regards, Andrey Borodin, Octonica & Ural Federal University.
Attachments:
GiST memmove.patchapplication/octet-stream; name="GiST memmove.patch"Download
From 5da3089fa72ac87e8e9ccaf0fc9ccfbbfb2c22d5 Mon Sep 17 00:00:00 2001
From: x4m <amborodin@acm.org>
Date: Sun, 3 Jul 2016 12:51:55 +0500
Subject: [PATCH] fix
---
src/backend/access/gist/gist.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 996363c..49c4560 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -503,9 +503,44 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* PageIndexTupleDelete() here and PageIndexMultiDelete() in
* gistRedoPageUpdateRecord()
*/
+
+#ifdef OLDBEHAVIOR
if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+#else
+ if (OffsetNumberIsValid(oldoffnum))
+ {
+ if(ntup==1)
+ {
+ ItemId itemId;
+ Size newsz;
+ Item pageItem;
+ newsz = IndexTupleSize(*itup);
+ itemId = PageGetItemId(page,oldoffnum);
+ pageItem = PageGetItem(page,itemId);
+ if(IndexTupleSize(pageItem)==newsz)
+ {
+ memmove(pageItem,*itup,newsz);
+ }
+ else
+ {
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+#endif
+
MarkBufferDirty(buffer);
Hi!
On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com>
wrote:
I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);That is: remove old tuple if it’s there, then place updated tuple at the
end.Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.
Very promising results!
I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (/messages/by-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=GHo4fewu3Yg@mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.
+1 for PageReplaceItem()
Even if item is not the same size, we can move the tail of page once
instead of twice.
I think you should implement PageReplaceItem() version and add it to the
commitfest.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Hi!
I think you should implement PageReplaceItem() version and add it to the commitfest.
Here is the patch.
I've called function PageIndexTupleOverwrite() because it's suitable
only for indices. It works on my tests and performance is the same as
in proof-of-concept (despite some sanity checks copied from
PageIndexTupleDelete).
Next weekend I'll do more testing, and then add it to commitfest.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
Show quoted text
Hi!
On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com>
wrote:I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);That is: remove old tuple if it’s there, then place updated tuple at the
end.Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.Very promising results!
I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (/messages/by-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=GHo4fewu3Yg@mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.+1 for PageReplaceItem()
Even if item is not the same size, we can move the tail of page once instead
of twice.
I think you should implement PageReplaceItem() version and add it to the
commitfest.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
PageIndexTupleOverwrite.patchapplication/octet-stream; name=PageIndexTupleOverwrite.patchDownload
From ab9a21492991e0f331ac830862ed557ca2574c44 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 4 Jul 2016 10:30:37 +0500
Subject: [PATCH] PageIndexTupleOverwrite
---
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/storage/page/bufpage.c | 104 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
3 files changed, 123 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..239971b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /*if we have just one tuple to update we replace it on-place on page*/
+ if(ntup==1)
+ {
+ PageIndexTupleOverwrite(page,oldoffnum,*itup);
+ }
+ else
+ {
+ /*this corner case is here to support mix calls case (see comment above)*/
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /*just append all tuples at the end of a page*/
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..e75f42d 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,110 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tup;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ int newsize;
+ int nline;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ nline = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > nline)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tup = PageGetItemId(page, offnum);
+
+
+ Assert(ItemIdHasStorage(tup));
+
+ newsize = IndexTupleSize(newtup);
+ oldsize = ItemIdGetLength(tup);
+ /*may have negative size here if new tuple is larger*/
+ size_diff = oldsize-newsize;
+ offset = ItemIdGetOffset(tup);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ //elog(WARNING,"old %u new %u diff %d offset %u offnum %u movesize %d",oldsize,newsize,size_diff,offset, offnum,(int) (offset - phdr->pd_upper));
+
+ if (size_diff!=0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = 1; i <= nline; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ Assert(ItemIdHasStorage(ii));
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /*Fix updated tuple length*/
+ tup = PageGetItemId(page, offnum);
+ tup->lp_len = newsize;
+
+ /*now place new tuple on page*/
+ memmove((char *) page + offset + size_diff, newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..780e982 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
Here is a patch with corrections from Alexander Korotkov.
My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 10:41 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Show quoted text
Hi!
I think you should implement PageReplaceItem() version and add it to the commitfest.
Here is the patch.
I've called function PageIndexTupleOverwrite() because it's suitable
only for indices. It works on my tests and performance is the same as
in proof-of-concept (despite some sanity checks copied from
PageIndexTupleDelete).
Next weekend I'll do more testing, and then add it to commitfest.Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-03 15:21 GMT+05:00 Alexander Korotkov <a.korotkov@postgrespro.ru>:
Hi!
On Sun, Jul 3, 2016 at 12:24 PM, Andrew Borodin <borodin@octonica.com>
wrote:I think there is some room for improving GiST inserts. Following is
the description of what I think is performance problem.Function gistplacetopage in file /src/backend/access/gist/gist.c is
responsible for updating or appending new tuple on GiST page.
Currently, after checking necessity of page split due to overflow, it
essentially executes following:if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);That is: remove old tuple if it’s there, then place updated tuple at the
end.Half of the old data have to be shifted my memmove inside
PageIndexTupleDelete() call, half of the linp-s have to be corrected.If the updated tuple has same size as already residing on page we can
just overwrite it. Attached patch demonstrates that concept. Attached
test.sql inserts million rows into GiST index based on cube extension.
My machine is Hyper-V VM running Ubuntu on i5-2500 CPU with SSD
storage. Before patch, insert part of test is executed on average
within 159 second, after patch application: insert part is executed
within 77 seconds on average. That is almost twice faster (for
CPU\Mem-bounded inserts, disk-constrained test will show no
improvement). But it works only for fixed-size tuple inserts.Very promising results!
I know that code in patch is far from beautiful: it operates with
three different levels of abstraction within 5 lines of code. Those
are low level memmove(), system-wide PageAddItem() and GiST private
gistfillBuffer().By the way PageAddItem() have overwrite flag, but it only works with
unused ItemId’s. Marking old ItemId as unused before PageAddItem()
didn’t work for me. Unfortunately bufpage.c routines do not contain
one for updating(replacing with new) tuple on page. It is important
for me because I’m working on advanced GiST page layout (/messages/by-id/CAJEAwVE0rrr+OBT-P0gDCtXbVDkBBG_WcXwCBK=GHo4fewu3Yg@mail.gmail.com
), current approach is to use skip-tuples which can be used to skip
many flowing tuples with one key check. Obviously, this design cares
about tuples order. And update in a fashion “place updated tuple at
the end” won’t work for me.So, I think it would be better to implement PageReplaceItem()
functionality to make code better, to make existing GiST inserts
faster and to enable new advanced page layouts in indices.+1 for PageReplaceItem()
Even if item is not the same size, we can move the tail of page once instead
of twice.
I think you should implement PageReplaceItem() version and add it to the
commitfest.------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
PageIndexTupleOverwrite v2.patchapplication/octet-stream; name="PageIndexTupleOverwrite v2.patch"Download
From 5d6b227aed186d56d02ed3ec75f9a8ef01661f72 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 4 Jul 2016 16:08:41 +0500
Subject: [PATCH] temp3
---
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/storage/page/bufpage.c | 104 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
3 files changed, 123 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..ba4dfde 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /*if we have just one tuple to update we replace it on-place on page*/
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page,oldoffnum,*itup);
+ }
+ else
+ {
+ /*this corner case is here to support mix calls case (see comment above)*/
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /*just append all tuples at the end of a page*/
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..d41a233 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,110 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tup;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ int newsize;
+ int nline;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ nline = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > nline)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tup = PageGetItemId(page, offnum);
+
+
+ Assert(ItemIdHasStorage(tup));
+
+ newsize = IndexTupleSize(newtup);
+ oldsize = ItemIdGetLength(tup);
+ /*may have negative size here if new tuple is larger*/
+ size_diff = oldsize - newsize;
+ offset = ItemIdGetOffset(tup);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /*skip modifications if nothing has to be changed actually*/
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = 1; i <= nline; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ Assert(ItemIdHasStorage(ii));
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+
+ /*Fix updated tuple length*/
+ tup = PageGetItemId(page, offnum);
+ tup->lp_len = newsize;
+ }
+
+
+ /*now place new tuple on page*/
+ memmove((char *) page + offset + size_diff, newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..780e982 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
On Mon, Jul 4, 2016 at 4:52 PM, Andrew Borodin <borodin@octonica.com> wrote:
Here is a patch with corrections from Alexander Korotkov.
My tests, check and check-world show no problems against Ubuntu LTS 14 x64 VM.
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /*if we have just one tuple to update we replace it on-place on page*/
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page,oldoffnum,*itup);
+ }
+ else
+ {
+ /*this corner case is here to support mix calls case (see comment above)*/
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
Here, you have changed the way tuple is added on a page, but still the
WAL record is same as before. I think WAL replay should perform the
action in a same way as we have done when original operation is
performed. If not, then it can change the organization of page which
might lead to failure in replay of consecutive WAL records.
+ /*if we have just one tuple to update we replace it on-place on page*/
In comments, we always leave a space both in the beginning and at the
end of a comment. See comments in nearby code.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
How can I check that it works? Would it be sufficient if I kill 9 backend
and postmaster after commit and it will start normally executing select
queries after restart?
I'll make a patch with missing spaces later. I also found some more missing
spaces in PageIndexTupleOverwrite call and typo in comments ("on-place").
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 18:58 GMT+05:00 Amit Kapila <amit.kapila16@gmail.com>:
Show quoted text
On Mon, Jul 4, 2016 at 4:52 PM, Andrew Borodin <borodin@octonica.com>
wrote:Here is a patch with corrections from Alexander Korotkov.
My tests, check and check-world show no problems against Ubuntu LTS 14x64 VM.
- PageIndexTupleDelete(page, oldoffnum); - gistfillbuffer(page, itup, ntup, InvalidOffsetNumber); + { + /*if we have just one tuple to update we replace it on-place on page*/ + if(ntup == 1) + { + PageIndexTupleOverwrite(page,oldoffnum,*itup); + } + else + { + /*this corner case is here to support mix calls case (see comment above)*/ + PageIndexTupleDelete(page, oldoffnum); + gistfillbuffer(page, itup, ntup, InvalidOffsetNumber); + }Here, you have changed the way tuple is added on a page, but still the
WAL record is same as before. I think WAL replay should perform the
action in a same way as we have done when original operation is
performed. If not, then it can change the organization of page which
might lead to failure in replay of consecutive WAL records.+ /*if we have just one tuple to update we replace it on-place on page*/
In comments, we always leave a space both in the beginning and at the
end of a comment. See comments in nearby code.--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Andrew Borodin wrote:
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
We require that replay of WAL leads to identical bytes than the
original. Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images. So
please fix it instead of just verifying that this works for GiST.
By the way, BRIN indexes have a need of this operation too. The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you, Alvaro.
Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.
About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Andrew Borodin wrote:
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
We require that replay of WAL leads to identical bytes than the
original. Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images. So
please fix it instead of just verifying that this works for GiST.By the way, BRIN indexes have a need of this operation too. The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Here is a patch with updated WAL behavior.
I'm not certain about MAXALIGN for size of appended tuple. Currently
if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
will ereport(ERROR).
I suspect that it should allign size instead.
Also I suspect that PageIndexTupleOverwrite() should make a call to
ItemIdSetNormal() to pretend it is generic function and not just a
private part of GiST.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Show quoted text
Thank you, Alvaro.
Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Andrew Borodin wrote:
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
We require that replay of WAL leads to identical bytes than the
original. Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images. So
please fix it instead of just verifying that this works for GiST.By the way, BRIN indexes have a need of this operation too. The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
PageIndexTupleOverwrite v3.patchapplication/octet-stream; name="PageIndexTupleOverwrite v3.patch"Download
From 0c882dc983204ed92af874b617c693da81978cce Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 5 Jul 2016 17:34:24 +0500
Subject: [PATCH] temp4
---
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 18 ++++++-
src/backend/storage/page/bufpage.c | 104 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
4 files changed, 140 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..169566c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, *itup);
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..a71a465 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,8 +80,24 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ * */
+ if ( xldata->ntodelete == 1 && xldata->ntoinsert == 1 )
+ {
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page,offnum,itup);
+ data +=IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ ninserted++;
+ }
+
/* Delete old tuples */
- if (xldata->ntodelete > 0)
+ else if (xldata->ntodelete > 0)
{
OffsetNumber *todelete = (OffsetNumber *) data;
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..857ba5a 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,110 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tup;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ int newsize;
+ int nline;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ nline = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > nline)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tup = PageGetItemId(page, offnum);
+
+
+ Assert(ItemIdHasStorage(tup));
+
+ newsize = IndexTupleSize(newtup);
+ oldsize = ItemIdGetLength(tup);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - newsize;
+ offset = ItemIdGetOffset(tup);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = 1; i <= nline; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ Assert(ItemIdHasStorage(ii));
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+
+ /* Fix updated tuple length */
+ tup = PageGetItemId(page, offnum);
+ tup->lp_len = newsize;
+ }
+
+
+ /* now place new tuple on page */
+ memmove((char *) page + offset + size_diff, newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..780e982 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
Here is a new patch. Updates:
1. Uses MAXALIGN to allocate space on page
2. Uses ItemIdSetNormal in case when ItemId was not normal for some
reasons before call
3. Improved comments and var names
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-05 17:54 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Show quoted text
Here is a patch with updated WAL behavior.
I'm not certain about MAXALIGN for size of appended tuple. Currently
if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
will ereport(ERROR).
I suspect that it should allign size instead.Also I suspect that PageIndexTupleOverwrite() should make a call to
ItemIdSetNormal() to pretend it is generic function and not just a
private part of GiST.Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Thank you, Alvaro.
Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Andrew Borodin wrote:
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
We require that replay of WAL leads to identical bytes than the
original. Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images. So
please fix it instead of just verifying that this works for GiST.By the way, BRIN indexes have a need of this operation too. The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
PageIndexTupleOverwrite v4.patchapplication/octet-stream; name="PageIndexTupleOverwrite v4.patch"Download
From b631ead5be1035131ef1a2c9e95b540bb4af5ac5 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 6 Jul 2016 22:04:41 +0500
Subject: [PATCH] temp7
---
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 22 +++++++-
src/backend/storage/page/bufpage.c | 106 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
4 files changed, 145 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..169566c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, *itup);
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..934d50d 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,9 +80,26 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- /* Delete old tuples */
- if (xldata->ntodelete > 0)
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ * */
+ if ( xldata->ntodelete == 1 && xldata->ntoinsert == 1 )
{
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page,offnum,itup);
+ /* set up data pointer to skip PageAddItem loop */
+ data +=IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ /* correct insertion count for following assert check */
+ ninserted++;
+ }
+ else if (xldata->ntodelete > 0)
+ {
+ /* if it's not in-place update we proceed with deleting old tuples */
OffsetNumber *todelete = (OffsetNumber *) data;
data += sizeof(OffsetNumber) * xldata->ntodelete;
@@ -115,6 +132,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
}
}
+ /* check that buffer had exactly same count of tuples */
Assert(ninserted == xldata->ntoinsert);
PageSetLSN(page, lsn);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..b196a05 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,112 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tupid;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ int newsize;
+ unsigned alignednewsize;
+ int itemcount;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ /* set the item pointer */
+
+
+ Assert(ItemIdHasStorage(tupid));
+
+ newsize = IndexTupleSize(newtup);
+ alignednewsize = MAXALIGN(newsize);
+ oldsize = ItemIdGetLength(tupid);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - alignednewsize;
+ offset = ItemIdGetOffset(tupid);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ Assert(ItemIdHasStorage(ii));
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Fix updated tuple length */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+
+ /* now place new tuple on page */
+ memmove(PageGetItem(page,tupid), newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..780e982 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
One more thing came to my mind:
WAL records done by the GiST before patch will corrupt GiST after
patch if replayed.
Is it a problem? May be it's prevented on some higher level?
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-06 22:11 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Here is a new patch. Updates:
1. Uses MAXALIGN to allocate space on page
2. Uses ItemIdSetNormal in case when ItemId was not normal for some
reasons before call
3. Improved comments and var namesBest regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-05 17:54 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Here is a patch with updated WAL behavior.
I'm not certain about MAXALIGN for size of appended tuple. Currently
if anyone calls PageIndexTupleOverwrite() with unalligned tuple it
will ereport(ERROR).
I suspect that it should allign size instead.Also I suspect that PageIndexTupleOverwrite() should make a call to
ItemIdSetNormal() to pretend it is generic function and not just a
private part of GiST.Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:59 GMT+05:00 Andrew Borodin <borodin@octonica.com>:
Thank you, Alvaro.
Yes, now I see. I'll update gistRedoPageUpdateRecord() to work
accordingly with patched gistplacetopage().
Also, i've noticed that PageAddItem uses MAXALIGN() macro to calculate
tuple size. So, PageIndexTupleOverwrite should behave the same.About PageIndexDeleteNoCompact() in BRIN. I do not see why
PageIndexDeleteNoCompact() is not a good solution instead of
PageIndexTupleOverwrite? Will it mark tuple as unused until next
PageAddItem will use it's place?Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-07-04 22:16 GMT+05:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Andrew Borodin wrote:
Thank you, Amit.
Currently, if WAL reconstruct page in another order it won't be a problem.
We require that replay of WAL leads to identical bytes than the
original. Among other things, this enables use of a tool that verifies
that WAL is working correctly simply by comparing page images. So
please fix it instead of just verifying that this works for GiST.By the way, BRIN indexes have a need of this operation too. The current
approach is to call PageIndexDeleteNoCompact followed by PageAddItem.
I suppose it would be beneficial to use your new routine there too.--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Jul 8, 2016 at 2:00 PM, Andrew Borodin <borodin@octonica.com> wrote:
(Please top-post on threads, this is annoying)
One more thing came to my mind:
WAL records done by the GiST before patch will corrupt GiST after
patch if replayed.
Is it a problem? May be it's prevented on some higher level?
If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is
bumped to prevent any problems.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.
Here is a patch with updated GiST README.
I haven't found apropriate place to describe PageIndexTupleOverwrite
function in docs, since all it's siblings are described only in the
code.
Code comments describing this function are coherent with others
(PageAddItem, PageIndexTupleDelete).
Best regards, Andrey Borodin, Octonica & Ural Federal University.
Attachments:
XLog-magic-bump.patchapplication/octet-stream; name=XLog-magic-bump.patchDownload
From f96e2a7b4130cc8c97405a3e332d7600fdf5fa1b Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 24 Jul 2016 14:54:13 +0500
Subject: [PATCH] XLog magic bump
---
src/include/access/xlog_internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 2627519..0a595cc 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
/*
* Each page of XLOG file has a header like this:
*/
-#define XLOG_PAGE_MAGIC 0xD092 /* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD093 /* can be used as WAL version indicator */
typedef struct XLogPageHeaderData
{
--
2.7.0.windows.1
PageIndexTupleOverwrite v5.patchapplication/octet-stream; name="PageIndexTupleOverwrite v5.patch"Download
From 3cffa2e1f9bf4efd0d3546bf279a20d18836dd43 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 24 Jul 2016 14:42:21 +0500
Subject: [PATCH] temp8
---
src/backend/access/gist/README | 8 +++
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 22 +++++++-
src/backend/storage/page/bufpage.c | 106 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
5 files changed, 153 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index dd4c9fa..b02247f 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -171,6 +171,14 @@ pages. The caller must then call gistplacetopage() on the parent page to
insert the downlink tuples. The parent page that holds the downlink to
the child might have migrated as a result of concurrent splits of the
parent, gistfindCorrectParent() is used to find the parent page.
+gistplacetopage gets list of tuples and offnum as a parametes. If passed
+offnum is invalid, all tuples will be apended at the end of a page. If list
+contains many tuples and offnum is valid, offnum`ed tuple will be deleted from
+the page and all tuples from list appended at the end of a page. Append is
+performed by PageAddItem routine. If list contains exactly one tuple and offnum
+ is valid, tuple on page is replaced with PageIndexTupleOverwrite routine,
+which shows significant performance improvement in case when old tuple's size
+matches size of a new tuple.
Splitting the root page works slightly differently. At root split,
gistplacetopage() allocates the new child pages and replaces the old root
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..169566c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, *itup);
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..934d50d 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,9 +80,26 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- /* Delete old tuples */
- if (xldata->ntodelete > 0)
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ * */
+ if ( xldata->ntodelete == 1 && xldata->ntoinsert == 1 )
{
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page,offnum,itup);
+ /* set up data pointer to skip PageAddItem loop */
+ data +=IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ /* correct insertion count for following assert check */
+ ninserted++;
+ }
+ else if (xldata->ntodelete > 0)
+ {
+ /* if it's not in-place update we proceed with deleting old tuples */
OffsetNumber *todelete = (OffsetNumber *) data;
data += sizeof(OffsetNumber) * xldata->ntodelete;
@@ -115,6 +132,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
}
}
+ /* check that buffer had exactly same count of tuples */
Assert(ninserted == xldata->ntoinsert);
PageSetLSN(page, lsn);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..b196a05 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,112 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tupid;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ int newsize;
+ unsigned alignednewsize;
+ int itemcount;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ /* set the item pointer */
+
+
+ Assert(ItemIdHasStorage(tupid));
+
+ newsize = IndexTupleSize(newtup);
+ alignednewsize = MAXALIGN(newsize);
+ oldsize = ItemIdGetLength(tupid);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - alignednewsize;
+ offset = ItemIdGetOffset(tupid);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ Assert(ItemIdHasStorage(ii));
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Fix updated tuple length */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+
+ /* now place new tuple on page */
+ memmove(PageGetItem(page,tupid), newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..780e982 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
On Sun, Jul 24, 2016 at 7:18 PM, Andrew Borodin <borodin@octonica.com> wrote:
If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.
Usually the committer in charge of reviewing such a patch would bump
it. There is no need for the patch submitter to do so. I should have
been more precise previously, sorry for my twisted words.
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Michael Paquier <michael.paquier@gmail.com> writes:
On Sun, Jul 24, 2016 at 7:18 PM, Andrew Borodin <borodin@octonica.com> wrote:
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.
Usually the committer in charge of reviewing such a patch would bump
it. There is no need for the patch submitter to do so. I should have
been more precise previously, sorry for my twisted words.
It's good to remind the committer that such a bump is needed, of course.
But yeah, casting the reminder in the form of a hunk of the patch is
more likely to cause trouble than be helpful.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Borodin wrote:
If a feature changes the shape of WAL records, XLOG_PAGE_MAGIC is bumped to prevent any problems.
I've attached patch with a bump, but, obviously, it'll be irrelevant
after any other commit changing WAL shape.Here is a patch with updated GiST README.
I haven't found apropriate place to describe PageIndexTupleOverwrite
function in docs, since all it's siblings are described only in the
code.
Code comments describing this function are coherent with others
(PageAddItem, PageIndexTupleDelete).
Can you please patch BRIN to use this new function too?
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Can you please patch BRIN to use this new function too?
AFAIK Sergey Mirvoda was going to implement and test it.
It is easy to do this optimization for brin_samepage_update (see patch
draft in attachment, make check passes), but WAL of regular BRIN
update is not clear for me.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
Attachments:
Brin_PageIndexTupleOverwrite.patchapplication/octet-stream; name=Brin_PageIndexTupleOverwrite.patchDownload
From 4fb35e379c884b85614ba0b07a8f0f745b8a19fe Mon Sep 17 00:00:00 2001
From: x4m <amborodin@acm.org>
Date: Mon, 25 Jul 2016 13:52:55 +0500
Subject: [PATCH] brinpatch
mend
---
src/backend/access/brin/brin_pageops.c | 7 +++----
src/backend/access/brin/brin_xlog.c | 5 +----
src/backend/access/gist/gist.c | 2 +-
src/backend/access/gist/gistxlog.c | 2 +-
src/backend/storage/page/bufpage.c | 6 ++----
src/include/storage/bufpage.h | 2 +-
6 files changed, 9 insertions(+), 15 deletions(-)
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index d0ca485..63c252a 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -178,10 +178,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
}
START_CRIT_SECTION();
- PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
- if (PageAddItem(oldpage, (Item) newtup, newsz, oldoff, true,
- false) == InvalidOffsetNumber)
- elog(ERROR, "failed to add BRIN tuple");
+
+ PageIndexTupleOverwrite(oldpage,oldoff,(Item)newtup,newsz);
+
MarkBufferDirty(oldbuf);
/* XLOG stuff */
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index deb7af4..87c7894 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -192,10 +192,7 @@ brin_xlog_samepage_update(XLogReaderState *record)
if (PageGetMaxOffsetNumber(page) + 1 < offnum)
elog(PANIC, "brin_xlog_samepage_update: invalid max offset number");
- PageIndexDeleteNoCompact(page, &offnum, 1);
- offnum = PageAddItem(page, (Item) brintuple, tuplen, offnum, true, false);
- if (offnum == InvalidOffsetNumber)
- elog(PANIC, "brin_xlog_samepage_update: failed to add tuple");
+ PageIndexTupleOverwrite(page,offnum,(Item)brintuple, tuplen);
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5ac1211..6e2ac2a 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -508,7 +508,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
/* if we have just one tuple to update we replace it in-place on page */
if(ntup == 1)
{
- PageIndexTupleOverwrite(page, oldoffnum, *itup);
+ PageIndexTupleOverwrite(page, oldoffnum, *itup,IndexTupleSize(*itup));
}
else
{
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 934d50d..67461d9 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -90,7 +90,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
OffsetNumber offnum = *((OffsetNumber *) data);
data += sizeof(OffsetNumber);
itup = (IndexTuple) data;
- PageIndexTupleOverwrite(page,offnum,itup);
+ PageIndexTupleOverwrite(page,offnum,itup,IndexTupleSize(itup));
/* set up data pointer to skip PageAddItem loop */
data +=IndexTupleSize(itup);
Assert(data - begin == datalen);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index cfbd7f0..842f050 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -682,7 +682,7 @@ PageGetHeapFreeSpace(Page page)
* Unlike heap pages, we keep compacted line pointers.
*/
void
-PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize)
{
PageHeader phdr = (PageHeader) page;
char *addr;
@@ -690,7 +690,6 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
int size_diff;
int oldsize;
unsigned offset;
- int newsize;
unsigned alignednewsize;
int itemcount;
@@ -718,7 +717,6 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
Assert(ItemIdHasStorage(tupid));
- newsize = IndexTupleSize(newtup);
alignednewsize = MAXALIGN(newsize);
oldsize = ItemIdGetLength(tupid);
/* may have negative size here if new tuple is larger */
@@ -765,7 +763,7 @@ PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup)
{
ItemId ii = PageGetItemId(phdr, i);
- Assert(ItemIdHasStorage(ii));
+ //Assert(ItemIdHasStorage(ii));
if (ItemIdGetOffset(ii) <= offset)
ii->lp_off += size_diff;
}
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index eee708b..ef08e7f 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -404,7 +404,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
-extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
Can you please patch BRIN to use this new function too?
On my machine replacement of both BRIN update cases (see
https://github.com/x4m/pggistopt/commit/a6d301ff79104b977619339d53aebf748045418a
) showed no performance changes on folowing update query (6 seconds of
updates avg):
create table dataTable(x int, y int);
insert into dataTable(x,y) select x,y from generate_series(1,1e3,1)
x,generate_series(1,1e3,1) y;
create index idx on dataTable using brin(x,y);
update datatable set x = random()*1024, y = random()*1024;
https://gist.github.com/x4m/7e69fd924b9ffd2fdc9c6100e741171d
Probably I was looking in a wrong place. I do not see other cases when
PageIndexTupleOverwrite can improve performance of BRIN. Though I'll
make PageIndexTupleOverwrite BRIN-compatible in forthcoming patch
version: BRIN tuples have no length in header and it must be taken as
a parameter. Just as the PageAddItem do.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Here is BRIN-compatible version of patch. Now function
PageIndexTupleOverwrite consumes tuple size as a parameter, this
behavior is coherent with PageAddItem.
Also, i had to remove asserstion that item has a storage in the loop
that adjusts line pointers. It would be great if someone could check
that place (commented Assert(ItemIdHasStorage(ii)); in
PageIndexTupleOverwrite). I suspect that there might be a case when
linp's should not be adjusted.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
Attachments:
PageIndexTupleOverwrite v6.patchapplication/octet-stream; name="PageIndexTupleOverwrite v6.patch"Download
From 78f8994dd9138ed0c7ed17e4ed89b3303d7c472e Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 30 Jul 2016 15:54:04 +0500
Subject: [PATCH] PageIndexTupleOverwrite GiST Optimization
---
src/backend/access/gist/README | 8 +++
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 24 +++++++-
src/backend/storage/page/bufpage.c | 110 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
5 files changed, 158 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index dd4c9fa..b02247f 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -171,6 +171,14 @@ pages. The caller must then call gistplacetopage() on the parent page to
insert the downlink tuples. The parent page that holds the downlink to
the child might have migrated as a result of concurrent splits of the
parent, gistfindCorrectParent() is used to find the parent page.
+gistplacetopage gets list of tuples and offnum as a parametes. If passed
+offnum is invalid, all tuples will be apended at the end of a page. If list
+contains many tuples and offnum is valid, offnum`ed tuple will be deleted from
+the page and all tuples from list appended at the end of a page. Append is
+performed by PageAddItem routine. If list contains exactly one tuple and offnum
+ is valid, tuple on page is replaced with PageIndexTupleOverwrite routine,
+which shows significant performance improvement in case when old tuple's size
+matches size of a new tuple.
Splitting the root page works slightly differently. At root split,
gistplacetopage() allocates the new child pages and replaces the old root
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..a60884e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, (Item)*itup,IndexTupleSize(*itup));
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..72e8c61 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,12 +80,29 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- /* Delete old tuples */
- if (xldata->ntodelete > 0)
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ * */
+ if (xldata->ntodelete == 1 && xldata->ntoinsert == 1)
{
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page, offnum, (Item)itup, IndexTupleSize(itup));
+ /* set up data pointer to skip PageAddItem loop */
+ data += IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ /* correct insertion count for following assert check */
+ ninserted++;
+ }
+ else if (xldata->ntodelete > 0)
+ {
+ /* if it's not in-place update we proceed with deleting old tuples */
OffsetNumber *todelete = (OffsetNumber *) data;
- data += sizeof(OffsetNumber) * xldata->ntodelete;
+ data += sizeof(OffsetNumber) *xldata->ntodelete;
PageIndexMultiDelete(page, todelete, xldata->ntodelete);
if (GistPageIsLeaf(page))
@@ -115,6 +132,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
}
}
+ /* check that buffer had exactly same count of tuples */
Assert(ninserted == xldata->ntoinsert);
PageSetLSN(page, lsn);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..22e422c 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,116 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tupid;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ unsigned alignednewsize;
+ int itemcount;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ /* set the item pointer */
+
+
+ Assert(ItemIdHasStorage(tupid));
+
+ alignednewsize = MAXALIGN(newsize);
+ oldsize = ItemIdGetLength(tupid);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - alignednewsize;
+ offset = ItemIdGetOffset(tupid);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ /* Normally here was following assertion
+ * Assert(ItemIdHasStorage(ii));
+ * This assertion was introduced in PageIndexTupleDelete
+ * But if this function will be used from BRIN index
+ * this assertion will fail. Thus, here we do not check that
+ * item has the storage.
+ * */
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Fix updated tuple length */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+
+ /* now place new tuple on page */
+ memmove(PageGetItem(page,tupid), newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..d587c6d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
30.07.2016 14:00, Andrew Borodin:
Here is BRIN-compatible version of patch. Now function
PageIndexTupleOverwrite consumes tuple size as a parameter, this
behavior is coherent with PageAddItem.
Also, i had to remove asserstion that item has a storage in the loop
that adjusts line pointers. It would be great if someone could check
that place (commented Assert(ItemIdHasStorage(ii)); in
PageIndexTupleOverwrite). I suspect that there might be a case when
linp's should not be adjusted.
Hi, I reviewed the code and I have couple of questions about
following code:
/* may have negative size here if new tuple is larger */
size_diff = oldsize - alignednewsize;
offset = ItemIdGetOffset(tupid);
if (offset < phdr->pd_upper || (offset + size_diff) >
phdr->pd_special ||
offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg("corrupted item offset: offset = %u, size = %u",
offset, (unsigned int) size_diff)));
First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
Although, I'm quite sure that it was already aligned somewhere before.
I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
I'd rather add the check: (offset+size_diff < pd_lower).
Besides that moment, the patch seems pretty good.
BTW, I'm very surprised that it improves performance so much.
And also size is reduced significantly. 89MB against 289MB without patch.
Could you explain in details, why does it happen?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Anastasia , thank you for your attentive code examine.
2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
Although, I'm quite sure that it was already aligned somewhere before.
I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
I'd rather add the check: (offset+size_diff < pd_lower).
Actually, that's a tricky question. There may be different code expectations.
1. If we expect that not-maxaligned tuple may be placed by any other
routine, we should remove check (size_diff != MAXALIGN(size_diff)),
since it will fail for not-maxaligned tuple.
2. If we expect that noone should use PageIndexTupleOverwrite with
not-maxaligned tuples, than current checks are OK: we will break
execution if we find non-maxaligned tuples. With an almost correct
message.
I suggest that this check may be debug-only assertion: in a production
code routine will work with not-maxaligned tuples if they already
reside on the page, in a dev build it will inform dev that something
is going wrong. Is this behavior Postgres-style compliant?
I agree that pd_lower check makes sense.
BTW, I'm very surprised that it improves performance so much.
And also size is reduced significantly. 89MB against 289MB without patch.
Could you explain in details, why does it happen?
Size reduction is unexpected for me.
There might be 4 plausible explanations. I'll list them ordered by
descending of probability:
1. Before this patch every update was throwing recently updated tuple
to the end of a page. Thus, in some-how ordered data, recent best-fit
will be discovered last. This can cause increase of MBB's overlap in
spatial index and slightly higher tree. But 3 times size decrease is
unlikely.
How did you obtained those results? Can I look at a test case?
2. Bug in PageIndexTupleDelete causing unused space emersion. I've
searched for it, unsuccessfully.
3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
bug. May be we are not placing something not very important and big on
a page?
4. Magic.
I do not see what one should do with the R-tree to change it's size by
a factor of 3. First three explanations are not better that forth,
actually.
Those 89 MB, they do not include WAL, right?
Thank you for the review.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
06.08.2016 19:51, Andrew Borodin:
Anastasia , thank you for your attentive code examine.
2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize?
Although, I'm quite sure that it was already aligned somewhere before.
I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary.
I'd rather add the check: (offset+size_diff < pd_lower).Actually, that's a tricky question. There may be different code expectations.
1. If we expect that not-maxaligned tuple may be placed by any other
routine, we should remove check (size_diff != MAXALIGN(size_diff)),
since it will fail for not-maxaligned tuple.
2. If we expect that noone should use PageIndexTupleOverwrite with
not-maxaligned tuples, than current checks are OK: we will break
execution if we find non-maxaligned tuples. With an almost correct
message.
I suggest that this check may be debug-only assertion: in a production
code routine will work with not-maxaligned tuples if they already
reside on the page, in a dev build it will inform dev that something
is going wrong. Is this behavior Postgres-style compliant?I agree that pd_lower check makes sense.
Pointing out to this comparison I worried about possible call of
MAXALIGN for negative size_diff value. Although I don't sure if it can
cause any problem.
Tuples on a page are always maxaligned.
But, as far as I recall, itemid->lp_len can contain non-maxaligned value.
So, I'd suggest you to perform MAXALIGN(oldsize) before computing size_diff:
size_diff = oldsize - alignednewsize;
Since both arguments are maxaligned the check of size_diff is not needed.
BTW, I'm very surprised that it improves performance so much.
And also size is reduced significantly. 89MB against 289MB without patch.
Could you explain in details, why does it happen?Size reduction is unexpected for me.
There might be 4 plausible explanations. I'll list them ordered by
descending of probability:
1. Before this patch every update was throwing recently updated tuple
to the end of a page. Thus, in some-how ordered data, recent best-fit
will be discovered last. This can cause increase of MBB's overlap in
spatial index and slightly higher tree. But 3 times size decrease is
unlikely.
How did you obtained those results? Can I look at a test case?
Nothing special, I've just run the test you provided in this thread.
And check index size via select pg_size_pretty(pg_relation_size('idx'));
2. Bug in PageIndexTupleDelete causing unused space emersion. I've
searched for it, unsuccessfully.
3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a
bug. May be we are not placing something not very important and big on
a page?
4. Magic.
I do not see what one should do with the R-tree to change it's size by
a factor of 3. First three explanations are not better that forth,
actually.
Those 89 MB, they do not include WAL, right?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Here is new version of the patch, now it includes recommendations from
Anastasia Lubennikova.
I've investigated anamalous index size decrease. Most probable version
appeared to be true.
Cube extension, as some others, use Guttman's polynomial time split
algorithm. It is known to generate "needle-like" MBBs (MBRs) for
sorted data due to imbalanced splits (splitting 100 tuples as 98 to
2). Due to previous throw-to-the-end behavior of GiST this imbalance
was further amplificated (most of inserts were going to bigger part
after split). So GiST inserts were extremely slow for sorted data.
There is no need to do exact sorting to trigger this behavior.
This behavior can be fused by implementation of small-m restriction in
split (AFAIR this is described in original R-tree paper from 84),
which prescribes to do a split in a parts no smaller than m, where m
is around 20% of a page capacity (in tuples number). After application
of this patch performance for sorted data is roughly the same as
performance for randomized data.
If data is randomized this effect of Guttman poly-time split is not in
effect; test from the start of the thread will show no statistically
confident improvement by the patch.
To prove performance increase in randomized case I've composed
modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
This test with five runs shows following:
Before patch
Insert Time AVG 78 seconds STD 9.5
Afer patch
Insert Time AVG 68 seconds STD 7.7
This is marginal but statistically viable performance improvement.
For sorted data performance is improved by a factor of 3.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
Attachments:
PageIndexTupleOverwrite v7.patchapplication/octet-stream; name="PageIndexTupleOverwrite v7.patch"Download
From 8c6dcb8d434eb0d7e3a3adf176224951f34a6fad Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 9 Aug 2016 21:18:08 +0500
Subject: [PATCH] GiST inserts optimization with new function
PageIndexTupleOverwrite Allows to avoid jigging tuples on GiST page.
---
src/backend/access/gist/README | 8 +++
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 24 +++++++-
src/backend/storage/page/bufpage.c | 112 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
5 files changed, 160 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index dd4c9fa..b02247f 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -171,6 +171,14 @@ pages. The caller must then call gistplacetopage() on the parent page to
insert the downlink tuples. The parent page that holds the downlink to
the child might have migrated as a result of concurrent splits of the
parent, gistfindCorrectParent() is used to find the parent page.
+gistplacetopage gets list of tuples and offnum as a parametes. If passed
+offnum is invalid, all tuples will be apended at the end of a page. If list
+contains many tuples and offnum is valid, offnum`ed tuple will be deleted from
+the page and all tuples from list appended at the end of a page. Append is
+performed by PageAddItem routine. If list contains exactly one tuple and offnum
+ is valid, tuple on page is replaced with PageIndexTupleOverwrite routine,
+which shows significant performance improvement in case when old tuple's size
+matches size of a new tuple.
Splitting the root page works slightly differently. At root split,
gistplacetopage() allocates the new child pages and replaces the old root
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index fdf0c5a..a60884e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if(ntup == 1)
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, (Item)*itup,IndexTupleSize(*itup));
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index b48e97c..72e8c61 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,12 +80,29 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- /* Delete old tuples */
- if (xldata->ntodelete > 0)
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ * */
+ if (xldata->ntodelete == 1 && xldata->ntoinsert == 1)
{
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page, offnum, (Item)itup, IndexTupleSize(itup));
+ /* set up data pointer to skip PageAddItem loop */
+ data += IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ /* correct insertion count for following assert check */
+ ninserted++;
+ }
+ else if (xldata->ntodelete > 0)
+ {
+ /* if it's not in-place update we proceed with deleting old tuples */
OffsetNumber *todelete = (OffsetNumber *) data;
- data += sizeof(OffsetNumber) * xldata->ntodelete;
+ data += sizeof(OffsetNumber) *xldata->ntodelete;
PageIndexMultiDelete(page, todelete, xldata->ntodelete);
if (GistPageIsLeaf(page))
@@ -115,6 +132,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
}
}
+ /* check that buffer had exactly same count of tuples */
Assert(ninserted == xldata->ntoinsert);
PageSetLSN(page, lsn);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..62dccc6 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,118 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tupid;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ unsigned alignednewsize;
+ int itemcount;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ /* set the item pointer */
+
+
+ Assert(ItemIdHasStorage(tupid));
+
+ alignednewsize = MAXALIGN(newsize);
+ oldsize = ItemIdGetLength(tupid);
+ /* tuples on a page are always maxaligned */
+ oldsize = MAXALIGN(oldsize);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - alignednewsize;
+ offset = ItemIdGetOffset(tupid);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || (offset + size_diff) < phdr->pd_lower)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ /* Normally here was following assertion
+ * Assert(ItemIdHasStorage(ii));
+ * This assertion was introduced in PageIndexTupleDelete
+ * But if this function will be used from BRIN index
+ * this assertion will fail. Thus, here we do not check that
+ * item has the storage.
+ * */
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Fix updated tuple length */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+
+ /* now place new tuple on page */
+ memmove(PageGetItem(page,tupid), newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..d587c6d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
09.08.2016 19:45, Andrew Borodin:
Here is new version of the patch, now it includes recommendations from
Anastasia Lubennikova.I've investigated anamalous index size decrease. Most probable version
appeared to be true.
Cube extension, as some others, use Guttman's polynomial time split
algorithm. It is known to generate "needle-like" MBBs (MBRs) for
sorted data due to imbalanced splits (splitting 100 tuples as 98 to
2). Due to previous throw-to-the-end behavior of GiST this imbalance
was further amplificated (most of inserts were going to bigger part
after split). So GiST inserts were extremely slow for sorted data.
There is no need to do exact sorting to trigger this behavior.
This behavior can be fused by implementation of small-m restriction in
split (AFAIR this is described in original R-tree paper from 84),
which prescribes to do a split in a parts no smaller than m, where m
is around 20% of a page capacity (in tuples number). After application
of this patch performance for sorted data is roughly the same as
performance for randomized data.
Thank you for explanation. Now it's clear to me. I did some more testing and
found nothing special. The declared feature is implemented correctly.
It passes all regression tests and improves performance.
I still have a few minor nitpicks about the patch style.
You can find a style guide on
https://www.postgresql.org/docs/9.6/static/source.html
1) remove extra whitespace in README
2) add whitespace: if(ntup == 1)
3) fix comments in accordance with coding conventions
/* In case of single tuple update GiST calls overwrite
* In all other cases function gistplacetopage deletes
* old tuples and place updated at the end
* */
+ /* Normally here was following assertion
+ * Assert(ItemIdHasStorage(ii));
+ * This assertion was introduced in PageIndexTupleDelete
+ * But if this function will be used from BRIN index
+ * this assertion will fail. Thus, here we do not check that
+ * item has the storage.
+ * */
4) remove unrelated changes
- data += sizeof(OffsetNumber) * xldata->ntodelete;
+ data += sizeof(OffsetNumber) *xldata->ntodelete;
5) If the comment is correct, maxalignment is not necessary.
+ /* tuples on a page are always maxaligned */
+ oldsize = MAXALIGN(oldsize);
6) I'd rather use alignednewsize here.
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
After the cleanup you can change status to "Ready for Committer"
without waiting for the response.
If data is randomized this effect of Guttman poly-time split is not in
effect; test from the start of the thread will show no statistically
confident improvement by the patch.
To prove performance increase in randomized case I've composed
modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
This test with five runs shows following:
Before patch
Insert Time AVG 78 seconds STD 9.5
Afer patch
Insert Time AVG 68 seconds STD 7.7
This is marginal but statistically viable performance improvement.For sorted data performance is improved by a factor of 3.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Thank you for your corrections.
Here is the patch with suggestions taken into account, except 6th.
6) I'd rather use alignednewsize here.
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
This behavior is accroding to ubiquitous PageAddItem.
Best regards, Andrey Borodin, Octonica & Ural Federal University.
2016-08-16 20:23 GMT+05:00 Anastasia Lubennikova <a.lubennikova@postgrespro.ru>:
Show quoted text
09.08.2016 19:45, Andrew Borodin:
Here is new version of the patch, now it includes recommendations from
Anastasia Lubennikova.I've investigated anamalous index size decrease. Most probable version
appeared to be true.
Cube extension, as some others, use Guttman's polynomial time split
algorithm. It is known to generate "needle-like" MBBs (MBRs) for
sorted data due to imbalanced splits (splitting 100 tuples as 98 to
2). Due to previous throw-to-the-end behavior of GiST this imbalance
was further amplificated (most of inserts were going to bigger part
after split). So GiST inserts were extremely slow for sorted data.
There is no need to do exact sorting to trigger this behavior.
This behavior can be fused by implementation of small-m restriction in
split (AFAIR this is described in original R-tree paper from 84),
which prescribes to do a split in a parts no smaller than m, where m
is around 20% of a page capacity (in tuples number). After application
of this patch performance for sorted data is roughly the same as
performance for randomized data.Thank you for explanation. Now it's clear to me. I did some more testing and
found nothing special. The declared feature is implemented correctly.
It passes all regression tests and improves performance.I still have a few minor nitpicks about the patch style.
You can find a style guide on
https://www.postgresql.org/docs/9.6/static/source.html1) remove extra whitespace in README
2) add whitespace: if(ntup == 1)
3) fix comments in accordance with coding conventions
/* In case of single tuple update GiST calls overwrite
* In all other cases function gistplacetopage deletes
* old tuples and place updated at the end
* */+ /* Normally here was following assertion + * Assert(ItemIdHasStorage(ii)); + * This assertion was introduced in PageIndexTupleDelete + * But if this function will be used from BRIN index + * this assertion will fail. Thus, here we do not check that + * item has the storage. + * */4) remove unrelated changes - data += sizeof(OffsetNumber) * xldata->ntodelete; + data += sizeof(OffsetNumber) *xldata->ntodelete;5) If the comment is correct, maxalignment is not necessary. + /* tuples on a page are always maxaligned */ + oldsize = MAXALIGN(oldsize);6) I'd rather use alignednewsize here.
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);After the cleanup you can change status to "Ready for Committer"
without waiting for the response.If data is randomized this effect of Guttman poly-time split is not in
effect; test from the start of the thread will show no statistically
confident improvement by the patch.
To prove performance increase in randomized case I've composed
modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1
This test with five runs shows following:
Before patch
Insert Time AVG 78 seconds STD 9.5
Afer patch
Insert Time AVG 68 seconds STD 7.7
This is marginal but statistically viable performance improvement.For sorted data performance is improved by a factor of 3.
Best regards, Andrey Borodin, Octonica & Ural Federal University.--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
PageIndexTupleOverwrite v8.patchapplication/octet-stream; name="PageIndexTupleOverwrite v8.patch"Download
From 4d1d88fd8088dfa13befb2995092852c13a6d96b Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 18 Aug 2016 12:57:03 +0500
Subject: [PATCH] GiST inserts optimization with new function
PageIndexTupleOverwrite Allows to avoid jigging tuples on GiST page.
---
src/backend/access/gist/README | 8 +++
src/backend/access/gist/gist.c | 20 ++++++-
src/backend/access/gist/gistxlog.c | 22 ++++++-
src/backend/storage/page/bufpage.c | 115 +++++++++++++++++++++++++++++++++++++
src/include/storage/bufpage.h | 1 +
5 files changed, 162 insertions(+), 4 deletions(-)
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index dd4c9fa..09cc40b 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -171,6 +171,14 @@ pages. The caller must then call gistplacetopage() on the parent page to
insert the downlink tuples. The parent page that holds the downlink to
the child might have migrated as a result of concurrent splits of the
parent, gistfindCorrectParent() is used to find the parent page.
+gistplacetopage gets list of tuples and offnum as a parametes. If passed
+offnum is invalid, all tuples will be apended at the end of a page. If list
+contains many tuples and offnum is valid, offnum`ed tuple will be deleted from
+the page and all tuples from list appended at the end of a page. Append is
+performed by PageAddItem routine. If list contains exactly one tuple and offnum
+is valid, tuple on page is replaced with PageIndexTupleOverwrite routine,
+which shows significant performance improvement in case when old tuple's size
+matches size of a new tuple.
Splitting the root page works slightly differently. At root split,
gistplacetopage() allocates the new child pages and replaces the old root
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index e8034b9..d1c0b7c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -504,8 +504,24 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* gistRedoPageUpdateRecord()
*/
if (OffsetNumberIsValid(oldoffnum))
- PageIndexTupleDelete(page, oldoffnum);
- gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ {
+ /* if we have just one tuple to update we replace it in-place on page */
+ if( ntup == 1 )
+ {
+ PageIndexTupleOverwrite(page, oldoffnum, (Item)*itup,IndexTupleSize(*itup));
+ }
+ else
+ {
+ /* this corner case is here to support mix calls case (see comment above) */
+ PageIndexTupleDelete(page, oldoffnum);
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
+ }
+ else
+ {
+ /* just append all tuples at the end of a page */
+ gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
+ }
MarkBufferDirty(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 01c7ef7..b441349 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -80,9 +80,26 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
page = (Page) BufferGetPage(buffer);
- /* Delete old tuples */
- if (xldata->ntodelete > 0)
+ /* In case of single tuple update GiST calls overwrite
+ * In all other cases function gistplacetopage deletes
+ * old tuples and place updated at the end
+ */
+ if (xldata->ntodelete == 1 && xldata->ntoinsert == 1)
{
+ IndexTuple itup;
+ OffsetNumber offnum = *((OffsetNumber *) data);
+ data += sizeof(OffsetNumber);
+ itup = (IndexTuple) data;
+ PageIndexTupleOverwrite(page, offnum, (Item)itup, IndexTupleSize(itup));
+ /* set up data pointer to skip PageAddItem loop */
+ data += IndexTupleSize(itup);
+ Assert(data - begin == datalen);
+ /* correct insertion count for following assert check */
+ ninserted++;
+ }
+ else if (xldata->ntodelete > 0)
+ {
+ /* if it's not in-place update we proceed with deleting old tuples */
OffsetNumber *todelete = (OffsetNumber *) data;
data += sizeof(OffsetNumber) * xldata->ntodelete;
@@ -115,6 +132,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
}
}
+ /* check that buffer had exactly same count of tuples */
Assert(ninserted == xldata->ntoinsert);
PageSetLSN(page, lsn);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index f2a07f2..c803865 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -707,6 +707,121 @@ PageGetHeapFreeSpace(Page page)
/*
+ * PageIndexTupleOverwrite
+ *
+ * This routine does the work of overwriting a tuple on an index page.
+ *
+ * Unlike heap pages, we keep compacted line pointers.
+ */
+void
+PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize)
+{
+ PageHeader phdr = (PageHeader) page;
+ char *addr;
+ ItemId tupid;
+ int size_diff;
+ int oldsize;
+ unsigned offset;
+ unsigned alignednewsize;
+ int itemcount;
+
+ /*
+ * As with PageIndexTupleDelete, paranoia is told to be justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ /* set the item pointer */
+
+
+ Assert(ItemIdHasStorage(tupid));
+
+ alignednewsize = MAXALIGN(newsize);
+ oldsize = ItemIdGetLength(tupid);
+ /*
+ * Tuples on a page are always maxaligned, but PageAddItem puts
+ * to ItemId size before doing MAXALIGN.
+ */
+ oldsize = MAXALIGN(oldsize);
+ /* may have negative size here if new tuple is larger */
+ size_diff = oldsize - alignednewsize;
+ offset = ItemIdGetOffset(tupid);
+
+
+ if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special ||
+ offset != MAXALIGN(offset) || (offset + size_diff) < phdr->pd_lower)
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item offset: offset = %u, size = %u",
+ offset, (unsigned int) size_diff)));
+
+
+ /*
+ * Now move everything between the old upper bound (beginning of tuple
+ * space) and the end of the overwritten tuple forward, so that space in
+ * the middle of the page is left free. If we've just deleted the tuple
+ * at the beginning of tuple space, then there's no need to do the copy
+ * (and bcopy on some architectures SEGV's if asked to move zero bytes).
+ */
+
+ /* beginning of tuple space */
+ addr = (char *) page + phdr->pd_upper;
+
+ /* skip modifications if nothing has to be changed actually */
+ if (size_diff != 0)
+ {
+ int i;
+ memmove(addr + size_diff, addr, (int) (offset - phdr->pd_upper));
+
+ /* adjust free space boundary pointers */
+ phdr->pd_upper += size_diff;
+
+ /*
+ * we need to adjust the linp entries that remain.
+ *
+ * Anything that used to be before the deleted tuple's data was moved
+ * forward by the size of the deleted tuple.
+ */
+
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ /* Normally here was following assertion
+ * Assert(ItemIdHasStorage(ii));
+ * This assertion was introduced in PageIndexTupleDelete
+ * But if this function will be used from BRIN index
+ * this assertion will fail. Thus, here we do not check that
+ * item has the storage.
+ */
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Fix updated tuple length */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+
+ /* now place new tuple on page */
+ memmove(PageGetItem(page,tupid), newtup, newsize);
+}
+
+
+/*
* PageIndexTupleDelete
*
* This routine does the work of removing a tuple from an index page.
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..d587c6d 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -426,6 +426,7 @@ extern Size PageGetFreeSpace(Page page);
extern Size PageGetExactFreeSpace(Page page);
extern Size PageGetHeapFreeSpace(Page page);
extern void PageIndexTupleDelete(Page page, OffsetNumber offset);
+extern void PageIndexTupleOverwrite(Page page, OffsetNumber offnum, Item newtup, Size newsize);
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
--
2.7.0.windows.1
Andrew Borodin <borodin@octonica.com> writes:
Thank you for your corrections.
Here is the patch with suggestions taken into account, except 6th.
I'll pick this up, unless some other committer is already on it.
I think that we should buy back the addition of PageIndexTupleOverwrite
to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
earlier by Alvaro. The latter seems a lot messier in concept, and it's
more general than BRIN needs. It also kinda looks like we could get
rid of PAI_ALLOW_FAR_OFFSET, which is a messy kluge in itself.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Andrew Borodin <borodin@octonica.com> writes:
Thank you for your corrections.
Here is the patch with suggestions taken into account, except 6th.I'll pick this up, unless some other committer is already on it.
I think that we should buy back the addition of PageIndexTupleOverwrite
to bufpage.c by getting rid of PageIndexDeleteNoCompact, as suggested
earlier by Alvaro. The latter seems a lot messier in concept, and it's
more general than BRIN needs.
Yeah, BRIN only wants to remove one item nowadays; that function was
written when the BRIN code wanted to remove multiple items in one go.
It also kinda looks like we could get rid of PAI_ALLOW_FAR_OFFSET,
which is a messy kluge in itself.
That'd be nice.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hey Alvaro, can you comment on this bit in the proposed patch?
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ /* Normally here was following assertion
+ * Assert(ItemIdHasStorage(ii));
+ * This assertion was introduced in PageIndexTupleDelete
+ * But if this function will be used from BRIN index
+ * this assertion will fail. Thus, here we do not check that
+ * item has the storage.
+ */
+ if (ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage. Are you expecting that any BRIN index items
wouldn't? If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Hey Alvaro, can you comment on this bit in the proposed patch?
+ for (i = FirstOffsetNumber; i <= itemcount; i++) + { + ItemId ii = PageGetItemId(phdr, i); + + /* Normally here was following assertion + * Assert(ItemIdHasStorage(ii)); + * This assertion was introduced in PageIndexTupleDelete + * But if this function will be used from BRIN index + * this assertion will fail. Thus, here we do not check that + * item has the storage. + */ + if (ItemIdGetOffset(ii) <= offset) + ii->lp_off += size_diff; + } + }That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage. Are you expecting that any BRIN index items
wouldn't? If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?
It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.
This is a tricky case to reproduce, but I think this should do it:
consider a table with pages_per_range=1. Page 1 is summarized by index
tuple 1, page 2 is summarized by itup 2, page 3 is summarized by index
tuple 3. On heap page 2 a new tuple is inserted and the summary data is
now much larger, such that the summary tuple no longer fits in the index
page, so it is moved to a new index page. Then index tuple 2 in the
first index page becomes unused. Revmap still points to lp 3, so it
can't be moved to lp 2. The lp_offs can be adjusted, as long as the lp
itself is not moved from its original position.
Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage. Are you expecting that any BRIN index items
wouldn't? If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?
It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.
Hm. So apparently, the only reason this stuff works at all is that
BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.
Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.
No, it iterates over all lps on the page. I'm inclined to think it
should be written like
if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
ii->lp_off += size_diff;
because futzing with the lp_off field in an item that isn't really
pointing at storage feels wrong. We might need to do that to
PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.
I notice that PageIndexDeleteNoCompact, which seems to express what
BRIN is expecting in a rather underdocumented way, forces any
items without storage into "unused" state. I don't really think
it's bufpage.c's place to do that, though. Do you think that code
is actually supposed to fire, or is it just there for lack of a
better idea?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
Tom Lane wrote:
That comment seems utterly wrong to me, because both PageIndexTupleDelete
and PageIndexMultiDelete certainly contain assertions that every item on
the page has storage. Are you expecting that any BRIN index items
wouldn't? If they wouldn't, is adjusting their lp_off as if they did
have storage sensible?It is possible in BRIN to have empty intermediate tuples; for example it
is possible for lp 1 and 3 to contain index tuples, while lp 2 does not.Hm. So apparently, the only reason this stuff works at all is that
BRIN isn't using either PageIndexTupleDelete or PageIndexMultiDelete.
Yes, this is why the NoCompact variant was introduced in the first
place.
Now if this loop is concerned only with live lps and does not move lps,
then it should be fine to add the assertion.No, it iterates over all lps on the page. I'm inclined to think it
should be written likeif (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
ii->lp_off += size_diff;because futzing with the lp_off field in an item that isn't really
pointing at storage feels wrong. We might need to do that to
PageIndexTupleDelete and/or PageIndexMultiDelete someday, too.
I suppose it is conceivable that we start using lp_off for other
purposes in the future, so I don't disagree. I don't think index pages
currently do any funny business with it.
I notice that PageIndexDeleteNoCompact, which seems to express what
BRIN is expecting in a rather underdocumented way, forces any
items without storage into "unused" state. I don't really think
it's bufpage.c's place to do that, though. Do you think that code
is actually supposed to fire, or is it just there for lack of a
better idea?
I just put it there only because I didn't see any reason not to, really.
I don't think BRIN relies on it.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
That storage assertion fired during usual update table set x=random() without conditions. Also Make check fails without it (for brin usage, gist is ok with it).
Cannot type quotation properly, I'm on a phone for a long time.
Regards, Andrey Borodin.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrey Borodin <borodin@octonica.com> writes:
That storage assertion fired during usual update table set x=random() without conditions. Also Make check fails without it (for brin usage, gist is ok with it).
I'm confused by that assertion, because the patch-as-submitted didn't
touch BRIN. Based on Alvaro's comments, yes it would fail if we'd
modified BRIN to use this function ... but we hadn't yet.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Here's a draft patch that replaces BRIN's use of PageIndexDeleteNoCompact
and an immediately following PageAddItemExtended with
PageIndexTupleOverwrite. It turns out that there are two use-cases in
BRIN for PageIndexDeleteNoCompact, and the other one can't be replaced
because while there is a PageAddItem call beside it, that one is for a
different page. But this is still enough to let us get rid of
PAI_ALLOW_FAR_OFFSET, which seems like a good thing.
Actually, with this patch, PageAddItemExtended isn't directly referenced
anywhere, so we could revert that API and fold it back into PageAddItem,
saving a nanosecond or two per call. I'm inclined to leave that alone,
though, since I agree with the underlying thought that any future
extensions to PageAddItem's functionality would be better handled by more
bits in a flags bitmask. Maybe a better idea is to get rid of the call
overhead by turning PageAddItem into a macro. In any case, doing
something about that could be a separate patch.
It's also kind of tempting to rewrite PageIndexDeleteNoCompact into a
function that only deletes one tuple, and presumably is simpler and faster
than what's there. But maybe there's something coming down the pike that
needs the extra generality? That should also be a separate patch, anyway.
This passes make check-world but I'm not sure how heavily that stresses
BRIN ... are there any other tests you'd want to see run before
committing?
regards, tom lane
Attachments:
use-PageIndexTupleOverwrite-in-BRIN.patchtext/x-diff; charset=us-ascii; name=use-PageIndexTupleOverwrite-in-BRIN.patchDownload
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 6ebfedd..24ae38b 100644
*** a/src/backend/access/brin/brin_pageops.c
--- b/src/backend/access/brin/brin_pageops.c
*************** brin_doupdate(Relation idxrel, BlockNumb
*** 178,187 ****
}
START_CRIT_SECTION();
! PageIndexDeleteNoCompact(oldpage, &oldoff, 1);
! if (PageAddItemExtended(oldpage, (Item) newtup, newsz, oldoff,
! PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET) == InvalidOffsetNumber)
! elog(ERROR, "failed to add BRIN tuple");
MarkBufferDirty(oldbuf);
/* XLOG stuff */
--- 178,185 ----
}
START_CRIT_SECTION();
! if (!PageIndexTupleOverwrite(oldpage, oldoff, (Item) newtup, newsz))
! elog(ERROR, "failed to replace BRIN tuple");
MarkBufferDirty(oldbuf);
/* XLOG stuff */
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index 27ba0a9..ed14729 100644
*** a/src/backend/access/brin/brin_xlog.c
--- b/src/backend/access/brin/brin_xlog.c
*************** brin_xlog_samepage_update(XLogReaderStat
*** 189,202 ****
page = (Page) BufferGetPage(buffer);
offnum = xlrec->offnum;
- if (PageGetMaxOffsetNumber(page) + 1 < offnum)
- elog(PANIC, "brin_xlog_samepage_update: invalid max offset number");
! PageIndexDeleteNoCompact(page, &offnum, 1);
! offnum = PageAddItemExtended(page, (Item) brintuple, tuplen, offnum,
! PAI_OVERWRITE | PAI_ALLOW_FAR_OFFSET);
! if (offnum == InvalidOffsetNumber)
! elog(PANIC, "brin_xlog_samepage_update: failed to add tuple");
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
--- 189,197 ----
page = (Page) BufferGetPage(buffer);
offnum = xlrec->offnum;
! if (!PageIndexTupleOverwrite(page, offnum, (Item) brintuple, tuplen))
! elog(PANIC, "brin_xlog_samepage_update: failed to replace tuple");
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index bce0d53..08e222e 100644
*** a/src/backend/storage/page/bufpage.c
--- b/src/backend/storage/page/bufpage.c
*************** PageIsVerified(Page page, BlockNumber bl
*** 166,186 ****
* inserted, or InvalidOffsetNumber if the item is not inserted for any
* reason. A WARNING is issued indicating the reason for the refusal.
*
! * If flag PAI_OVERWRITE is set, we just store the item at the specified
! * offsetNumber (which must be either a currently-unused item pointer,
! * or one past the last existing item). Otherwise,
! * if offsetNumber is valid and <= current max offset in the page,
! * insert item into the array at that position by shuffling ItemId's
! * down to make room.
! * If offsetNumber is not valid, then assign one by finding the first
* one that is both unused and deallocated.
*
* If flag PAI_IS_HEAP is set, we enforce that there can't be more than
* MaxHeapTuplesPerPage line pointers on the page.
*
- * If flag PAI_ALLOW_FAR_OFFSET is not set, we disallow placing items
- * beyond one past the last existing item.
- *
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
*/
OffsetNumber
--- 166,189 ----
* inserted, or InvalidOffsetNumber if the item is not inserted for any
* reason. A WARNING is issued indicating the reason for the refusal.
*
! * offsetNumber must be either InvalidOffsetNumber to specify finding a
! * free item pointer, or a value between FirstOffsetNumber and one past
! * the last existing item, to specify using that particular item pointer.
! *
! * If offsetNumber is valid and flag PAI_OVERWRITE is set, we just store
! * the item at the specified offsetNumber, which must be either a
! * currently-unused item pointer, or one past the last existing item.
! *
! * If offsetNumber is valid and flag PAI_OVERWRITE is not set, insert
! * the item at the specified offsetNumber, moving existing items later
! * in the array to make room.
! *
! * If offsetNumber is not valid, then assign a slot by finding the first
* one that is both unused and deallocated.
*
* If flag PAI_IS_HEAP is set, we enforce that there can't be more than
* MaxHeapTuplesPerPage line pointers on the page.
*
* !!! EREPORT(ERROR) IS DISALLOWED HERE !!!
*/
OffsetNumber
*************** PageAddItemExtended(Page page,
*** 267,277 ****
}
}
! /*
! * Reject placing items beyond the first unused line pointer, unless
! * caller asked for that behavior specifically.
! */
! if ((flags & PAI_ALLOW_FAR_OFFSET) == 0 && offsetNumber > limit)
{
elog(WARNING, "specified item offset is too large");
return InvalidOffsetNumber;
--- 270,277 ----
}
}
! /* Reject placing items beyond the first unused line pointer */
! if (offsetNumber > limit)
{
elog(WARNING, "specified item offset is too large");
return InvalidOffsetNumber;
*************** PageAddItemExtended(Page page,
*** 290,299 ****
* Note: do arithmetic as signed ints, to avoid mistakes if, say,
* alignedSize > pd_upper.
*/
! if ((flags & PAI_ALLOW_FAR_OFFSET) != 0)
! lower = Max(phdr->pd_lower,
! SizeOfPageHeaderData + sizeof(ItemIdData) * offsetNumber);
! else if (offsetNumber == limit || needshuffle)
lower = phdr->pd_lower + sizeof(ItemIdData);
else
lower = phdr->pd_lower;
--- 290,296 ----
* Note: do arithmetic as signed ints, to avoid mistakes if, say,
* alignedSize > pd_upper.
*/
! if (offsetNumber == limit || needshuffle)
lower = phdr->pd_lower + sizeof(ItemIdData);
else
lower = phdr->pd_lower;
*************** PageIndexDeleteNoCompact(Page page, Offs
*** 1093,1098 ****
--- 1090,1202 ----
}
}
+
+ /*
+ * PageIndexTupleOverwrite
+ *
+ * Replace a specified tuple on an index page.
+ *
+ * The new tuple is placed exactly where the old one had been, shifting
+ * other tuples' data up or down as needed to keep the page compacted.
+ * This is better than deleting and reinserting the tuple, because it
+ * avoids any data shifting when the tuple size doesn't change; and
+ * even when it does, we avoid moving the item pointers around.
+ * Conceivably this could also be of use to an index AM that cares about
+ * the physical order of tuples as well as their ItemId order.
+ *
+ * If there's insufficient space for the new tuple, return false. Other
+ * errors represent data-corruption problems, so we just elog.
+ */
+ bool
+ PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+ Item newtup, Size newsize)
+ {
+ PageHeader phdr = (PageHeader) page;
+ ItemId tupid;
+ int oldsize;
+ unsigned offset;
+ Size alignednewsize;
+ int size_diff;
+ int itemcount;
+
+ /*
+ * As with PageRepairFragmentation, paranoia seems justified.
+ */
+ if (phdr->pd_lower < SizeOfPageHeaderData ||
+ phdr->pd_lower > phdr->pd_upper ||
+ phdr->pd_upper > phdr->pd_special ||
+ phdr->pd_special > BLCKSZ ||
+ phdr->pd_special != MAXALIGN(phdr->pd_special))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted page pointers: lower = %u, upper = %u, special = %u",
+ phdr->pd_lower, phdr->pd_upper, phdr->pd_special)));
+
+ itemcount = PageGetMaxOffsetNumber(page);
+ if ((int) offnum <= 0 || (int) offnum > itemcount)
+ elog(ERROR, "invalid index offnum: %u", offnum);
+
+ tupid = PageGetItemId(page, offnum);
+ Assert(ItemIdHasStorage(tupid));
+ oldsize = ItemIdGetLength(tupid);
+ offset = ItemIdGetOffset(tupid);
+
+ if (offset < phdr->pd_upper || (offset + oldsize) > phdr->pd_special ||
+ offset != MAXALIGN(offset))
+ ereport(ERROR,
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("corrupted item pointer: offset = %u, size = %u",
+ offset, (unsigned int) oldsize)));
+
+ /*
+ * Determine actual change in space requirement, check for page overflow.
+ */
+ oldsize = MAXALIGN(oldsize);
+ alignednewsize = MAXALIGN(newsize);
+ if (alignednewsize > oldsize + (phdr->pd_upper - phdr->pd_lower))
+ return false;
+
+ /*
+ * Relocate existing data and update line pointers, unless the new tuple
+ * is the same size as the old (after alignment), in which case there's
+ * nothing to do. Notice that what we have to relocate is data before the
+ * target tuple, not data after, so it's convenient to express size_diff
+ * as the amount by which the tuple's size is decreasing, making it the
+ * delta to add to pd_upper and affected line pointers.
+ */
+ size_diff = oldsize - (int) alignednewsize;
+ if (size_diff != 0)
+ {
+ char *addr = (char *) page + phdr->pd_upper;
+ int i;
+
+ /* relocate all tuple data before the target tuple */
+ memmove(addr + size_diff, addr, offset - phdr->pd_upper);
+
+ /* adjust free space boundary pointer */
+ phdr->pd_upper += size_diff;
+
+ /* adjust affected line pointers too */
+ for (i = FirstOffsetNumber; i <= itemcount; i++)
+ {
+ ItemId ii = PageGetItemId(phdr, i);
+
+ /* Allow items without storage; currently only BRIN needs that */
+ if (ItemIdHasStorage(ii) && ItemIdGetOffset(ii) <= offset)
+ ii->lp_off += size_diff;
+ }
+ }
+
+ /* Update the item's tuple length (other fields shouldn't change) */
+ ItemIdSetNormal(tupid, offset + size_diff, newsize);
+
+ /* Copy new tuple data onto page */
+ memcpy(PageGetItem(page, tupid), newtup, newsize);
+
+ return true;
+ }
+
+
/*
* Set checksum for a page in shared buffers.
*
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 15cebfc..0ea47f5 100644
*** a/src/include/storage/bufpage.h
--- b/src/include/storage/bufpage.h
*************** do { \
*** 409,415 ****
*/
#define PAI_OVERWRITE (1 << 0)
#define PAI_IS_HEAP (1 << 1)
- #define PAI_ALLOW_FAR_OFFSET (1 << 2)
extern void PageInit(Page page, Size pageSize, Size specialSize);
extern bool PageIsVerified(Page page, BlockNumber blkno);
--- 409,414 ----
*************** extern void PageIndexTupleDelete(Page pa
*** 429,434 ****
--- 428,435 ----
extern void PageIndexMultiDelete(Page page, OffsetNumber *itemnos, int nitems);
extern void PageIndexDeleteNoCompact(Page page, OffsetNumber *itemnos,
int nitems);
+ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
+ Item newtup, Size newsize);
extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
... BTW, a quick grep for other calls of PageIndexTupleDelete suggests
that SPGiST could make very good use of PageIndexTupleOverwrite, and
there may be use for it in GIN too.
I see one place in btree where there's a PageIndexTupleDelete and then
an insert, but it's unlikely to be worth touching because the page is
expected to be empty after the PageIndexTupleDelete; so there's no
data movement we could avoid in that case.
I don't plan to investigate these cases myself, but somebody should.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I wrote:
... BTW, a quick grep for other calls of PageIndexTupleDelete suggests
that SPGiST could make very good use of PageIndexTupleOverwrite, and
there may be use for it in GIN too.
I've pushed this patch with some editorializing and some followup
work. The above remains as possible material for a future patch,
but I'm going to mark this commitfest entry as closed.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers