[PATCH] Microvacuum for gist.
Hi,
I have written microvacuum support for gist access method.
Briefly microvacuum includes two steps:
1. When search tells us that the tuple is invisible to all transactions it
is marked LP_DEAD and page is marked as "has dead tuples",
2. Then, when insert touches full page which has dead tuples it calls
microvacuum instead of splitting page.
You can find a kind of review here [1]http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120.
[1]: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120
Patch is in attachements. Please review it.
--
Best regards,
Lubennikova Anastasia
Attachments:
microvacuum_for_gist.patchapplication/octet-stream; name=microvacuum_for_gist.patchDownload
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 36,42 **** static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
!
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
--- 36,42 ----
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
***************
*** 209,214 **** gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
--- 209,223 ----
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /* If leaf page is full, try at first to delete dead tuples.
+ * And then check again.
+ */
+ if ((is_split)&& GistPageIsLeaf(page) && GistPageHasGarbage(page)) {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
***************
*** 1440,1442 **** freeGISTstate(GISTSTATE *giststate)
--- 1449,1517 ----
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer) {
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId)) {
+ elog(NOTICE, "gistvacuumpage. Dead offnum %hd, blkno %d ndeletable %d", offnum, BufferGetBlockNumber(buffer), ndeletable);
+ deletable[ndeletable++] = offnum;
+ }
+ }
+
+ if (ndeletable > 0) {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked, but
+ * weren't included in our target-item list), but it will almost always be
+ * true and it doesn't seem worth an additional page scan to check it.
+ * Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+ }
*** a/src/backend/access/gist/gistget.c
--- b/src/backend/access/gist/gistget.c
***************
*** 25,30 ****
--- 25,90 ----
#include "utils/rel.h"
+ static void
+ gistkillitems(IndexScanDesc scan) {
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ minoff = FirstOffsetNumber;
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (i = 0; i < so->numKilled; i++)
+ {
+ if (so->killedItems != NULL) {
+ OffsetNumber offnum = FirstOffsetNumber;
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+ if (ItemPointerEquals(&ituple->t_tid, &(so->killedItems[i])))
+ {
+ elog(NOTICE, "gistkillitems. Mark Item Dead offnum %hd, blkno %d", offnum, BufferGetBlockNumber(buffer));
+ /* found the item */
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ break; /* out of inner search loop */
+ }
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+ }
+
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
*
***************
*** 331,337 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
/* Ignore tuple if it doesn't match */
if (!match)
continue;
-
if (tbm && GistPageIsLeaf(page))
{
/*
--- 391,396 ----
***************
*** 451,456 **** getNextNearest(IndexScanDesc scan)
--- 510,527 ----
if (scan->xs_itup)
{
+ /* If previously returned index tuple is not visible
+ * save it into so->killedItems. And at the end of the page scan mark all saved tuples as dead.
+ */
+ if (scan->kill_prior_tuple) {
+ if(so->killedItems == NULL) {
+ MemoryContext oldCxt2 = MemoryContextSwitchTo(so->giststate->scanCxt);
+ so->killedItems = (ItemPointerData*) palloc(MaxIndexTuplesPerPage*sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt2);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = scan->xs_ctup.t_self;
+ }
/* free previously returned tuple */
pfree(scan->xs_itup);
scan->xs_itup = NULL;
***************
*** 515,520 **** getNextNearest(IndexScanDesc scan)
--- 586,597 ----
}
else
{
+ if ((so->curBlkno != InvalidBlockNumber)&&(so->numKilled > 0))
+ gistkillitems(scan);
+
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/* visit an index page, extract its items into queue */
CHECK_FOR_INTERRUPTS();
***************
*** 572,578 **** gistgettuple(PG_FUNCTION_ARGS)
--- 649,664 ----
{
if (so->curPageData < so->nPageData)
{
+ if ((scan->kill_prior_tuple)&&(so->curPageData > 0)) {
+ if(so->killedItems == NULL) {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+ so->killedItems = (ItemPointerData*) palloc(MaxIndexTuplesPerPage*sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData-1].heapPtr;
+ }
/* continuing to return tuples from a leaf page */
scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr;
scan->xs_recheck = so->pageData[so->curPageData].recheck;
***************
*** 585,594 **** gistgettuple(PG_FUNCTION_ARGS)
--- 671,693 ----
PG_RETURN_BOOL(true);
}
+ /* Check the last returned tuple and add it to killitems if necessary */
+ if ((scan->kill_prior_tuple)&&(so->curPageData > 0)&&(so->curPageData == so->nPageData)) {
+ if(so->killedItems == NULL) {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+ so->killedItems = (ItemPointerData*) palloc(MaxIndexTuplesPerPage*sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData-1].heapPtr;
+ }
/* find and process the next index page */
do
{
+ if ((so->curBlkno != InvalidBlockNumber)&&(so->numKilled > 0))
+ gistkillitems(scan);
+
GISTSearchItem *item = getNextGISTSearchItem(so);
if (!item)
***************
*** 596,601 **** gistgettuple(PG_FUNCTION_ARGS)
--- 695,703 ----
CHECK_FOR_INTERRUPTS();
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/*
* While scanning a leaf page, ItemPointers of matching heap
* tuples are stored in so->pageData. If there are any on
*** a/src/backend/access/gist/gistscan.c
--- b/src/backend/access/gist/gistscan.c
***************
*** 93,98 **** gistbeginscan(PG_FUNCTION_ARGS)
--- 93,102 ----
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+
scan->opaque = so;
/*
*** a/src/include/access/gist.h
--- b/src/include/access/gist.h
***************
*** 41,48 ****
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
typedef XLogRecPtr GistNSN;
--- 41,49 ----
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+ #define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead, but not deleted yet */
typedef XLogRecPtr GistNSN;
***************
*** 137,142 **** typedef struct GISTENTRY
--- 138,147 ----
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+ #define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+ #define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+ #define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 22,27 ****
--- 22,28 ----
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+ #include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
***************
*** 161,166 **** typedef struct GISTScanOpaqueData
--- 162,172 ----
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ ItemPointerData *killedItems; /* heap pointers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */Hi!
On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova <
lubennikovaav@gmail.com> wrote:
I have written microvacuum support for gist access method.
Briefly microvacuum includes two steps:
1. When search tells us that the tuple is invisible to all transactions it
is marked LP_DEAD and page is marked as "has dead tuples",
2. Then, when insert touches full page which has dead tuples it calls
microvacuum instead of splitting page.
You can find a kind of review here [1].[1]
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120Patch is in attachements. Please review it.
Nice!
Some notes about this patch.
1) Could you give same test case demonstrating that microvacuum really work
with patch? Finally, we should get index less growing with microvacuum.
2) Generating notices for every dead tuple would be too noisy. I suggest to
replace notice with one of debug levels.
+ elog(NOTICE, "gistkillitems. Mark Item Dead offnum %hd, blkno %d",
offnum, BufferGetBlockNumber(buffer));
3) Please, recheck coding style. For instance, this line needs more spaces
and open brace should be on the next line.
+ if ((scan->kill_prior_tuple)&&(so->curPageData > 0)&&(so->curPageData ==
so->nPageData)) {
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 7/30/15 7:33 AM, Alexander Korotkov wrote:
2) Generating notices for every dead tuple would be too noisy. I suggest
to replace notice with one of debug levels.+ elog(NOTICE, "gistkillitems. Mark Item Dead offnum %hd, blkno %d",
offnum, BufferGetBlockNumber(buffer));
Even that seems like pretty serious overkill. vacuumlazy.c doesn't have
anything like that, and I don't think the BTree code does either. If you
were debugging something and actually needed it I'd say drop in a
temporary printf().
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
30.07.2015 16:33, Alexander Korotkov пишет:
Hi!
On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com <mailto:lubennikovaav@gmail.com>> wrote:I have written microvacuum support for gist access method.
Briefly microvacuum includes two steps:
1. When search tells us that the tuple is invisible to all
transactions it is marked LP_DEAD and page is marked as "has dead
tuples",
2. Then, when insert touches full page which has dead tuples it
calls microvacuum instead of splitting page.
You can find a kind of review here [1].[1]
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120Patch is in attachements. Please review it.
Nice!
Some notes about this patch.
1) Could you give same test case demonstrating that microvacuum really
work with patch? Finally, we should get index less growing with
microvacuum.2) Generating notices for every dead tuple would be too noisy. I
suggest to replace notice with one of debug levels.+ elog(NOTICE, "gistkillitems. Mark Item Dead offnum %hd, blkno %d",
offnum, BufferGetBlockNumber(buffer));3) Please, recheck coding style. For instance, this line needs more
spaces and open brace should be on the next line.+ if ((scan->kill_prior_tuple)&&(so->curPageData >
0)&&(so->curPageData == so->nPageData)) {------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
<http://www.postgrespro.com/>
The Russian Postgres Company
1) Test and results are in attachments. Everything seems to work as
expected.
2) I dropped these notices. It was done only for debug purposes. Updated
patch is attached.
3) fixed
//
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
microvacuum_for_gist_2.patchtext/plain; charset=windows-1251; name=microvacuum_for_gist_2.patchDownload
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 36,42 **** static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
!
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
--- 36,42 ----
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
***************
*** 209,214 **** gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
--- 209,225 ----
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /*
+ * If leaf page is full, try at first to delete dead tuples. And then
+ * check again.
+ */
+ if ((is_split) && GistPageIsLeaf(page) && GistPageHasGarbage(page))
+ {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
***************
*** 1440,1442 **** freeGISTstate(GISTSTATE *giststate)
--- 1451,1519 ----
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer)
+ {
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked,
+ * but weren't included in our target-item list), but it will almost
+ * always be true and it doesn't seem worth an additional page scan to
+ * check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+ }
*** a/src/backend/access/gist/gistget.c
--- b/src/backend/access/gist/gistget.c
***************
*** 25,30 ****
--- 25,93 ----
#include "utils/rel.h"
+ static void
+ gistkillitems(IndexScanDesc scan)
+ {
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ minoff = FirstOffsetNumber;
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (i = 0; i < so->numKilled; i++)
+ {
+ if (so->killedItems != NULL)
+ {
+ OffsetNumber offnum = FirstOffsetNumber;
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+
+ if (ItemPointerEquals(&ituple->t_tid, &(so->killedItems[i])))
+ {
+ /* found the item */
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ break; /* out of inner search loop */
+ }
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+ }
+
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
*
***************
*** 331,337 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
/* Ignore tuple if it doesn't match */
if (!match)
continue;
-
if (tbm && GistPageIsLeaf(page))
{
/*
--- 394,399 ----
***************
*** 451,456 **** getNextNearest(IndexScanDesc scan)
--- 513,535 ----
if (scan->xs_itup)
{
+ /*
+ * If previously returned index tuple is not visible save it into
+ * so->killedItems. And at the end of the page scan mark all saved
+ * tuples as dead.
+ */
+ if (scan->kill_prior_tuple)
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt2 = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt2);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = scan->xs_ctup.t_self;
+ }
/* free previously returned tuple */
pfree(scan->xs_itup);
scan->xs_itup = NULL;
***************
*** 515,520 **** getNextNearest(IndexScanDesc scan)
--- 594,605 ----
}
else
{
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan);
+
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/* visit an index page, extract its items into queue */
CHECK_FOR_INTERRUPTS();
***************
*** 572,578 **** gistgettuple(PG_FUNCTION_ARGS)
--- 657,675 ----
{
if (so->curPageData < so->nPageData)
{
+ if ((scan->kill_prior_tuple) && (so->curPageData > 0))
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr;
+ }
/* continuing to return tuples from a leaf page */
scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr;
scan->xs_recheck = so->pageData[so->curPageData].recheck;
***************
*** 586,594 **** gistgettuple(PG_FUNCTION_ARGS)
--- 683,711 ----
PG_RETURN_BOOL(true);
}
+ /*
+ * Check the last returned tuple and add it to killitems if
+ * necessary
+ */
+ if ((scan->kill_prior_tuple) && (so->curPageData > 0) && (so->curPageData == so->nPageData))
+ {
+
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr;
+ }
/* find and process the next index page */
do
{
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan);
+
GISTSearchItem *item = getNextGISTSearchItem(so);
if (!item)
***************
*** 596,601 **** gistgettuple(PG_FUNCTION_ARGS)
--- 713,721 ----
CHECK_FOR_INTERRUPTS();
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/*
* While scanning a leaf page, ItemPointers of matching heap
* tuples are stored in so->pageData. If there are any on
*** a/src/backend/access/gist/gistscan.c
--- b/src/backend/access/gist/gistscan.c
***************
*** 93,98 **** gistbeginscan(PG_FUNCTION_ARGS)
--- 93,102 ----
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+
scan->opaque = so;
/*
*** a/src/include/access/gist.h
--- b/src/include/access/gist.h
***************
*** 41,48 ****
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
typedef XLogRecPtr GistNSN;
--- 41,51 ----
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were
! * deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+ #define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead,
+ * but not deleted yet */
typedef XLogRecPtr GistNSN;
***************
*** 137,142 **** typedef struct GISTENTRY
--- 140,149 ----
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+ #define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+ #define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+ #define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 22,27 ****
--- 22,28 ----
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+ #include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
***************
*** 161,166 **** typedef struct GISTScanOpaqueData
--- 162,172 ----
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ ItemPointerData *killedItems; /* heap pointers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:
1) Test and results are in attachments. Everything seems to work as
expected.
2) I dropped these notices. It was done only for debug purposes. Updated
patch is attached.
3) fixed
Good! Another couple of notes from me:
1) I think gistvacuumpage() and gistkillitems() need function-level
comments.
2) ItemIdIsDead() can be used in index scan like it's done
in _bt_checkkeys().
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru <mailto:a.lubennikova@postgrespro.ru>>
wrote:1) Test and results are in attachments. Everything seems to work
as expected.
2) I dropped these notices. It was done only for debug purposes.
Updated patch is attached.
3) fixedGood! Another couple of notes from me:
1) I think gistvacuumpage() and gistkillitems() need function-level
comments.
2) ItemIdIsDead() can be used in index scan like it's done
in _bt_checkkeys().------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
<http://www.postgrespro.com/>
The Russian Postgres Company
I've added some comments.
ItemIdIsDead() is used now (just skip dead tuples as not matching the
quals).
And there is one else check of LSN in gistkillitems to make sure that
page was not changed between reads.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
microvacuum_for_gist_3.patchtext/plain; charset=windows-1251; name=microvacuum_for_gist_3.patchDownload
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 36,42 **** static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
!
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
--- 36,42 ----
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
***************
*** 209,214 **** gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
--- 209,225 ----
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /*
+ * If leaf page is full, try at first to delete dead tuples. And then
+ * check again.
+ */
+ if ((is_split) && GistPageIsLeaf(page) && GistPageHasGarbage(page))
+ {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
***************
*** 1440,1442 **** freeGISTstate(GISTSTATE *giststate)
--- 1451,1522 ----
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+ /*
+ * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ */
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer)
+ {
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked,
+ * but weren't included in our target-item list), but it will almost
+ * always be true and it doesn't seem worth an additional page scan to
+ * check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+ }
*** a/src/backend/access/gist/gistget.c
--- b/src/backend/access/gist/gistget.c
***************
*** 24,29 ****
--- 24,116 ----
#include "utils/memutils.h"
#include "utils/rel.h"
+ /*
+ * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * We match items by heap TID before mark them. If an item has moved off
+ * the current page due to a split, we'll fail to find it and do nothing
+ * (this is not an error case --- we assume the item will eventually get
+ * marked in a future indexscan).
+ *
+ * We re-read page here, so it's significant to check page LSN. If page
+ * has been modified since the last read (as determined by LSN), we dare not
+ * flag any antries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
+ */
+ static void
+ gistkillitems(IndexScanDesc scan)
+ {
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ /*
+ * If page LSN differs it means that the page was modified since the last read.
+ * killedItemes could be not valid so LP_DEAD hints applying is not safe.
+ */
+ if(PageGetLSN(page) != so->curPageLSN)
+ {
+ UnlockReleaseBuffer(buffer);
+ so->numKilled = 0; /* reset counter */
+ return;
+ }
+
+ minoff = FirstOffsetNumber;
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (i = 0; i < so->numKilled; i++)
+ {
+ if (so->killedItems != NULL)
+ {
+ OffsetNumber offnum = FirstOffsetNumber;
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+
+ if (ItemPointerEquals(&ituple->t_tid, &(so->killedItems[i])))
+ {
+ /* found the item */
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ break; /* out of inner search loop */
+ }
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+ }
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
***************
*** 306,322 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
MemoryContextReset(so->pageDataCxt);
/*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
! IndexTuple it = (IndexTuple) PageGetItem(page, PageGetItemId(page, i));
bool match;
bool recheck;
bool recheck_distances;
/*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
--- 393,425 ----
MemoryContextReset(so->pageDataCxt);
/*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->curPageLSN = PageGetLSN(page);
+
+ /*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
! ItemId iid = PageGetItemId(page, i);
! IndexTuple it;
bool match;
bool recheck;
bool recheck_distances;
/*
+ * If the scan specifies not to return killed tuples, then we treat a
+ * killed tuple as not passing the qual.
+ */
+ if(scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ continue;
+
+ it = (IndexTuple) PageGetItem(page, iid);
+ /*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
***************
*** 331,337 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
/* Ignore tuple if it doesn't match */
if (!match)
continue;
-
if (tbm && GistPageIsLeaf(page))
{
/*
--- 434,439 ----
***************
*** 451,456 **** getNextNearest(IndexScanDesc scan)
--- 553,575 ----
if (scan->xs_itup)
{
+ /*
+ * If previously returned index tuple is not visible save it into
+ * so->killedItems. And at the end of the page scan mark all saved
+ * tuples as dead.
+ */
+ if (scan->kill_prior_tuple)
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt2 = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt2);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = scan->xs_ctup.t_self;
+ }
/* free previously returned tuple */
pfree(scan->xs_itup);
scan->xs_itup = NULL;
***************
*** 515,520 **** getNextNearest(IndexScanDesc scan)
--- 634,645 ----
}
else
{
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan);
+
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/* visit an index page, extract its items into queue */
CHECK_FOR_INTERRUPTS();
***************
*** 572,578 **** gistgettuple(PG_FUNCTION_ARGS)
--- 697,715 ----
{
if (so->curPageData < so->nPageData)
{
+ if ((scan->kill_prior_tuple) && (so->curPageData > 0))
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr;
+ }
/* continuing to return tuples from a leaf page */
scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr;
scan->xs_recheck = so->pageData[so->curPageData].recheck;
***************
*** 586,594 **** gistgettuple(PG_FUNCTION_ARGS)
--- 723,751 ----
PG_RETURN_BOOL(true);
}
+ /*
+ * Check the last returned tuple and add it to killitems if
+ * necessary
+ */
+ if ((scan->kill_prior_tuple) && (so->curPageData > 0) && (so->curPageData == so->nPageData))
+ {
+
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData));
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr;
+ }
/* find and process the next index page */
do
{
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan);
+
GISTSearchItem *item = getNextGISTSearchItem(so);
if (!item)
***************
*** 596,601 **** gistgettuple(PG_FUNCTION_ARGS)
--- 753,761 ----
CHECK_FOR_INTERRUPTS();
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/*
* While scanning a leaf page, ItemPointers of matching heap
* tuples are stored in so->pageData. If there are any on
*** a/src/backend/access/gist/gistscan.c
--- b/src/backend/access/gist/gistscan.c
***************
*** 93,98 **** gistbeginscan(PG_FUNCTION_ARGS)
--- 93,103 ----
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+ so->curPageLSN = InvalidXLogRecPtr;
+
scan->opaque = so;
/*
*** a/src/include/access/gist.h
--- b/src/include/access/gist.h
***************
*** 41,48 ****
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
typedef XLogRecPtr GistNSN;
--- 41,51 ----
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were
! * deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+ #define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead,
+ * but not deleted yet */
typedef XLogRecPtr GistNSN;
***************
*** 137,142 **** typedef struct GISTENTRY
--- 140,149 ----
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+ #define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+ #define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+ #define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 22,27 ****
--- 22,28 ----
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+ #include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
***************
*** 161,166 **** typedef struct GISTScanOpaqueData
--- 162,173 ----
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ ItemPointerData *killedItems; /* heap pointers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+ GistNSN curPageLSN; /* pos in the WAL stream when page was read */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */Hi,
I don't know too much about gist, but did a quick read through. Mostly
spotting some stylistic issues. Please fix those making it easier for
the next reviewer.
*** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *************** *** 36,42 **** static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); !
#define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ --- 36,42 ---- bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer);#define ROTATEDIST(d) do { \
SplitedPageLayout
*tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \
Newline removed.
+ /* + * If leaf page is full, try at first to delete dead tuples. And then + * check again. + */ + if ((is_split) && GistPageIsLeaf(page) && GistPageHasGarbage(page))
superfluous parens around is_split
+ /* + * gistkillitems() -- set LP_DEAD state for items an indexscan caller has + * told us were killed. + * + * We match items by heap TID before mark them. If an item has moved off + * the current page due to a split, we'll fail to find it and do nothing + * (this is not an error case --- we assume the item will eventually get + * marked in a future indexscan). + * + * We re-read page here, so it's significant to check page LSN. If page + * has been modified since the last read (as determined by LSN), we dare not + * flag any antries because it is possible that the old entry was vacuumed + * away and the TID was re-used by a completely different heap tuple.
s/significant/important/?.
s/If page/If the page/
s/dare not/cannot/
+ */ + static void + gistkillitems(IndexScanDesc scan) + { + GISTScanOpaque so = (GISTScanOpaque) scan->opaque; + Buffer buffer; + Page page; + OffsetNumber minoff; + OffsetNumber maxoff; + int i; + bool killedsomething = false; + + Assert(so->curBlkno != InvalidBlockNumber); + + buffer = ReadBuffer(scan->indexRelation, so->curBlkno); + if (!BufferIsValid(buffer)) + return; + + LockBuffer(buffer, GIST_SHARE); + gistcheckpage(scan->indexRelation, buffer); + page = BufferGetPage(buffer); + + /* + * If page LSN differs it means that the page was modified since the last read. + * killedItemes could be not valid so LP_DEAD hints applying is not safe. + */ + if(PageGetLSN(page) != so->curPageLSN) + { + UnlockReleaseBuffer(buffer); + so->numKilled = 0; /* reset counter */ + return; + } + + minoff = FirstOffsetNumber; + maxoff = PageGetMaxOffsetNumber(page); + + maxoff = PageGetMaxOffsetNumber(page);
duplicated line.
+ for (i = 0; i < so->numKilled; i++) + { + if (so->killedItems != NULL) + { + OffsetNumber offnum = FirstOffsetNumber; + + while (offnum <= maxoff) + {
This nested loop deserves a comment.
+ ItemId iid = PageGetItemId(page, offnum); + IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); + + if (ItemPointerEquals(&ituple->t_tid, &(so->killedItems[i]))) + { + /* found the item */ + ItemIdMarkDead(iid); + killedsomething = true; + break; /* out of inner search loop */ + } + offnum = OffsetNumberNext(offnum); + } + } + }
I know it's the same approach nbtree takes, but if there's a significant
number of deleted items this takes me as a rather suboptimal
approach. The constants are small, but this still essentially is O(n^2).
*************** *** 451,456 **** getNextNearest(IndexScanDesc scan) --- 553,575 ----if (scan->xs_itup) { + /* + * If previously returned index tuple is not visible save it into + * so->killedItems. And at the end of the page scan mark all saved + * tuples as dead. + */ + if (scan->kill_prior_tuple) + { + if (so->killedItems == NULL) + { + MemoryContext oldCxt2 = MemoryContextSwitchTo(so->giststate->scanCxt); + + so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData)); + MemoryContextSwitchTo(oldCxt2); + }
oldCxt2?
+ if ((so->numKilled < MaxIndexTuplesPerPage)) + so->killedItems[so->numKilled++] = scan->xs_ctup.t_self; + }
superfluous parens.
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0)) + gistkillitems(scan);
superfluous parens.
+ if ((scan->kill_prior_tuple) && (so->curPageData > 0)) + {
superfluous parens.
+ if (so->killedItems == NULL) + { + MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt); + + so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData)); + MemoryContextSwitchTo(oldCxt); + } + if (so->numKilled < MaxIndexTuplesPerPage) + so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr; + } /* continuing to return tuples from a leaf page */ scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr; scan->xs_recheck = so->pageData[so->curPageData].recheck; ***************
overlong lines.
*** 586,594 **** gistgettuple(PG_FUNCTION_ARGS) --- 723,751 ---- PG_RETURN_BOOL(true); }+ /* + * Check the last returned tuple and add it to killitems if + * necessary + */ + if ((scan->kill_prior_tuple) && (so->curPageData > 0) && (so->curPageData == so->nPageData)) + {
superfluous parens galore.
+ if (so->killedItems == NULL) + { + MemoryContext oldCxt = MemoryContextSwitchTo(so->giststate->scanCxt); + + so->killedItems = (ItemPointerData *) palloc(MaxIndexTuplesPerPage * sizeof(ItemPointerData)); + MemoryContextSwitchTo(oldCxt); + } + if ((so->numKilled < MaxIndexTuplesPerPage)) + so->killedItems[so->numKilled++] = so->pageData[so->curPageData - 1].heapPtr; + } /* find and process the next index page */ do { + if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0)) + gistkillitems(scan); + GISTSearchItem *item = getNextGISTSearchItem(so);if (!item)
Too long lines.
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
I don't know too much about gist, but did a quick read through. Mostly
spotting some stylistic issues. Please fix those making it easier for
the next reviewer.
Thank you for review! All mentioned issues are fixed.
--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
Attachments:
microvacuum_for_gist_4.patchtext/x-patch; name=microvacuum_for_gist_4.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0e49959..f658831 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -36,6 +36,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
+static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
@@ -209,6 +210,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /*
+ * If leaf page is full, try at first to delete dead tuples. And then
+ * check again.
+ */
+ if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
+ {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
@@ -1440,3 +1452,72 @@ freeGISTstate(GISTSTATE *giststate)
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+/*
+ * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ */
+static void
+gistvacuumpage(Relation rel, Page page, Buffer buffer)
+{
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked,
+ * but weren't included in our target-item list), but it will almost
+ * always be true and it doesn't seem worth an additional page scan to
+ * check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+}
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 20f695c..e655a05 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -24,6 +24,97 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * We match items by heap TID before mark them. If an item has moved off
+ * the current page due to a split, we'll fail to find it and do nothing
+ * (this is not an error case --- we assume the item will eventually get
+ * marked in a future indexscan).
+ *
+ * We re-read page here, so it's important to check page LSN. If the page
+ * has been modified since the last read (as determined by LSN), we cannot
+ * flag any antries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
+ */
+static void
+gistkillitems(IndexScanDesc scan)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ /*
+ * If page LSN differs it means that the page was modified since the last read.
+ * killedItemes could be not valid so LP_DEAD hints applying is not safe.
+ */
+ if(PageGetLSN(page) != so->curPageLSN)
+ {
+ UnlockReleaseBuffer(buffer);
+ so->numKilled = 0; /* reset counter */
+ return;
+ }
+
+ minoff = FirstOffsetNumber;
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ for (i = 0; i < so->numKilled; i++)
+ {
+ if (so->killedItems != NULL)
+ {
+ OffsetNumber offnum = FirstOffsetNumber;
+
+ while (offnum <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, offnum);
+ IndexTuple ituple = (IndexTuple) PageGetItem(page, iid);
+
+ /*
+ * We match items by heap TID before mark them. We cope with cases
+ * where items have moved right due to insertions. If founded item
+ * is not equal to killed item, just skip it.
+ */
+ if (ItemPointerEquals(&ituple->t_tid, &(so->killedItems[i])))
+ {
+ /* found the item */
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ break; /* out of inner search loop */
+ }
+ offnum = OffsetNumberNext(offnum);
+ }
+ }
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+}
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
@@ -306,17 +397,33 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
MemoryContextReset(so->pageDataCxt);
/*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->curPageLSN = PageGetLSN(page);
+
+ /*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
- IndexTuple it = (IndexTuple) PageGetItem(page, PageGetItemId(page, i));
+ ItemId iid = PageGetItemId(page, i);
+ IndexTuple it;
bool match;
bool recheck;
bool recheck_distances;
/*
+ * If the scan specifies not to return killed tuples, then we treat a
+ * killed tuple as not passing the qual.
+ */
+ if(scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ continue;
+
+ it = (IndexTuple) PageGetItem(page, iid);
+ /*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
@@ -451,6 +557,27 @@ getNextNearest(IndexScanDesc scan)
if (scan->xs_itup)
{
+ /*
+ * If previously returned index tuple is not visible save it into
+ * so->killedItems. And at the end of the page scan mark all saved
+ * tuples as dead.
+ */
+ if (scan->kill_prior_tuple)
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt =
+ MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems =
+ (ItemPointerData *) palloc(MaxIndexTuplesPerPage
+ * sizeof(ItemPointerData));
+
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] = scan->xs_ctup.t_self;
+ }
/* free previously returned tuple */
pfree(scan->xs_itup);
scan->xs_itup = NULL;
@@ -515,6 +642,12 @@ getNextNearest(IndexScanDesc scan)
}
else
{
+ if (so->curBlkno != InvalidBlockNumber && so->numKilled > 0)
+ gistkillitems(scan);
+
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/* visit an index page, extract its items into queue */
CHECK_FOR_INTERRUPTS();
@@ -572,7 +705,24 @@ gistgettuple(PG_FUNCTION_ARGS)
{
if (so->curPageData < so->nPageData)
{
+ if (scan->kill_prior_tuple && so->curPageData > 0)
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt =
+ MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems =
+ (ItemPointerData *) palloc(MaxIndexTuplesPerPage
+ * sizeof(ItemPointerData));
+
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] =
+ so->pageData[so->curPageData - 1].heapPtr;
+ }
/* continuing to return tuples from a leaf page */
scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr;
scan->xs_recheck = so->pageData[so->curPageData].recheck;
@@ -586,9 +736,36 @@ gistgettuple(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(true);
}
+ /*
+ * Check the last returned tuple and add it to killitems if
+ * necessary
+ */
+ if (scan->kill_prior_tuple
+ && so->curPageData > 0
+ && so->curPageData == so->nPageData)
+ {
+
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt =
+ MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems =
+ (ItemPointerData *) palloc(MaxIndexTuplesPerPage
+ * sizeof(ItemPointerData));
+
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if ((so->numKilled < MaxIndexTuplesPerPage))
+ so->killedItems[so->numKilled++] =
+ so->pageData[so->curPageData - 1].heapPtr;
+ }
/* find and process the next index page */
do
{
+ if (so->curBlkno != InvalidBlockNumber && so->numKilled > 0)
+ gistkillitems(scan);
+
GISTSearchItem *item = getNextGISTSearchItem(so);
if (!item)
@@ -596,6 +773,9 @@ gistgettuple(PG_FUNCTION_ARGS)
CHECK_FOR_INTERRUPTS();
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/*
* While scanning a leaf page, ItemPointers of matching heap
* tuples are stored in so->pageData. If there are any on
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index ad39294..a17c5bc 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -93,6 +93,11 @@ gistbeginscan(PG_FUNCTION_ARGS)
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+ so->curPageLSN = InvalidXLogRecPtr;
+
scan->opaque = so;
/*
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 81e559b..ea3a3b0 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -41,8 +41,11 @@
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
-#define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
+#define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were
+ * deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+#define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead,
+ * but not deleted yet */
typedef XLogRecPtr GistNSN;
@@ -137,6 +140,10 @@ typedef struct GISTENTRY
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+#define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+#define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+#define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 4f1a5c3..5f70b76 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+#include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -161,6 +162,12 @@ typedef struct GISTScanOpaqueData
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ ItemPointerData *killedItems; /* heap pointers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+ GistNSN curPageLSN; /* pos in the WAL stream when page was read */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */
Some notices
1 gistvacuumpage():
OffsetNumber deletable[MaxOffsetNumber];
Seems, MaxOffsetNumber is too much, MaxIndexTuplesPerPage is enough
2 Loop in gistkillitems() for searching heap pointer. It seems to me that
it could be a performance problem. To fix it, it's needed to remember index
tuple's offset number somewhere near GISTScanOpaqueData->killedItems. And
gistkillitems() will loop over offsets and compare heap pointer from killedItems
and index tuple, if they doesn't match then just skip this index tuple.
3 Connected with previous, could you show some performance tests?
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
04.09.2015 15:11, Teodor Sigaev:
Some notices
1 gistvacuumpage():
OffsetNumber deletable[MaxOffsetNumber];
Seems, MaxOffsetNumber is too much, MaxIndexTuplesPerPage is enough
Fixed.
2 Loop in gistkillitems() for searching heap pointer. It seems to me that
it could be a performance problem. To fix it, it's needed to remember
index tuple's offset number somewhere near
GISTScanOpaqueData->killedItems. And
gistkillitems() will loop over offsets and compare heap pointer from
killedItems and index tuple, if they doesn't match then just skip this
index tuple.
Thanks for suggestion. I've rewritten this function. Now killedItems[]
contains only OffsetNumbers of tuples which we are going to delete.
PageLSN check is enough to ensure that nothing has changed on the page.
Heap pointer recheck is unnecessary. (It's really important for btree,
where tuple could be inserted in the middle of page. But we can't have
such situation for GiST index page).
It works 50% faster than before.
3 Connected with previous, could you show some performance tests?
Perfomance test is attached.
Test is following - portion of tuples is deleted and after that selected
several times.
Without microvacuum. All 'select' queries are executed at about same time
Time: 360,468 ms
Time: 243,197 ms
Time: 238,310 ms
Time: 238,018 ms
With microvacuum. First 'select' invokes gistkillitems(). It's executed
a bit slower than before.
But following queries are executed significantly faster than without
microvacuum.
Time: 368,780 ms
Time: 69,769 ms
Time: 9,545 ms
Time: 12,427 ms
Please, review the patch again. I could have missed something.
P.S. Do I need to write any documentation update?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
microvacuum_for_gist_5.patchtext/x-patch; name=microvacuum_for_gist_5.patchDownload
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 36,41 **** static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
--- 36,42 ----
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
+ static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
***************
*** 209,214 **** gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
--- 210,226 ----
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /*
+ * If leaf page is full, try at first to delete dead tuples. And then
+ * check again.
+ */
+ if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
+ {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
***************
*** 1440,1442 **** freeGISTstate(GISTSTATE *giststate)
--- 1452,1523 ----
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+ /*
+ * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ */
+ static void
+ gistvacuumpage(Relation rel, Page page, Buffer buffer)
+ {
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum,
+ minoff,
+ maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked,
+ * but weren't included in our target-item list), but it will almost
+ * always be true and it doesn't seem worth an additional page scan to
+ * check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+ }
*** a/src/backend/access/gist/gistget.c
--- b/src/backend/access/gist/gistget.c
***************
*** 24,29 ****
--- 24,97 ----
#include "utils/memutils.h"
#include "utils/rel.h"
+ /*
+ * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * We re-read page here, so it's important to check page LSN. If the page
+ * has been modified since the last read (as determined by LSN), we cannot
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
+ */
+ static void
+ gistkillitems(IndexScanDesc scan)
+ {
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ ItemId iid;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+ Assert(so->killedItems != NULL);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ /*
+ * If page LSN differs it means that the page was modified since the last read.
+ * killedItems could be not valid so LP_DEAD hints applying is not safe.
+ */
+ if(PageGetLSN(page) != so->curPageLSN)
+ {
+ UnlockReleaseBuffer(buffer);
+ so->numKilled = 0; /* reset counter */
+ return;
+ }
+
+ /*
+ * Mark all killedItems as dead. We need no additional recheck,
+ * because, if page was modified, pageLSN must have changed.
+ */
+ for (i = 0; i < so->numKilled; i++)
+ {
+ offnum = so->killedItems[i];
+ iid = PageGetItemId(page, offnum);
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+ }
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
***************
*** 306,322 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
MemoryContextReset(so->pageDataCxt);
/*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
! IndexTuple it = (IndexTuple) PageGetItem(page, PageGetItemId(page, i));
bool match;
bool recheck;
bool recheck_distances;
/*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
--- 374,406 ----
MemoryContextReset(so->pageDataCxt);
/*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->curPageLSN = PageGetLSN(page);
+
+ /*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
! ItemId iid = PageGetItemId(page, i);
! IndexTuple it;
bool match;
bool recheck;
bool recheck_distances;
/*
+ * If the scan specifies not to return killed tuples, then we treat a
+ * killed tuple as not passing the qual.
+ */
+ if(scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ continue;
+
+ it = (IndexTuple) PageGetItem(page, iid);
+ /*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
***************
*** 331,337 **** gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
/* Ignore tuple if it doesn't match */
if (!match)
continue;
-
if (tbm && GistPageIsLeaf(page))
{
/*
--- 415,420 ----
*** a/src/backend/access/gist/gistscan.c
--- b/src/backend/access/gist/gistscan.c
***************
*** 93,98 **** gistbeginscan(PG_FUNCTION_ARGS)
--- 93,103 ----
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+ so->curPageLSN = InvalidXLogRecPtr;
+
scan->opaque = so;
/*
*** a/src/include/access/gist.h
--- b/src/include/access/gist.h
***************
*** 41,48 ****
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
typedef XLogRecPtr GistNSN;
--- 41,51 ----
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
! #define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were
! * deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+ #define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead,
+ * but not deleted yet */
typedef XLogRecPtr GistNSN;
***************
*** 137,142 **** typedef struct GISTENTRY
--- 140,149 ----
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+ #define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+ #define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+ #define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 22,27 ****
--- 22,28 ----
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+ #include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
***************
*** 124,129 **** typedef struct GISTSearchHeapItem
--- 125,131 ----
bool recheckDistances; /* T if distances must be rechecked */
IndexTuple ftup; /* data fetched back from the index, used in
* index-only scans */
+ OffsetNumber offnum;
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
***************
*** 161,166 **** typedef struct GISTScanOpaqueData
--- 163,174 ----
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ OffsetNumber *killedItems; /* offset numbers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+ GistNSN curPageLSN; /* pos in the WAL stream when page was read */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */
Something goes wrong...
gist.c:1465:5: warning: unused variable 'minoff' [-Wunused-variable]
minoff,
gistget.c:37:1: warning: unused function 'gistkillitems' [-Wunused-function]
gistkillitems(IndexScanDesc scan)
Without microvacuum. All 'select' queries are executed at about same time
Time: 360,468 ms
Time: 243,197 ms
Time: 238,310 ms
Time: 238,018 msWith microvacuum. First 'select' invokes gistkillitems(). It's executed a bit
slower than before.
But following queries are executed significantly faster than without microvacuum.
Time: 368,780 ms
Time: 69,769 ms
Time: 9,545 ms
Time: 12,427 ms
That's perfect, but I can't reproduce that. Suppose, because of "unused function
'gistkillitems'"
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
BTW, I slightly modify your test to provide more stable results.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
Fixed patch is attached.
08.09.2015 13:47, Teodor Sigaev:
BTW, I slightly modify your test to provide more stable results.
Thank you, I tried it, Results are nearly the same.
Without microvacuum
Time: 312,955 ms
Time: 264,597 ms
Time: 243,286 ms
Time: 243,679 ms
With microvacuum:
Time: 354,097 ms
Time: 82,206 ms
Time: 11,714 ms
Time: 11,277 ms
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
microvacuum_for_gist_6.patchtext/x-patch; name=microvacuum_for_gist_6.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0e49959..bb48782 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -36,6 +36,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
bool unlockbuf, bool unlockleftchild);
static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
GISTSTATE *giststate, List *splitinfo, bool releasebuf);
+static void gistvacuumpage(Relation rel, Page page, Buffer buffer);
#define ROTATEDIST(d) do { \
@@ -209,6 +210,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
* because the tuple vector passed to gistSplit won't include this tuple.
*/
is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+
+ /*
+ * If leaf page is full, try at first to delete dead tuples. And then
+ * check again.
+ */
+ if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
+ {
+ gistvacuumpage(rel, page, buffer);
+ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
+ }
+
if (is_split)
{
/* no space for insertion */
@@ -1440,3 +1452,70 @@ freeGISTstate(GISTSTATE *giststate)
/* It's sufficient to delete the scanCxt */
MemoryContextDelete(giststate->scanCxt);
}
+
+/*
+ * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ */
+static void
+gistvacuumpage(Relation rel, Page page, Buffer buffer)
+{
+ OffsetNumber deletable[MaxOffsetNumber];
+ int ndeletable = 0;
+ OffsetNumber offnum, maxoff;
+
+ /*
+ * Scan over all items to see which ones need to be deleted according to
+ * LP_DEAD flags.
+ */
+ maxoff = PageGetMaxOffsetNumber(page);
+ for (offnum = FirstOffsetNumber;
+ offnum <= maxoff;
+ offnum = OffsetNumberNext(offnum))
+ {
+ ItemId itemId = PageGetItemId(page, offnum);
+
+ if (ItemIdIsDead(itemId))
+ deletable[ndeletable++] = offnum;
+ }
+
+ if (ndeletable > 0)
+ {
+ START_CRIT_SECTION();
+
+ PageIndexMultiDelete(page, deletable, ndeletable);
+
+ /*
+ * Mark the page as not containing any LP_DEAD items. This is not
+ * certainly true (there might be some that have recently been marked,
+ * but weren't included in our target-item list), but it will almost
+ * always be true and it doesn't seem worth an additional page scan to
+ * check it. Remember that F_HAS_GARBAGE is only a hint anyway.
+ */
+ GistClearPageHasGarbage(page);
+
+ MarkBufferDirty(buffer);
+
+ /* XLOG stuff */
+ if (RelationNeedsWAL(rel))
+ {
+ XLogRecPtr recptr;
+
+ recptr = gistXLogUpdate(rel->rd_node, buffer,
+ deletable, ndeletable,
+ NULL, 0, InvalidBuffer);
+
+ PageSetLSN(page, recptr);
+ }
+ else
+ PageSetLSN(page, gistGetFakeLSN(rel));
+
+ END_CRIT_SECTION();
+ }
+
+ /*
+ * Note: if we didn't find any LP_DEAD items, then the page's
+ * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a
+ * separate write to clear it, however. We will clear it when we split
+ * the page.
+ */
+}
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 20f695c..8a8d121 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -24,6 +24,74 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+/*
+ * gistkillitems() -- set LP_DEAD state for items an indexscan caller has
+ * told us were killed.
+ *
+ * We re-read page here, so it's important to check page LSN. If the page
+ * has been modified since the last read (as determined by LSN), we cannot
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple.
+ */
+static void
+gistkillitems(IndexScanDesc scan)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ Buffer buffer;
+ Page page;
+ OffsetNumber offnum;
+ ItemId iid;
+ int i;
+ bool killedsomething = false;
+
+ Assert(so->curBlkno != InvalidBlockNumber);
+ Assert(so->killedItems != NULL);
+
+ buffer = ReadBuffer(scan->indexRelation, so->curBlkno);
+ if (!BufferIsValid(buffer))
+ return;
+
+ LockBuffer(buffer, GIST_SHARE);
+ gistcheckpage(scan->indexRelation, buffer);
+ page = BufferGetPage(buffer);
+
+ /*
+ * If page LSN differs it means that the page was modified since the last read.
+ * killedItems could be not valid so LP_DEAD hints applying is not safe.
+ */
+ if(PageGetLSN(page) != so->curPageLSN)
+ {
+ UnlockReleaseBuffer(buffer);
+ so->numKilled = 0; /* reset counter */
+ return;
+ }
+
+ /*
+ * Mark all killedItems as dead. We need no additional recheck,
+ * because, if page was modified, pageLSN must have changed.
+ */
+ for (i = 0; i < so->numKilled; i++)
+ {
+ offnum = so->killedItems[i];
+ iid = PageGetItemId(page, offnum);
+ ItemIdMarkDead(iid);
+ killedsomething = true;
+ }
+
+ if (killedsomething)
+ {
+ GistMarkPageHasGarbage(page);
+ MarkBufferDirtyHint(buffer, true);
+ }
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Always reset the scan state, so we don't look for same items on other
+ * pages.
+ */
+ so->numKilled = 0;
+}
/*
* gistindex_keytest() -- does this index tuple satisfy the scan key(s)?
@@ -306,17 +374,33 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
MemoryContextReset(so->pageDataCxt);
/*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->curPageLSN = PageGetLSN(page);
+
+ /*
* check all tuples on page
*/
maxoff = PageGetMaxOffsetNumber(page);
for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
{
- IndexTuple it = (IndexTuple) PageGetItem(page, PageGetItemId(page, i));
+ ItemId iid = PageGetItemId(page, i);
+ IndexTuple it;
bool match;
bool recheck;
bool recheck_distances;
/*
+ * If the scan specifies not to return killed tuples, then we treat a
+ * killed tuple as not passing the qual.
+ */
+ if(scan->ignore_killed_tuples && ItemIdIsDead(iid))
+ continue;
+
+ it = (IndexTuple) PageGetItem(page, iid);
+ /*
* Must call gistindex_keytest in tempCxt, and clean up any leftover
* junk afterward.
*/
@@ -331,7 +415,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
/* Ignore tuple if it doesn't match */
if (!match)
continue;
-
if (tbm && GistPageIsLeaf(page))
{
/*
@@ -348,6 +431,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
*/
so->pageData[so->nPageData].heapPtr = it->t_tid;
so->pageData[so->nPageData].recheck = recheck;
+ so->pageData[so->nPageData].offnum = i;
/*
* In an index-only scan, also fetch the data from the tuple.
@@ -572,7 +656,24 @@ gistgettuple(PG_FUNCTION_ARGS)
{
if (so->curPageData < so->nPageData)
{
+ if (scan->kill_prior_tuple && so->curPageData > 0)
+ {
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt =
+ MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems =
+ (OffsetNumber *) palloc(MaxIndexTuplesPerPage
+ * sizeof(OffsetNumber));
+
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] =
+ so->pageData[so->curPageData - 1].offnum;
+ }
/* continuing to return tuples from a leaf page */
scan->xs_ctup.t_self = so->pageData[so->curPageData].heapPtr;
scan->xs_recheck = so->pageData[so->curPageData].recheck;
@@ -586,9 +687,36 @@ gistgettuple(PG_FUNCTION_ARGS)
PG_RETURN_BOOL(true);
}
+ /*
+ * Check the last returned tuple and add it to killitems if
+ * necessary
+ */
+ if (scan->kill_prior_tuple
+ && so->curPageData > 0
+ && so->curPageData == so->nPageData)
+ {
+
+ if (so->killedItems == NULL)
+ {
+ MemoryContext oldCxt =
+ MemoryContextSwitchTo(so->giststate->scanCxt);
+
+ so->killedItems =
+ (OffsetNumber *) palloc(MaxIndexTuplesPerPage
+ * sizeof(OffsetNumber));
+
+ MemoryContextSwitchTo(oldCxt);
+ }
+ if (so->numKilled < MaxIndexTuplesPerPage)
+ so->killedItems[so->numKilled++] =
+ so->pageData[so->curPageData - 1].offnum;
+ }
/* find and process the next index page */
do
{
+ if ((so->curBlkno != InvalidBlockNumber) && (so->numKilled > 0))
+ gistkillitems(scan);
+
GISTSearchItem *item = getNextGISTSearchItem(so);
if (!item)
@@ -596,6 +724,9 @@ gistgettuple(PG_FUNCTION_ARGS)
CHECK_FOR_INTERRUPTS();
+ /* save current item BlockNumber for next gistkillitems() call */
+ so->curBlkno = item->blkno;
+
/*
* While scanning a leaf page, ItemPointers of matching heap
* tuples are stored in so->pageData. If there are any on
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index ad39294..a17c5bc 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -93,6 +93,11 @@ gistbeginscan(PG_FUNCTION_ARGS)
memset(scan->xs_orderbynulls, true, sizeof(bool) * scan->numberOfOrderBys);
}
+ so->killedItems = NULL; /* until needed */
+ so->numKilled = 0;
+ so->curBlkno = InvalidBlockNumber;
+ so->curPageLSN = InvalidXLogRecPtr;
+
scan->opaque = so;
/*
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 81e559b..ea3a3b0 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -41,8 +41,11 @@
*/
#define F_LEAF (1 << 0) /* leaf page */
#define F_DELETED (1 << 1) /* the page has been deleted */
-#define F_TUPLES_DELETED (1 << 2) /* some tuples on the page are dead */
+#define F_TUPLES_DELETED (1 << 2) /* some tuples on the page were
+ * deleted */
#define F_FOLLOW_RIGHT (1 << 3) /* page to the right has no downlink */
+#define F_HAS_GARBAGE (1 << 4) /* some tuples on the page are dead,
+ * but not deleted yet */
typedef XLogRecPtr GistNSN;
@@ -137,6 +140,10 @@ typedef struct GISTENTRY
#define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
#define GistClearTuplesDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_TUPLES_DELETED)
+#define GistPageHasGarbage(page) ( GistPageGetOpaque(page)->flags & F_HAS_GARBAGE)
+#define GistMarkPageHasGarbage(page) ( GistPageGetOpaque(page)->flags |= F_HAS_GARBAGE)
+#define GistClearPageHasGarbage(page) ( GistPageGetOpaque(page)->flags &= ~F_HAS_GARBAGE)
+
#define GistFollowRight(page) ( GistPageGetOpaque(page)->flags & F_FOLLOW_RIGHT)
#define GistMarkFollowRight(page) ( GistPageGetOpaque(page)->flags |= F_FOLLOW_RIGHT)
#define GistClearFollowRight(page) ( GistPageGetOpaque(page)->flags &= ~F_FOLLOW_RIGHT)
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 4f1a5c3..86e972e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -22,6 +22,7 @@
#include "storage/bufmgr.h"
#include "storage/buffile.h"
#include "utils/hsearch.h"
+#include "access/genam.h"
/*
* Maximum number of "halves" a page can be split into in one operation.
@@ -124,6 +125,7 @@ typedef struct GISTSearchHeapItem
bool recheckDistances; /* T if distances must be rechecked */
IndexTuple ftup; /* data fetched back from the index, used in
* index-only scans */
+ OffsetNumber offnum;
} GISTSearchHeapItem;
/* Unvisited item, either index page or heap tuple */
@@ -161,6 +163,12 @@ typedef struct GISTScanOpaqueData
/* pre-allocated workspace arrays */
double *distances; /* output area for gistindex_keytest */
+ /* info about killed items if any (killedItems is NULL if never used) */
+ OffsetNumber *killedItems; /* offset numbers of killed items */
+ int numKilled; /* number of currently stored items */
+ BlockNumber curBlkno; /* current number of block */
+ GistNSN curPageLSN; /* pos in the WAL stream when page was read */
+
/* In a non-ordered search, returnable heap items are stored here: */
GISTSearchHeapItem pageData[BLCKSZ / sizeof(IndexTupleData)];
OffsetNumber nPageData; /* number of valid items in array */
On 8 September 2015 at 22:35, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:
Fixed patch is attached.
08.09.2015 13:47, Teodor Sigaev:
BTW, I slightly modify your test to provide more stable results.
Thank you, I tried it, Results are nearly the same.
Without microvacuum
Time: 312,955 ms
Time: 264,597 ms
Time: 243,286 ms
Time: 243,679 msWith microvacuum:
Time: 354,097 ms
Time: 82,206 ms
Time: 11,714 ms
Time: 11,277 ms
Looks good to me (except for the initial hit):
Without microvacuum: 1st run | 2nd run
Time: 259.996 ms | 246.831 ms
Time: 169.820 ms | 169.501 ms
Time: 176.045 ms | 166.845 ms
Time: 169.230 ms | 167.637 ms
With microvacuum: 1st run | 2nd run
Time: 625.883 ms | 425.231 ms
Time: 10.891 ms | 10.603 ms
Time: 10.002 ms | 10.971 ms
Time: 11.613 ms | 11.643 ms
--
Thom
On Tue, Sep 8, 2015 at 2:35 PM, Anastasia Lubennikova <
a.lubennikova@postgrespro.ru> wrote:
Fixed patch is attached.
The commit of this patch seems to have created a bug in which updated
tuples can disappear from the index, while remaining in the table.
It looks like the bug depends on going through a crash-recovery cycle, but
I am not sure of that yet.
I've looked through the commit diff and don't see anything obviously
wrong. I notice index tuples are marked dead with only a buffer content
share lock, and the page is defragmented with only a buffer exclusive lock
(as opposed to a super-exclusive buffer clean up lock). But as far as I
can tell, both of those should be safe on an index. Also, if that was the
bug, it should happen without crash-recovery.
The test is pretty simple. I create a 10,000 row table with a
unique-by-construction id column with a btree_gist index on it and a
counter column, and fire single-row updates of the counter for random ids
in high concurrency (8 processes running flat out). I force the server to
crash frequently with simulated torn-page writes in which md.c writes a
partial page and then PANICs. Eventually (1 to 3 hours) the updates start
indicating they updated 0 rows. At that point, a forced table scan will
find the row, but the index doesn't.
Any hints on how to proceed with debugging this? If I can't get it to
reproduce the problem in the absence of crash-recovery cycles with an
overnight run, then I think my next step will be to run it over hot-standby
and see if WAL replay in the absence of crashes might be broken as well.
Cheers,
Jeff
16.09.2015 07:30, Jeff Janes:
The commit of this patch seems to have created a bug in which updated
tuples can disappear from the index, while remaining in the table.It looks like the bug depends on going through a crash-recovery cycle,
but I am not sure of that yet.I've looked through the commit diff and don't see anything obviously
wrong. I notice index tuples are marked dead with only a buffer
content share lock, and the page is defragmented with only a buffer
exclusive lock (as opposed to a super-exclusive buffer clean up
lock). But as far as I can tell, both of those should be safe on an
index. Also, if that was the bug, it should happen without
crash-recovery.The test is pretty simple. I create a 10,000 row table with a
unique-by-construction id column with a btree_gist index on it and a
counter column, and fire single-row updates of the counter for random
ids in high concurrency (8 processes running flat out). I force the
server to crash frequently with simulated torn-page writes in which
md.c writes a partial page and then PANICs. Eventually (1 to 3 hours)
the updates start indicating they updated 0 rows. At that point, a
forced table scan will find the row, but the index doesn't.Any hints on how to proceed with debugging this? If I can't get it to
reproduce the problem in the absence of crash-recovery cycles with an
overnight run, then I think my next step will be to run it over
hot-standby and see if WAL replay in the absence of crashes might be
broken as well.
I've found the bug. It's because of mixed calls of
PageIndexMultiDelete() in gistvacuumpage() and
PageIndexTupleDelete() in gistRedoPageUpdateRecord().
These functions are conflicting.
I've fixed my patch by change MultiDelete to TupleDelete in
gistvacuumpage(). Patch is attached.
But It seems to me that it would be better to rewrite all mentions of
TupleDelete to MultiDelete in gist code.
I'm working on it.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company()
Attachments:
micro_fix.patchtext/x-patch; name=micro_fix.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4edc5a7..1d02e1b 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1478,14 +1478,20 @@ gistvacuumpage(Relation rel, Page page, Buffer buffer)
ItemId itemId = PageGetItemId(page, offnum);
if (ItemIdIsDead(itemId))
- deletable[ndeletable++] = offnum;
+ {
+ deletable[ndeletable] = offnum - ndeletable;
+ ndeletable++;
+ }
}
if (ndeletable > 0)
{
+ int i;
+
START_CRIT_SECTION();
- PageIndexMultiDelete(page, deletable, ndeletable);
+ for (i = 0; i < ndeletable; i++)
+ PageIndexTupleDelete(page, deletable[i]);
/*
* Mark the page as not containing any LP_DEAD items. This is not
But It seems to me that it would be better to rewrite all mentions of
TupleDelete to MultiDelete in gist code.
Sure. Patch is attached, and it changes WAL format, so be carefull with testing.
Please, have a look.
Also in attach scripts reproduce bug from Jeff's report:
g.pl - creates and fills test table
w.pl - worker, could run in several session
Usage
perl g.pl | psql contrib_regression
perl w.pl | psql contrib_regression | grep 'UPDATE 0'
and killall -9 postgres while w.pl is running. Recovery will fail with high
probability.
Thank you, Jeff, for report.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Attachments:
gist_delete.patchtext/plain; charset=UTF-8; name=gist_delete.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4edc5a7..53bccf6 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -466,6 +466,11 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
*/
START_CRIT_SECTION();
+ /*
+ * While we delete only one tuple at once we could mix calls
+ * PageIndexTupleDelete() here and PageIndexMultiDelete() in
+ * gistRedoPageUpdateRecord()
+ */
if (OffsetNumberIsValid(oldoffnum))
PageIndexTupleDelete(page, oldoffnum);
gistfillbuffer(page, itup, ntup, InvalidOffsetNumber);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 2337dbd..a0b0eeb 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -208,23 +208,20 @@ gistbulkdelete(PG_FUNCTION_ARGS)
idxtuple = (IndexTuple) PageGetItem(page, iid);
if (callback(&(idxtuple->t_tid), callback_state))
- {
- todelete[ntodelete] = i - ntodelete;
- ntodelete++;
- stats->tuples_removed += 1;
- }
+ todelete[ntodelete++] = i;
else
stats->num_index_tuples += 1;
}
+ stats->tuples_removed += ntodelete;
+
if (ntodelete)
{
START_CRIT_SECTION();
MarkBufferDirty(buffer);
- for (i = 0; i < ntodelete; i++)
- PageIndexTupleDelete(page, todelete[i]);
+ PageIndexMultiDelete(page, todelete, ntodelete);
GistMarkTuplesDeleted(page);
if (RelationNeedsWAL(rel))
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index fbdbb3c..c63cc81 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -83,13 +83,11 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
/* Delete old tuples */
if (xldata->ntodelete > 0)
{
- int i;
OffsetNumber *todelete = (OffsetNumber *) data;
data += sizeof(OffsetNumber) * xldata->ntodelete;
- for (i = 0; i < xldata->ntodelete; i++)
- PageIndexTupleDelete(page, todelete[i]);
+ PageIndexMultiDelete(page, todelete, xldata->ntodelete);
if (GistPageIsLeaf(page))
GistMarkTuplesDeleted(page);
}
On Wed, Sep 16, 2015 at 8:36 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
But It seems to me that it would be better to rewrite all mentions of
TupleDelete to MultiDelete in gist code.
Sure. Patch is attached, and it changes WAL format, so be carefull with
testing.
Please, have a look.Also in attach scripts reproduce bug from Jeff's report:
g.pl - creates and fills test table
w.pl - worker, could run in several sessionUsage
perl g.pl | psql contrib_regression
perl w.pl | psql contrib_regression | grep 'UPDATE 0'and killall -9 postgres while w.pl is running. Recovery will fail with
high probability.Thank you, Jeff, for report.
Thanks, that seems to have fixed it.
But I don't understand this comment:
+ /*
+ * While we delete only one tuple at once we could mix calls
+ * PageIndexTupleDelete() here and PageIndexMultiDelete() in
+ * gistRedoPageUpdateRecord()
+ */
Does this mean:
Since we delete only one tuple per WAL record here, we can call
PageIndexTupleDelete() here and re-play it with PageIndexMultiDelete() in
gistRedoPageUpdateRecord()
Thanks,
Jeff
But I don't understand this comment:
+ /* + * While we delete only one tuple at once we could mix calls + * PageIndexTupleDelete() here and PageIndexMultiDelete() in + * gistRedoPageUpdateRecord() + */Does this mean:
Since we delete only one tuple per WAL record here, we can call
PageIndexTupleDelete() here and re-play it with PageIndexMultiDelete() in
gistRedoPageUpdateRecord()
yes. The problem was when we delete tuples with offset (2,4,6) with
PageIndexMultiDelete() we will delete exctly pointed tuples. Bur if we delete
tuple by PageIndexTupleDelete() with offset 2 then 4-th tuple will be moved 3 3
and 6 become 5. So, next tuple to delete now is 3 and we should call
PageIndexTupleDelete(3) and so on. And bug was: we deleted tuples in
ginpagevacuum with a help of PageIndexMultiDelete() and write to WAL his
argument, and recovery process uses PageIndexTupleDelete() without correcting
offset.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers