Optimize single tuple fetch from nbtree index
Hi hackers,
While I was reviewing some code in another patch, I stumbled upon a possible optimization in the btree index code in nbtsearch.c for queries using 'LIMIT 1'. I have written a small patch that implements this optimization, but I'd like some feedback on the design of the patch, whether it is correct at all to use this optimization, and whether the performance tradeoffs are deemed worth it by the community.
Basically, an example of the case I'd like to optimize is the following. Given a table 'tbl' with an index on columns (k,ts DESC):
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;
And, even more importantly, when this query gets called in an inner loop like:
SELECT* FROM generate_series(:start_ts, :end_ts, :interval) ts -- perhaps thousands of iterations, could also be a loop over values of 'k' rather than timestamps. this is just an example
CROSS JOIN LATERAL (
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC LIMIT 1;
) _;
With time-series data, this case often arises as you have a certain natural key for which you store updates as they occur. Getting the state of k at a specific time then boils down to the given query there, which is almost always the fastest way to get this information, since the index scan with LIMIT 1 is very fast already. However, there seems to be a possibility to make this even faster (up to nearly 3x faster in test cases that use this nested loop of index scans)
Every time the index scan is done, all tuples from the leaf page are read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an exception for this *only* the first time amgettuple gets called. This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.
There are a few caveats:
- Possible performance decrease for queries that need a small number of tuples (but more than one), because they now need to lock the same page twice. This can happen in several cases, for example: LIMIT 3; LIMIT 1 but the first tuple returned does not match other scan conditions; LIMIT 1 but the tuple returned is not visible; no LIMIT at all but there are just only a few matching rows.
- We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.
I did performance tests for some best case and worst case test scenarios. TPS results were stable and reproducible in re-runs on my, otherwise idle, server. Attached are the full results and how to reproduce. I picked test cases that show best performance as well as worst performance compared to master. Summary: the greatest performance improvement can be seen for the cases with the subquery in a nested loop. In a nested loop of 100 times, the performance is roughly two times better, for 10000 times the performance is roughly three times better. For most test cases that don't use LIMIT 1, I couldn't find a noticeable difference, except for the nested loop with a LIMIT 3 (or similarly, a nested loop without any LIMIT-clause that returns just three tuples). This is also theoretically the worst-case test case, because it has to lock the page again and then read it, just for one tuple. In this case, I saw TPS decrease by 2-3% in a few cases (details in the attached file), due to it having to lock/unlock the same page in both _bt_first and _bt_next.
A few concrete questions to the community:
- Does the community also see this as a useful optimization?
- Is the way it is currently implemented safe? I struggled quite a bit to get everything working with respect to page splits and insertions. In particular, I don't know why in my patch, _bt_find_offset_for_tid needs to consider searching for items with an offset *before* the passed offset. As far as my understanding goes, this could only happen when the index gets vacuumed in the mean-time. However, we hold a pin on the buffer the whole time (we even assert this), so vacuum should not have been possible. Still, this code gets triggered sometimes, so it seems to be necessary. Perhaps someone in the community who's more of an expert on this can shed some light on it.
- What are considered acceptable performance tradeoffs in this case? Is a performance degradation in any part generally not acceptable at all?
I'd also welcome any feedback on the process - this is my first patch and while I tried to follow the guidelines, I may have missed something along the way.
Attachments:
- out_limit.txt: pgbench results for patched branch
- out_master.txt pgbench results for master branch (can be diffed with out_limit.txt to efficiently see the difference)
- init_test.sql: creates a simple table for the test cases and fills it with data
- test_fwd.sql: the nested loop example, parameters :nlimit and :nitems to specify how many rows per inner loop to limit to and how many iterations of the loop need to be done. we're returning a sum() over a column to make sure transferring result data over the socket is not the bottleneck - this is to show the worst-case behavior of this patch. When selecting the full row instead of the sum, the performance difference is negligible for the 'bad' cases, while still providing great (up to 2.5x) improvements for the 'good' cases. This can be tested by changing the sum() to select a column per row instead.
- test_fwd_eq.sql: nested loop with simple unique equality select
- test_fwd_single.sql: single query with LIMIT without nested loop
-Floris
Attachments:
0001-Optimize-single-tuple-fetch-from-nbtree-index-v01.patchapplication/octet-stream; name=0001-Optimize-single-tuple-fetch-from-nbtree-index-v01.patchDownload
From f1f8607c7492a5c52180226b4fd9734cdc6780c7 Mon Sep 17 00:00:00 2001
From: Floris <florisvannee@optiver.com>
Date: Wed, 31 Jul 2019 08:58:24 +0200
Subject: [PATCH] Optimize single tuple fetch from nbtree index
Before this patch, calling amgettuple on an nbtree index always reads all
tuples on the page before returning the first tuple. Consecutive calls do
not need to lock the buffer and this helps performance in general. However,
when only one tuple is required, there is a lot of extra processing,
which hurts performance for common queries with a LIMIT 1 clause.
This patch changes _bt_first, which gets called for the first index
tuple to be fetched, to read only the first tuple from a page.
Consecutive calls to _bt_next will read all tuples as before. This
greatly improves performance for statements that have LIMIT 1 and fetch
from an index.
Author: Floris van Nee
Reviewed-by: <empty so far>
---
src/backend/access/nbtree/nbtree.c | 4 +-
src/backend/access/nbtree/nbtsearch.c | 407 ++++++++++++++++++++++++++++++----
src/include/access/nbtree.h | 10 +-
3 files changed, 380 insertions(+), 41 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..3362104 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -244,7 +244,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
* _bt_first() to get the first item in the scan.
*/
if (!BTScanPosIsValid(so->currPos))
- res = _bt_first(scan, dir);
+ res = _bt_first(scan, dir, false);
else
{
/*
@@ -309,7 +309,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
do
{
/* Fetch the first page & tuple */
- if (_bt_first(scan, ForwardScanDirection))
+ if (_bt_first(scan, ForwardScanDirection, true))
{
/* Save tuple ID, and continue scanning */
heapTid = &scan->xs_heaptid;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index c655dad..09cc176 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -23,11 +23,24 @@
#include "utils/lsyscache.h"
#include "utils/rel.h"
+/*
+ * Enum used in _bt_readpage to determine how many tuples
+ * on the page need to be read.
+ * ReadPageModeFirst reads at least the first matching tuple,
+ * but possibly reads two as an optimization, in order to properly
+ * determine continuescan in _bt_readpage.
+ */
+typedef enum ReadPageMode
+{
+ ReadPageModeFirst, /* read only the first tuple */
+ ReadPageModeContinue, /* read the rest of the tuples on the page */
+ ReadPageModeAll /* read all tuples on the page */
+} ReadPageMode;
static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
- OffsetNumber offnum);
+ OffsetNumber offnum, ReadPageMode readMode);
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
@@ -37,6 +50,14 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+static OffsetNumber _bt_find_offset_for_tid(
+ IndexScanDesc scan,
+ OffsetNumber offnum, ItemPointer tid);
+static OffsetNumber _bt_readnextpage_find_offset(IndexScanDesc scan,
+ BlockNumber blkno, ItemPointer tid);
+static OffsetNumber _bt_steppage_find_offset(IndexScanDesc scan, ItemPointer tid);
+static inline bool _bt_continue_scan_from_tid(
+ IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, ItemPointer heapTid);
/*
@@ -743,7 +764,7 @@ _bt_compare(Relation rel,
* in locating the scan start position.
*/
bool
-_bt_first(IndexScanDesc scan, ScanDirection dir)
+_bt_first(IndexScanDesc scan, ScanDirection dir, bool readFullPage)
{
Relation rel = scan->indexRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1304,7 +1325,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, offnum))
+ if (!_bt_readpage(scan, dir, offnum, readFullPage ? ReadPageModeAll : ReadPageModeFirst))
{
/*
* There's no actually-matching data on this page. Try to advance to
@@ -1316,8 +1337,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
else
{
- /* Drop the lock, and maybe the pin, on the current page */
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ /* Drop the lock on the page. Do not drop the pin here because we could have
+ * read just one item in ReadPageModeFirst mode so we may need to read it again.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
}
readcomplete:
@@ -1331,6 +1354,216 @@ readcomplete:
}
/*
+ * Keep reading pages to the right of blkno until we find an item with heap tid as specified.
+ * It returns the offset number on the index page of the tuple with that tid.
+ * On entry, buffer is either pinned on unpinned, but not locked.
+ * On exit, buffer is pinnend and locked.
+ */
+OffsetNumber _bt_readnextpage_find_offset(IndexScanDesc scan, BlockNumber blkno, ItemPointer tid)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Relation rel;
+ Page page;
+ BTPageOpaque opaque;
+
+ rel = scan->indexRelation;
+
+ if (BTScanPosIsPinned(so->currPos))
+ LockBuffer(so->currPos.buf, BT_READ);
+ else
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* We need the real page to the right here.
+ * In case a split occurred, we can't follow the pre-stored pointer.
+ */
+ blkno = opaque->btpo_next;
+ _bt_relbuf(rel, so->currPos.buf);
+
+ for (;;)
+ {
+ if (blkno == P_NONE)
+ {
+ return InvalidOffsetNumber;
+ }
+ /* check for interrupts while we're not holding any buffer lock */
+ CHECK_FOR_INTERRUPTS();
+
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* check for deleted page */
+ if (!P_IGNORE(opaque))
+ {
+ /* reads current page for specified tid and return it if found */
+ OffsetNumber offnum = _bt_find_offset_for_tid(scan, P_FIRSTDATAKEY(opaque), tid);
+ if (offnum != InvalidOffsetNumber)
+ return offnum;
+ }
+
+ blkno = opaque->btpo_next;
+ _bt_relbuf(rel, so->currPos.buf);
+ }
+ /* we didn't find it anywhere. this should never happen, but return InvalidOffSetNumber so
+ * caller can raise an error if required.
+ */
+ return InvalidOffsetNumber;
+}
+
+
+/*
+ * Steps pages to the right until it finds a tuple with specified heap tip.
+ * It returns the offset number on the index page of the tuple with that tid.
+ * This function just cleans up work on the current page and calls _bt_readnextpage_find_offset
+ * which does the rest of the work.
+ * On entry, buffer is either pinned or unpinned, but not locked.
+ * On exit, buffer is pinned and locked.
+ */
+OffsetNumber _bt_steppage_find_offset(IndexScanDesc scan, ItemPointer tid)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ BlockNumber blkno = InvalidBlockNumber;
+
+ Assert(BTScanPosIsValid(so->currPos));
+
+ /* Before leaving current page, deal with any killed items */
+ if (so->numKilled > 0)
+ _bt_killitems(scan);
+
+ /*
+ * Before we modify currPos, make a copy of the page data if there was a
+ * mark position that needs it.
+ */
+ if (so->markItemIndex >= 0)
+ {
+ /* bump pin on current buffer for assignment to mark buffer */
+ if (BTScanPosIsPinned(so->currPos))
+ IncrBufferRefCount(so->currPos.buf);
+ memcpy(&so->markPos, &so->currPos,
+ offsetof(BTScanPosData, items[1]) +
+ so->currPos.lastItem * sizeof(BTScanPosItem));
+ if (so->markTuples)
+ memcpy(so->markTuples, so->currTuples,
+ so->currPos.nextTupleOffset);
+ so->markPos.itemIndex = so->markItemIndex;
+ so->markItemIndex = -1;
+ }
+
+ blkno = so->currPos.currPage;
+ return _bt_readnextpage_find_offset(scan, blkno, tid);
+}
+
+/*
+ * Looks on the current page for an item with heap tid as specified.
+ * On entry, buffer is pinned and locked.
+ * On exit, buffer is still pinnend and locked.
+ */
+OffsetNumber _bt_find_offset_for_tid(IndexScanDesc scan, OffsetNumber offnum, ItemPointer tid)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ Page page;
+ BTPageOpaque opaque;
+ OffsetNumber minoff;
+ OffsetNumber maxoff;
+ OffsetNumber curoff;
+
+ Assert(BufferIsValid(so->currPos.buf));
+
+ page = BufferGetPage(so->currPos.buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ minoff = P_FIRSTDATAKEY(opaque);
+ maxoff = PageGetMaxOffsetNumber(page);
+
+ /* Start searching at offnum, because this is the most likely place
+ * where we will find the tuple.
+ * However, it may have been moved due to insertions or page splits.
+ * So if it's not at offnum, we need to look at the rest of the page.
+ * An insertion is most likely, so first start looking at offsets
+ * higher than offnum.
+ */
+ curoff = Max(minoff, offnum);
+ while (curoff <= maxoff)
+ {
+ ItemId iid = PageGetItemId(page, curoff);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (ItemPointerEquals(&itup->t_tid, tid))
+ {
+ return curoff;
+ }
+ curoff = OffsetNumberNext(curoff);
+ }
+ /*
+ * It seems to be required to also scan back from offnum to minoff, but it is
+ * unclear to me why this is necessary. Vacuum shouldn't be happening if we have
+ * the buffer pinned continuously (which is the case since we explicitly assert this
+ * in _bt_continue_scan_from_tid).
+ */
+ curoff = OffsetNumberPrev(Min(maxoff, offnum));
+ while (curoff >= minoff)
+ {
+ ItemId iid = PageGetItemId(page, curoff);
+ IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (ItemPointerEquals(&itup->t_tid, tid))
+ {
+ return curoff;
+ }
+ curoff = OffsetNumberPrev(curoff);
+ }
+ return InvalidOffsetNumber;
+}
+
+/*
+ * Continues scanning on the current page at the specified heapTid.
+ * The given offset number is given as a hint as to where this tid is most likely found in the index,
+ * although it could have been moved due to eg. insertions in the meantime.
+ * Buffer must be pinned but not locked.
+ * On exit, no lock is held while pin may or may not be released.
+ */
+bool _bt_continue_scan_from_tid(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, ItemPointer heapTid)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ReadPageMode readMode = ReadPageModeContinue;
+
+ Assert(BTScanPosIsPinned(so->currPos));
+ LockBuffer(so->currPos.buf, BT_READ);
+
+ offnum = _bt_find_offset_for_tid(scan, offnum, heapTid);
+ if (offnum == InvalidOffsetNumber)
+ {
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ offnum = _bt_steppage_find_offset(scan, heapTid);
+ if (offnum == InvalidOffsetNumber)
+ {
+ elog(ERROR, "fell off the end of index while searching for heap tid \"%s\"",
+ RelationGetRelationName(scan->indexRelation));
+ }
+ readMode = ReadPageModeAll;
+ }
+ offnum = ScanDirectionIsForward(dir) ? OffsetNumberNext(offnum) : OffsetNumberPrev(offnum);
+ if (!_bt_readpage(scan, dir, offnum, readMode))
+ {
+ /*
+ * There's no actually-matching data on this page. Try to advance to
+ * the next page. Return false if there's no matching data at all.
+ */
+ LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+ if (!_bt_steppage(scan, dir))
+ return false;
+ }
+ else
+ {
+ /* Drop the lock, and maybe the pin, on the current page */
+ _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ }
+ return true;
+}
+
+/*
* _bt_next() -- Get the next item in a scan.
*
* On entry, so->currPos describes the current page, which may be pinned
@@ -1358,7 +1591,14 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
{
if (++so->currPos.itemIndex > so->currPos.lastItem)
{
- if (!_bt_steppage(scan, dir))
+ if (so->currPos.moreRightOnCurPage)
+ {
+ OffsetNumber offnum = so->currPos.items[so->currPos.itemIndex - 1].indexOffset;
+ ItemPointer heapTid = &so->currPos.items[so->currPos.itemIndex - 1].heapTid;
+ if (!_bt_continue_scan_from_tid(scan, dir, offnum, heapTid))
+ return false;
+ }
+ else if (!_bt_steppage(scan, dir))
return false;
}
}
@@ -1366,7 +1606,14 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
{
if (--so->currPos.itemIndex < so->currPos.firstItem)
{
- if (!_bt_steppage(scan, dir))
+ if (so->currPos.moreLeftOnCurPage)
+ {
+ OffsetNumber offnum = so->currPos.items[so->currPos.itemIndex + 1].indexOffset;
+ ItemPointer heapTid = &so->currPos.items[so->currPos.itemIndex + 1].heapTid;
+ if (!_bt_continue_scan_from_tid(scan, dir, offnum, heapTid))
+ return false;
+ }
+ else if (!_bt_steppage(scan, dir))
return false;
}
}
@@ -1399,8 +1646,9 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
*
* Returns true if any matching items found on the page, false if none.
*/
+
static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum, ReadPageMode readMode)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Page page;
@@ -1408,6 +1656,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
OffsetNumber minoff;
OffsetNumber maxoff;
int itemIndex;
+ int startItemIndex;
bool continuescan;
int indnatts;
@@ -1434,28 +1683,38 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
minoff = P_FIRSTDATAKEY(opaque);
maxoff = PageGetMaxOffsetNumber(page);
- /*
- * We note the buffer's block number so that we can release the pin later.
- * This allows us to re-read the buffer if it is needed again for hinting.
- */
- so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+ if (readMode != ReadPageModeContinue)
+ {
+ /*
+ * We note the buffer's block number so that we can release the pin later.
+ * This allows us to re-read the buffer if it is needed again for hinting.
+ */
+ so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
- /*
- * We save the LSN of the page as we read it, so that we know whether it
- * safe to apply LP_DEAD hints to the page later. This allows us to drop
- * the pin for MVCC scans, which allows vacuum to avoid blocking.
- */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+ /*
+ * We save the LSN of the page as we read it, so that we know whether it
+ * safe to apply LP_DEAD hints to the page later. This allows us to drop
+ * the pin for MVCC scans, which allows vacuum to avoid blocking.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
- /*
- * we must save the page's right-link while scanning it; this tells us
- * where to step right to after we're done with these items. There is no
- * corresponding need for the left-link, since splits always go right.
- */
- so->currPos.nextPage = opaque->btpo_next;
+ /*
+ * we must save the page's right-link while scanning it; this tells us
+ * where to step right to after we're done with these items. There is no
+ * corresponding need for the left-link, since splits always go right.
+ */
+ so->currPos.nextPage = opaque->btpo_next;
- /* initialize tuple workspace to empty */
- so->currPos.nextTupleOffset = 0;
+ /* initialize tuple workspace to empty */
+ so->currPos.nextTupleOffset = 0;
+ }
+ else
+ {
+ /* we must save the next page link, because it may have changed while we
+ * didn't hold the lock
+ */
+ so->currPos.nextPage = opaque->btpo_next;
+ }
/*
* Now that the current page has been made consistent, the macro should be
@@ -1465,8 +1724,15 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
if (ScanDirectionIsForward(dir))
{
+ so->currPos.moreLeftOnCurPage = false;
+ so->currPos.moreRightOnCurPage = (readMode == ReadPageModeFirst);
+
/* load items[] in ascending order */
- itemIndex = 0;
+
+ if (readMode != ReadPageModeContinue)
+ itemIndex = startItemIndex = 0;
+ else
+ itemIndex = startItemIndex = so->currPos.itemIndex;
offnum = Max(offnum, minoff);
@@ -1492,6 +1758,29 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* tuple passes all scan key conditions, so remember it */
_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;
+
+ if (readMode == ReadPageModeFirst)
+ {
+ /* we do compare one item further if possible
+ * in order to check if we can set continuescan to false
+ * this is useful for a PK search where the next item
+ * will never match
+ */
+ offnum = OffsetNumberNext(offnum);
+ if (offnum <= maxoff)
+ {
+ iid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan) && (!scan->ignore_killed_tuples || !ItemIdIsDead(iid)))
+ {
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ itemIndex++;
+ }
+ offnum = OffsetNumberNext(offnum);
+ }
+ break;
+ }
}
/* When !continuescan, there can't be any more matches, so stop */
if (!continuescan)
@@ -1511,7 +1800,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
* only appear on non-pivot tuples on the right sibling page are
* common.
*/
- if (continuescan && !P_RIGHTMOST(opaque))
+ if (continuescan && readMode != ReadPageModeFirst && !P_RIGHTMOST(opaque))
{
ItemId iid = PageGetItemId(page, P_HIKEY);
IndexTuple itup = (IndexTuple) PageGetItem(page, iid);
@@ -1521,18 +1810,31 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
}
+ if (offnum > maxoff)
+ so->currPos.moreRightOnCurPage = false;
if (!continuescan)
+ {
so->currPos.moreRight = false;
+ so->currPos.moreRightOnCurPage = false;
+ }
Assert(itemIndex <= MaxIndexTuplesPerPage);
so->currPos.firstItem = 0;
so->currPos.lastItem = itemIndex - 1;
- so->currPos.itemIndex = 0;
+ so->currPos.itemIndex = startItemIndex;
+
+ return (startItemIndex <= so->currPos.lastItem);
}
else
{
+ so->currPos.moreLeftOnCurPage = (readMode == ReadPageModeFirst);
+ so->currPos.moreRightOnCurPage = false;
+
/* load items[] in descending order */
- itemIndex = MaxIndexTuplesPerPage;
+ if (readMode != ReadPageModeContinue)
+ itemIndex = startItemIndex = MaxIndexTuplesPerPage;
+ else
+ itemIndex = startItemIndex = so->currPos.itemIndex + 1;
offnum = Min(offnum, maxoff);
@@ -1576,24 +1878,53 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
/* tuple passes all scan key conditions, so remember it */
itemIndex--;
_bt_saveitem(so, itemIndex, offnum, itup);
+
+ if (readMode == ReadPageModeFirst)
+ {
+ /* we do compare one item further if possible
+ * in order to check if we can set continuescan to false
+ * this is useful for a PK search where the next item
+ * will never match
+ */
+ offnum = OffsetNumberPrev(offnum);
+ if (offnum >= minoff)
+ {
+ iid = PageGetItemId(page, offnum);
+ itup = (IndexTuple) PageGetItem(page, iid);
+
+ if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan) && (!scan->ignore_killed_tuples || !ItemIdIsDead(iid)))
+ {
+ itemIndex--;
+ _bt_saveitem(so, itemIndex, offnum, itup);
+ }
+ offnum = OffsetNumberPrev(offnum);
+ }
+ break;
+ }
}
if (!continuescan)
{
/* there can't be any more matches, so stop */
- so->currPos.moreLeft = false;
break;
}
offnum = OffsetNumberPrev(offnum);
}
+ if (!continuescan)
+ {
+ so->currPos.moreLeft = false;
+ so->currPos.moreLeftOnCurPage = false;
+ }
+ if (offnum < minoff)
+ so->currPos.moreLeftOnCurPage = false;
Assert(itemIndex >= 0);
so->currPos.firstItem = itemIndex;
so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
- so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
- }
+ so->currPos.itemIndex = startItemIndex - 1;
- return (so->currPos.firstItem <= so->currPos.lastItem);
+ return (so->currPos.firstItem <= startItemIndex - 1);
+ }
}
/* Save an index item into so->currPos.items[itemIndex] */
@@ -1771,7 +2102,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, blkno, scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreRight if we can stop */
- if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+ if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque), ReadPageModeAll))
break;
}
else if (scan->parallel_scan != NULL)
@@ -1873,7 +2204,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
/* see if there are any matches on this page */
/* note that this will clear moreLeft if we can stop */
- if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+ if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page), ReadPageModeAll))
break;
}
else if (scan->parallel_scan != NULL)
@@ -2203,7 +2534,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
/*
* Now load data from the first page of the scan.
*/
- if (!_bt_readpage(scan, dir, start))
+ if (!_bt_readpage(scan, dir, start, ReadPageModeAll))
{
/*
* There's no actually-matching data on this page. Try to advance to
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 83e0e6c..9a3103b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -554,6 +554,14 @@ typedef struct BTScanPosData
BlockNumber nextPage; /* page's right link when we scanned it */
/*
+ * These are used when _bt_first is told not to scan the full page
+ * but only the initial matching tuple. They indicate whether or not
+ * there is more data to be scanned on the current page.
+ */
+ bool moreLeftOnCurPage;
+ bool moreRightOnCurPage;
+
+ /*
* moreLeft and moreRight track whether we think there may be matching
* index entries to the left and right of the current page, respectively.
* We can clear the appropriate one of these flags when _bt_checkkeys()
@@ -775,7 +783,7 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
bool forupdate, BTStack stack, int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first(IndexScanDesc scan, ScanDirection dir, bool readFullPage);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot);
--
2.4.6
Floris Van Nee <florisvannee@Optiver.com> writes:
Every time the index scan is done, all tuples from the leaf page are
read in nbtsearch.c:_bt_readpage. The idea of this patch is to make an
exception for this *only* the first time amgettuple gets called.
Regardless of whether there's actually a LIMIT 1? That seems disastrous
for every other case than the narrow one where the optimization wins.
Because every other case is now going to have to touch the index page
twice. That's more CPU and about double the contention --- if you could
not measure any degradation from that, you're not measuring the right
thing.
In principle, you could pass down knowledge of whether there's a LIMIT,
using the same mechanism used to enable top-N sorting. But it'd have to
also go through the AM interface layer, so I'm not sure how messy this
would be.
This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.
How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?
- We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.
Meh. I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller. There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.
I'm not unalterably opposed to doing something like this, but my sense
is that the complexity and probable negative performance impact on other
cases are not going to look like a good trade-off for optimizing the
case at hand.
BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.
regards, tom lane
Hi Tom,
Thanks for your quick reply!
Regardless of whether there's actually a LIMIT 1? That seems disastrous
for every other case than the narrow one where the optimization wins.
Because every other case is now going to have to touch the index page
twice. That's more CPU and about double the contention --- if you could
not measure any degradation from that, you're not measuring the right
thing.
I thought the same as well at first. Note that I did measure degradation of 2-3% as mentioned on some cases, but initially I also expected worse. Do you have any ideas on cases that would suffer the most? I thought the tight inner nested loop that I posted in my performance tests would have this index lookup as bottleneck. I know they are the bottleneck for the LIMIT 1 query (because these improve by a factor 2-3 with the patch). And my theory is that for a LIMIT 3, the price paid for this optimization is highest, because it would touch the page twice and read all items from it, while only returning three of them.
In principle, you could pass down knowledge of whether there's a LIMIT,
using the same mechanism used to enable top-N sorting. But it'd have to
also go through the AM interface layer, so I'm not sure how messy this
would be.
This was an idea I had as well and I would be willing to implement such a thing if this is deemed interesting enough by the community. However, I didn't want to do this for the first version of this patch, as it would be quite some extra work, which would be useless if the idea of the patch itself gets rejected already. :-) I'd appreciate any pointers in the right direction - I can take a look at how top-N sorting pushes the LIMIT down. Given enough interest for the basic idea of this patch, I will implement it.
This calls _bt_first in nbtsearch.c, which will, if there are scankeys, descend the tree to a leaf page and read just the first (or possibly two) tuples. It won't touch the rest of the page yet. If indeed just one tuple was required, there won't be a call to _bt_next and we're done. If we do need more than one tuple, _bt_next will resume reading tuples from the index page at the point where we left off.
How do you know how many index entries you have to fetch to get a tuple
that's live/visible to the query?
Indeed we don't know that - that's why this initial patch does not make any assumptions about this and just assumes the good-weather scenario that everything is visible. I'm not sure if it's possible to give an estimation of this and whether or not that would be useful. Currently, if it turns out that the tuple is not visible, there'll just be another call to _bt_next again which will resume reading the page as normal. I'm open to implement any suggestions that may improve this.
- We need to take into account page splits, insertions and vacuums while we do not have the read-lock in between _bt_first and the first call to _bt_next. This made my patch quite a bit more complicated than my initial implementation.
Meh. I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller. There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.
Agreed, which is also why I posted this initial version of the patch here already, to get some input from the experts on this topic what assumptions can be made now and in the future. If it turns out that it's completely not feasible to do an optimization like this, because of other constraints in the btree implementation, then we're done pretty quickly here. :-) For what it's worth: the patch at least passes make check consistently - I caught a lot of these edge cases related to page splits and insertions while running the regression tests, which runs the modified bits of code quite often and in parallel. There may be plenty of edge cases left however...
I'm not unalterably opposed to doing something like this, but my sense
is that the complexity and probable negative performance impact on other
cases are not going to look like a good trade-off for optimizing the
case at hand.
I do think it could be a big win if we could get something like this working. Cases with a LIMIT seem common enough to me to make it possible to add some extra optimizations, especially if that could lead to 2-3x the TPS for these kind of queries. However, it indeed needs to be within a reasonable complexity. If it turns out that in order for us to optimize this, we need to add a lot of extra complexity, it may not be worth it to add it.
BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.
It is pretty difficult to come up with any faster plans than this unfortunately. We have a large database with many tables with timeseries data, and when tables get large and when there's an efficient multi-column index and when you want to do these kind of time-base or item-based lookups, the nested loop is generally the fastest option. I can elaborate more on this, but I'm not sure if this thread is the best place for that.
I appreciate you taking the time to take a look at this. I'd be happy to look more into any suggestions you come up with. Working on this has taught me a lot about the internals of Postgres already - I find it really interesting!
-Floris
On Fri, Aug 2, 2019 at 1:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Meh. I think the odds that you got this 100% right are small, and the
odds that it would be maintainable are smaller. There's too much that
can happen if you're not holding any lock --- and there's a lot of active
work on btree indexes, which could break whatever assumptions you might
make today.
I agree that this sounds very scary.
BTW, you haven't even really made the case that optimizing a query that
behaves this way is the right thing to be doing ... maybe some other
plan shape that isn't a nestloop around a LIMIT query would be a better
solution.
I wonder if some variety of block nested loop join would be helpful
here. I'm not aware of any specific design that would help with
Floris' case, but the idea of reducing the number of scans required on
the inner side by buffering outer side tuples (say based on the
"k=:val" constant) seems like it might generalize well enough. I
suggest Floris look into that possibility. This paper might be worth a
read:
https://dl.acm.org/citation.cfm?id=582278
(Though it also might not be worth a read -- I haven't actually read it myself.)
--
Peter Geoghegan
On Fri, Aug 2, 2019 at 5:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
I wonder if some variety of block nested loop join would be helpful
here. I'm not aware of any specific design that would help with
Floris' case, but the idea of reducing the number of scans required on
the inner side by buffering outer side tuples (say based on the
"k=:val" constant) seems like it might generalize well enough. I
suggest Floris look into that possibility. This paper might be worth a
read:
Actually, having looked at the test case in more detail, that now
seems less likely. The test case seems designed to reward making it
cheaper to access one specific tuple among a fairly large group of
related tuples -- reducing the number of inner scans is not going to
be possible there.
If this really is totally representative of the case that Floris cares
about, I suppose that the approach taken more or less makes sense.
Unfortunately, it doesn't seem like an optimization that many other
users would find compelling, partly because it's only concerned with
fixed overheads, and partly because most queries don't actually look
like this.
--
Peter Geoghegan
Hello everyone.
I am also was looking into possibility of such optimisation few days ago
(attempt to reduce memcpy overhead on IndexOnlyScan).
One thing I noticed here - whole page is scanned only if index quals are
"opened" at some side.
So, in case of
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp ORDER BY k, ts DESC
LIMIT 1;
whole index page will be read.
But
SELECT* FROM tbl WHERE k=:val AND ts<=:timestamp AND ts<:=timestamp -
:interval ORDER BY k, ts DESC LIMIT 1;
is semantically the same, but only few :interval records will be processed.
So, you could try to compare such query in your benchmarks.
Also, some info about current design is contained in
src\backend\access\nbtree\README ("To minimize lock/unlock traffic, an
index scan always searches a leaf page
to identify all the matching items at once").
Thanks,
Michail.
Hi Peter,
Actually, having looked at the test case in more detail, that now
seems less likely. The test case seems designed to reward making it
cheaper to access one specific tuple among a fairly large group of
related tuples -- reducing the number of inner scans is not going to
be possible there.
If this really is totally representative of the case that Floris cares
about, I suppose that the approach taken more or less makes sense.
Unfortunately, it doesn't seem like an optimization that many other
users would find compelling, partly because it's only concerned with
fixed overheads, and partly because most queries don't actually look
like this.
Thanks for taking a look. Unfortunately this is exactly the case I care about. I'm a bit puzzled as to why this case wouldn't come up more often by other users though. We have many large tables with timeseries data and it seems to me that with timeseries data, two of the most common queries are:
(1) What is the state of { a1,a2, a3 ...} at timepoint t (but you don't know that there's an update *exactly* at timepoint t - so you're left with trying to find the latest update smaller than t)
(2) What is the state of { a } at timepoints { t1, t2, t3 ... }
Given that a1,a2,a3... are indepedently updating, but similar time series (eg. sensor a1 and sensor a2, but both provide a temperature value and update independently from each other).
Both of these can also be done with some kind of DISTINCT ON clause, but this is often already much slower than just doing a nested loop of fast index lookups with LIMIT 1 (this depends on the frequency of the timeseries data itself versus the sampling rate of your query though, for high frequency time series and/or low frequency sampling the LIMIT 1 approach is much faster).
Note that there is actually some related work to this - in the Index Skip Scan thread [1]/messages/by-id/CA+hUKGKW4dXTP9G+WBskjT09tzD+9aMWEm=Fpeb6RS5SXfPyKw@mail.gmail.com a function called _bt_read_closest was developed which also partially reads the page. A Skip Scan has a very similar access pattern to the use case I describe here, because it's also very likely to just require one tuple from the page. Even though the implementation in that patch is currently incorrect, performance of the Skip Scan would likely also be quite a bit faster if it had a correct implementation of this partial page-read and it wouldn't have to read the full page every time.
I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this:
* Note that if we hold a pin on the target page continuously from initially
* reading the items until applying this function, VACUUM cannot have deleted
* any items from the page, and so there is no need to search left from the
* recorded offset. (This observation also guarantees that the item is still
* the right one to delete, which might otherwise be questionable since heap
* TIDs can get recycled.) This holds true even if the page has been modified
* by inserts and page splits, so there is no need to consult the LSN.
Still, exactly this case happens in practice. In my tests I was able to get behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but *keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset.
However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin?
-Floris
[1]: /messages/by-id/CA+hUKGKW4dXTP9G+WBskjT09tzD+9aMWEm=Fpeb6RS5SXfPyKw@mail.gmail.com
05.08.2019 14:34, Floris Van Nee wrote:
I have one further question about these index offsets. There are several comments in master that indicate that it's impossible that an item moves 'left' on a page, if we continuously hold a pin on the page. For example, _bt_killitems has a comment like this:
* Note that if we hold a pin on the target page continuously from initially
* reading the items until applying this function, VACUUM cannot have deleted
* any items from the page, and so there is no need to search left from the
* recorded offset. (This observation also guarantees that the item is still
* the right one to delete, which might otherwise be questionable since heap
* TIDs can get recycled.) This holds true even if the page has been modified
* by inserts and page splits, so there is no need to consult the LSN.Still, exactly this case happens in practice. In my tests I was able to get behavior like:
1) pin + lock a page in _bt_first
2) read a tuple, record indexOffset (for example offset=100) and heap tid
3) unlock page, but*keep* the pin (end of _bt_first of my patch)
4) lock page again in _bt_next (we still hold the pin, so vacuum shouldn't have occurred)
5) look inside the current page for the heap Tid that we registered earlier
6) we find that we can now find this tuple at indexOffset=98, eg. it moved left. This should not be possible.
This case sometimes randomly happens when running 'make check', which is why I added code in my patch to also look left on the page from the previous indexOffset.However, this is in contradiction with the comments (and code) of _bt_killitems.
Is the comment incorrect/outdated or is there a bug in vacuum or any other part of Postgres that might move index items left even though there are others holding a pin?
Hello,
welcome to hackers with your first patch)
As far as I understood from the thread above, the design of this
optimization is under discussion, so I didn't review the proposed patch
itself.
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.
I didn't dig into the code, but this line looks suspicious (see my
findings about BTScanPosIsPinned below):
/* bump pin on current buffer for assignment to mark buffer */
if (BTScanPosIsPinned(so->currPos))
IncrBufferRefCount(so->currPos.buf);
While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().
I propose that we update BTScanPosIsPinned macro. Or, at least write a
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the
current algorithm is fine.
Peter, Tom, what do you think?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Fri, Aug 23, 2019 at 10:14 AM Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.
See commit 2ed5b87f. The code is supposed to do that, but it might do
it more often than is truly necessary. We don't want to block VACUUM
by holding a buffer pin for a very long time, which is theoretically
possible here. Do you think that it is actually unnecessary here? In
other words, do you think that we can fix this without breaking cases
that commit 2ed5b87f cares about?
I have been suspicious of this commit all along. For example, I
noticed that it can cause the kill_prior_tuple mechanism to be
ineffective in a way that didn't happen prior to Postgres 9.5:
/messages/by-id/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com
That particular complaint was never addressed. I meant to do more on
commit 2ed5b87f.
I didn't dig into the code, but this line looks suspicious (see my
findings about BTScanPosIsPinned below):/* bump pin on current buffer for assignment to mark buffer */
if (BTScanPosIsPinned(so->currPos))
IncrBufferRefCount(so->currPos.buf);While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().
I have always hated this macro. I think that code like the specific
code you quoted might be correct, kind of, but it looks like the
author was trying to change as little as possible about the code as it
existed in 2015, rather than changing things so that everything made
sense. It looks like a messy accretion.
Let me see if I can get it straight:
We're incrementing the ref count on the buffer if and only if it is
pinned (by which we mean valid), though only when the scan is valid
(which is not the same as pinned). Whether or not we increment the
count of a valid scan doesn't affect anything else we do (i.e. we
still restore a marked position either way).
This is just awful.
I propose that we update BTScanPosIsPinned macro. Or, at least write a
comment, why its current behavior is fine.
There are a few existing callers, that are definitely expecting that
this macro checks a pin, which it doesn't do.
I don't quite understand if that already causes any subtle bug, or the
current algorithm is fine.
I think that you're right -- at a minimum, this requires more
documentation. This code is a few years old, but I still wouldn't be
surprised if it turned out to be slightly wrong in a way that was
important. We still have no way of detecting if a buffer is accessed
without a pin. There have been numerous bugs like that before. (We
have talked about teaching Valgrind to detect the case, but that never
actually happened.)
--
Peter Geoghegan
Hello,
welcome to hackers with your first patch)
Thank you.
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.
May I ask what makes you conclude that the condition of holding the pin continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuously held by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin on the buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer got compacted while the backend held a pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on a page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happen even while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the comments in eg. _bt_killitems? Because currently those comments make it look like this case never occurs.
While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().
I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introduces problems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actually checks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffer gets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutive call to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in comments though.
-Floris
25.08.2019 0:59, Floris Van Nee wrote:
Though, I got interested in the comment inconsistency you have found.
I added debug message into this code branch of the patch and was able to
see it in regression.diffs after 'make check':
Speaking of your patch, it seems that the buffer was unpinned and pinned
again between two reads,
and the condition of holding it continuously has not been met.May I ask what makes you conclude that the condition of holding the pin continuously has not been met?
Your reply encouraged me to dig a little bit more into this today. First, I wanted to check if indeed the pin was continuously held by the backend or not. I added some debug info to ReleaseBuffer for this: it turned out that the pin on the buffer was definitely never released by the backend between the calls to _bt_first and _bt_next. So the buffer got compacted while the backend held a pin on it.
After some more searching I found the following code: _bt_vacuum_one_page in nbtinsert.c
This function compacts one single page without taking a super-exclusive lock. It is used during inserts to make room on a page. I verified that if I comment out the calls to this function, the compacting never happens while I have a pin on the buffer.
So I guess that answers my own question: cleaning up garbage during inserts is one of the cases where compacting may happen even while other backends hold a pin to the buffer. Perhaps this should also be more clearly phrased in the comments in eg. _bt_killitems? Because currently those comments make it look like this case never occurs.
You're right, the pin was not released between page reads.
I also added debug to UnpinBuffer, but now I see that I had interpreted
it wrongly.
As far as I understand, the issue with your patch is that it breaks the
*scan stops "between" pages* assumption
and thus it unsafely interacts with _bt_vacuum_one_page() cleanup.
See README:
Page read locks are held only for as long as a scan is examining a page.
To minimize lock/unlock traffic, an index scan always searches a leaf page
to identify all the matching items at once, copying their heap tuple IDs
into backend-local storage. The heap tuple IDs are then processed while
not holding any page lock within the index. We do continue to hold a pin
on the leaf page in some circumstances, to protect against concurrent
deletions (see below). In this state the scan is effectively stopped
"between" pages, either before or after the page it has pinned. This is
safe in the presence of concurrent insertions and even page splits, because
items are never moved across pre-existing page boundaries --- so the scan
cannot miss any items it should have seen, nor accidentally return the same
item twice.
and
Once an index tuple has been marked LP_DEAD it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion.
It seems that it contradicts the very idea of your patch, so probably we
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number
of tuples, to allow index work more optimal
and avoid the overhead for other loads.=
While reading the code to answer your question, I noticed that
BTScanPosIsPinned macro name is misleading.
It calls BufferIsValid(), not BufferIsPinned() as one could expect.
And BufferIsValid in bufmgr.h comment explicitly states that it
shouldn't be confused with BufferIsPinned.
The same goes for BTScanPosUnpinIfPinned().I agree the name is misleading. It clearly does something else than how it's named. However, I don't believe this introduces problems in these particular pieces of code, as long as the macro's are always used. BTScanPosIsPinned actually checks whether it's valid and not necessarily whether it's pinned, as you mentioned. However, any time the buffer gets unpinned using the macro BTScanPosUnpin, the buffer gets set to Invalid by the macro as well. Therefore, any consecutive call to BTScanPosIsPinned should indeed return false. It'd definitely be nice if this gets clarified in comments though.
That's true. It took me quite some time to understand that existing code
is correct.
There is a comment for the structure's field that claims that
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how
to improve them are welcome.
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachments:
improve_BTScanPos_comments_v1.patchtext/x-patch; name=improve_BTScanPos_comments_v1.patchDownload
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9b172c1..316a42d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1751,6 +1751,8 @@ _bt_killitems(IndexScanDesc scan)
Buffer buf;
/* Attempt to re-read the buffer, getting pin and lock. */
+ /* TODO Is it possible that currPage is not valid anymore? */
+ Assert(BTScanPosIsValid(so->currPos));
buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
/* It might not exist anymore; in which case we can't hint it. */
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..9abca7d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -512,14 +512,6 @@ typedef struct BTInsertStateData
typedef BTInsertStateData *BTInsertState;
/*
- * BTScanOpaqueData is the btree-private state needed for an indexscan.
- * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
- * details of the preprocessing), information about the current location
- * of the scan, and information about the marked location, if any. (We use
- * BTScanPosData to represent the data needed for each of current and marked
- * locations.) In addition we can remember some known-killed index entries
- * that must be marked before we can move off the current page.
- *
* Index scans work a page at a time: we pin and read-lock the page, identify
* all the matching items on the page and save them in BTScanPosData, then
* release the read-lock while returning the items to the caller for
@@ -583,6 +575,14 @@ typedef struct BTScanPosData
typedef BTScanPosData *BTScanPos;
+/*
+ * It's enough to check BufferIsValid inside this macro,
+ * since in context of BTScanPosData buf is always pinned if valid,
+ * see comment to struct's field above.
+ *
+ * NOTE To keep this assumption correct all users of scanpos must unpin buffer
+ * using macros below instead of directly calling ReleaseBuffer.
+ */
#define BTScanPosIsPinned(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
@@ -600,12 +600,24 @@ typedef BTScanPosData *BTScanPos;
BTScanPosUnpin(scanpos); \
} while (0)
+/*
+ * Check if scanpos is initialized.
+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */
#define BTScanPosIsValid(scanpos) \
( \
AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
!BufferIsValid((scanpos).buf)), \
BlockNumberIsValid((scanpos).currPage) \
)
+
+/*
+ * Reset BTScanPos fields.
+ * If scanpos was valid, caller must call BTScanPosUnpinIfPinned in advance.
+ */
#define BTScanPosInvalidate(scanpos) \
do { \
(scanpos).currPage = InvalidBlockNumber; \
@@ -625,6 +637,15 @@ typedef struct BTArrayKeyInfo
Datum *elem_values; /* array of num_elems Datums */
} BTArrayKeyInfo;
+/*
+ * BTScanOpaqueData is the btree-private state needed for an indexscan.
+ * This consists of preprocessed scan keys (see _bt_preprocess_keys() for
+ * details of the preprocessing), information about the current location
+ * of the scan, and information about the marked location, if any. (We use
+ * BTScanPosData to represent the data needed for each of current and marked
+ * locations.) In addition we can remember some known-killed index entries
+ * that must be marked before we can move off the current page.
+ */
typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
It seems that it contradicts the very idea of your patch, so probably we
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number
of tuples, to allow index work more optimal
and avoid the overhead for other loads.=
The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementation of the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn't be found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detected a page split. These two measures combined should give a correct implementation of the 'it's possible that a scan stops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performance advantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to other suggestions though.
That's true. It took me quite some time to understand that existing code
is correct.
There is a comment for the structure's field that claims that
BufferIsValid is the same that BufferIsPinned in ScanPos context.
Attached patch contains some comments' updates. Any suggestions on how
to improve them are welcome.
I'll have a look tomorrow. Thanks a lot for writing this up!
-Floris
It seems that it contradicts the very idea of your patch, so probably we
should look for other ways to optimize this use-case.
Maybe this restriction can be relaxed for write only tables, that never
have to reread the page because of visibility, or something like that.
Also we probably can add to IndexScanDescData info about expected number
of tuples, to allow index work more optimal
and avoid the overhead for other loads.=
The idea of the patch is exactly to relax this limitation. I forgot to update that README file though. The current implementation of the patch should be correct like this - that's why I added the look-back code on the page if the tuple couldn't be found anymore on the same location on the page. Similarly, it'll look on the page to the right if it detected a page split. These two measures combined should give a correct implementation of the 'it's possible that a scan stops in the middle of a page' relaxation. However, as Peter and Tom pointed out earlier, they feel that the performance advantage that this approach gives, does not outweigh the extra complexity at this time. I'd be open to other suggestions though.
Although now that I think of it - do you mean the case where the tuple that we returned to the caller after _bt_first actually gets deleted (not moved) from the page? I guess that can theoretically happen if _bt_first returns a non-visible tuple (but not DEAD yet in the index at the time of _bt_first). For my understanding, would a situation like the following lead to this (in my patch)?
1) Backend 1 does an index scan and returns the first tuple on _bt_first - this tuple is actually deleted in the heap already, however it's not marked dead yet in the index.
2) Backend 1 does a heap fetch to check actual visibility and determines the tuple is actually dead
3) While backend 1 is busy doing the heap fetch (so in between _bt_first and _bt_next) backend 2 comes in and manages to somehow do 1) a _bt_killitems on the page to mark tuples dead as well as 2) compact items on the page, thereby actually removing this item from the page.
4) Now backend 1 tries to find the next tuple in _bt_next - it first tries to locate the tuple where it left off, but cannot find it anymore because it got removed completely by backend 2.
If this is indeed possible then it's a bad issue unfortunately, and quite hard to try to reproduce, as a lot of things need to happen concurrently while doing a visiblity check.
As for your patch, I've had some time to take a look at it. For the two TODOs:
+ /* TODO Is it possible that currPage is not valid anymore? */
+ Assert(BTScanPosIsValid(so->currPos))
This Assert exists already a couple of lines earlier at the start of this function.
+ * TODO It is not clear to me
+ * why to check scanpos validity based on currPage value.
+ * I wonder, if we need currPage at all? Is there any codepath that
+ * assumes that currPage is not the same as BufferGetBlockNumber(buf)?
+ */
The comments in the source mention the following about this:
* We note the buffer's block number so that we can release the pin later.
* This allows us to re-read the buffer if it is needed again for hinting.
*/
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
As we figured out earlier, so->currPos.buf gets set to invalid when we release the pin by the unpin macro. So, if we don't store currPage number somewhere else, we cannot obtain the pin again if we need it during killitems. I think that's the reason that currPage is stored.
Other than the two TODOs in the code, I think the comments really help clarifying what's going on in the code - I'd be happy if this gets added.
-Floris