Skip all-visible pages during second HeapScan of CIC

Started by Pavan Deolaseealmost 9 years ago18 messages
#1Pavan Deolasee
pavan.deolasee@gmail.com
1 attachment(s)

Hello All,

During the second heap scan of CREATE INDEX CONCURRENTLY, we're only
interested in the tuples which were inserted after the first scan was
started. All such tuples can only exists in pages which have their VM bit
unset. So I propose the attached patch which consults VM during second scan
and skip all-visible pages. We do the same trick of skipping pages only if
certain threshold of pages can be skipped to ensure OS's read-ahead is not
disturbed.

The patch obviously shows significant reduction of time for building index
concurrently for very large tables, which are not being updated frequently
and which was vacuumed recently (so that VM bits are set). I can post
performance numbers if there is interest. For tables that are being updated
heavily, the threshold skipping was indeed useful and without that we saw a
slight regression.

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

Comments?

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

cic_skip_all_visible_v3.patchapplication/octet-stream; name=cic_skip_all_visible_v3.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5fd7f1e..f977db8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,7 +87,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
-						bool temp_snap);
+						bool temp_snap,
+						bool skip_all_visible);
 static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
@@ -1033,6 +1034,45 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
+		/*
+		 * Skip all-visible pages, if we are told.
+		 *
+		 * We skip pages only if we can skip a bunch of them. Otherwise OS's
+		 * prefetch for sequential scan will do a much better job. This is
+		 * similar to what we do in vacuumlazy.c
+		 */
+		if (scan->rs_skip_all_visible && scan->rs_next_unskippable_block < page)
+		{
+			scan->rs_next_unskippable_block = page;
+
+			Assert(!backward);
+			Assert(scan->rs_startblock == 0);
+			Assert(scan->rs_nblocks == scan->rs_numblocks);
+
+			while (scan->rs_next_unskippable_block < scan->rs_nblocks)
+			{
+				scan->rs_all_visible_checked++;
+				if (VM_ALL_VISIBLE(scan->rs_rd, scan->rs_next_unskippable_block, &scan->rs_vmbuf))
+					scan->rs_next_unskippable_block++;
+				else
+					break;
+			}
+#define SKIP_PAGES_THRESHOLD 32
+			if (scan->rs_next_unskippable_block - page > SKIP_PAGES_THRESHOLD)
+			{
+				scan->rs_skipped_total += (scan->rs_next_unskippable_block - page);
+				scan->rs_skipped_times++;
+				page = scan->rs_next_unskippable_block;
+				scan->rs_numblocks -= (scan->rs_next_unskippable_block - page);
+			}
+
+			/*
+			 * If we reached the end, just return.
+			 */
+			if (page >= scan->rs_nblocks)
+				return;
+		}
+		scan->rs_scanned_total++;
 		heapgetpage(scan, page);
 
 		dp = BufferGetPage(scan->rs_cbuf);
@@ -1394,7 +1434,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   true, true, true, false, false, false);
+								   true, true, true, false, false, false, false);
 }
 
 HeapScanDesc
@@ -1404,7 +1444,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   true, true, true, false, false, true);
+								   true, true, true, false, false, true, false);
 }
 
 HeapScanDesc
@@ -1414,7 +1454,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
-								   false, false, false);
+								   false, false, false, false);
 }
 
 HeapScanDesc
@@ -1422,7 +1462,8 @@ heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   false, false, true, true, false, false);
+								   false, false, true, true, false, false,
+								   false);
 }
 
 HeapScanDesc
@@ -1432,7 +1473,16 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
-								   false, true, false);
+								   false, true, false, false);
+}
+
+HeapScanDesc
+heap_beginscan_skip_all_visible(Relation relation, Snapshot snapshot,
+					 int nkeys, ScanKey key)
+{
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
+								   true, false, true,
+								   false, false, false, true);
 }
 
 static HeapScanDesc
@@ -1444,11 +1494,18 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
-						bool temp_snap)
+						bool temp_snap,
+						bool skip_all_visible)
 {
 	HeapScanDesc scan;
 
 	/*
+	 * No support for parallel or synchronous scans when skip_all_visible is
+	 * true.
+	 */
+	Assert(!skip_all_visible || (parallel_scan == NULL && !allow_sync));
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
@@ -1472,6 +1529,12 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 	scan->rs_parallel = parallel_scan;
+	scan->rs_skip_all_visible = skip_all_visible;
+	scan->rs_next_unskippable_block = InvalidBlockNumber;
+	scan->rs_all_visible_checked = InvalidBlockNumber;
+	scan->rs_skipped_total = scan->rs_scanned_total = 0;
+	scan->rs_skipped_times = 0;
+	scan->rs_vmbuf = InvalidBuffer;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1587,6 +1650,10 @@ heap_endscan(HeapScanDesc scan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/* unpin visibility map buffer too */
+	if (BufferIsValid(scan->rs_vmbuf))
+		ReleaseBuffer(scan->rs_vmbuf);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
@@ -1601,6 +1668,13 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
+	if (scan->rs_skip_all_visible)
+	{
+		elog(LOG, "heap_endscan: scanned (%.0f), checked (%u), skipped (%.0f), avg skipped (%.0f)",
+			scan->rs_scanned_total, scan->rs_all_visible_checked,
+			scan->rs_skipped_total,
+			scan->rs_skipped_total / scan->rs_skipped_times);
+	}
 	pfree(scan);
 }
 
@@ -1658,7 +1732,7 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 	RegisterSnapshot(snapshot);
 
 	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
-								   true, true, true, false, false, true);
+								   true, true, true, false, false, true, false);
 }
 
 /* ----------------
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 815a694..35c378d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2806,6 +2806,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	PGRUsage    ru0;
+
+	pg_rusage_init(&ru0);
 
 	/* Open and lock the parent heap relation */
 	heapRelation = heap_open(heapId, ShareUpdateExclusiveLock);
@@ -2873,8 +2876,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	tuplesort_end(state.tuplesort);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples: %s",
+		 state.htups, state.itups, state.tups_inserted,
+		 pg_rusage_show(&ru0));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -2994,16 +2998,26 @@ validate_index_heapscan(Relation heapRelation,
 
 	/*
 	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * satisfying the passed-in reference snapshot. But we are only interested
+	 * is tuples which were either inserted or updated after we took reference
+	 * snapshot for the initial build. Such tuples can only exist in pages
+	 * which are not marked as "all-visible". So we can easily skip
+	 * "all-visible" pages. This can be a huge win for very large tables, which
+	 * do not fit in shared buffers or OS cache and are less freqeuntly
+	 * updated.
+	 *
+	 * Note that since VACUUM conflicts with CIC on lock, there is no way new
+	 * pages will be marked "all-visible" after CIC started. We don't care if
+	 * pages are cleared with "all-visible" flag after we scanned them during
+	 * the upcoming heap scan..
+	 *
+	 * We must disable syncscan here, because it's critical that we read from
+	 * block zero forward to match the sorted TIDs.
 	 */
-	scan = heap_beginscan_strat(heapRelation,	/* relation */
+	scan = heap_beginscan_skip_all_visible(heapRelation,	/* relation */
 								snapshot,		/* snapshot */
 								0,		/* number of keys */
-								NULL,	/* scan key */
-								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								NULL);	/* scan key */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a864f78..aa10cd9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,8 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern HeapScanDesc heap_beginscan_sampling(Relation relation,
 						Snapshot snapshot, int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
+extern HeapScanDesc heap_beginscan_skip_all_visible(Relation relation,
+						Snapshot snapshot, int nkeys, ScanKey key);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce3ca8d..e7b073c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -54,6 +54,13 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_skip_all_visible;	/* skip all-visible pages ? */
+	double		rs_skipped_total;
+	int		rs_skipped_times;
+	double		rs_scanned_total;
+	BlockNumber	rs_next_unskippable_block;
+	BlockNumber	rs_all_visible_checked;
+	Buffer		rs_vmbuf;		/* visibility map buffer */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* total number of blocks in rel */
#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#1)
Re: Skip all-visible pages during second HeapScan of CIC

On Tue, Feb 28, 2017 at 10:42 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Hello All,

During the second heap scan of CREATE INDEX CONCURRENTLY, we're only
interested in the tuples which were inserted after the first scan was
started. All such tuples can only exists in pages which have their VM bit
unset. So I propose the attached patch which consults VM during second scan
and skip all-visible pages. We do the same trick of skipping pages only if
certain threshold of pages can be skipped to ensure OS's read-ahead is not
disturbed.

Great.

The patch obviously shows significant reduction of time for building index
concurrently for very large tables, which are not being updated frequently
and which was vacuumed recently (so that VM bits are set). I can post
performance numbers if there is interest.

Please share it. I'm curious.

For tables that are being updated
heavily, the threshold skipping was indeed useful and without that we saw a
slight regression.

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

Agreed.

And the patch looks good to me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Pavan Deolasee (#1)
Re: Skip all-visible pages during second HeapScan of CIC

Hi,

On 2017-02-28 19:12:03 +0530, Pavan Deolasee wrote:

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

I think that's true currently, but it'd also prevent us from doing that
in additional places. Which, in my opinion, we really should (and I
believe that's realistically achievable). Thus I really don't want to
base the correctness of CIC - a relatively infrequent operation - on the
assumption that no VM bits can be set concurrenty due to the SUE lock.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Andres Freund (#3)
Re: Skip all-visible pages during second HeapScan of CIC

On Fri, Mar 3, 2017 at 2:54 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-02-28 19:12:03 +0530, Pavan Deolasee wrote:

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

I think that's true currently, but it'd also prevent us from doing that
in additional places. Which, in my opinion, we really should (and I
believe that's realistically achievable). Thus I really don't want to
base the correctness of CIC - a relatively infrequent operation - on the
assumption that no VM bits can be set concurrenty due to the SUE lock.

I agree.

FWIW, the extra time that CIC takes over a plain CI is much reduced these days.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#3)
Re: Skip all-visible pages during second HeapScan of CIC

Andres,

* Andres Freund (andres@anarazel.de) wrote:

On 2017-02-28 19:12:03 +0530, Pavan Deolasee wrote:

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

I think that's true currently, but it'd also prevent us from doing that
in additional places. Which, in my opinion, we really should (and I
believe that's realistically achievable). Thus I really don't want to
base the correctness of CIC - a relatively infrequent operation - on the
assumption that no VM bits can be set concurrenty due to the SUE lock.

That sounds like we need a lock or similar mechanism to indicate that
CIC depends on the VM not being changed, rather than saying it shouldn't
depend on that because it might not always be true that the only other
operation (VACUUM) sets them and already acquires a conflicting lock.

Thanks!

Stephen

In reply to: Pavan Deolasee (#1)
Re: Skip all-visible pages during second HeapScan of CIC

On Tue, Feb 28, 2017 at 5:42 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

During the second heap scan of CREATE INDEX CONCURRENTLY, we're only
interested in the tuples which were inserted after the first scan was
started. All such tuples can only exists in pages which have their VM bit
unset. So I propose the attached patch which consults VM during second scan
and skip all-visible pages. We do the same trick of skipping pages only if
certain threshold of pages can be skipped to ensure OS's read-ahead is not
disturbed.

BTW, is there any danger of VACUUM acquiring a lock on the heap
relation (i.e. vacuuming it) after the first CIC transaction ends, but
before the second CIC transaction begins?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#6)
Re: Skip all-visible pages during second HeapScan of CIC

On 2017-03-03 15:12:04 -0800, Peter Geoghegan wrote:

On Tue, Feb 28, 2017 at 5:42 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

During the second heap scan of CREATE INDEX CONCURRENTLY, we're only
interested in the tuples which were inserted after the first scan was
started. All such tuples can only exists in pages which have their VM bit
unset. So I propose the attached patch which consults VM during second scan
and skip all-visible pages. We do the same trick of skipping pages only if
certain threshold of pages can be skipped to ensure OS's read-ahead is not
disturbed.

BTW, is there any danger of VACUUM acquiring a lock on the heap
relation (i.e. vacuuming it) after the first CIC transaction ends, but
before the second CIC transaction begins?

We hold session level locks to prevent such danger.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#5)
Re: Skip all-visible pages during second HeapScan of CIC

On Fri, Mar 3, 2017 at 6:06 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

On 2017-02-28 19:12:03 +0530, Pavan Deolasee wrote:

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

I think that's true currently, but it'd also prevent us from doing that
in additional places. Which, in my opinion, we really should (and I
believe that's realistically achievable). Thus I really don't want to
base the correctness of CIC - a relatively infrequent operation - on the
assumption that no VM bits can be set concurrenty due to the SUE lock.

That sounds like we need a lock or similar mechanism to indicate that
CIC depends on the VM not being changed, rather than saying it shouldn't
depend on that because it might not always be true that the only other
operation (VACUUM) sets them and already acquires a conflicting lock.

I don't really think that would be a useful approach. I think what
Andres is thinking about -- or at least what comes to mind for me --
is the possibility that heap_page_prune() might someday try to mark
pages all-visible. Then it would become something that happens from
time to time during foreground processing, rather than (as now)
something that only happens from within a maintenance command.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#8)
Re: Skip all-visible pages during second HeapScan of CIC

Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Mar 3, 2017 at 6:06 PM, Stephen Frost <sfrost@snowman.net> wrote:

* Andres Freund (andres@anarazel.de) wrote:

On 2017-02-28 19:12:03 +0530, Pavan Deolasee wrote:

Since VM bits are only set during VACUUM which conflicts with CIC on the
relation lock, I don't see any risk of incorrectly skipping pages that the
second scan should have scanned.

I think that's true currently, but it'd also prevent us from doing that
in additional places. Which, in my opinion, we really should (and I
believe that's realistically achievable). Thus I really don't want to
base the correctness of CIC - a relatively infrequent operation - on the
assumption that no VM bits can be set concurrenty due to the SUE lock.

That sounds like we need a lock or similar mechanism to indicate that
CIC depends on the VM not being changed, rather than saying it shouldn't
depend on that because it might not always be true that the only other
operation (VACUUM) sets them and already acquires a conflicting lock.

I don't really think that would be a useful approach. I think what
Andres is thinking about -- or at least what comes to mind for me --
is the possibility that heap_page_prune() might someday try to mark
pages all-visible. Then it would become something that happens from
time to time during foreground processing, rather than (as now)
something that only happens from within a maintenance command.

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Just to be clear, I wasn't thinking that heap_page_prune() or a
foreground process would actually block on the lock, just that it would
try to acquire it if it decided that it could twiddle the VM and if it
wasn't able to immediately then it would just leave it alone and move
on. I'll admit that I'm not super familiar with the CIC or VM
internals, frankly, so I may be all wet with this entire line of
thinking, but figured I'd at least explain what the idea was.

Thanks!

Stephen

#10Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#9)
Re: Skip all-visible pages during second HeapScan of CIC

On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost <sfrost@snowman.net> wrote:

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Well, I guess that's one option. I lean toward the position already
taken by Andres and Peter, namely, that it's probably not a great idea
to pursue this optimization. I'm not totally dead-set on that
position, but it doesn't seem likely that we can add material
synchronization overhead to heap_page_prune(), and I'd rather maintain
the option to set visibility map bits in more places in the future
than give up that opportunity by optimizing CIC now. It's nice for
CIC to be fast, but setting all visible bits more aggressively sounds
nicer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#10)
Re: Skip all-visible pages during second HeapScan of CIC

On 2017-03-07 21:03:53 -0500, Robert Haas wrote:

On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost <sfrost@snowman.net> wrote:

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Well, I guess that's one option. I lean toward the position already
taken by Andres and Peter, namely, that it's probably not a great idea
to pursue this optimization. I'm not totally dead-set on that
position, but it doesn't seem likely that we can add material
synchronization overhead to heap_page_prune(), and I'd rather maintain
the option to set visibility map bits in more places in the future
than give up that opportunity by optimizing CIC now. It's nice for
CIC to be fast, but setting all visible bits more aggressively sounds
nicer.

Indeed.

I wonder however, if careful snapshot managment couldn't solve this as
well. I have *not* thought a lot about this, but afaics we can easily
prevent all-visible from being set in cases it'd bother us by having an
"appropriate" xmin / registered snapshot.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#11)
Re: Skip all-visible pages during second HeapScan of CIC

On Tue, Mar 7, 2017 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-03-07 21:03:53 -0500, Robert Haas wrote:

On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost <sfrost@snowman.net> wrote:

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Well, I guess that's one option. I lean toward the position already
taken by Andres and Peter, namely, that it's probably not a great idea
to pursue this optimization. I'm not totally dead-set on that
position, but it doesn't seem likely that we can add material
synchronization overhead to heap_page_prune(), and I'd rather maintain
the option to set visibility map bits in more places in the future
than give up that opportunity by optimizing CIC now. It's nice for
CIC to be fast, but setting all visible bits more aggressively sounds
nicer.

Indeed.

I wonder however, if careful snapshot managment couldn't solve this as
well. I have *not* thought a lot about this, but afaics we can easily
prevent all-visible from being set in cases it'd bother us by having an
"appropriate" xmin / registered snapshot.

Yeah, but that's a tax on the whole system.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Robert Haas (#10)
1 attachment(s)
Re: Skip all-visible pages during second HeapScan of CIC

On Wed, Mar 8, 2017 at 7:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost <sfrost@snowman.net> wrote:

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Well, I guess that's one option. I lean toward the position already
taken by Andres and Peter, namely, that it's probably not a great idea
to pursue this optimization.

Fair point. I'm not going to "persist" with the idea too long. It seemed
like a good, low-risk feature to me which can benefit certain use cases
quite reasonably. It's not uncommon to create indexes (or reindex existing
indexes to remove index bloats) on extremely large tables and avoiding a
second heap scan can hugely benefit such cases. Holding up the patch for
something for which we don't even have a proposal yet seemed a bit strange
at first, but I see the point.

Anyways, for a recently vacuumed table of twice the size of RAM and on a
machine with SSDs, the patched CIC version runs about 25% faster. That's
probably the best case scenario.

I also realised that I'd attached a slightly older version of the patch. So
new version attached, but I'll remove the patch from CF if I don't see any
more votes in favour of the patch.

Thanks,
Pavan

Attachments:

cic_skip_all_visible_v4.patchapplication/octet-stream; name=cic_skip_all_visible_v4.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 5fd7f1e..acc04be 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,7 +87,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
-						bool temp_snap);
+						bool temp_snap,
+						bool skip_all_visible);
 static BlockNumber heap_parallelscan_nextpage(HeapScanDesc scan);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
@@ -1033,6 +1034,52 @@ heapgettup_pagemode(HeapScanDesc scan,
 			return;
 		}
 
+		/*
+		 * Skip all-visible pages, if we are told.
+		 *
+		 * We skip pages only if we can skip a bunch of them. Otherwise OS's
+		 * prefetch for sequential scan will do a much better job. This is
+		 * similar to what we do in vacuumlazy.c
+		 */
+		if (scan->rs_skip_all_visible && scan->rs_next_unskippable_block < page)
+		{
+			scan->rs_next_unskippable_block = page;
+
+			Assert(!backward);
+			Assert(scan->rs_startblock == 0);
+
+			while (scan->rs_next_unskippable_block < scan->rs_nblocks)
+			{
+				scan->rs_all_visible_checked++;
+				if (VM_ALL_VISIBLE(scan->rs_rd, scan->rs_next_unskippable_block, &scan->rs_vmbuf))
+					scan->rs_next_unskippable_block++;
+				else
+					break;
+			}
+#define SKIP_PAGES_THRESHOLD 32
+			if (scan->rs_next_unskippable_block - page > SKIP_PAGES_THRESHOLD)
+			{
+				scan->rs_skipped_total += (scan->rs_next_unskippable_block - page);
+				scan->rs_skipped_times++;
+				page = scan->rs_next_unskippable_block;
+				scan->rs_numblocks -= (scan->rs_next_unskippable_block - page);
+			}
+
+			/*
+			 * If we reached the end, just return.
+			 */
+			if (page >= scan->rs_nblocks)
+			{
+				if (BufferIsValid(scan->rs_cbuf))
+					ReleaseBuffer(scan->rs_cbuf);
+				scan->rs_cbuf = InvalidBuffer;
+				scan->rs_cblock = InvalidBlockNumber;
+				tuple->t_data = NULL;
+				scan->rs_inited = false;
+				return;
+			}
+		}
+		scan->rs_scanned_total++;
 		heapgetpage(scan, page);
 
 		dp = BufferGetPage(scan->rs_cbuf);
@@ -1394,7 +1441,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   true, true, true, false, false, false);
+								   true, true, true, false, false, false, false);
 }
 
 HeapScanDesc
@@ -1404,7 +1451,7 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   true, true, true, false, false, true);
+								   true, true, true, false, false, true, false);
 }
 
 HeapScanDesc
@@ -1414,7 +1461,7 @@ heap_beginscan_strat(Relation relation, Snapshot snapshot,
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, true,
-								   false, false, false);
+								   false, false, false, false);
 }
 
 HeapScanDesc
@@ -1422,7 +1469,8 @@ heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
-								   false, false, true, true, false, false);
+								   false, false, true, true, false, false,
+								   false);
 }
 
 HeapScanDesc
@@ -1432,7 +1480,16 @@ heap_beginscan_sampling(Relation relation, Snapshot snapshot,
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
 								   allow_strat, allow_sync, allow_pagemode,
-								   false, true, false);
+								   false, true, false, false);
+}
+
+HeapScanDesc
+heap_beginscan_skip_all_visible(Relation relation, Snapshot snapshot,
+					 int nkeys, ScanKey key)
+{
+	return heap_beginscan_internal(relation, snapshot, nkeys, key, NULL,
+								   true, false, true,
+								   false, false, false, true);
 }
 
 static HeapScanDesc
@@ -1444,11 +1501,18 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						bool allow_pagemode,
 						bool is_bitmapscan,
 						bool is_samplescan,
-						bool temp_snap)
+						bool temp_snap,
+						bool skip_all_visible)
 {
 	HeapScanDesc scan;
 
 	/*
+	 * No support for parallel or synchronous scans when skip_all_visible is
+	 * true.
+	 */
+	Assert(!skip_all_visible || (parallel_scan == NULL && !allow_sync));
+
+	/*
 	 * increment relation ref count while scanning relation
 	 *
 	 * This is just to make really sure the relcache entry won't go away while
@@ -1472,6 +1536,12 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 	scan->rs_parallel = parallel_scan;
+	scan->rs_skip_all_visible = skip_all_visible;
+	scan->rs_next_unskippable_block = 0;
+	scan->rs_all_visible_checked = 0;
+	scan->rs_skipped_total = scan->rs_scanned_total = 0;
+	scan->rs_skipped_times = 0;
+	scan->rs_vmbuf = InvalidBuffer;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
@@ -1587,6 +1657,10 @@ heap_endscan(HeapScanDesc scan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/* unpin visibility map buffer too */
+	if (BufferIsValid(scan->rs_vmbuf))
+		ReleaseBuffer(scan->rs_vmbuf);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
@@ -1601,6 +1675,13 @@ heap_endscan(HeapScanDesc scan)
 	if (scan->rs_temp_snap)
 		UnregisterSnapshot(scan->rs_snapshot);
 
+	if (scan->rs_skip_all_visible)
+	{
+		elog(LOG, "heap_endscan: scanned (%.0f), checked (%u), skipped (%.0f), avg skipped (%.0f)",
+			scan->rs_scanned_total, scan->rs_all_visible_checked,
+			scan->rs_skipped_total,
+			scan->rs_skipped_total / scan->rs_skipped_times);
+	}
 	pfree(scan);
 }
 
@@ -1658,7 +1739,7 @@ heap_beginscan_parallel(Relation relation, ParallelHeapScanDesc parallel_scan)
 	RegisterSnapshot(snapshot);
 
 	return heap_beginscan_internal(relation, snapshot, 0, NULL, parallel_scan,
-								   true, true, true, false, false, true);
+								   true, true, true, false, false, true, false);
 }
 
 /* ----------------
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 815a694..35c378d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2806,6 +2806,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	PGRUsage    ru0;
+
+	pg_rusage_init(&ru0);
 
 	/* Open and lock the parent heap relation */
 	heapRelation = heap_open(heapId, ShareUpdateExclusiveLock);
@@ -2873,8 +2876,9 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	tuplesort_end(state.tuplesort);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples: %s",
+		 state.htups, state.itups, state.tups_inserted,
+		 pg_rusage_show(&ru0));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -2994,16 +2998,26 @@ validate_index_heapscan(Relation heapRelation,
 
 	/*
 	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * satisfying the passed-in reference snapshot. But we are only interested
+	 * is tuples which were either inserted or updated after we took reference
+	 * snapshot for the initial build. Such tuples can only exist in pages
+	 * which are not marked as "all-visible". So we can easily skip
+	 * "all-visible" pages. This can be a huge win for very large tables, which
+	 * do not fit in shared buffers or OS cache and are less freqeuntly
+	 * updated.
+	 *
+	 * Note that since VACUUM conflicts with CIC on lock, there is no way new
+	 * pages will be marked "all-visible" after CIC started. We don't care if
+	 * pages are cleared with "all-visible" flag after we scanned them during
+	 * the upcoming heap scan..
+	 *
+	 * We must disable syncscan here, because it's critical that we read from
+	 * block zero forward to match the sorted TIDs.
 	 */
-	scan = heap_beginscan_strat(heapRelation,	/* relation */
+	scan = heap_beginscan_skip_all_visible(heapRelation,	/* relation */
 								snapshot,		/* snapshot */
 								0,		/* number of keys */
-								NULL,	/* scan key */
-								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								NULL);	/* scan key */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a864f78..aa10cd9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -118,6 +118,8 @@ extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 extern HeapScanDesc heap_beginscan_sampling(Relation relation,
 						Snapshot snapshot, int nkeys, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
+extern HeapScanDesc heap_beginscan_skip_all_visible(Relation relation,
+						Snapshot snapshot, int nkeys, ScanKey key);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 				   BlockNumber endBlk);
 extern void heapgetpage(HeapScanDesc scan, BlockNumber page);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index ce3ca8d..e7b073c 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -54,6 +54,13 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_skip_all_visible;	/* skip all-visible pages ? */
+	double		rs_skipped_total;
+	int		rs_skipped_times;
+	double		rs_scanned_total;
+	BlockNumber	rs_next_unskippable_block;
+	BlockNumber	rs_all_visible_checked;
+	Buffer		rs_vmbuf;		/* visibility map buffer */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* total number of blocks in rel */
#14Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Robert Haas (#12)
Re: Skip all-visible pages during second HeapScan of CIC

On Wed, Mar 8, 2017 at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 9:12 PM, Andres Freund <andres@anarazel.de> wrote:

I wonder however, if careful snapshot managment couldn't solve this as
well. I have *not* thought a lot about this, but afaics we can easily
prevent all-visible from being set in cases it'd bother us by having an
"appropriate" xmin / registered snapshot.

Yeah, but that's a tax on the whole system.

May be we can do that only when patched CIC (or any other operation which
may not like concurrent VM set) is in progress. We could use what Stephen
suggested upthread to find that state. But right now it's hard to think
because there is nothing on either side so we don't know what gets impacted
by aggressive VM set and how.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#12)
Re: Skip all-visible pages during second HeapScan of CIC

On 2017-03-07 21:38:40 -0500, Robert Haas wrote:

I wonder however, if careful snapshot managment couldn't solve this as
well. I have *not* thought a lot about this, but afaics we can easily
prevent all-visible from being set in cases it'd bother us by having an
"appropriate" xmin / registered snapshot.

Yeah, but that's a tax on the whole system.

I'm not sure I can buy that argument. CIC *already* holds a snapshot
during each of the two scans. By reducing the amount of time that's held
in the second scan, the systemwide impact is reduced, because it's held
for a shorter amount of time. We need to hold a slightly "more
aggressive" snapshot, that's true, but if you have high xid throughput
those effects should roughly balance each other.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Stephen Frost
sfrost@snowman.net
In reply to: Pavan Deolasee (#13)
Re: Skip all-visible pages during second HeapScan of CIC

* Pavan Deolasee (pavan.deolasee@gmail.com) wrote:

On Wed, Mar 8, 2017 at 7:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 7, 2017 at 4:26 PM, Stephen Frost <sfrost@snowman.net> wrote:

Right, that's what I thought he was getting at and my general thinking
was that we would need a way to discover if a CIC is ongoing on the
relation and therefore heap_page_prune(), or anything else, would know
that it can't twiddle the bits in the VM due to the ongoing CIC.
Perhaps a lock isn't the right answer there, but it would have to be
some kind of cross-process communication that operates at a relation
level..

Well, I guess that's one option. I lean toward the position already
taken by Andres and Peter, namely, that it's probably not a great idea
to pursue this optimization.

Fair point. I'm not going to "persist" with the idea too long. It seemed
like a good, low-risk feature to me which can benefit certain use cases
quite reasonably. It's not uncommon to create indexes (or reindex existing
indexes to remove index bloats) on extremely large tables and avoiding a
second heap scan can hugely benefit such cases. Holding up the patch for
something for which we don't even have a proposal yet seemed a bit strange
at first, but I see the point.

I'm not really sure that I do. CIC's will not be so frequent that not
allowing VM updates during their operation should really result in a
huge drag on the system, imv. Of course, if the only way to realize a
CIC is happening on the table is to perform some expensive operation and
we have to do that every time then it's not going to be worth it, but
I'm not sure that's really the case here.

The issue here, as I understand it at least, is to come up with a way
that we can make sure to not do anything which would screw up the CIC
while it's running, that we can detect very cheaply so we don't slow
things down during normal non-CIC-running periods. I suggested a new
lock, in part because anything that's updating the VM is going to have
to have some kind of lock anyway, and perhaps going for the "heavier"
lock that would conflict with the CIC, in an optimistic manner, would
make the normal non-CIC-running case essentially the same speed. If a
CIC was running then the attempt to acquire the lock would fail and
extra time would be spent acquiring a lower-weight lock, of course, but
that's going to happen relatively rarely.

Anyways, for a recently vacuumed table of twice the size of RAM and on a
machine with SSDs, the patched CIC version runs about 25% faster. That's
probably the best case scenario.

I agree, that certainly seems like a very nice performance improvement.

Thanks!

Stephen

#17Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#15)
Re: Skip all-visible pages during second HeapScan of CIC

* Andres Freund (andres@anarazel.de) wrote:

On 2017-03-07 21:38:40 -0500, Robert Haas wrote:

I wonder however, if careful snapshot managment couldn't solve this as
well. I have *not* thought a lot about this, but afaics we can easily
prevent all-visible from being set in cases it'd bother us by having an
"appropriate" xmin / registered snapshot.

Yeah, but that's a tax on the whole system.

I'm not sure I can buy that argument. CIC *already* holds a snapshot
during each of the two scans. By reducing the amount of time that's held
in the second scan, the systemwide impact is reduced, because it's held
for a shorter amount of time. We need to hold a slightly "more
aggressive" snapshot, that's true, but if you have high xid throughput
those effects should roughly balance each other.

For my 2c, at least, this certainly sounds like a potentially promising
approach too and I tend to agree with Andres on it not really being an
issue to have a more aggressive snapshot for the duration of the CIC.

Thanks!

Stephen

#18David Steele
david@pgmasters.net
In reply to: Pavan Deolasee (#13)
Re: Skip all-visible pages during second HeapScan of CIC

On 3/7/17 9:42 PM, Pavan Deolasee wrote:

Fair point. I'm not going to "persist" with the idea too long. It seemed
like a good, low-risk feature to me which can benefit certain use cases
quite reasonably. It's not uncommon to create indexes (or reindex
existing indexes to remove index bloats) on extremely large tables and
avoiding a second heap scan can hugely benefit such cases. Holding up
the patch for something for which we don't even have a proposal yet
seemed a bit strange at first, but I see the point.

Anyways, for a recently vacuumed table of twice the size of RAM and on a
machine with SSDs, the patched CIC version runs about 25% faster. That's
probably the best case scenario.

I also realised that I'd attached a slightly older version of the patch.
So new version attached, but I'll remove the patch from CF if I don't
see any more votes in favour of the patch.

I have marked this submission "Returned with Feedback". Please feel
free to resubmit to a future commitfest or open a discussion on hackers
if you develop a new approach.

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers