idea for concurrent seqscans

Started by Jeff Davisalmost 21 years ago28 messages
#1Jeff Davis
jdavis-pgsql@empires.org
2 attachment(s)

I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

I made a proof-of-concept implementation, which is entirely in heapam.c,
except for one addition to the HeapScanDesc struct in relscan.h. It is
not at all up to production quality; there are things I know that need
to be addressed. Basically, I just modified heapam.c to be able to start
at any page in the relation. Then, every time it reads a new page, I
have it mark the relation's oid and the page number in a shared mem
segment. Everytime a new scan is started, it reads the shared mem
segment, and if the relation's oid matches, it starts the scan at the
page number it found in the shared memory. Otherwise, it starts the scan
at 0.

There are a couple obvious issues, one is that my whole implementation
doesn't account for reverse scans at all (since initscan doesn't know
what direction the scan will move in), but that shouldn't be a major
problem since at worst it will be the current behavior (aside: can
someone tell me how to force reverse scans so I can test that better?).
Another is that there's a race condition with the shared mem, and that's
out of pure laziness on my part.

This method is really only effective at all if there is a significant
amount of disk i/o. If it's pulling the data from O/S buffers the
various scans will diverge too much and not be using eachother's shared
buffers.

I tested with shared_buffers=500 and all stats on. I used 60 threads
performing 30 seqscans each in my script ssf.rb (I refer to my
modification as "sequential scan follower" or ssf).

Here are some results with my modifications:
$ time ./ssf.rb # my script

real 4m22.476s
user 0m0.389s
sys 0m0.186s

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17232) as hit,
pg_stat_get_blocks_fetched(17232) as total;
hit | total
--------+---------
971503 | 3353963
(1 row)

Or, approx. 29% cache hit.

Here are the results without my modifications:

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17231) as hit,
pg_stat_get_blocks_fetched(17231) as total;
hit | total
--------+---------
199999 | 3353963
(1 row)

Or, approx. 6% cache hit. Note: the oid is different, because I have two
seperately initdb'd data directories, one for the modified version, one
for the unmodified 8.0.0.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality. However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Regards,
Jeff Davis

Attachments:

heapam.c.difftext/x-patch; charset=ISO-8859-1; name=heapam.c.diffDownload
--- postgresql-8.0.0/src/backend/access/heap/heapam.c	2004-12-31 13:59:16.000000000 -0800
+++ postgresql-8.0.0-ssf/src/backend/access/heap/heapam.c	2005-02-24 20:37:24.596626668 -0800
@@ -65,6 +65,50 @@
  * ----------------------------------------------------------------
  */
 
+
+#include <sys/shm.h>
+#include <sys/types.h>
+#include <sys/ipc.h>
+/*
+ Retrieves a location of an in-progress seqscan on relid
+ and returns the page location
+*/
+static BlockNumber get_page_loc(Oid relid) {
+  int shmid;
+  char *shm;
+  Oid shmrelid;
+  BlockNumber loc;
+  shmid = shmget(0x11aa55cc,(sizeof(Oid)+sizeof(BlockNumber)),
+		 0666|IPC_CREAT);
+  shm = shmat(shmid,NULL,0);
+  shmrelid = *((Oid*)shm);
+  loc = *((BlockNumber*)(shm+sizeof(Oid)));
+  shmdt(shm);
+  /*elog(NOTICE,"Getting shm: %u %u",shmrelid,loc);*/
+  if (shmrelid == relid)
+    return loc;
+  else
+    return 0;
+}
+
+/*
+ Places (relid,loc) in a shared mem structure
+*/
+static int report_page_loc(Oid relid, BlockNumber loc) 
+{
+  int shmid;
+  char *shm;
+  shmid = shmget(0x11aa55cc,(sizeof(Oid)+sizeof(BlockNumber)),
+		 0666|IPC_CREAT);
+  shm = shmat(shmid,NULL,0);
+
+  /*elog(NOTICE,"Setting shm: %u %u",relid,loc);*/
+  *((Oid*)shm) = relid;
+  *((BlockNumber*)(shm+sizeof(Oid))) = loc;
+  shmdt(shm);
+  return 0;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -85,6 +129,18 @@
 	scan->rs_ctup.t_data = NULL;
 	scan->rs_cbuf = InvalidBuffer;
 
+	/* initialize start page */
+	scan->rs_start_page = get_page_loc(RelationGetRelid(scan->rs_rd));
+	  /*
+	    if(scan->rs_nblocks)
+	  scan->rs_start_page = 5 % scan->rs_nblocks;
+	else
+	  scan->rs_start_page = 0;
+	  */
+	/*elog(NOTICE,"Scan started on relid: %u, pages: %d, page: %d",
+	     RelationGetRelid(scan->rs_rd),
+	     scan->rs_nblocks,
+	     scan->rs_start_page);*/
 	/* we don't have a marked position... */
 	ItemPointerSetInvalid(&(scan->rs_mctid));
 
@@ -115,261 +171,280 @@
 		   Snapshot snapshot,
 		   int nkeys,
 		   ScanKey key,
-		   BlockNumber pages)
+		   BlockNumber pages,
+	   HeapScanDesc scan)
 {
-	ItemId		lpp;
-	Page		dp;
-	BlockNumber page;
-	int			lines;
-	OffsetNumber lineoff;
-	int			linesleft;
-	ItemPointer tid;
-
-	tid = (tuple->t_data == NULL) ? NULL : &(tuple->t_self);
-
-	/*
-	 * debugging stuff
-	 *
-	 * check validity of arguments, here and for other functions too Note: no
-	 * locking manipulations needed--this is a local function
-	 */
+  ItemId		lpp;
+  Page		dp;
+  BlockNumber page;
+  int			lines;
+  OffsetNumber lineoff;
+  int			linesleft;
+  ItemPointer tid;
+
+  tid = (tuple->t_data == NULL) ? NULL : &(tuple->t_self);
+
+  /*
+   * debugging stuff
+   *
+   * check validity of arguments, here and for other functions too Note: no
+   * locking manipulations needed--this is a local function
+   */
 #ifdef	HEAPDEBUGALL
-	if (ItemPointerIsValid(tid))
-		elog(DEBUG2, "heapgettup(%s, tid=0x%x[%d,%d], dir=%d, ...)",
-			 RelationGetRelationName(relation), tid, tid->ip_blkid,
-			 tid->ip_posid, dir);
-	else
-		elog(DEBUG2, "heapgettup(%s, tid=0x%x, dir=%d, ...)",
-			 RelationGetRelationName(relation), tid, dir);
-
-	elog(DEBUG2, "heapgettup(..., b=0x%x, nkeys=%d, key=0x%x", buffer, nkeys, key);
-
-	elog(DEBUG2, "heapgettup: relation(%c)=`%s', %p",
-		 relation->rd_rel->relkind, RelationGetRelationName(relation),
-		 snapshot);
+  if (ItemPointerIsValid(tid))
+    elog(DEBUG2, "heapgettup(%s, tid=0x%x[%d,%d], dir=%d, ...)",
+	 RelationGetRelationName(relation), tid, tid->ip_blkid,
+	 tid->ip_posid, dir);
+  else
+    elog(DEBUG2, "heapgettup(%s, tid=0x%x, dir=%d, ...)",
+	 RelationGetRelationName(relation), tid, dir);
+
+  elog(DEBUG2, "heapgettup(..., b=0x%x, nkeys=%d, key=0x%x", buffer, nkeys, key);
+
+  elog(DEBUG2, "heapgettup: relation(%c)=`%s', %p",
+       relation->rd_rel->relkind, RelationGetRelationName(relation),
+       snapshot);
 #endif   /* HEAPDEBUGALL */
 
-	if (!ItemPointerIsValid(tid))
-	{
-		Assert(!PointerIsValid(tid));
-		tid = NULL;
-	}
-
-	tuple->t_tableOid = relation->rd_id;
-
-	/*
-	 * return null immediately if relation is empty
-	 */
-	if (pages == 0)
-	{
-		if (BufferIsValid(*buffer))
-			ReleaseBuffer(*buffer);
-		*buffer = InvalidBuffer;
-		tuple->t_datamcxt = NULL;
-		tuple->t_data = NULL;
-		return;
-	}
-
-	/*
-	 * calculate next starting lineoff, given scan direction
-	 */
-	if (dir == 0)
-	{
-		/*
-		 * ``no movement'' scan direction: refetch same tuple
-		 */
-		if (tid == NULL)
-		{
-			if (BufferIsValid(*buffer))
-				ReleaseBuffer(*buffer);
-			*buffer = InvalidBuffer;
-			tuple->t_datamcxt = NULL;
-			tuple->t_data = NULL;
-			return;
-		}
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   ItemPointerGetBlockNumber(tid));
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lineoff = ItemPointerGetOffsetNumber(tid);
-		lpp = PageGetItemId(dp, lineoff);
-
-		tuple->t_datamcxt = NULL;
-		tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
-		tuple->t_len = ItemIdGetLength(lpp);
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-
-		return;
-	}
-	else if (dir < 0)
-	{
-		/*
-		 * reverse scan direction
-		 */
-		if (tid == NULL)
-		{
-			page = pages - 1;	/* final page */
-		}
-		else
-		{
-			page = ItemPointerGetBlockNumber(tid);		/* current page */
-		}
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber(dp);
-		if (tid == NULL)
-		{
-			lineoff = lines;	/* final offnum */
-		}
-		else
-		{
-			lineoff =			/* previous offnum */
-				OffsetNumberPrev(ItemPointerGetOffsetNumber(tid));
-		}
-		/* page and lineoff now reference the physically previous tid */
-	}
+  if (!ItemPointerIsValid(tid))
+    {
+      Assert(!PointerIsValid(tid));
+      tid = NULL;
+    }
+
+  tuple->t_tableOid = relation->rd_id;
+
+  /*
+   * return null immediately if relation is empty
+   */
+  if (pages == 0)
+    {
+      if (BufferIsValid(*buffer))
+	ReleaseBuffer(*buffer);
+      *buffer = InvalidBuffer;
+      tuple->t_datamcxt = NULL;
+      tuple->t_data = NULL;
+      return;
+    }
+
+  /*
+   * calculate next starting lineoff, given scan direction
+   */
+  if (dir == 0)
+    {
+      /*elog(NOTICE,"NoMovement Scan started\n");*/
+      /*
+       * ``no movement'' scan direction: refetch same tuple
+       */
+      if (tid == NULL)
+	{
+	  if (BufferIsValid(*buffer))
+	    ReleaseBuffer(*buffer);
+	  *buffer = InvalidBuffer;
+	  tuple->t_datamcxt = NULL;
+	  tuple->t_data = NULL;
+	  return;
+	}
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     ItemPointerGetBlockNumber(tid));
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lineoff = ItemPointerGetOffsetNumber(tid);
+      lpp = PageGetItemId(dp, lineoff);
+
+      tuple->t_datamcxt = NULL;
+      tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+      tuple->t_len = ItemIdGetLength(lpp);
+      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+      return;
+    }
+  else if (dir < 0)
+    {
+      /*
+       * reverse scan direction
+       */
+      if (tid == NULL)
+	{
+	  /*page = pages-1;*/	/* final page */
+	  page = scan->rs_start_page;
+	}
+      else
+	{
+	  page = ItemPointerGetBlockNumber(tid);  /* current page */
+	}
+
+      Assert(page < pages);
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber(dp);
+      if (tid == NULL)
+	{
+	  lineoff = lines;	/* final offnum */
+	}
+      else
+	{
+	  lineoff =			/* previous offnum */
+	    OffsetNumberPrev(ItemPointerGetOffsetNumber(tid));
+	}
+      /* page and lineoff now reference the physically previous tid */
+    }
+  else
+    {
+      /*
+       * forward scan direction
+       */
+      if (tid == NULL)
+	{
+	  page = scan->rs_start_page;     /* first page */
+	  lineoff = FirstOffsetNumber;		/* first offnum */
+	}
+      else
+	{
+	  page = ItemPointerGetBlockNumber(tid);	  /* current page */
+	  lineoff =			/* next offnum */
+	    OffsetNumberNext(ItemPointerGetOffsetNumber(tid));
+	}
+
+      Assert(page < pages);
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber(dp);
+      /* page and lineoff now reference the physically next tid */
+    }
+
+  /* 'dir' is now non-zero */
+
+  /*
+   * calculate line pointer and number of remaining items to check on
+   * this page.
+   */
+  lpp = PageGetItemId(dp, lineoff);
+  if (dir < 0)
+    linesleft = lineoff - 1;
+  else
+    linesleft = lines - lineoff;
+
+  /*
+   * advance the scan until we find a qualifying tuple or run out of
+   * stuff to scan
+   */
+  for (;;)
+    {
+      while (linesleft >= 0)
+	{
+	  if (ItemIdIsUsed(lpp))
+	    {
+	      bool		valid;
+
+	      tuple->t_datamcxt = NULL;
+	      tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+	      tuple->t_len = ItemIdGetLength(lpp);
+	      ItemPointerSet(&(tuple->t_self), page, lineoff);
+
+	      /*
+	       * if current tuple qualifies, return it.
+	       */
+	      HeapTupleSatisfies(tuple, relation, *buffer, (PageHeader) dp,
+				 snapshot, nkeys, key, valid);
+	      if (valid)
+		{
+		  LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		  return;
+		}
+	    }
+
+	  /*
+	   * otherwise move to the next item on the page
+	   */
+	  --linesleft;
+	  if (dir < 0)
+	    {
+	      --lpp;			/* move back in this page's ItemId array */
+	      --lineoff;
+	    }
+	  else
+	    {
+	      ++lpp;			/* move forward in this page's ItemId
+					 * array */
+	      ++lineoff;
+	    }
+	}
+
+      /*
+       * if we get here, it means we've exhausted the items on this page
+       * and it's time to move to the next.
+       */
+      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+		
+      /* advance page, or wrap around */
+      if(dir < 0) {
+	if(page > 0)
+	  page = page - 1;
 	else
-	{
-		/*
-		 * forward scan direction
-		 */
-		if (tid == NULL)
-		{
-			page = 0;			/* first page */
-			lineoff = FirstOffsetNumber;		/* first offnum */
-		}
-		else
-		{
-			page = ItemPointerGetBlockNumber(tid);		/* current page */
-			lineoff =			/* next offnum */
-				OffsetNumberNext(ItemPointerGetOffsetNumber(tid));
-		}
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber(dp);
-		/* page and lineoff now reference the physically next tid */
-	}
-
-	/* 'dir' is now non-zero */
-
-	/*
-	 * calculate line pointer and number of remaining items to check on
-	 * this page.
-	 */
-	lpp = PageGetItemId(dp, lineoff);
-	if (dir < 0)
-		linesleft = lineoff - 1;
+	  page = pages;
+      }
+      else {
+	if(page < pages-1)
+	  page = page + 1;
 	else
-		linesleft = lines - lineoff;
+	  page = 0;
+      }
 
-	/*
-	 * advance the scan until we find a qualifying tuple or run out of
-	 * stuff to scan
-	 */
-	for (;;)
+      /*
+       * return NULL if we've exhausted all the pages
+       */
+      if(page == scan->rs_start_page)
+	{
+	  if (BufferIsValid(*buffer))
+	    ReleaseBuffer(*buffer);
+	  *buffer = InvalidBuffer;
+	  tuple->t_datamcxt = NULL;
+	  tuple->t_data = NULL;
+	  return;
+	}
+
+      Assert(page < pages);
+
+      /*elog(NOTICE,"Page on relid %u advanced to %d",
+	RelationGetRelid(scan->rs_rd),page);*/
+      report_page_loc(RelationGetRelid(scan->rs_rd),page);
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber((Page) dp);
+      linesleft = lines - 1;
+      if (dir < 0)
+	{
+	  lineoff = lines;
+	  lpp = PageGetItemId(dp, lines);
+	}
+      else
 	{
-		while (linesleft >= 0)
-		{
-			if (ItemIdIsUsed(lpp))
-			{
-				bool		valid;
-
-				tuple->t_datamcxt = NULL;
-				tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
-				tuple->t_len = ItemIdGetLength(lpp);
-				ItemPointerSet(&(tuple->t_self), page, lineoff);
-
-				/*
-				 * if current tuple qualifies, return it.
-				 */
-				HeapTupleSatisfies(tuple, relation, *buffer, (PageHeader) dp,
-								   snapshot, nkeys, key, valid);
-				if (valid)
-				{
-					LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-					return;
-				}
-			}
-
-			/*
-			 * otherwise move to the next item on the page
-			 */
-			--linesleft;
-			if (dir < 0)
-			{
-				--lpp;			/* move back in this page's ItemId array */
-				--lineoff;
-			}
-			else
-			{
-				++lpp;			/* move forward in this page's ItemId
-								 * array */
-				++lineoff;
-			}
-		}
-
-		/*
-		 * if we get here, it means we've exhausted the items on this page
-		 * and it's time to move to the next.
-		 */
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-
-		/*
-		 * return NULL if we've exhausted all the pages
-		 */
-		if ((dir < 0) ? (page == 0) : (page + 1 >= pages))
-		{
-			if (BufferIsValid(*buffer))
-				ReleaseBuffer(*buffer);
-			*buffer = InvalidBuffer;
-			tuple->t_datamcxt = NULL;
-			tuple->t_data = NULL;
-			return;
-		}
-
-		page = (dir < 0) ? (page - 1) : (page + 1);
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber((Page) dp);
-		linesleft = lines - 1;
-		if (dir < 0)
-		{
-			lineoff = lines;
-			lpp = PageGetItemId(dp, lines);
-		}
-		else
-		{
-			lineoff = FirstOffsetNumber;
-			lpp = PageGetItemId(dp, FirstOffsetNumber);
-		}
+	  lineoff = FirstOffsetNumber;
+	  lpp = PageGetItemId(dp, FirstOffsetNumber);
 	}
+    }
 }
 
 
@@ -829,6 +904,7 @@
 	/*
 	 * Note: we depend here on the -1/0/1 encoding of ScanDirection.
 	 */
+
 	heapgettup(scan->rs_rd,
 			   (int) direction,
 			   &(scan->rs_ctup),
@@ -836,7 +912,7 @@
 			   scan->rs_snapshot,
 			   scan->rs_nkeys,
 			   scan->rs_key,
-			   scan->rs_nblocks);
+			   scan->rs_nblocks,scan);
 
 	if (scan->rs_ctup.t_data == NULL && !BufferIsValid(scan->rs_cbuf))
 	{
@@ -1989,7 +2065,7 @@
 				   scan->rs_snapshot,
 				   0,
 				   NULL,
-				   scan->rs_nblocks);
+				   scan->rs_nblocks,scan);
 	}
 }
 
relscan.h.difftext/x-patch; charset=ISO-8859-1; name=relscan.h.diffDownload
--- postgresql-8.0.0/src/include/access/relscan.h	2004-12-31 14:03:21.000000000 -0800
+++ postgresql-8.0.0-ssf/src/include/access/relscan.h	2005-02-22 17:57:27.343559992 -0800
@@ -34,6 +34,7 @@
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
 
 	PgStat_Info rs_pgstat_info; /* statistics collector hook */
+  BlockNumber rs_start_page; /* page where scan started */
 } HeapScanDescData;
 
 typedef HeapScanDescData *HeapScanDesc;
#2Simon Riggs
simon@2ndquadrant.com
In reply to: Jeff Davis (#1)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote:

I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

This is cool and was on my list of would-like-to-implement features.

It's usually known as Synchronised Scanning. AFAIK it is free of any
patent restriction: it has already been implemented by both Teradata and
RedBrick.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality.

I'll be happy to help you do this, at least for design and code review.

I'll come back later with more detailed comments on your thoughts so
far.

However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Third party?

Best Regards, Simon Riggs

#3Jeff Davis
jdavis-pgsql@empires.org
In reply to: Simon Riggs (#2)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 13:38 +0000, Simon Riggs wrote:

On Fri, 2005-02-25 at 00:34 -0800, Jeff Davis wrote:

I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

This is cool and was on my list of would-like-to-implement features.

It's usually known as Synchronised Scanning. AFAIK it is free of any
patent restriction: it has already been implemented by both Teradata and
RedBrick.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality.

I'll be happy to help you do this, at least for design and code review.

I'll come back later with more detailed comments on your thoughts so
far.

Good to hear. I'll clean up the code and document some more tests. Three
questions come to mind right now:
(1) Do we care about reverse scans being done with synchronized
scanning? If so, is there a good way to know in advance whether it is
going to be a forward or reverse scan (i.e. before heap_getnext())?
(2) Where is the appropriate place to put the page location of an
in-progress scan? Are there other pieces of shared memory that aren't
disk buffers that I should be making use of?

However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Third party?

A 2nd party? Anyone else? That was a typo :)

Regards,
Jeff Davis

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#3)
Re: idea for concurrent seqscans

Jeff Davis <jdavis-pgsql@empires.org> writes:

(1) Do we care about reverse scans being done with synchronized
scanning? If so, is there a good way to know in advance whether it is
going to be a forward or reverse scan (i.e. before heap_getnext())?

There are no reverse heapscans --- the only case where you'll see
direction = backwards is while backing up a cursor with FETCH BACKWARD.
I don't think you need to optimize that case.

What I'm more concerned about is your use of shared memory. I didn't
have time to look at the patch, but how are you determining an upper
bound on the amount of memory you need? What sort of locking and
contention issues will there be?

Another point is that this will render the results from heapscans
unstable, since different executions of the same query might start
at different points. This would for example probably break many
of the regression tests. We can deal with that if we have to, but
it raises the bar of how much benefit I'd want to see from the patch.

One detail that might or might not be significant: different scans are
very likely to have slightly different ideas about where the end of the
table is, since they determine this with an lseek(SEEK_END) at the
instant they start the scan. I don't think this invalidates your idea
but you need to watch out for corner-case bugs in the coding.

regards, tom lane

#5Jeff Davis
jdavis-pgsql@empires.org
In reply to: Tom Lane (#4)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 12:54 -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

(1) Do we care about reverse scans being done with synchronized
scanning? If so, is there a good way to know in advance whether it is
going to be a forward or reverse scan (i.e. before heap_getnext())?

There are no reverse heapscans --- the only case where you'll see
direction = backwards is while backing up a cursor with FETCH BACKWARD.
I don't think you need to optimize that case.

Ok, I was wondering about that.

What I'm more concerned about is your use of shared memory. I didn't
have time to look at the patch, but how are you determining an upper
bound on the amount of memory you need? What sort of locking and
contention issues will there be?

Right now a scanning backend puts the page it's scanning into shared
memory when it gets a new page (so it's not every tuple). I haven't
determined whether this will be a major point of locking contention.
However, one possible implementation seems to solve both problems at
once:
Let's say we just had a static hash table of size
100*sizeof(oid)*sizeof(blocknumber) (to hold the relation's oid and the
page number it's currently scanning). The relid would predetermine the
placement in the table. If there's a collision, overwrite. I don't think
much is lost in that case, unless, for example, two tables in an
important join have oids that hash to the same value. In that case the
effectiveness of synchronized scanning will be lost, but not worse than
the current behavior.
Let's say we didn't use any locks at all. Are there any real dangers
there? If there's a race, and one backend gets some garbage data, it can
just say "this is out of bounds, start the scan at 0". Since it's a
static hash table, we don't have to worry about following a bad pointer,
etc. If that looks like it will be a problem, I can test with locking
also to see what kind of contention there is.

The current patch I sent was very much a proof of concept, but all it
did was have a shared mem segment of size 8 bytes (only holds info for
one relid at a time). That would probably be somewhat effective in many
cases, but of course we want it larger than that (800? 8KB?).

In short, I tried to overcome these problems with simplicity. Where
simplicity doesn't work I default to starting the scan at 0. Hopefully
those non-simple cases (like hash collisions and shared memory races)
are rare enough that we don't lose all that we gain.

Another point is that this will render the results from heapscans
unstable, since different executions of the same query might start
at different points. This would for example probably break many
of the regression tests. We can deal with that if we have to, but
it raises the bar of how much benefit I'd want to see from the patch.

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

One detail that might or might not be significant: different scans are
very likely to have slightly different ideas about where the end of the
table is, since they determine this with an lseek(SEEK_END) at the
instant they start the scan. I don't think this invalidates your idea
but you need to watch out for corner-case bugs in the coding.

I only see that as an issue in initscan(), where it sets the start page.
A simple bounds check would cure that, no? If it was out of bounds, set
the start page to zero, and we didn't lose much. I need a bounds check
there anyway, since the data we get from shared memory needs to be
validated. That bounds check would be comparing against the current
backend's scan->rs_nblocks, which should be the correct number for that
backend.

Regards,
Jeff Davis

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Davis (#5)
Re: idea for concurrent seqscans

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

regards, tom lane

#7Jim C. Nasby
decibel@decibel.org
In reply to: Tom Lane (#6)
Re: idea for concurrent seqscans

On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#8Jeff Davis
jdavis-pgsql@empires.org
In reply to: Tom Lane (#6)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 13:30 -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Well, that does make testing more difficult, or it at least requires
extra work to make the regression tests understand the results better.

I'll sumbmit a better patch, and then if everyone decides it's worth the
hassle with the regression tests, we can use it in 8.1. Some more
testing is required to see if the results are really as good as we hope.

Regards,
Jeff Davis

#9Jeff Davis
jdavis-pgsql@empires.org
In reply to: Jim C. Nasby (#7)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote:

On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)

True, that was my reasoning when I proposed synchronized scanning.

Keep in mind that this is a criticism of only the regression tests, not
the RDBMS itself.

I don't know much about the regression tests, so maybe it's impractical
to not assume consistent order. I'm sure the developers will vote one
way or the other. I hate to throw away a potential performance boost,
but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Regards,
Jeff Davis

#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jim C. Nasby (#7)
Re: idea for concurrent seqscans

Jim C. Nasby wrote:

On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)

The only trick I can think of is to use SELECT ... INTO TEMPORARY tab
... oRDER BY and then use COPY to dump the table. It will then dump in
the order of the ORDER BY.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#10)
Re: idea for concurrent seqscans

Sorry, please disregard my ramblings. I thought it was a different
question.

---------------------------------------------------------------------------

pgman wrote:

Jim C. Nasby wrote:

On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)

The only trick I can think of is to use SELECT ... INTO TEMPORARY tab
... oRDER BY and then use COPY to dump the table. It will then dump in
the order of the ORDER BY.

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#12Jim C. Nasby
decibel@decibel.org
In reply to: Jeff Davis (#9)
Re: idea for concurrent seqscans

On Fri, Feb 25, 2005 at 04:30:17PM -0800, Jeff Davis wrote:

On Fri, 2005-02-25 at 18:03 -0600, Jim C. Nasby wrote:

On Fri, Feb 25, 2005 at 01:30:57PM -0500, Tom Lane wrote:

Jeff Davis <jdavis-pgsql@empires.org> writes:

I didn't consider that. Is there a reason the regression tests assume
the results will be returned in a certain order (or a consistent order)?

We use diff as the checking tool.

Doesn't the SQL spec specifically state that the only time you'll get
results in a deterministic order is if you use ORDER BY? Assuming
otherwise seems a bad idea (though at least in the case of testing it
makes the test more strenuous rather than less...)

True, that was my reasoning when I proposed synchronized scanning.

Keep in mind that this is a criticism of only the regression tests, not
the RDBMS itself.

I don't know much about the regression tests, so maybe it's impractical
to not assume consistent order. I'm sure the developers will vote one
way or the other. I hate to throw away a potential performance boost,
but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Certainly, but I suspect it's just a matter of adding ORDER BY to
everything, which just about anyone (even myself!) should be able to do.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#12)
Re: idea for concurrent seqscans

"Jim C. Nasby" <decibel@decibel.org> writes:

but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Certainly, but I suspect it's just a matter of adding ORDER BY to
everything, which just about anyone (even myself!) should be able to do.

Performance is not the issue; test coverage, however, is an issue.
See the comment at the end of
http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383

regards, tom lane

#14Jim C. Nasby
decibel@decibel.org
In reply to: Tom Lane (#13)
Re: idea for concurrent seqscans

On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Certainly, but I suspect it's just a matter of adding ORDER BY to
everything, which just about anyone (even myself!) should be able to do.

Performance is not the issue; test coverage, however, is an issue.
See the comment at the end of
http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383

Assuming you're talkning about "You might wonder why we don't order all
the regression test queries explicitly to get rid of this issue once and
for all. The reason is that that would make the regression tests less
useful, not more, since they'd tend to exercise query plan types that
produce ordered results to the exclusion of those that don't.", good
point. I can think of 2 ways around this:

1) Select into a temptable, then select out of it with an order by

2) Run the output through sort before doing the diff

Is there any reason one of these wouldn't work?
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#15Jeff Davis
jdavis-pgsql@empires.org
In reply to: Jeff Davis (#1)
3 attachment(s)
Re: idea for concurrent seqscans

I have a newer version of my Synchronized Scanning patch which hopefully
makes it closer to a real patch, the first one was more of a proof of
concept.

DONE:
* I added a proper bounds check for the result it gets from shared
memory.
* I expanded the shared memory to be a static hash table (currently set
to a size of 8KB, but that is a one line change if that's too big). Now
it can keep track of many scans on many tables at once.

TODO:
* More testing. I plan to get some more tests up on Monday, and when I
start to see the best uses for this patch, I will also try to benchmark
against MySQL or something else. I'm seeing some good preliminary
results in cache hit rate (which is much higher in some cases), but I
need to demonstrate an actual decrease in runtime.
* Right now the shared mem isn't destroyed when the postmaster shuts
down.
* This patch still makes no use of locks at all. If someone thinks locks
are required, please let me know. Currently, if there is inconsistent
data in the shared memory area the worst that can happen is no worse
than the current behavior.

Regards,
Jeff Davis

Attachments:

relscan.h.difftext/x-patch; charset=ISO-8859-1; name=relscan.h.diffDownload
--- postgresql-8.0.0/src/include/access/relscan.h	2004-12-31 14:03:21.000000000 -0800
+++ postgresql-8.0.0-ss/src/include/access/relscan.h	2005-02-25 17:37:54.398197789 -0800
@@ -34,6 +34,7 @@
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
 
 	PgStat_Info rs_pgstat_info; /* statistics collector hook */
+        BlockNumber rs_start_page; /* page where this scan started */
 } HeapScanDescData;
 
 typedef HeapScanDescData *HeapScanDesc;
heapam.c.difftext/x-patch; charset=ISO-8859-1; name=heapam.c.diffDownload
--- postgresql-8.0.0/src/backend/access/heap/heapam.c	2004-12-31 13:59:16.000000000 -0800
+++ postgresql-8.0.0-ss/src/backend/access/heap/heapam.c	2005-02-25 23:49:15.049187562 -0800
@@ -65,6 +65,89 @@
  * ----------------------------------------------------------------
  */
 
+
+
+/* ss_get_startloc
+ *   returns the current location of an
+ *   in-progress scan. 
+ *   WARNING: this function is subject to a shared
+ *            memory race which could result in
+ *            returning a page location which doesn't
+ *            exist. In addition, another backend may
+ *            see a different number of relation pages and
+ *            may report a page which does not exist
+ *            in this backend. Therefore, the caller
+ *            should ALWAYS bounds-check this result
+ *            and have a default start page if this
+ *            function's return value is out of bounds.
+ */
+static BlockNumber ss_get_startloc(Oid relid) {
+  int shmid;
+  char *shm;
+  ss_scan_loc_t scanloc;
+  int offset;
+
+  offset = ss_hash_relid(relid)*sizeof(ss_scan_loc_t);
+  if((shmid = shmget(SS_SHM_KEY,
+		     SS_HASH_TABLE_SIZE*(sizeof(ss_scan_loc_t)),
+		     0666)) < 0)
+    return -1;
+  if((shm = shmat(shmid,NULL,SHM_RDONLY)) < 0)
+    return -1;
+
+  scanloc = *((ss_scan_loc_t*)(shm+offset));
+
+  if(shmdt(shm) < 0)
+    return -1;
+
+  return (scanloc.relid == relid) ? scanloc.loc : 0;
+}
+
+/* ss_report_loc
+ *   stores the relid and location in a static hash
+ *   table. Collisions simply overwrite.
+ *   WARNING: This function is subject to a shared
+ *            memory race. The data stored in this
+ *            table should not be relied upon! Rather,
+ *            it should be used as a suggested start
+ *            location for future scans and those scans
+ *            should start at a default location if this
+ *            data is out of bounds.
+ */
+static int ss_report_loc(Oid relid, BlockNumber loc) 
+{
+  int shmid;
+  char *shm;
+  ss_scan_loc_t scanloc;
+  int offset;
+
+  scanloc.relid = relid;
+  scanloc.loc = loc;
+  offset = ss_hash_relid(relid);
+  if((shmid = shmget(SS_SHM_KEY,
+		     SS_HASH_TABLE_SIZE*(sizeof(ss_scan_loc_t)),
+		     0666|IPC_CREAT)) < 0)
+    return -1;
+  if((shm = shmat(shmid,NULL,0)) < 0)
+    return -1;
+
+  *((ss_scan_loc_t*)(shm+offset)) = scanloc;
+
+  if(shmdt(shm) < 0)
+    return -1;
+
+  return 0;
+}
+
+/* ss_hash_relid
+ *   Simple hash function for use in ss_get_startloc()
+ *   and ss_report_loc()
+ */
+static int ss_hash_relid(Oid relid)
+{
+  return relid % SS_HASH_TABLE_SIZE;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -85,6 +168,22 @@
 	scan->rs_ctup.t_data = NULL;
 	scan->rs_cbuf = InvalidBuffer;
 
+	/* Initialize start page. 
+	 *  This sets the scan to start near where an
+	 *  already-in-progress scan is taking place to 
+	 *  make better use of shared buffers. If the
+	 *  it can't find a valid place to start, it starts
+	 *  at 0.
+	 *  This is called "Synchronized Scanning"
+	 */
+	scan->rs_start_page = ss_get_startloc(RelationGetRelid(scan->rs_rd));
+	/* Bounds checking for this return value is required
+	 * because the data in shared mem cannot be relied upon.
+	 */
+	if(scan->rs_start_page >= scan->rs_nblocks ||
+	   scan->rs_start_page <  0)
+	  scan->rs_start_page = 0;
+
 	/* we don't have a marked position... */
 	ItemPointerSetInvalid(&(scan->rs_mctid));
 
@@ -115,261 +214,283 @@
 		   Snapshot snapshot,
 		   int nkeys,
 		   ScanKey key,
-		   BlockNumber pages)
+		   BlockNumber pages,
+	   HeapScanDesc scan)
 {
-	ItemId		lpp;
-	Page		dp;
-	BlockNumber page;
-	int			lines;
-	OffsetNumber lineoff;
-	int			linesleft;
-	ItemPointer tid;
-
-	tid = (tuple->t_data == NULL) ? NULL : &(tuple->t_self);
-
-	/*
-	 * debugging stuff
-	 *
-	 * check validity of arguments, here and for other functions too Note: no
-	 * locking manipulations needed--this is a local function
-	 */
+  ItemId		lpp;
+  Page		dp;
+  BlockNumber page;
+  int			lines;
+  OffsetNumber lineoff;
+  int			linesleft;
+  ItemPointer tid;
+
+  tid = (tuple->t_data == NULL) ? NULL : &(tuple->t_self);
+
+  /*
+   * debugging stuff
+   *
+   * check validity of arguments, here and for other functions too Note: no
+   * locking manipulations needed--this is a local function
+   */
 #ifdef	HEAPDEBUGALL
-	if (ItemPointerIsValid(tid))
-		elog(DEBUG2, "heapgettup(%s, tid=0x%x[%d,%d], dir=%d, ...)",
-			 RelationGetRelationName(relation), tid, tid->ip_blkid,
-			 tid->ip_posid, dir);
-	else
-		elog(DEBUG2, "heapgettup(%s, tid=0x%x, dir=%d, ...)",
-			 RelationGetRelationName(relation), tid, dir);
-
-	elog(DEBUG2, "heapgettup(..., b=0x%x, nkeys=%d, key=0x%x", buffer, nkeys, key);
-
-	elog(DEBUG2, "heapgettup: relation(%c)=`%s', %p",
-		 relation->rd_rel->relkind, RelationGetRelationName(relation),
-		 snapshot);
+  if (ItemPointerIsValid(tid))
+    elog(DEBUG2, "heapgettup(%s, tid=0x%x[%d,%d], dir=%d, ...)",
+	 RelationGetRelationName(relation), tid, tid->ip_blkid,
+	 tid->ip_posid, dir);
+  else
+    elog(DEBUG2, "heapgettup(%s, tid=0x%x, dir=%d, ...)",
+	 RelationGetRelationName(relation), tid, dir);
+
+  elog(DEBUG2, "heapgettup(..., b=0x%x, nkeys=%d, key=0x%x", buffer, nkeys, key);
+
+  elog(DEBUG2, "heapgettup: relation(%c)=`%s', %p",
+       relation->rd_rel->relkind, RelationGetRelationName(relation),
+       snapshot);
 #endif   /* HEAPDEBUGALL */
 
-	if (!ItemPointerIsValid(tid))
-	{
-		Assert(!PointerIsValid(tid));
-		tid = NULL;
-	}
-
-	tuple->t_tableOid = relation->rd_id;
-
-	/*
-	 * return null immediately if relation is empty
-	 */
-	if (pages == 0)
-	{
-		if (BufferIsValid(*buffer))
-			ReleaseBuffer(*buffer);
-		*buffer = InvalidBuffer;
-		tuple->t_datamcxt = NULL;
-		tuple->t_data = NULL;
-		return;
-	}
-
-	/*
-	 * calculate next starting lineoff, given scan direction
-	 */
-	if (dir == 0)
-	{
-		/*
-		 * ``no movement'' scan direction: refetch same tuple
-		 */
-		if (tid == NULL)
-		{
-			if (BufferIsValid(*buffer))
-				ReleaseBuffer(*buffer);
-			*buffer = InvalidBuffer;
-			tuple->t_datamcxt = NULL;
-			tuple->t_data = NULL;
-			return;
-		}
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   ItemPointerGetBlockNumber(tid));
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lineoff = ItemPointerGetOffsetNumber(tid);
-		lpp = PageGetItemId(dp, lineoff);
-
-		tuple->t_datamcxt = NULL;
-		tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
-		tuple->t_len = ItemIdGetLength(lpp);
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-
-		return;
-	}
-	else if (dir < 0)
-	{
-		/*
-		 * reverse scan direction
-		 */
-		if (tid == NULL)
-		{
-			page = pages - 1;	/* final page */
-		}
-		else
-		{
-			page = ItemPointerGetBlockNumber(tid);		/* current page */
-		}
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber(dp);
-		if (tid == NULL)
-		{
-			lineoff = lines;	/* final offnum */
-		}
-		else
-		{
-			lineoff =			/* previous offnum */
-				OffsetNumberPrev(ItemPointerGetOffsetNumber(tid));
-		}
-		/* page and lineoff now reference the physically previous tid */
-	}
+  if (!ItemPointerIsValid(tid))
+    {
+      Assert(!PointerIsValid(tid));
+      tid = NULL;
+    }
+
+  tuple->t_tableOid = relation->rd_id;
+
+  /*
+   * return null immediately if relation is empty
+   */
+  if (pages == 0)
+    {
+      if (BufferIsValid(*buffer))
+	ReleaseBuffer(*buffer);
+      *buffer = InvalidBuffer;
+      tuple->t_datamcxt = NULL;
+      tuple->t_data = NULL;
+      return;
+    }
+
+  /*
+   * calculate next starting lineoff, given scan direction
+   */
+  if (dir == 0)
+    {
+      /*
+       * ``no movement'' scan direction: refetch same tuple
+       */
+      if (tid == NULL)
+	{
+	  if (BufferIsValid(*buffer))
+	    ReleaseBuffer(*buffer);
+	  *buffer = InvalidBuffer;
+	  tuple->t_datamcxt = NULL;
+	  tuple->t_data = NULL;
+	  return;
+	}
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     ItemPointerGetBlockNumber(tid));
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lineoff = ItemPointerGetOffsetNumber(tid);
+      lpp = PageGetItemId(dp, lineoff);
+
+      tuple->t_datamcxt = NULL;
+      tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+      tuple->t_len = ItemIdGetLength(lpp);
+      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+      return;
+    }
+  else if (dir < 0)
+    {
+      /*
+       * reverse scan direction
+       */
+      if (tid == NULL)
+	{
+	  /*page = pages-1;*/	/* final page */
+	  page = scan->rs_start_page;
+	}
+      else
+	{
+	  page = ItemPointerGetBlockNumber(tid);  /* current page */
+	}
+
+      Assert(page < pages);
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber(dp);
+      if (tid == NULL)
+	{
+	  lineoff = lines;	/* final offnum */
+	}
+      else
+	{
+	  lineoff =			/* previous offnum */
+	    OffsetNumberPrev(ItemPointerGetOffsetNumber(tid));
+	}
+      /* page and lineoff now reference the physically previous tid */
+    }
+  else
+    {
+      /*
+       * forward scan direction
+       */
+      if (tid == NULL)
+	{
+	  page = scan->rs_start_page;     /* first page */
+	  lineoff = FirstOffsetNumber;		/* first offnum */
+	}
+      else
+	{
+	  page = ItemPointerGetBlockNumber(tid);	  /* current page */
+	  lineoff =			/* next offnum */
+	    OffsetNumberNext(ItemPointerGetOffsetNumber(tid));
+	}
+
+      Assert(page < pages);
+
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber(dp);
+      /* page and lineoff now reference the physically next tid */
+    }
+
+  /* 'dir' is now non-zero */
+
+  /*
+   * calculate line pointer and number of remaining items to check on
+   * this page.
+   */
+  lpp = PageGetItemId(dp, lineoff);
+  if (dir < 0)
+    linesleft = lineoff - 1;
+  else
+    linesleft = lines - lineoff;
+
+  /*
+   * advance the scan until we find a qualifying tuple or run out of
+   * stuff to scan
+   */
+  for (;;)
+    {
+      while (linesleft >= 0)
+	{
+	  if (ItemIdIsUsed(lpp))
+	    {
+	      bool		valid;
+
+	      tuple->t_datamcxt = NULL;
+	      tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+	      tuple->t_len = ItemIdGetLength(lpp);
+	      ItemPointerSet(&(tuple->t_self), page, lineoff);
+
+	      /*
+	       * if current tuple qualifies, return it.
+	       */
+	      HeapTupleSatisfies(tuple, relation, *buffer, (PageHeader) dp,
+				 snapshot, nkeys, key, valid);
+	      if (valid)
+		{
+		  LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+		  return;
+		}
+	    }
+
+	  /*
+	   * otherwise move to the next item on the page
+	   */
+	  --linesleft;
+	  if (dir < 0)
+	    {
+	      --lpp;			/* move back in this page's ItemId array */
+	      --lineoff;
+	    }
+	  else
+	    {
+	      ++lpp;			/* move forward in this page's ItemId
+					 * array */
+	      ++lineoff;
+	    }
+	}
+
+      /*
+       * if we get here, it means we've exhausted the items on this page
+       * and it's time to move to the next.
+       */
+      LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+
+		
+      /* advance page, or wrap around */
+      if(dir < 0) {
+	if(page > 0)
+	  page = page - 1;
 	else
-	{
-		/*
-		 * forward scan direction
-		 */
-		if (tid == NULL)
-		{
-			page = 0;			/* first page */
-			lineoff = FirstOffsetNumber;		/* first offnum */
-		}
-		else
-		{
-			page = ItemPointerGetBlockNumber(tid);		/* current page */
-			lineoff =			/* next offnum */
-				OffsetNumberNext(ItemPointerGetOffsetNumber(tid));
-		}
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber(dp);
-		/* page and lineoff now reference the physically next tid */
-	}
-
-	/* 'dir' is now non-zero */
-
-	/*
-	 * calculate line pointer and number of remaining items to check on
-	 * this page.
-	 */
-	lpp = PageGetItemId(dp, lineoff);
-	if (dir < 0)
-		linesleft = lineoff - 1;
+	  page = pages;
+      }
+      else {
+	if(page < pages-1)
+	  page = page + 1;
 	else
-		linesleft = lines - lineoff;
+	  page = 0;
+      }
 
-	/*
-	 * advance the scan until we find a qualifying tuple or run out of
-	 * stuff to scan
-	 */
-	for (;;)
+      /*
+       * return NULL if we've exhausted all the pages
+       */
+      if(page == scan->rs_start_page)
+	{
+	  if (BufferIsValid(*buffer))
+	    ReleaseBuffer(*buffer);
+	  *buffer = InvalidBuffer;
+	  tuple->t_datamcxt = NULL;
+	  tuple->t_data = NULL;
+	  return;
+	}
+
+      Assert(page < pages);
+
+      /*
+       * This reports the current page being accessed,
+       *  so that other backends can start scanning where
+       *  shared buffers are more likely to be hit.
+       *  This is called "Synchrinized Scanning".
+       */
+      ss_report_loc(RelationGetRelid(scan->rs_rd),page);
+      *buffer = ReleaseAndReadBuffer(*buffer,
+				     relation,
+				     page);
+
+      LockBuffer(*buffer, BUFFER_LOCK_SHARE);
+      dp = (Page) BufferGetPage(*buffer);
+      lines = PageGetMaxOffsetNumber((Page) dp);
+      linesleft = lines - 1;
+      if (dir < 0)
+	{
+	  lineoff = lines;
+	  lpp = PageGetItemId(dp, lines);
+	}
+      else
 	{
-		while (linesleft >= 0)
-		{
-			if (ItemIdIsUsed(lpp))
-			{
-				bool		valid;
-
-				tuple->t_datamcxt = NULL;
-				tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
-				tuple->t_len = ItemIdGetLength(lpp);
-				ItemPointerSet(&(tuple->t_self), page, lineoff);
-
-				/*
-				 * if current tuple qualifies, return it.
-				 */
-				HeapTupleSatisfies(tuple, relation, *buffer, (PageHeader) dp,
-								   snapshot, nkeys, key, valid);
-				if (valid)
-				{
-					LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-					return;
-				}
-			}
-
-			/*
-			 * otherwise move to the next item on the page
-			 */
-			--linesleft;
-			if (dir < 0)
-			{
-				--lpp;			/* move back in this page's ItemId array */
-				--lineoff;
-			}
-			else
-			{
-				++lpp;			/* move forward in this page's ItemId
-								 * array */
-				++lineoff;
-			}
-		}
-
-		/*
-		 * if we get here, it means we've exhausted the items on this page
-		 * and it's time to move to the next.
-		 */
-		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
-
-		/*
-		 * return NULL if we've exhausted all the pages
-		 */
-		if ((dir < 0) ? (page == 0) : (page + 1 >= pages))
-		{
-			if (BufferIsValid(*buffer))
-				ReleaseBuffer(*buffer);
-			*buffer = InvalidBuffer;
-			tuple->t_datamcxt = NULL;
-			tuple->t_data = NULL;
-			return;
-		}
-
-		page = (dir < 0) ? (page - 1) : (page + 1);
-
-		Assert(page < pages);
-
-		*buffer = ReleaseAndReadBuffer(*buffer,
-									   relation,
-									   page);
-
-		LockBuffer(*buffer, BUFFER_LOCK_SHARE);
-		dp = (Page) BufferGetPage(*buffer);
-		lines = PageGetMaxOffsetNumber((Page) dp);
-		linesleft = lines - 1;
-		if (dir < 0)
-		{
-			lineoff = lines;
-			lpp = PageGetItemId(dp, lines);
-		}
-		else
-		{
-			lineoff = FirstOffsetNumber;
-			lpp = PageGetItemId(dp, FirstOffsetNumber);
-		}
+	  lineoff = FirstOffsetNumber;
+	  lpp = PageGetItemId(dp, FirstOffsetNumber);
 	}
+    }
 }
 
 
@@ -829,6 +950,7 @@
 	/*
 	 * Note: we depend here on the -1/0/1 encoding of ScanDirection.
 	 */
+
 	heapgettup(scan->rs_rd,
 			   (int) direction,
 			   &(scan->rs_ctup),
@@ -836,7 +958,7 @@
 			   scan->rs_snapshot,
 			   scan->rs_nkeys,
 			   scan->rs_key,
-			   scan->rs_nblocks);
+			   scan->rs_nblocks,scan);
 
 	if (scan->rs_ctup.t_data == NULL && !BufferIsValid(scan->rs_cbuf))
 	{
@@ -1989,7 +2111,7 @@
 				   scan->rs_snapshot,
 				   0,
 				   NULL,
-				   scan->rs_nblocks);
+				   scan->rs_nblocks,scan);
 	}
 }
 
heapam.h.difftext/x-patch; charset=ISO-8859-1; name=heapam.h.diffDownload
--- postgresql-8.0.0/src/include/access/heapam.h	2004-12-31 14:03:21.000000000 -0800
+++ postgresql-8.0.0-ss/src/include/access/heapam.h	2005-02-25 17:37:24.256359103 -0800
@@ -25,6 +25,28 @@
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
+
+/* Synchronized Scanning:
+ *  These definitions and declarations are used
+ *  by the Synchronized Scanning support functions
+ *  in heapam.c
+ */
+
+#include <sys/shm.h>
+#define SS_HASH_TABLE_SIZE 1024
+#define SS_SHM_KEY 0x11aa55cc
+
+typedef struct {
+  Oid relid;
+  BlockNumber loc;
+} ss_scan_loc_t;
+
+static BlockNumber ss_get_startloc(Oid);
+static int         ss_report_loc(Oid,BlockNumber);
+static int         ss_hash_relid(Oid);
+
+/* end Synchronized Scanning section */
+
 /* ----------------
  *		fastgetattr
  *
#16William Volkman
wkvpsql@netshark.com
In reply to: Jim C. Nasby (#14)
Re: idea for concurrent seqscans

On Fri, 2005-02-25 at 22:49, Jim C. Nasby wrote:

On Fri, Feb 25, 2005 at 11:51:40PM -0500, Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

The patch is for *concurrent* seqscans, would the regression tests
even be affected by this (IIRC they are single user, single process)?

#17Neil Conway
neilc@samurai.com
In reply to: William Volkman (#16)
Re: idea for concurrent seqscans

William Volkman wrote:

The patch is for *concurrent* seqscans, would the regression tests
even be affected by this (IIRC they are single user, single process)?

No; `make installcheck' is serial, but `make check' executes tests in
parallel in multiple backends concurrently.

-Neil

#18Neil Conway
neilc@samurai.com
In reply to: Jeff Davis (#15)
Re: idea for concurrent seqscans

Jeff Davis wrote:

I have a newer version of my Synchronized Scanning patch which hopefully
makes it closer to a real patch, the first one was more of a proof of
concept.

A few minor comments:

- context diffs (diff -c) are the preferred format. Also, folks usually
send patches as a single diff; an easy way to generate that is via `cvs
diff', or `diff -r old_dir new_dir'.

- needlessly reindenting code makes it difficult to understand what
you've changed. You should probably follow the PG coding conventions WRT
indentation, brace placement, and so forth, although this will be fixed
by a script later in any case. See Chapter 43 of the docs.

- you don't need to (and should not) declare `static' functions in
header files. If your additions to heapam.c aren't used outside of
heapam.c, they needn't be declared in the header file.

- PG has an abstraction layer for using shared memory that you should
take advantage of. You should do something like: (1) create a function
that returns the amount of shared memory space you require (2) invoke
the function from CreateSharedMemoryAndSemaphores (3) create/attach to
and initialize the shared memory during startup, via ShmemInitStruct().
See how InitProcGlobal() works, for example.

- it makes me quite nervous to be reading and writing to shared data
without using locks. If there is too much of a performance hit to
acquire and release a lock for each page traversed, what about only
updating the shared memory stats every K pages?

-Neil

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#13)
Re: idea for concurrent seqscans

Tom Lane said:

"Jim C. Nasby" <decibel@decibel.org> writes:

but I also hate to burden the developers with rewriting a lot of
regression tests when their time could be better spent elsewhere.

Certainly, but I suspect it's just a matter of adding ORDER BY to
everything, which just about anyone (even myself!) should be able to
do.

Performance is not the issue; test coverage, however, is an issue. See
the comment at the end of
http://developer.postgresql.org/docs/postgres/regress-evaluation.html#AEN22383&gt;

Is it not possible to wrap the original query in an outer query that applies
the ordering, leaving the original query run without ordering? Would that
cause a lessening of test coverage?

cheers

andrew

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#14)
Re: idea for concurrent seqscans

"Jim C. Nasby" <decibel@decibel.org> writes:

Assuming you're talkning about "You might wonder why we don't order all
the regression test queries explicitly to get rid of this issue once and
for all. The reason is that that would make the regression tests less
useful, not more, since they'd tend to exercise query plan types that
produce ordered results to the exclusion of those that don't.", good
point. I can think of 2 ways around this:

1) Select into a temptable, then select out of it with an order by

2) Run the output through sort before doing the diff

Is there any reason one of these wouldn't work?

Like I said originally, we could certainly devise a solution if we
needed to. I was just pointing out that this is a nontrivial
consideration, and I don't want to buy into it if the patch proves
to offer only marginal performance improvements.

regards, tom lane

#21Jim C. Nasby
decibel@decibel.org
In reply to: Jeff Davis (#15)
Re: idea for concurrent seqscans

On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote:

* I expanded the shared memory to be a static hash table (currently set
to a size of 8KB, but that is a one line change if that's too big). Now
it can keep track of many scans on many tables at once.

I know you're still early in this, but should this value be a GUC?

* This patch still makes no use of locks at all. If someone thinks locks
are required, please let me know. Currently, if there is inconsistent
data in the shared memory area the worst that can happen is no worse
than the current behavior.

It would be useful to have the backend put something in the logfile any
time it has to revert to the default behavior. Without that, we have no
way to know how often it's happening, and if it's worth trying to make
it happen less.
--
Jim C. Nasby, Database Consultant decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

#22Jeff Davis
jdavis-pgsql@empires.org
In reply to: Neil Conway (#18)
Re: idea for concurrent seqscans

Thanks for the information.

On Sat, 2005-02-26 at 23:39 +1100, Neil Conway wrote:

Jeff Davis wrote:

I have a newer version of my Synchronized Scanning patch which hopefully
makes it closer to a real patch, the first one was more of a proof of
concept.

A few minor comments:

- context diffs (diff -c) are the preferred format. Also, folks usually
send patches as a single diff; an easy way to generate that is via `cvs
diff', or `diff -r old_dir new_dir'.

Ok.

- needlessly reindenting code makes it difficult to understand what
you've changed. You should probably follow the PG coding conventions WRT
indentation, brace placement, and so forth, although this will be fixed
by a script later in any case. See Chapter 43 of the docs.

The only reason I did that was because the original source was difficult
to read and work on. Is there a reason that code and comments wrap
around to the next line throughout the file?

- PG has an abstraction layer for using shared memory that you should
take advantage of. You should do something like: (1) create a function
that returns the amount of shared memory space you require (2) invoke
the function from CreateSharedMemoryAndSemaphores (3) create/attach to
and initialize the shared memory during startup, via ShmemInitStruct().
See how InitProcGlobal() works, for example.

Will do.

- it makes me quite nervous to be reading and writing to shared data
without using locks. If there is too much of a performance hit to
acquire and release a lock for each page traversed, what about only
updating the shared memory stats every K pages?

I'll try that and test it and hopefully any performance problems appear
sooner rather than later. If something appears, every K pages sounds
like a plan to me.

I will get a new patch ready soon with your suggestions.

Regards,
Jeff Davis

#23Jeff Davis
jdavis-pgsql@empires.org
In reply to: Jim C. Nasby (#21)
Re: idea for concurrent seqscans

On Sat, 2005-02-26 at 10:16 -0600, Jim C. Nasby wrote:

On Sat, Feb 26, 2005 at 12:22:45AM -0800, Jeff Davis wrote:

* I expanded the shared memory to be a static hash table (currently set
to a size of 8KB, but that is a one line change if that's too big). Now
it can keep track of many scans on many tables at once.

I know you're still early in this, but should this value be a GUC?

I don't know if we want to clutter the GUC variables with a bunch of
minor things like this. Most users won't understand what they're
actually getting by allocating more memory here. The only improvement
would be if there are two tables that are very often scanned
simultaneously that happen to hash to the same value, in which case
decreasing the shared mem has almost as much chance of solving the
problem as increasing it :)

However, if people think it's worthwhile, I'll be happy to do it.

* This patch still makes no use of locks at all. If someone thinks locks
are required, please let me know. Currently, if there is inconsistent
data in the shared memory area the worst that can happen is no worse
than the current behavior.

It would be useful to have the backend put something in the logfile any
time it has to revert to the default behavior. Without that, we have no
way to know how often it's happening, and if it's worth trying to make
it happen less.

Good idea. However, the first scan on a table after a server start will
always be default, so I'll need to avoid excessive logging.

Regards,
Jeff Davis

#24Neil Conway
neilc@samurai.com
In reply to: Jeff Davis (#22)
Re: idea for concurrent seqscans

Jeff Davis wrote:

The only reason I did that was because the original source was difficult
to read and work on. Is there a reason that code and comments wrap
around to the next line throughout the file?

I'm not sure what you mean. Assuming your editor expand tabs to 4 spaces
(which is the convention in the PostgreSQL source), lines should be
about 80 characters at most. Expressions that would take more characters
horizontally if left on one line are divided into multiple lines, and
indented appropriately. At any rate, the source is perfectly readable to
me :) Perhaps you need to tweak your editor's configuration?

-Neil

#25Gaetano Mendola
mendola@bigfoot.com
In reply to: Tom Lane (#20)
Re: idea for concurrent seqscans

Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

Assuming you're talkning about "You might wonder why we don't order all
the regression test queries explicitly to get rid of this issue once and
for all. The reason is that that would make the regression tests less
useful, not more, since they'd tend to exercise query plan types that
produce ordered results to the exclusion of those that don't.", good
point. I can think of 2 ways around this:

1) Select into a temptable, then select out of it with an order by

2) Run the output through sort before doing the diff

Is there any reason one of these wouldn't work?

Like I said originally, we could certainly devise a solution if we
needed to. I was just pointing out that this is a nontrivial
consideration, and I don't want to buy into it if the patch proves
to offer only marginal performance improvements.

I'll bet will not offer only marginal performance improvements. I see some
time my 4-CPU server with 3 CPU in holiday and other CPU working on a long
sequential scan. I hope that this patch, if it works correctly will be used
in future Postgresql version

Regards
Gaetano Mendola

#26Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#20)
Re: idea for concurrent seqscans

On Sat, 2005-02-26 at 10:47 -0500, Tom Lane wrote:

"Jim C. Nasby" <decibel@decibel.org> writes:

Assuming you're talkning about "You might wonder why we don't order all
the regression test queries explicitly to get rid of this issue once and
for all. The reason is that that would make the regression tests less
useful, not more, since they'd tend to exercise query plan types that
produce ordered results to the exclusion of those that don't.", good
point. I can think of 2 ways around this:

1) Select into a temptable, then select out of it with an order by

2) Run the output through sort before doing the diff

Is there any reason one of these wouldn't work?

Like I said originally, we could certainly devise a solution if we
needed to. I was just pointing out that this is a nontrivial
consideration, and I don't want to buy into it if the patch proves
to offer only marginal performance improvements.

Yes, the starting place is performance prototyping. Jeff has sensibly
started with that thought in his initial post.

I would suggest that we used a new GUC
enable_synch_scans = off, by default.
to control whether this new behaviour was utilised.

That way, all of the current tests would stand as-is. My feeling is that
in general, we should only add tests, not alter existing ones. It would
be straightforward, even if laborious, to add some additional tests that
prove that this type of system feature returns correct SQL results.
However, the key seems to be that the results of SQL runs without an
ORDER BY would be timing dependant, so would actually be a wholly new
type of test - we would need to run 1 on its own, then compare with 2
run together to check the same answer was still returned, possibly with
a post execution external sort.

Best Regards, Simon Riggs

#27Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jeff Davis (#1)
Re: idea for concurrent seqscans

Added to TODO list:

* Allow sequential scans to take advantage of other concurrent
sequentiqal scans, also called "Synchronised Scanning"

---------------------------------------------------------------------------

Jeff Davis wrote:

I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

I made a proof-of-concept implementation, which is entirely in heapam.c,
except for one addition to the HeapScanDesc struct in relscan.h. It is
not at all up to production quality; there are things I know that need
to be addressed. Basically, I just modified heapam.c to be able to start
at any page in the relation. Then, every time it reads a new page, I
have it mark the relation's oid and the page number in a shared mem
segment. Everytime a new scan is started, it reads the shared mem
segment, and if the relation's oid matches, it starts the scan at the
page number it found in the shared memory. Otherwise, it starts the scan
at 0.

There are a couple obvious issues, one is that my whole implementation
doesn't account for reverse scans at all (since initscan doesn't know
what direction the scan will move in), but that shouldn't be a major
problem since at worst it will be the current behavior (aside: can
someone tell me how to force reverse scans so I can test that better?).
Another is that there's a race condition with the shared mem, and that's
out of pure laziness on my part.

This method is really only effective at all if there is a significant
amount of disk i/o. If it's pulling the data from O/S buffers the
various scans will diverge too much and not be using eachother's shared
buffers.

I tested with shared_buffers=500 and all stats on. I used 60 threads
performing 30 seqscans each in my script ssf.rb (I refer to my
modification as "sequential scan follower" or ssf).

Here are some results with my modifications:
$ time ./ssf.rb # my script

real 4m22.476s
user 0m0.389s
sys 0m0.186s

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17232) as hit,
pg_stat_get_blocks_fetched(17232) as total;
hit | total
--------+---------
971503 | 3353963
(1 row)

Or, approx. 29% cache hit.

Here are the results without my modifications:

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17231) as hit,
pg_stat_get_blocks_fetched(17231) as total;
hit | total
--------+---------
199999 | 3353963
(1 row)

Or, approx. 6% cache hit. Note: the oid is different, because I have two
seperately initdb'd data directories, one for the modified version, one
for the unmodified 8.0.0.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality. However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Regards,
Jeff Davis

[ Attachment, skipping... ]

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#28Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jeff Davis (#1)
Re: idea for concurrent seqscans

TODO description added:

* Allow sequential scans to take advantage of other concurrent
sequentiqal scans, also called "Synchronised Scanning"

One possible implementation is to start sequential scans from the lowest
numbered buffer in the shared cache, and when reaching the end wrap
around to the beginning, rather than always starting sequential scans
at the start of the table.

---------------------------------------------------------------------------

Jeff Davis wrote:

I had an idea that might improve parallel seqscans on the same relation.

If you have lots of concurrent seqscans going on a large relation, the
cache hit ratio is very low. But, if the seqscans are concurrent on the
same relation, there may be something to gain by starting a seqscan near
the page being accessed by an already-in-progress seqscan, and wrapping
back around to that start location. That would make some use of the
shared buffers, which would otherwise just be cache pollution.

I made a proof-of-concept implementation, which is entirely in heapam.c,
except for one addition to the HeapScanDesc struct in relscan.h. It is
not at all up to production quality; there are things I know that need
to be addressed. Basically, I just modified heapam.c to be able to start
at any page in the relation. Then, every time it reads a new page, I
have it mark the relation's oid and the page number in a shared mem
segment. Everytime a new scan is started, it reads the shared mem
segment, and if the relation's oid matches, it starts the scan at the
page number it found in the shared memory. Otherwise, it starts the scan
at 0.

There are a couple obvious issues, one is that my whole implementation
doesn't account for reverse scans at all (since initscan doesn't know
what direction the scan will move in), but that shouldn't be a major
problem since at worst it will be the current behavior (aside: can
someone tell me how to force reverse scans so I can test that better?).
Another is that there's a race condition with the shared mem, and that's
out of pure laziness on my part.

This method is really only effective at all if there is a significant
amount of disk i/o. If it's pulling the data from O/S buffers the
various scans will diverge too much and not be using eachother's shared
buffers.

I tested with shared_buffers=500 and all stats on. I used 60 threads
performing 30 seqscans each in my script ssf.rb (I refer to my
modification as "sequential scan follower" or ssf).

Here are some results with my modifications:
$ time ./ssf.rb # my script

real 4m22.476s
user 0m0.389s
sys 0m0.186s

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17232) as hit,
pg_stat_get_blocks_fetched(17232) as total;
hit | total
--------+---------
971503 | 3353963
(1 row)

Or, approx. 29% cache hit.

Here are the results without my modifications:

test=# select relpages from pg_class where relname='test_ssf';
relpages
----------
1667
(1 row)

test=# select count(*) from test_ssf;
count
--------
200000
(1 row)

test=# select pg_stat_get_blocks_hit(17231) as hit,
pg_stat_get_blocks_fetched(17231) as total;
hit | total
--------+---------
199999 | 3353963
(1 row)

Or, approx. 6% cache hit. Note: the oid is different, because I have two
seperately initdb'd data directories, one for the modified version, one
for the unmodified 8.0.0.

This is the first time I've really modified the PG source code to do
anything that looked promising, so this is more of a question than
anything else. Is it promising? Is this a potentially good approach? I'm
happy to post more test data and more documentation, and I'd also be
happy to bring the code to production quality. However, before I spend
too much more time on that, I'd like to get a general response from a
3rd party to let me know if I'm off base.

Regards,
Jeff Davis

[ Attachment, skipping... ]

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073