[PATCH]Tablesample Submission

Started by Qi Huangover 13 years ago14 messages

huangqiyx@outlook.com

over 13 years ago

1 attachment(s)

Hi, hackers
I made the final version tablesample patch. It is implementing SYSTEM and BERNOULLI sample method, which is basically "feature-complete". The regression test is also included in this patch.
There is an wiki documentation on https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation. The detail about this patch and this project is all included in this documentation.
The pgbench test result below:
Without patch:
transaction
type: SELECT only

scaling
factor: 10

query
mode: simple

duration:
30 s

number of
clients: 2

number of
threads: 2

number of
transactions actually processed: 304857

tps =
10161.803463 (including connections establishing)

tps =
10162.963554 (excluding connections establishing)

number of
clients: 4

number of
threads: 2

number of
transactions actually processed: 245072

tps =
8168.796552 (including connections establishing)

tps =
8173.947876 (excluding connections establishing)

number of
clients: 8

number of
threads: 2

number of
transactions actually processed: 218426

tps =
7280.624465 (including connections establishing)

tps =
7284.863386 (excluding connections establishing)

scaling
factor: 10

number of
clients: 16

number of
threads: 2

number of
transactions actually processed: 199204

tps =
6636.427331 (including connections establishing)

tps =
6650.783233 (excluding connections establishing)

scaling
factor: 10

number of
clients: 32

number of
threads: 2

number of
transactions actually processed: 186793

tps =
6221.025810 (including connections establishing)

tps =
6238.904071 (excluding connections establishing)

With Patch:
number of clients: 2
number of threads: 2
number of transactions actually processed: 329926
tps = 10997.232742 (including connections establishing)
tps = 10998.712762 (excluding connections establishing)
number of clients: 4
number of threads: 2
number of transactions actually processed: 261993
tps = 8732.875565 (including connections establishing)
tps = 8735.013550 (excluding connections establishing)
number of clients: 8
number of threads: 2
number of transactions actually processed: 203579
tps = 6785.601601 (including connections establishing)
tps = 6788.043016 (excluding connections establishing)
number of clients: 16
number of threads: 2
number of transactions actually processed: 190773
tps = 6354.824262 (including connections establishing)
tps = 6361.348206 (excluding connections establishing)
number of clients: 32
number of threads: 2
number of transactions actually processed: 190801
tps = 6353.821626 (including connections establishing)
tps = 6380.813409 (excluding connections establishing)

Thanks and Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Attachments:

tablesample_context.patchapplication/octet-streamDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 69,74 ****
--- 69,75 ----
  #include "utils/snapmgr.h"
  #include "utils/syscache.h"
  #include "utils/tqual.h"
+ #include "utils/sampleutils.h"
  
  
  /* GUC variable */
***************
*** 87,93 **** static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
! 
  
  /* ----------------------------------------------------------------
   *						 heap support routines
--- 88,96 ----
  				bool all_visible_cleared, bool new_all_visible_cleared);
  static bool HeapSatisfiesHOTUpdate(Relation relation, Bitmapset *hot_attrs,
  					   HeapTuple oldtup, HeapTuple newtup);
! static void acquire_next_sampletup(HeapScanDesc scan);
! static void acquire_block_sample(HeapScanDesc scan);
! static void	heapnullreturn(HeapScanDesc scan);
  
  /* ----------------------------------------------------------------
   *						 heap support routines
***************
*** 588,593 **** heapgettup(HeapScanDesc scan,
--- 591,599 ----
   *
   *		Same API as heapgettup, but used in page-at-a-time mode
   *
+  * SampleScan System algo is also implemented. Direction of Backward is especially used
+  * under CURSOR query.
+  *
   * The internal logic is much the same as heapgettup's too, but there are some
   * differences: we do not take the buffer content lock (that only needs to
   * happen inside heapgetpage), and we iterate through just the tuples listed
***************
*** 612,617 **** heapgettup_pagemode(HeapScanDesc scan,
--- 618,626 ----
  	OffsetNumber lineoff;
  	int			linesleft;
  	ItemId		lpp;
+ 	int			rand_percent;
+ 	int			sample_percent = scan->sample_percent;
+ 	void		*rand_state = scan->rs_randstate; /* random generator state */
  
  	/*
  	 * calculate next starting lineindex, given scan direction
***************
*** 630,635 **** heapgettup_pagemode(HeapScanDesc scan,
--- 639,664 ----
  				return;
  			}
  			page = scan->rs_startblock; /* first page */
+ 
+ 			/* SampleScan, implement SYSTEM page selection */
+ 			if(scan->is_sample_scan)
+ 			{
+ 				while(page < scan->rs_nblocks)
+ 				{
+ 					rand_percent = get_rand_in_range(rand_state, 0, 100);
+ 					if(rand_percent >= sample_percent)
+ 						page++;
+ 					else break;
+ 				}
+ 				finished = (page >= scan->rs_nblocks);
+ 				/* All pages just skiped, clear up and return */
+ 				if(finished)
+ 				{
+ 					heapnullreturn(scan);
+ 					return;
+ 				}
+ 			}
+ 
  			heapgetpage(scan, page);
  			lineindex = 0;
  			scan->rs_inited = true;
***************
*** 673,678 **** heapgettup_pagemode(HeapScanDesc scan,
--- 702,727 ----
  				page = scan->rs_startblock - 1;
  			else
  				page = scan->rs_nblocks - 1;
+ 
+ 			/* SampleScan, implement SYSTEM page selection */
+ 			if(scan->is_sample_scan)
+ 			{
+ 				while(page > 0)
+ 				{
+ 					rand_percent = get_rand_in_range(rand_state, 0, 100);
+ 					if(rand_percent >= sample_percent)
+ 						page--;
+ 					else break;
+ 				}
+ 				finished = (page == scan->rs_startblock);
+ 				/* All pages just skiped, clear up and return */
+ 				if(finished)
+ 				{
+ 					heapnullreturn(scan);
+ 					return;
+ 				}
+ 			}
+ 
  			heapgetpage(scan, page);
  		}
  		else
***************
*** 701,706 **** heapgettup_pagemode(HeapScanDesc scan,
--- 750,756 ----
  	{
  		/*
  		 * ``no movement'' scan direction: refetch prior tuple
+ 		 * Same behavior for SampleScan, so no change
  		 */
  		if (!scan->rs_inited)
  		{
***************
*** 782,797 **** heapgettup_pagemode(HeapScanDesc scan,
  		 */
  		if (backward)
  		{
  			finished = (page == scan->rs_startblock);
! 			if (page == 0)
  				page = scan->rs_nblocks;
  			page--;
  		}
  		else
  		{
  			page++;
  			if (page >= scan->rs_nblocks)
! 				page = 0;
  			finished = (page == scan->rs_startblock);
  
  			/*
--- 832,878 ----
  		 */
  		if (backward)
  		{
+ 			/* SampleScan, implement SYSTEM page selection */
+ 			if(scan->is_sample_scan)
+ 			{
+ 				while(page > scan->rs_startblock)
+ 				{
+ 					rand_percent = get_rand_in_range(rand_state, 0, 100);
+ 					if(rand_percent >= sample_percent)
+ 						page--;
+ 					else break;
+ 				}
+ 			}
+ 
  			finished = (page == scan->rs_startblock);
! 			/*
! 			 * It was deciding page == 0, now change to 
! 			 * page == scan->rs_startblock, otherwise doesn't make sense.
! 			 * Is this reasoning correct?
! 			 */
! 			if (page == scan->rs_startblock)
  				page = scan->rs_nblocks;
  			page--;
  		}
  		else
  		{
  			page++;
+ 
+ 			/* SampleScan, implement SYSTEM page selection */
+ 			while(page < scan->rs_nblocks)
+ 			{
+ 				rand_percent = get_rand_in_range(rand_state, 0, 100);
+ 				if(rand_percent >= sample_percent)
+ 					page++;
+ 				else break;
+ 			}
+ 			/*
+ 			 * It was setting page = 0, now change to 
+ 			 * page = scan->rs_startblock, otherwise doesn't make sense.
+ 			 * Is this reasoning correct?
+ 			 */
  			if (page >= scan->rs_nblocks)
! 				page = scan->rs_startblock;
  			finished = (page == scan->rs_startblock);
  
  			/*
***************
*** 815,826 **** heapgettup_pagemode(HeapScanDesc scan,
  		 */
  		if (finished)
  		{
! 			if (BufferIsValid(scan->rs_cbuf))
! 				ReleaseBuffer(scan->rs_cbuf);
! 			scan->rs_cbuf = InvalidBuffer;
! 			scan->rs_cblock = InvalidBlockNumber;
! 			tuple->t_data = NULL;
! 			scan->rs_inited = false;
  			return;
  		}
  
--- 896,902 ----
  		 */
  		if (finished)
  		{
! 			heapnullreturn(scan);
  			return;
  		}
  
***************
*** 836,841 **** heapgettup_pagemode(HeapScanDesc scan,
--- 912,930 ----
  	}
  }
  
+ static void
+ heapnullreturn(HeapScanDesc scan)
+ {
+ 	HeapTuple	tuple = &(scan->rs_ctup);
+ 
+ 	if (BufferIsValid(scan->rs_cbuf))
+ 		ReleaseBuffer(scan->rs_cbuf);
+ 	scan->rs_cbuf = InvalidBuffer;
+ 	scan->rs_cblock = InvalidBlockNumber;
+ 	tuple->t_data = NULL;
+ 	scan->rs_inited = false;
+ 	return;
+ }
  
  #if defined(DISABLE_COMPLEX_MACRO)
  /*
***************
*** 1236,1241 **** heap_beginscan_internal(Relation relation, Snapshot snapshot,
--- 1325,1333 ----
  
  	initscan(scan, key, false);
  
+ 	/* By default, it is not samplescan */
+ 	scan->is_sample_scan = false;
+ 
  	return scan;
  }
  
***************
*** 1315,1332 **** heap_endscan(HeapScanDesc scan)
  #endif   /* !defined(HEAPDEBUGALL) */
  
  
  HeapTuple
  heap_getnext(HeapScanDesc scan, ScanDirection direction)
  {
  	/* Note: no locking manipulations needed */
  
  	HEAPDEBUG_1;				/* heap_getnext( info ) */
  
! 	if (scan->rs_pageatatime)
! 		heapgettup_pagemode(scan, direction,
! 							scan->rs_nkeys, scan->rs_key);
! 	else
! 		heapgettup(scan, direction, scan->rs_nkeys, scan->rs_key);
  
  	if (scan->rs_ctup.t_data == NULL)
  	{
--- 1407,1446 ----
  #endif   /* !defined(HEAPDEBUGALL) */
  
  
+ /*
+  * SampleScan is implemented from this function.
+  */
  HeapTuple
  heap_getnext(HeapScanDesc scan, ScanDirection direction)
  {
  	/* Note: no locking manipulations needed */
  
+ 	TableSampleMethod sample_method = scan->sample_method;
+ 
  	HEAPDEBUG_1;				/* heap_getnext( info ) */
  
! 	/* if sample percent is 0, just return NULL */
! 	if (scan->is_sample_scan && scan->sample_percent == 0)
! 		return NULL;
! 
! 	/* SampleScan or SeqScan */
! 	if (scan->is_sample_scan)
! 	{
! 		/* Which sample method */
! 		if(sample_method == SAMPLE_SYSTEM)
! 			/* This function will check scan type inside */
! 			heapgettup_pagemode(scan, direction,
! 								scan->rs_nkeys, scan->rs_key);
! 		else if(sample_method == SAMPLE_BERNOULLI)
! 			acquire_next_sampletup(scan);
! 	} else 
! 	{
! 		if (scan->rs_pageatatime)
! 			heapgettup_pagemode(scan, direction,
! 								scan->rs_nkeys, scan->rs_key);
! 		else
! 			heapgettup(scan, direction, scan->rs_nkeys, scan->rs_key);
! 	}
  
  	if (scan->rs_ctup.t_data == NULL)
  	{
***************
*** 5780,5782 **** heap_sync(Relation rel)
--- 5894,6071 ----
  		heap_close(toastrel, AccessShareLock);
  	}
  }
+ 
+ /*
+  * acquire_next_sampletup -- algo modified from acquire_sample_rows in analyze.c.
+  * The core algo is inside acquire_block_sample.
+  */
+ static void
+ acquire_next_sampletup(HeapScanDesc scan)
+ {
+ 	HeapTuple		tuple = &(scan->rs_ctup);
+ 	HeapTuple		pass_tuple;
+ 
+ 	/* Collect the sample first if not yet */
+ 	if(!scan->rs_sampleinited)
+ 		acquire_block_sample(scan);
+ 
+ 	/* Just retrieve tuple one by one from the sample collection */
+ 	if(scan->rs_curindex < scan->rs_samplesize)
+ 	{
+ 		/* Get the head tuple from sampled tuple array */
+ 		pass_tuple = scan->rs_samplerows[scan->rs_curindex];
+ 		tuple->t_len = pass_tuple->t_len;
+ 		tuple->t_self = pass_tuple->t_self;
+ 		tuple->t_tableOid = pass_tuple->t_tableOid;
+ 		tuple->t_data = pass_tuple->t_data;
+ 		scan->rs_curindex++;
+ 	}
+ 	else
+ 	{
+ 		/* No more tuple to return */
+ 		scan->rs_cbuf = InvalidBuffer;
+ 		scan->rs_cblock = InvalidBlockNumber;
+ 		tuple->t_data = NULL;
+ 	}
+ 
+ 	return;
+ }
+ 
+ /*
+  * See acquire_sample_rows in analyze.c for detail description of the algo.
+  * Selected rows are returned in the caller-allocated array scan->rs_samplerows, which
+  * must have at least targrows entries.
+  *
+  */
+ static void
+ acquire_block_sample(HeapScanDesc scan)
+ {
+ 	int			numrows = 0;	/* # rows now in reservoir */
+ 	double		samplerows = 0; /* total # rows collected */
+ 	double		rowstoskip = -1;	/* -1 means not set yet */
+ 	int			targrows = scan->targrows;
+ 	HeapTuple	*rows = scan->rs_samplerows;
+ 	BlockNumber	totalblocks = scan->rs_nblocks;
+ 	BlockSamplerData bs;
+ 	Relation onerel = scan->rs_rd;
+ 	void	*rand_state = scan->rs_randstate;
+ 	double		rstate;
+ 
+ 	/* If the required sample size is 0, we just return */
+ 	if(targrows == 0)
+ 	{
+ 		scan->rs_samplesize = 0;
+ 		scan->rs_sampleinited = true;
+ 		return;
+ 	}
+ 
+ 	Assert(targrows > 0);
+ 
+ 	/* Prepare for sampling block numbers */
+ 	BlockSampler_Init(&bs, totalblocks, targrows);
+ 	/* Prepare for sampling rows */
+ 	rstate = anl_init_selection_state(rand_state, targrows);
+ 
+ 	/* Outer loop over blocks to sample */
+ 	while (BlockSampler_HasMore(&bs))
+ 	{
+ 		BlockNumber targblock = BlockSampler_Next(rand_state, &bs);
+ 		Buffer		targbuffer;
+ 		Page		targpage;
+ 		OffsetNumber targoffset,
+ 					maxoffset;
+ 
+ 		/*
+ 		 * We must maintain a pin on the target page's buffer to ensure that
+ 		 * the maxoffset value stays good (else concurrent VACUUM might delete
+ 		 * tuples out from under us).  Hence, pin the page until we are done
+ 		 * looking at it.  We also choose to hold sharelock on the buffer
+ 		 * throughout --- we could release and re-acquire sharelock for each
+ 		 * tuple, but since we aren't doing much work per tuple, the extra
+ 		 * lock traffic is probably better avoided.
+ 		 */
+ 		targbuffer = ReadBufferExtended(onerel, MAIN_FORKNUM, targblock,
+ 										RBM_NORMAL, scan->rs_strategy);
+ 		LockBuffer(targbuffer, BUFFER_LOCK_SHARE);
+ 		targpage = BufferGetPage(targbuffer);
+ 		maxoffset = PageGetMaxOffsetNumber(targpage);
+ 
+ 		/* Inner loop over all tuples on the selected page */
+ 		for (targoffset = FirstOffsetNumber; targoffset <= maxoffset; targoffset++)
+ 		{
+ 			ItemId		itemid;
+ 			HeapTupleData targtuple;
+ 
+ 			itemid = PageGetItemId(targpage, targoffset);
+ 
+ 			/*
+ 			 * We ignore unused and redirect line pointers.  DEAD line
+ 			 * pointers should be counted as dead, because we need vacuum to
+ 			 * run to get rid of them.	Note that this rule agrees with the
+ 			 * way that heap_page_prune() counts things.
+ 			 */
+ 			if (!ItemIdIsNormal(itemid))
+ 			{
+ 				continue;
+ 			}
+ 
+ 			ItemPointerSet(&targtuple.t_self, targblock, targoffset);
+ 
+ 			targtuple.t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
+ 			targtuple.t_len = ItemIdGetLength(itemid);
+ 
+ 			/*
+ 			 * The first targrows sample rows are simply copied into the
+ 			 * reservoir. Then we start replacing tuples in the sample
+ 			 * until we reach the end of the relation.	This algorithm is
+ 			 * from Jeff Vitter's paper (see full citation below). It
+ 			 * works by repeatedly computing the number of tuples to skip
+ 			 * before selecting a tuple, which replaces a randomly chosen
+ 			 * element of the reservoir (current set of tuples).  At all
+ 			 * times the reservoir is a true random sample of the tuples
+ 			 * we've passed over so far, so when we fall off the end of
+ 			 * the relation we're done.
+ 			 */
+ 			if (numrows < targrows)
+ 				rows[numrows++] = heap_copytuple(&targtuple);
+ 			else
+ 			{
+ 				/*
+ 				 * t in Vitter's paper is the number of records already
+ 				 * processed.  If we need to compute a new S value, we
+ 				 * must use the not-yet-incremented value of samplerows as
+ 				 * t.
+ 				 */
+ 				if (rowstoskip < 0)
+ 					rowstoskip = anl_get_next_S(rand_state, samplerows, targrows,
+ 												&rstate);
+ 
+ 				if (rowstoskip <= 0)
+ 				{
+ 					/*
+ 					 * Found a suitable tuple, so save it, replacing one
+ 					 * old tuple at random
+ 					 */
+ 					int			k = (int) (targrows * anl_random_fract(rand_state));
+ 
+ 					Assert(k >= 0 && k < targrows);
+ 					heap_freetuple(rows[k]);
+ 					rows[k] = heap_copytuple(&targtuple);
+ 				}
+ 
+ 				rowstoskip -= 1;
+ 			}
+ 
+ 			samplerows += 1;
+ 		}
+ 
+ 		/* Now release the lock and pin on the page */
+ 		UnlockReleaseBuffer(targbuffer);
+ 	}
+ 
+ 	scan->rs_samplesize = numrows;
+ 	scan->rs_sampleinited = true;
+ 
+ 	return;
+ }
+ 
*** a/src/backend/commands/analyze.c
--- b/src/backend/commands/analyze.c
***************
*** 52,70 ****
  #include "utils/syscache.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
  
  
- /* Data structure for Algorithm S from Knuth 3.4.2 */
- typedef struct
- {
- 	BlockNumber N;				/* number of blocks, known in advance */
- 	int			n;				/* desired sample size */
- 	BlockNumber t;				/* current block number */
- 	int			m;				/* blocks selected so far */
- } BlockSamplerData;
- 
- typedef BlockSamplerData *BlockSampler;
- 
  /* Per-index data for ANALYZE */
  typedef struct AnlIndexData
  {
--- 52,60 ----
  #include "utils/syscache.h"
  #include "utils/timestamp.h"
  #include "utils/tqual.h"
+ #include "utils/sampleutils.h"
  
  
  /* Per-index data for ANALYZE */
  typedef struct AnlIndexData
  {
***************
*** 86,95 **** static BufferAccessStrategy vac_strategy;
  static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt,
  			   AcquireSampleRowsFunc acquirefunc, BlockNumber relpages,
  			   bool inh, int elevel);
- static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
- 				  int samplesize);
- static bool BlockSampler_HasMore(BlockSampler bs);
- static BlockNumber BlockSampler_Next(BlockSampler bs);
  static void compute_index_stats(Relation onerel, double totalrows,
  					AnlIndexData *indexdata, int nindexes,
  					HeapTuple *rows, int numrows,
--- 76,81 ----
***************
*** 108,114 **** static void update_attstats(Oid relid, bool inh,
  static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
  static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
  
- 
  /*
   *	analyze_rel() -- analyze one relation
   */
--- 94,99 ----
***************
*** 937,1030 **** examine_attribute(Relation onerel, int attnum, Node *index_expr)
  }
  
  /*
-  * BlockSampler_Init -- prepare for random sampling of blocknumbers
-  *
-  * BlockSampler is used for stage one of our new two-stage tuple
-  * sampling mechanism as discussed on pgsql-hackers 2004-04-02 (subject
-  * "Large DB").  It selects a random sample of samplesize blocks out of
-  * the nblocks blocks in the table.  If the table has less than
-  * samplesize blocks, all blocks are selected.
-  *
-  * Since we know the total number of blocks in advance, we can use the
-  * straightforward Algorithm S from Knuth 3.4.2, rather than Vitter's
-  * algorithm.
-  */
- static void
- BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
- {
- 	bs->N = nblocks;			/* measured table size */
- 
- 	/*
- 	 * If we decide to reduce samplesize for tables that have less or not much
- 	 * more than samplesize blocks, here is the place to do it.
- 	 */
- 	bs->n = samplesize;
- 	bs->t = 0;					/* blocks scanned so far */
- 	bs->m = 0;					/* blocks selected so far */
- }
- 
- static bool
- BlockSampler_HasMore(BlockSampler bs)
- {
- 	return (bs->t < bs->N) && (bs->m < bs->n);
- }
- 
- static BlockNumber
- BlockSampler_Next(BlockSampler bs)
- {
- 	BlockNumber K = bs->N - bs->t;		/* remaining blocks */
- 	int			k = bs->n - bs->m;		/* blocks still to sample */
- 	double		p;				/* probability to skip block */
- 	double		V;				/* random */
- 
- 	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
- 
- 	if ((BlockNumber) k >= K)
- 	{
- 		/* need all the rest */
- 		bs->m++;
- 		return bs->t++;
- 	}
- 
- 	/*----------
- 	 * It is not obvious that this code matches Knuth's Algorithm S.
- 	 * Knuth says to skip the current block with probability 1 - k/K.
- 	 * If we are to skip, we should advance t (hence decrease K), and
- 	 * repeat the same probabilistic test for the next block.  The naive
- 	 * implementation thus requires an anl_random_fract() call for each block
- 	 * number.	But we can reduce this to one anl_random_fract() call per
- 	 * selected block, by noting that each time the while-test succeeds,
- 	 * we can reinterpret V as a uniform random number in the range 0 to p.
- 	 * Therefore, instead of choosing a new V, we just adjust p to be
- 	 * the appropriate fraction of its former value, and our next loop
- 	 * makes the appropriate probabilistic test.
- 	 *
- 	 * We have initially K > k > 0.  If the loop reduces K to equal k,
- 	 * the next while-test must fail since p will become exactly zero
- 	 * (we assume there will not be roundoff error in the division).
- 	 * (Note: Knuth suggests a "<=" loop condition, but we use "<" just
- 	 * to be doubly sure about roundoff error.)  Therefore K cannot become
- 	 * less than k, which means that we cannot fail to select enough blocks.
- 	 *----------
- 	 */
- 	V = anl_random_fract();
- 	p = 1.0 - (double) k / (double) K;
- 	while (V < p)
- 	{
- 		/* skip */
- 		bs->t++;
- 		K--;					/* keep K == N - t */
- 
- 		/* adjust p to be new cutoff point in reduced range */
- 		p *= 1.0 - (double) k / (double) K;
- 	}
- 
- 	/* select */
- 	bs->m++;
- 	return bs->t++;
- }
- 
- /*
   * acquire_sample_rows -- acquire a random sample of rows from the table
   *
   * Selected rows are returned in the caller-allocated array rows[], which
--- 922,927 ----
***************
*** 1082,1093 **** acquire_sample_rows(Relation onerel, int elevel,
  	/* Prepare for sampling block numbers */
  	BlockSampler_Init(&bs, totalblocks, targrows);
  	/* Prepare for sampling rows */
! 	rstate = anl_init_selection_state(targrows);
  
  	/* Outer loop over blocks to sample */
  	while (BlockSampler_HasMore(&bs))
  	{
! 		BlockNumber targblock = BlockSampler_Next(&bs);
  		Buffer		targbuffer;
  		Page		targpage;
  		OffsetNumber targoffset,
--- 979,990 ----
  	/* Prepare for sampling block numbers */
  	BlockSampler_Init(&bs, totalblocks, targrows);
  	/* Prepare for sampling rows */
! 	rstate = anl_init_selection_state(NULL, targrows);
  
  	/* Outer loop over blocks to sample */
  	while (BlockSampler_HasMore(&bs))
  	{
! 		BlockNumber targblock = BlockSampler_Next(NULL, &bs);
  		Buffer		targbuffer;
  		Page		targpage;
  		OffsetNumber targoffset,
***************
*** 1229,1235 **** acquire_sample_rows(Relation onerel, int elevel,
  					 * t.
  					 */
  					if (rowstoskip < 0)
! 						rowstoskip = anl_get_next_S(samplerows, targrows,
  													&rstate);
  
  					if (rowstoskip <= 0)
--- 1126,1132 ----
  					 * t.
  					 */
  					if (rowstoskip < 0)
! 						rowstoskip = anl_get_next_S(NULL, samplerows, targrows,
  													&rstate);
  
  					if (rowstoskip <= 0)
***************
*** 1238,1244 **** acquire_sample_rows(Relation onerel, int elevel,
  						 * Found a suitable tuple, so save it, replacing one
  						 * old tuple at random
  						 */
! 						int			k = (int) (targrows * anl_random_fract());
  
  						Assert(k >= 0 && k < targrows);
  						heap_freetuple(rows[k]);
--- 1135,1141 ----
  						 * Found a suitable tuple, so save it, replacing one
  						 * old tuple at random
  						 */
! 						int			k = (int) (targrows * anl_random_fract(NULL));
  
  						Assert(k >= 0 && k < targrows);
  						heap_freetuple(rows[k]);
***************
*** 1297,1412 **** acquire_sample_rows(Relation onerel, int elevel,
  	return numrows;
  }
  
- /* Select a random value R uniformly distributed in (0 - 1) */
- double
- anl_random_fract(void)
- {
- 	return ((double) random() + 1) / ((double) MAX_RANDOM_VALUE + 2);
- }
- 
- /*
-  * These two routines embody Algorithm Z from "Random sampling with a
-  * reservoir" by Jeffrey S. Vitter, in ACM Trans. Math. Softw. 11, 1
-  * (Mar. 1985), Pages 37-57.  Vitter describes his algorithm in terms
-  * of the count S of records to skip before processing another record.
-  * It is computed primarily based on t, the number of records already read.
-  * The only extra state needed between calls is W, a random state variable.
-  *
-  * anl_init_selection_state computes the initial W value.
-  *
-  * Given that we've already read t records (t >= n), anl_get_next_S
-  * determines the number of records to skip before the next record is
-  * processed.
-  */
- double
- anl_init_selection_state(int n)
- {
- 	/* Initial value of W (for use when Algorithm Z is first applied) */
- 	return exp(-log(anl_random_fract()) / n);
- }
- 
- double
- anl_get_next_S(double t, int n, double *stateptr)
- {
- 	double		S;
- 
- 	/* The magic constant here is T from Vitter's paper */
- 	if (t <= (22.0 * n))
- 	{
- 		/* Process records using Algorithm X until t is large enough */
- 		double		V,
- 					quot;
- 
- 		V = anl_random_fract(); /* Generate V */
- 		S = 0;
- 		t += 1;
- 		/* Note: "num" in Vitter's code is always equal to t - n */
- 		quot = (t - (double) n) / t;
- 		/* Find min S satisfying (4.1) */
- 		while (quot > V)
- 		{
- 			S += 1;
- 			t += 1;
- 			quot *= (t - (double) n) / t;
- 		}
- 	}
- 	else
- 	{
- 		/* Now apply Algorithm Z */
- 		double		W = *stateptr;
- 		double		term = t - (double) n + 1;
- 
- 		for (;;)
- 		{
- 			double		numer,
- 						numer_lim,
- 						denom;
- 			double		U,
- 						X,
- 						lhs,
- 						rhs,
- 						y,
- 						tmp;
- 
- 			/* Generate U and X */
- 			U = anl_random_fract();
- 			X = t * (W - 1.0);
- 			S = floor(X);		/* S is tentatively set to floor(X) */
- 			/* Test if U <= h(S)/cg(X) in the manner of (6.3) */
- 			tmp = (t + 1) / term;
- 			lhs = exp(log(((U * tmp * tmp) * (term + S)) / (t + X)) / n);
- 			rhs = (((t + X) / (term + S)) * term) / t;
- 			if (lhs <= rhs)
- 			{
- 				W = rhs / lhs;
- 				break;
- 			}
- 			/* Test if U <= f(S)/cg(X) */
- 			y = (((U * (t + 1)) / term) * (t + S + 1)) / (t + X);
- 			if ((double) n < S)
- 			{
- 				denom = t;
- 				numer_lim = term + S;
- 			}
- 			else
- 			{
- 				denom = t - (double) n + S;
- 				numer_lim = t + 1;
- 			}
- 			for (numer = t + S; numer >= numer_lim; numer -= 1)
- 			{
- 				y *= numer / denom;
- 				denom -= 1;
- 			}
- 			W = exp(-log(anl_random_fract()) / n);		/* Generate W in advance */
- 			if (exp(log(y) / n) <= (t + X) / t)
- 				break;
- 		}
- 		*stateptr = W;
- 	}
- 	return S;
- }
- 
  /*
   * qsort comparator for sorting rows[] array
   */
--- 1194,1199 ----
***************
*** 1431,1437 **** compare_rows(const void *a, const void *b)
  	return 0;
  }
  
- 
  /*
   * acquire_inherited_sample_rows -- acquire sample rows from inheritance tree
   *
--- 1218,1223 ----
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***************
*** 725,730 **** ExplainNode(PlanState *planstate, List *ancestors,
--- 725,733 ----
  		case T_SeqScan:
  			pname = sname = "Seq Scan";
  			break;
+ 		case T_SampleScan:
+ 			pname = sname = "Sample Scan";
+ 			break;
  		case T_IndexScan:
  			pname = sname = "Index Scan";
  			break;
***************
*** 864,869 **** ExplainNode(PlanState *planstate, List *ancestors,
--- 867,873 ----
  	switch (nodeTag(plan))
  	{
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_BitmapHeapScan:
  		case T_TidScan:
  		case T_SubqueryScan:
***************
*** 1109,1114 **** ExplainNode(PlanState *planstate, List *ancestors,
--- 1113,1119 ----
  										   planstate, es);
  			/* FALL THRU */
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_ValuesScan:
  		case T_CteScan:
  		case T_WorkTableScan:
***************
*** 1819,1824 **** ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
--- 1824,1830 ----
  	switch (nodeTag(plan))
  	{
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_BitmapHeapScan:
*** a/src/backend/executor/Makefile
--- b/src/backend/executor/Makefile
***************
*** 21,27 **** OBJS = execAmi.o execCurrent.o execGrouping.o execJunk.o execMain.o \
         nodeLimit.o nodeLockRows.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
!        nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
         nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
--- 21,27 ----
         nodeLimit.o nodeLockRows.o \
         nodeMaterial.o nodeMergeAppend.o nodeMergejoin.o nodeModifyTable.o \
         nodeNestloop.o nodeFunctionscan.o nodeRecursiveunion.o nodeResult.o \
!        nodeSeqscan.o nodeSamplescan.o nodeSetOp.o nodeSort.o nodeUnique.o \
         nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
         nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
         nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o spi.o
*** a/src/backend/executor/execAmi.c
--- b/src/backend/executor/execAmi.c
***************
*** 38,43 ****
--- 38,44 ----
  #include "executor/nodeRecursiveunion.h"
  #include "executor/nodeResult.h"
  #include "executor/nodeSeqscan.h"
+ #include "executor/nodeSamplescan.h"
  #include "executor/nodeSetOp.h"
  #include "executor/nodeSort.h"
  #include "executor/nodeSubplan.h"
***************
*** 152,157 **** ExecReScan(PlanState *node)
--- 153,162 ----
  			ExecReScanSeqScan((SeqScanState *) node);
  			break;
  
+ 		case T_SampleScanState:
+ 			ExecReScanSampleScan((SampleScanState *) node);
+ 			break;
+ 
  		case T_IndexScanState:
  			ExecReScanIndexScan((IndexScanState *) node);
  			break;
***************
*** 274,279 **** ExecMarkPos(PlanState *node)
--- 279,288 ----
  			ExecSeqMarkPos((SeqScanState *) node);
  			break;
  
+ 		case T_SampleScanState:
+ 			ExecSampleMarkPos((SampleScanState *) node);
+ 			break;
+ 
  		case T_IndexScanState:
  			ExecIndexMarkPos((IndexScanState *) node);
  			break;
***************
*** 331,336 **** ExecRestrPos(PlanState *node)
--- 340,349 ----
  			ExecSeqRestrPos((SeqScanState *) node);
  			break;
  
+ 		case T_SampleScanState:
+ 			ExecSampleRestrPos((SampleScanState *) node);
+ 			break;
+ 
  		case T_IndexScanState:
  			ExecIndexRestrPos((IndexScanState *) node);
  			break;
***************
*** 383,388 **** ExecSupportsMarkRestore(NodeTag plantype)
--- 396,402 ----
  	switch (plantype)
  	{
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_TidScan:
***************
*** 452,457 **** ExecSupportsBackwardScan(Plan *node)
--- 466,475 ----
  		case T_CteScan:
  			return TargetListSupportsBackwardScan(node->targetlist);
  
+ 		/* For now, TableSample should not do backward */
+ 		case T_SampleScan:
+ 			return false;
+ 
  		case T_IndexScan:
  			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
  				TargetListSupportsBackwardScan(node->targetlist);
*** a/src/backend/executor/execCurrent.c
--- b/src/backend/executor/execCurrent.c
***************
*** 261,266 **** search_plan_tree(PlanState *node, Oid table_oid)
--- 261,267 ----
  			 * scan nodes can all be treated alike
  			 */
  		case T_SeqScanState:
+ 		case T_SampleScanState:
  		case T_IndexScanState:
  		case T_IndexOnlyScanState:
  		case T_BitmapHeapScanState:
*** a/src/backend/executor/execProcnode.c
--- b/src/backend/executor/execProcnode.c
***************
*** 101,106 ****
--- 101,107 ----
  #include "executor/nodeRecursiveunion.h"
  #include "executor/nodeResult.h"
  #include "executor/nodeSeqscan.h"
+ #include "executor/nodeSamplescan.h"
  #include "executor/nodeSetOp.h"
  #include "executor/nodeSort.h"
  #include "executor/nodeSubplan.h"
***************
*** 188,193 **** ExecInitNode(Plan *node, EState *estate, int eflags)
--- 189,199 ----
  												   estate, eflags);
  			break;
  
+ 		case T_SampleScan:
+ 			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
+ 													   estate, eflags);
+ 			break;
+ 
  		case T_IndexScan:
  			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
  													 estate, eflags);
***************
*** 399,404 **** ExecProcNode(PlanState *node)
--- 405,414 ----
  			result = ExecSeqScan((SeqScanState *) node);
  			break;
  
+ 		case T_SampleScanState:
+ 			result = ExecSampleScan((SampleScanState *) node);
+ 			break;
+ 
  		case T_IndexScanState:
  			result = ExecIndexScan((IndexScanState *) node);
  			break;
***************
*** 633,638 **** ExecEndNode(PlanState *node)
--- 643,652 ----
  			ExecEndSeqScan((SeqScanState *) node);
  			break;
  
+ 		case T_SampleScanState:
+ 			ExecEndSampleScan((SampleScanState *) node);
+ 			break;
+ 
  		case T_IndexScanState:
  			ExecEndIndexScan((IndexScanState *) node);
  			break;
*** /dev/null
--- b/src/backend/executor/nodeSamplescan.c
***************
*** 0 ****
--- 1,345 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeSamplescan.c
+  *	  Support routines for sample scans of relations.
+  *
+  * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/executor/nodeSamplescan.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ /*
+  * INTERFACE ROUTINES
+  *		ExecSampleScan				randomly scans a relation.
+  *		ExecSampleNext				retrieve next tuple in the order depending on sampling method System/Bernoulli.
+  *		ExecInitSampleScan			creates and initializes a samplescan node.
+  *		ExecEndSampleScan			releases any storage allocated.
+  *		ExecReScanSampleScan		rescans the relation
+  *		ExecSampleMarkPos			marks scan position
+  *		ExecSampleRestrPos			restores scan position
+  */
+ #include "postgres.h"
+ 
+ #include <time.h>
+ 
+ #include "access/relscan.h"
+ #include "executor/execdebug.h"
+ #include "executor/nodeSamplescan.h"
+ #include "utils/rel.h"
+ #include "access/heapam.h"
+ #include "executor/executor.h"
+ #include "parser/parsetree.h"
+ #include "storage/bufmgr.h"
+ 
+ 
+ static void InitScanRelation(SampleScanState *node, EState *estate);
+ static TupleTableSlot *SampleNext(SampleScanState *node);
+ 
+ 
+ 
+ /* ----------------------------------------------------------------
+  *						Scan Support
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		SampleNext
+  *
+  *		This is a workhorse for ExecSampleScan
+  * ----------------------------------------------------------------
+  */
+ static TupleTableSlot *
+ SampleNext(SampleScanState *node)
+ {
+ 	HeapTuple	tuple;
+ 	HeapScanDesc scandesc;
+ 	EState		 *estate;
+ 	ScanDirection	direction;
+ 	TupleTableSlot *slot;
+ 
+ 	/*
+ 	 * get information from the estate and scan state.
+ 	 */
+ 	scandesc = node->ss.ss_currentScanDesc;
+ 	estate = node->ss.ps.state;
+ 	direction = estate->es_direction;
+ 	slot = node->ss.ss_ScanTupleSlot;
+ 
+ 	tuple = heap_getnext(scandesc, direction); 
+ 
+ 	/*              
+ 	 * save the tuple and the buffer returned to us by the access methods in
+ 	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
+ 	 * tuples returned by heap_getnext() are pointers onto disk pages and were
+ 	 * not created with palloc() and so should not be pfree()'d.  Note also
+ 	 * that ExecStoreTuple will increment the refcount of the buffer; the
+ 	 * refcount will not be dropped until the tuple table slot is cleared.
+ 	 */
+ 	if (tuple)
+ 		ExecStoreTuple(tuple,	/* tuple to store */
+ 					   slot,	/* slot to store in */
+ 					   scandesc->rs_cbuf,		/* buffer associated with this
+ 												 * tuple */
+ 					   false);	/* don't pfree this pointer */
+ 	else
+ 		ExecClearTuple(slot);
+ 
+ 	return slot;
+ }
+ 
+ /*
+  * SampleRecheck -- access method routine to recheck a tuple in EvalPlanQual.
+  * Same as SeqScan, should modify?
+  */
+ static bool
+ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
+ {
+ 	return true;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSampleScan(node)
+  *
+  *		Return the next tuple in the sample scan's result set
+  *		We call the ExecScan() routine and pass it the appropriate
+  *		access method functions.
+  * ----------------------------------------------------------------
+  */
+ TupleTableSlot *
+ ExecSampleScan(SampleScanState *node)
+ {
+ 	return ExecScan((ScanState *) node,
+ 				(ExecScanAccessMtd) SampleNext,
+ 				(ExecScanRecheckMtd) SampleRecheck);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		InitScanRelation
+  *
+  *		This does the initialization for scan relations and
+  *		subplans of scans.
+  * ----------------------------------------------------------------
+  */
+ static void
+ InitScanRelation(SampleScanState *node, EState *estate)
+ {
+ 	Relation	currentRelation;
+ 	HeapScanDesc currentScanDesc;
+ 
+ 	/*
+ 	 * get the relation object id from the relid'th entry in the range table,
+ 	 * open that relation and acquire appropriate lock on it.
+ 	 */
+ 	currentRelation = ExecOpenScanRelation(estate,
+ 									 ((SampleScan *) node->ss.ps.plan)->scan.scanrelid);
+ 
+ 	currentScanDesc = heap_beginscan(currentRelation,
+ 									 estate->es_snapshot,
+ 									 0,
+ 									 NULL);
+ 
+ 	node->ss.ss_currentRelation = currentRelation;
+ 	node->ss.ss_currentScanDesc = currentScanDesc;
+ 
+ 	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitSampleScan
+  * ----------------------------------------------------------------
+  */
+ SampleScanState *
+ ExecInitSampleScan(SampleScan *node, EState *estate, int eflags)
+ {
+ 	SampleScanState *scanstate;
+ 	TableSampleMethod sample_method = node->sample_info->sample_method;
+ 	int				  sample_percent = node->sample_info->sample_percent;
+ 	double				  seed;
+ 
+ 	/*
+ 	 * We don't expect to have any child plan node
+ 	 */
+ 	Assert(outerPlan(node) == NULL);
+ 	Assert(innerPlan(node) == NULL);
+ 
+ 	/*
+ 	 * create state structure
+ 	 */
+ 	scanstate = makeNode(SampleScanState);
+ 	scanstate->ss.ps.plan = (Plan *) node;
+ 	scanstate->ss.ps.state = estate;
+ 
+ 	/*
+ 	 * Miscellaneous initialization
+ 	 *
+ 	 * create expression context for node
+ 	 */
+ 	ExecAssignExprContext(estate, &scanstate->ss.ps);
+ 
+ 	/*
+ 	 * initialize child expressions
+ 	 */
+ 	scanstate->ss.ps.targetlist = (List *)
+ 		ExecInitExpr((Expr *) node->scan.plan.targetlist,
+ 					 (PlanState *) scanstate);
+ 	scanstate->ss.ps.qual = (List *)
+ 		ExecInitExpr((Expr *) node->scan.plan.qual,
+ 					 (PlanState *) scanstate);
+ 
+ 	/*
+ 	 * tuple table initialization
+ 	 */
+ 	ExecInitResultTupleSlot(estate, &scanstate->ss.ps);
+ 	ExecInitScanTupleSlot(estate, &scanstate->ss);
+ 
+ 	/*
+ 	 * initialize scan relation
+ 	 */
+ 	InitScanRelation(scanstate, estate);
+ 
+ 	/*
+ 	 * InitScanRelation cannot see sampleinfo, initialize below
+ 	 */
+ 	HeapScanDesc scan = scanstate->ss.ss_currentScanDesc;
+ 
+ 	scan->is_sample_scan = true;
+ 	scan->sample_percent = sample_percent;
+ 	scan->sample_method = sample_method;
+ 
+ 	scanstate->ss.ps.ps_TupFromTlist = false;
+ 
+ 	/*
+ 	 * Initialize result tuple type and projection info.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+ 	ExecAssignScanProjectionInfo(&scanstate->ss);
+ 
+ 	/*
+ 	 * Initialize fields of SampleScanState for BERNOULLI
+ 	 */
+ 	if(sample_method == SAMPLE_BERNOULLI)
+ 	{
+ 		double numtuples = scan->rs_rd->rd_rel->reltuples;
+ 		int targrows = (int)floor(numtuples*sample_percent/100 + 0.5);
+ 
+ 		scan->targrows = targrows;
+ 		scan->rs_samplerows = (HeapTuple *)palloc(targrows * sizeof(HeapTuple));
+ 		scan->rs_curindex = 0;
+ 		scan->rs_samplesize = 0;
+ 		scan->rs_sampleinited = false;
+ 	}
+ 
+ 	/*
+ 	 * Setup PRNG state; seed with the REPEATABLE clause, if any. We
+ 	 * can't just use srandom(), since there could be multiple
+ 	 * concurrent sample scans.
+ 	 *
+ 	 * XXX: using time() to seed the PRNG in the non-repeatable case
+ 	 * could probably be improved. Different state array sizes could
+ 	 * also be tried: do we need high-quality random numbers?
+ 	 */
+ 	if(node->sample_info->is_repeatable)
+ 		seed = node->sample_info->repeat_seed;
+ 	else
+ 		seed = time(NULL);
+ 
+ 	sample_set_seed(scan->rs_randstate, seed);
+ 
+ 	return scanstate;
+ }
+ 
+ /* ----------------------------------------------------------------
+  *						Join Support
+  * XXX: This part needs more work.
+  * 1. To understand , read at execAmi.c, ExecSupportsMarkRestore
+  * 2. The below part is still seqscan code, not changed yet.
+  * 3. ReScan is needed by join methods, Others used by MergeJoin
+  * 4. In Optimizer, the Join support also needs update, like row 
+  *    estimate, cost estimate, and join cost estimate, needs to 
+  *    go better.
+  * ----------------------------------------------------------------
+  */
+ 
+ /* ----------------------------------------------------------------
+  *		ExecReScanSampleScan
+  *
+  *		Rescans the relation.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecReScanSampleScan(SampleScanState *node)
+ {
+ 	HeapScanDesc scan;
+ 
+ 	scan = node->ss.ss_currentScanDesc;
+ 
+ 	heap_rescan(scan,			/* scan desc */
+ 				NULL);			/* new scan keys */
+ 
+ 	ExecScanReScan((ScanState *) node);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSampleMarkPos(node)
+  *
+  *		Marks scan position.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecSampleMarkPos(SampleScanState *node)
+ {
+ 	HeapScanDesc scan = node->ss.ss_currentScanDesc;
+ 
+ 	heap_markpos(scan);
+ }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecSampleRestrPos
+  *
+  *		Restores scan position.
+  * ----------------------------------------------------------------
+  */
+ void
+ ExecSampleRestrPos(SampleScanState *node)
+ {
+ 	HeapScanDesc scan = node->ss.ss_currentScanDesc;
+ 
+ 	/*
+ 	 * Clear any reference to the previously returned tuple.  This is needed
+ 	 * because the slot is simply pointing at scan->rs_cbuf, which
+ 	 * heap_restrpos will change; we'd have an internally inconsistent slot if
+ 	 * we didn't do this.
+ 	 */
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 
+ 	heap_restrpos(scan);
+ }
+ 
+ /*
+  * Shutdown this scan. This function should generally be symmetric with
+  * ExecInitSampleScan(): we ought to clean up after ourselves.
+  */
+ void
+ ExecEndSampleScan(SampleScanState *node)
+ {
+ 	ExecFreeExprContext(&node->ss.ps);
+ 
+ 	ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
+ 	ExecClearTuple(node->ss.ss_ScanTupleSlot);
+ 
+ 	/*
+ 	 * Close heap scan
+ 	 */
+ 	heap_endscan(node->ss.ss_currentScanDesc);
+ 
+ 	/*
+ 	 * Note that ExecCloseScanRelation() does NOT release the lock we
+ 	 * acquired on the scan relation: it is held until the end of the
+ 	 * transaction.
+ 	 */
+ 	ExecCloseScanRelation(node->ss.ss_currentRelation);
+ }
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 350,355 **** _copySeqScan(const SeqScan *from)
--- 350,373 ----
  }
  
  /*
+  * _copySampleScan
+  */
+ static SampleScan *
+ _copySampleScan(const SampleScan *from)
+ {
+ 	SampleScan *newnode = makeNode(SampleScan);
+ 
+ 	/*
+ 	 * copy node superclass fields
+ 	 */
+ 	CopyScanFields((const Scan *) from, (Scan *) newnode);
+ 
+ 	COPY_NODE_FIELD(sample_info);
+ 
+ 	return newnode;
+ }
+ 
+ /*
   * _copyIndexScan
   */
  static IndexScan *
***************
*** 1013,1018 **** _copyRangeVar(const RangeVar *from)
--- 1031,1037 ----
  	COPY_SCALAR_FIELD(inhOpt);
  	COPY_SCALAR_FIELD(relpersistence);
  	COPY_NODE_FIELD(alias);
+ 	COPY_NODE_FIELD(sample_info);
  	COPY_LOCATION_FIELD(location);
  
  	return newnode;
***************
*** 1804,1809 **** _copyFromExpr(const FromExpr *from)
--- 1823,1846 ----
  	return newnode;
  }
  
+ 
+ /*
+  * _CopyTableSampleInfo
+  */
+ static TableSampleInfo *
+ _copyTableSampleInfo(const TableSampleInfo *from)
+ {
+ 	TableSampleInfo *newnode = makeNode(TableSampleInfo);
+ 
+ 	COPY_SCALAR_FIELD(sample_percent);
+ 	COPY_SCALAR_FIELD(sample_method);
+ 	COPY_SCALAR_FIELD(is_repeatable);
+ 	COPY_SCALAR_FIELD(repeat_seed);
+ 
+ 	return newnode;
+ }
+ 
+ 
  /* ****************************************************************
   *						relation.h copy functions
   *
***************
*** 1955,1960 **** _copyRangeTblEntry(const RangeTblEntry *from)
--- 1992,1998 ----
  	COPY_SCALAR_FIELD(rtekind);
  	COPY_SCALAR_FIELD(relid);
  	COPY_SCALAR_FIELD(relkind);
+ 	COPY_NODE_FIELD(sample_info);
  	COPY_NODE_FIELD(subquery);
  	COPY_SCALAR_FIELD(security_barrier);
  	COPY_SCALAR_FIELD(jointype);
***************
*** 3867,3872 **** copyObject(const void *from)
--- 3905,3913 ----
  		case T_ValuesScan:
  			retval = _copyValuesScan(from);
  			break;
+ 		case T_SampleScan:
+ 			retval = _copySampleScan(from);
+ 			break;
  		case T_CteScan:
  			retval = _copyCteScan(from);
  			break;
***************
*** 4066,4071 **** copyObject(const void *from)
--- 4107,4115 ----
  		case T_FromExpr:
  			retval = _copyFromExpr(from);
  			break;
+ 		case T_TableSampleInfo:
+ 			retval = _copyTableSampleInfo(from);
+ 			break;
  
  			/*
  			 * RELATION NODES
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
***************
*** 107,112 **** _equalRangeVar(const RangeVar *a, const RangeVar *b)
--- 107,113 ----
  	COMPARE_SCALAR_FIELD(relpersistence);
  	COMPARE_NODE_FIELD(alias);
  	COMPARE_LOCATION_FIELD(location);
+ 	COMPARE_NODE_FIELD(sample_info);
  
  	return true;
  }
***************
*** 777,782 **** _equalFromExpr(const FromExpr *a, const FromExpr *b)
--- 778,794 ----
  	return true;
  }
  
+ static bool
+ _equalTableSampleInfo(TableSampleInfo *a, TableSampleInfo *b)
+ {
+ 	COMPARE_SCALAR_FIELD(sample_percent);
+ 	COMPARE_SCALAR_FIELD(sample_method);
+ 	COMPARE_SCALAR_FIELD(is_repeatable);
+ 	COMPARE_SCALAR_FIELD(repeat_seed);
+ 
+ 	return true;
+ }
+ 
  
  /*
   * Stuff from relation.h
***************
*** 2271,2276 **** _equalRangeTblEntry(const RangeTblEntry *a, const RangeTblEntry *b)
--- 2283,2289 ----
  	COMPARE_SCALAR_FIELD(rtekind);
  	COMPARE_SCALAR_FIELD(relid);
  	COMPARE_SCALAR_FIELD(relkind);
+ 	COMPARE_NODE_FIELD(sample_info);
  	COMPARE_NODE_FIELD(subquery);
  	COMPARE_SCALAR_FIELD(security_barrier);
  	COMPARE_SCALAR_FIELD(jointype);
***************
*** 2630,2635 **** equal(const void *a, const void *b)
--- 2643,2651 ----
  		case T_JoinExpr:
  			retval = _equalJoinExpr(a, b);
  			break;
+ 		case T_TableSampleInfo:
+ 			retval = _equalTableSampleInfo(a, b);
+ 			break;
  
  			/*
  			 * RELATION NODES
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 437,442 **** _outSeqScan(StringInfo str, const SeqScan *node)
--- 437,452 ----
  }
  
  static void
+ _outSampleScan(StringInfo str, const SampleScan *node)
+ {
+ 	WRITE_NODE_TYPE("SAMPLESCAN");
+ 
+ 	_outScanInfo(str, (const Scan *) node);
+ 
+ 	WRITE_NODE_FIELD(sample_info);
+ }
+ 
+ static void
  _outIndexScan(StringInfo str, const IndexScan *node)
  {
  	WRITE_NODE_TYPE("INDEXSCAN");
***************
*** 880,885 **** _outRangeVar(StringInfo str, const RangeVar *node)
--- 890,896 ----
  	WRITE_CHAR_FIELD(relpersistence);
  	WRITE_NODE_FIELD(alias);
  	WRITE_LOCATION_FIELD(location);
+ 	WRITE_NODE_FIELD(sample_info);
  }
  
  static void
***************
*** 1749,1754 **** _outRelOptInfo(StringInfo str, const RelOptInfo *node)
--- 1760,1766 ----
  	WRITE_FLOAT_FIELD(allvisfrac, "%.6f");
  	WRITE_NODE_FIELD(subplan);
  	WRITE_NODE_FIELD(subroot);
+ 	WRITE_BOOL_FIELD(has_table_sample);
  	/* we don't try to print fdwroutine or fdw_private */
  	WRITE_NODE_FIELD(baserestrictinfo);
  	WRITE_NODE_FIELD(joininfo);
***************
*** 2330,2335 **** _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
--- 2342,2348 ----
  		case RTE_RELATION:
  			WRITE_OID_FIELD(relid);
  			WRITE_CHAR_FIELD(relkind);
+ 			WRITE_NODE_FIELD(sample_info);
  			break;
  		case RTE_SUBQUERY:
  			WRITE_NODE_FIELD(subquery);
***************
*** 2372,2377 **** _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
--- 2385,2401 ----
  }
  
  static void
+ _outTableSampleInfo(StringInfo str, TableSampleInfo *node)
+ {
+ 	WRITE_NODE_TYPE("TABLESAMPLE");
+ 
+ 	WRITE_INT_FIELD(sample_percent);
+ 	WRITE_ENUM_FIELD(sample_method, TableSampleMethod);
+ 	WRITE_INT_FIELD(repeat_seed);
+ 	WRITE_BOOL_FIELD(is_repeatable);
+ }
+ 
+ static void
  _outAExpr(StringInfo str, const A_Expr *node)
  {
  	WRITE_NODE_TYPE("AEXPR");
***************
*** 2762,2767 **** _outNode(StringInfo str, const void *obj)
--- 2786,2794 ----
  			case T_ValuesScan:
  				_outValuesScan(str, obj);
  				break;
+ 			case T_SampleScan:
+ 				_outSampleScan(str, obj);
+ 				break;
  			case T_CteScan:
  				_outCteScan(str, obj);
  				break;
***************
*** 3148,3153 **** _outNode(StringInfo str, const void *obj)
--- 3175,3183 ----
  			case T_Constraint:
  				_outConstraint(str, obj);
  				break;
+ 			case T_TableSampleInfo:
+ 				_outTableSampleInfo(str, obj);
+ 				break;
  			case T_FuncCall:
  				_outFuncCall(str, obj);
  				break;
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
***************
*** 380,385 **** _readRangeVar(void)
--- 380,399 ----
  	READ_CHAR_FIELD(relpersistence);
  	READ_NODE_FIELD(alias);
  	READ_LOCATION_FIELD(location);
+ 	READ_NODE_FIELD(sample_info);
+ 
+ 	READ_DONE();
+ }
+ 
+ static TableSampleInfo *
+ _readTableSampleInfo(void)
+ {
+ 	READ_LOCALS(TableSampleInfo);
+ 
+ 	READ_INT_FIELD(sample_percent);
+ 	READ_ENUM_FIELD(sample_method, TableSampleMethod);
+ 	READ_BOOL_FIELD(is_repeatable);
+ 	READ_INT_FIELD(repeat_seed);
  
  	READ_DONE();
  }
***************
*** 1189,1194 **** _readRangeTblEntry(void)
--- 1203,1209 ----
  		case RTE_RELATION:
  			READ_OID_FIELD(relid);
  			READ_CHAR_FIELD(relkind);
+ 			READ_NODE_FIELD(sample_info);
  			break;
  		case RTE_SUBQUERY:
  			READ_NODE_FIELD(subquery);
***************
*** 1270,1275 **** parseNodeString(void)
--- 1285,1292 ----
  		return_value = _readAlias();
  	else if (MATCH("RANGEVAR", 8))
  		return_value = _readRangeVar();
+ 	else if (MATCH("TABLESAMPLE", 11))
+ 		return_value = _readTableSampleInfo();
  	else if (MATCH("INTOCLAUSE", 10))
  		return_value = _readIntoClause();
  	else if (MATCH("VAR", 3))
*** a/src/backend/optimizer/path/allpaths.c
--- b/src/backend/optimizer/path/allpaths.c
***************
*** 376,389 **** set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  static void
  set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  {
! 	/* Consider sequential scan */
! 	add_path(rel, create_seqscan_path(root, rel, NULL));
  
! 	/* Consider index scans */
! 	create_index_paths(root, rel);
  
! 	/* Consider TID scans */
! 	create_tidscan_paths(root, rel);
  
  	/* Now find the cheapest of the paths for this rel */
  	set_cheapest(rel);
--- 376,403 ----
  static void
  set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  {
! 	/*
! 	 * If there's a TABLESAMPLE clause, we ONLY consider using a SampleScan.
! 	 * This could be improved: in some cicumstances it might make sense to do
! 	 * an IndexScan and then sample from the index scan's result set, for instance.
! 	 */
! 	if(rel->has_table_sample){
! 		/* We decide we are inside a cursor stmt. If yes, abort */
! 		if(root->glob->isCursorStmt)
! 			elog(ERROR,
! 					"Unsupported Cursor: No Tablesample should be queried under cursor!");
! 		add_path(rel, create_samplescan_path(root, rel));
! 	}else
! 	{
! 		/* Consider sequential scan */
! 		add_path(rel, create_seqscan_path(root, rel, NULL));
  
! 		/* Consider index scans */
! 		create_index_paths(root, rel);
  
! 		/* Consider TID scans */
! 		create_tidscan_paths(root, rel);
! 	}
  
  	/* Now find the cheapest of the paths for this rel */
  	set_cheapest(rel);
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 103,108 **** int			effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE;
--- 103,109 ----
  Cost		disable_cost = 1.0e10;
  
  bool		enable_seqscan = true;
+ bool		enable_samplescan = true;
  bool		enable_indexscan = true;
  bool		enable_indexonlyscan = true;
  bool		enable_bitmapscan = true;
***************
*** 214,219 **** cost_seqscan(Path *path, PlannerInfo *root,
--- 215,263 ----
  	path->total_cost = startup_cost + run_cost;
  }
  
+ 
+ /*
+  * cost_samplescan
+  *	  Determines and returns the cost of scanning a relation according to sample scan.
+  */
+ void
+ cost_samplescan(Path *path, PlannerInfo *root,
+ 			 RelOptInfo *baserel)
+ {
+ 	Cost			startup_cost = 0;
+ 	Cost			run_cost = 0;
+ 	Cost			cpu_per_tuple;
+ 	RangeTblEntry	*rte;
+ 	int				sample_percent;
+ 
+ 	/* Should only be applied to base relations */
+ 	Assert(baserel->relid > 0);
+ 	Assert(baserel->rtekind == RTE_RELATION);
+ 	Assert(path->pathtype == T_SampleScan);
+ 
+ 	rte = planner_rt_fetch(baserel->relid, root);
+ 	sample_percent = rte->sample_info->sample_percent;
+ 
+ 	/*
+ 	 * disk costs
+ 	 * When the sample percentage is close to 100, we're likely to
+ 	 * be doing purely sequantial I/O. Conversely, for small percentage
+ 	 * samples, we're doing random I/O. Fow now, just be conservative
+ 	 * and always assume that we need to do a random I/O for each
+ 	 * sampled block. Of course, this is quite bogus.
+ 	 */
+ 	run_cost += random_page_cost * baserel->pages * sample_percent/100;
+ 
+ 	/* CPU costs */
+ 	startup_cost += baserel->baserestrictcost.startup;
+ 	cpu_per_tuple = cpu_tuple_cost + baserel->baserestrictcost.per_tuple;
+ 	run_cost += cpu_per_tuple * baserel->tuples;
+ 
+ 	path->startup_cost = startup_cost;
+ 	path->total_cost = startup_cost + run_cost;
+ }
+ 
+ 
  /*
   * cost_index
   *	  Determines and returns the cost of scanning a relation using an index.
***************
*** 3416,3422 **** approx_tuple_count(PlannerInfo *root, JoinPath *path, List *quals)
   *
   * We set the following fields of the rel node:
   *	rows: the estimated number of output tuples (after applying
!  *		  restriction clauses).
   *	width: the estimated average output tuple width in bytes.
   *	baserestrictcost: estimated cost of evaluating baserestrictinfo clauses.
   */
--- 3460,3466 ----
   *
   * We set the following fields of the rel node:
   *	rows: the estimated number of output tuples (after applying
!  *		  restriction clauses and considering the effect of TABLESAMPLE).
   *	width: the estimated average output tuple width in bytes.
   *	baserestrictcost: estimated cost of evaluating baserestrictinfo clauses.
   */
***************
*** 3424,3429 **** void
--- 3468,3474 ----
  set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel)
  {
  	double		nrows;
+ 	RangeTblEntry	*rte;
  
  	/* Should only be applied to base relations */
  	Assert(rel->relid > 0);
***************
*** 3435,3440 **** set_baserel_size_estimates(PlannerInfo *root, RelOptInfo *rel)
--- 3480,3496 ----
  							   JOIN_INNER,
  							   NULL);
  
+ 	/* Consider TABLESAMPLE, if any. We assume the live heap rows are
+ 	 * uniformly distributed over the heap: this is a bogus simplifying
+ 	 * assumption. Note that the executor will apply the TABLESAMPLE
+ 	 * clause before applying any restrictions, we assume that the 
+ 	 * restrictions have the same selectivity for the sampled sub-relation
+ 	 * as they do for the entire relation (which is somewhat reasonable).
+ 	 */
+ 	rte = planner_rt_fetch(rel->relid, root);
+ 	if(rte->sample_info)
+ 		nrows = nrows*(rte->sample_info->sample_percent/100);
+ 
  	rel->rows = clamp_row_est(nrows);
  
  	cost_qual_eval(&rel->baserestrictcost, rel->baserestrictinfo, root);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 55,60 **** static Material *create_material_plan(PlannerInfo *root, MaterialPath *best_path
--- 55,62 ----
  static Plan *create_unique_plan(PlannerInfo *root, UniquePath *best_path);
  static SeqScan *create_seqscan_plan(PlannerInfo *root, Path *best_path,
  					List *tlist, List *scan_clauses);
+ static SampleScan *create_samplescan_plan(PlannerInfo *root, Path *best_path,
+ 						List *tlist, List *scan_clauses);
  static Scan *create_indexscan_plan(PlannerInfo *root, IndexPath *best_path,
  					  List *tlist, List *scan_clauses, bool indexonly);
  static BitmapHeapScan *create_bitmap_scan_plan(PlannerInfo *root,
***************
*** 93,98 **** static List *order_qual_clauses(PlannerInfo *root, List *clauses);
--- 95,101 ----
  static void copy_path_costsize(Plan *dest, Path *src);
  static void copy_plan_costsize(Plan *dest, Plan *src);
  static SeqScan *make_seqscan(List *qptlist, List *qpqual, Index scanrelid);
+ static SampleScan *make_samplescan(PlannerInfo *root, List *qptlist, List *qpqual, Index scanrelid);
  static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
  			   Oid indexid, List *indexqual, List *indexqualorig,
  			   List *indexorderby, List *indexorderbyorig,
***************
*** 214,219 **** create_plan_recurse(PlannerInfo *root, Path *best_path)
--- 217,223 ----
  	switch (best_path->pathtype)
  	{
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_BitmapHeapScan:
***************
*** 326,331 **** create_scan_plan(PlannerInfo *root, Path *best_path)
--- 330,342 ----
  												scan_clauses);
  			break;
  
+ 		case T_SampleScan:
+ 			plan = (Plan *) create_samplescan_plan(root,
+ 													best_path,
+ 													tlist,
+ 													scan_clauses);
+ 			break;
+ 
  		case T_IndexScan:
  			plan = (Plan *) create_indexscan_plan(root,
  												  (IndexPath *) best_path,
***************
*** 512,517 **** disuse_physical_tlist(Plan *plan, Path *path)
--- 523,529 ----
  	switch (path->pathtype)
  	{
  		case T_SeqScan:
+ 		case T_SampleScan:
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  		case T_BitmapHeapScan:
***************
*** 1089,1094 **** create_seqscan_plan(PlannerInfo *root, Path *best_path,
--- 1101,1146 ----
  }
  
  /*
+  * create_samplescan_plan
+  *	 Returns a samplescan plan for the base relation scanned by 'best_path'
+  *	 with restriction clauses 'scan_clauses' and targetlist 'tlist'.
+  */
+ static SampleScan *
+ create_samplescan_plan(PlannerInfo *root, Path *best_path,
+ 					List *tlist, List *scan_clauses)
+ {
+ 	SampleScan    *scan_plan;
+ 	Index		scan_relid = best_path->parent->relid;
+ 
+ 	/* it should be a base rel... */
+ 	Assert(scan_relid > 0);
+ 	Assert(best_path->parent->rtekind == RTE_RELATION);
+ 	Assert(best_path->pathtype == T_SampleScan);
+ 
+ 	/* Sort clauses into best execution order */
+ 	scan_clauses = order_qual_clauses(root, scan_clauses);
+ 
+ 	/* Reduce RestrictInfo list to bare expressions; ignore pseudoconstants */
+ 	scan_clauses = extract_actual_clauses(scan_clauses, false);
+ 
+ 	/* Replace any outer-relation variables with nestloop params */
+ 	if (best_path->param_info)
+ 	{
+ 		scan_clauses = (List *)
+ 			replace_nestloop_params(root, (Node *) scan_clauses);
+ 	}
+ 
+ 	scan_plan = make_samplescan(root, tlist,
+ 							 scan_clauses,
+ 							 scan_relid);
+ 
+ 	copy_path_costsize(&scan_plan->scan.plan, best_path);
+ 
+ 	return scan_plan;
+ }
+ 
+ 
+ /*
   * create_indexscan_plan
   *	  Returns an indexscan plan for the base relation scanned by 'best_path'
   *	  with restriction clauses 'scan_clauses' and targetlist 'tlist'.
***************
*** 3167,3172 **** make_seqscan(List *qptlist,
--- 3219,3253 ----
  	return node;
  }
  
+ 
+ static SampleScan *
+ make_samplescan(PlannerInfo *root,
+ 			 List *qptlist,
+ 			 List *qpqual,
+ 			 Index scanrelid)
+ {
+ 	SampleScan    *node = makeNode(SampleScan);
+ 	Plan	   *plan = &node->scan.plan;
+ 	RangeTblEntry *rte;
+ 
+ 	/* cost should be inserted by caller */
+ 	plan->targetlist = qptlist;
+ 	plan->qual = qpqual;
+ 	plan->lefttree = NULL;
+ 	plan->righttree = NULL;
+ 	node->scan.scanrelid = scanrelid;
+ 
+ 	/*
+ 	 * For the convenience of nodeSamplescan.c, we stash a pointer to
+ 	 * the TableSampleInfo for this SampleScan in the SampleScan's
+ 	 * plan node.
+ 	 */
+ 	rte = planner_rt_fetch(scanrelid, root);
+ 	node->sample_info = rte->sample_info;
+ 
+ 	return node;
+ }
+ 
  static IndexScan *
  make_indexscan(List *qptlist,
  			   List *qpqual,
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***************
*** 163,168 **** standard_planner(Query *parse, int cursorOptions, ParamListInfo boundParams)
--- 163,171 ----
  	glob->lastPHId = 0;
  	glob->lastRowMarkId = 0;
  	glob->transientPlan = false;
+ 	if(cursorOptions)
+ 		glob->isCursorStmt = true;
+ 	else glob->isCursorStmt = false;
  
  	/* Determine what fraction of the plan is likely to be scanned */
  	if (cursorOptions & CURSOR_OPT_FAST_PLAN)
*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 306,311 **** set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
--- 306,322 ----
  					fix_scan_list(root, splan->plan.qual, rtoffset);
  			}
  			break;
+ 		case T_SampleScan:
+ 			{
+ 				SampleScan     *splan = (SampleScan *) plan;
+ 
+ 				splan->scan.scanrelid += rtoffset;
+ 				splan->scan.plan.targetlist -
+ 					fix_scan_list(root, splan->scan.plan.targetlist, rtoffset);
+ 				splan->scan.plan.qual =
+ 					fix_scan_list(root, splan->scan.plan.qual, rtoffset);
+ 			}
+ 			break;
  		case T_IndexScan:
  			{
  				IndexScan  *splan = (IndexScan *) plan;
*** a/src/backend/optimizer/plan/subselect.c
--- b/src/backend/optimizer/plan/subselect.c
***************
*** 2082,2087 **** finalize_plan(PlannerInfo *root, Plan *plan, Bitmapset *valid_params,
--- 2082,2091 ----
  			context.paramids = bms_add_members(context.paramids, scan_params);
  			break;
  
+ 		case T_SampleScan:
+ 			context.paramids = bms_add_members(context.paramids, scan_params);
+ 			break;
+ 
  		case T_IndexScan:
  			finalize_primnode((Node *) ((IndexScan *) plan)->indexqual,
  							  &context);
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 747,752 **** create_seqscan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer)
--- 747,774 ----
  	return pathnode;
  }
  
+ 
+ /*
+  * create_samplescan_path
+  *	  Creates a path corresponding to a sample scan, returning the
+  *	  pathnode.
+  *	  Forked from create_seqscan_path, maybe there are other attributes needed.
+  */
+ Path *
+ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel)
+ {
+ 	Path	   *pathnode = makeNode(Path);
+ 
+ 	pathnode->pathtype = T_SampleScan;
+ 	pathnode->parent = rel;
+ 	pathnode->pathkeys = NIL;
+ 
+ 	cost_samplescan(pathnode, root, rel);
+ 
+ 	return pathnode;
+ }
+ 
+ 
  /*
   * create_index_path
   *	  Creates a path node for an index scan.
***************
*** 2072,2077 **** reparameterize_path(PlannerInfo *root, Path *path,
--- 2094,2101 ----
  	{
  		case T_SeqScan:
  			return create_seqscan_path(root, rel, required_outer);
+ 		case T_SampleScan:
+ 			return create_samplescan_path(root, rel);
  		case T_IndexScan:
  		case T_IndexOnlyScan:
  			{
*** a/src/backend/optimizer/util/relnode.c
--- b/src/backend/optimizer/util/relnode.c
***************
*** 117,122 **** build_simple_rel(PlannerInfo *root, int relid, RelOptKind reloptkind)
--- 117,123 ----
  	rel->subroot = NULL;
  	rel->fdwroutine = NULL;
  	rel->fdw_private = NULL;
+ 	rel->has_table_sample = (rte->sample_info != NULL);
  	rel->baserestrictinfo = NIL;
  	rel->baserestrictcost.startup = 0;
  	rel->baserestrictcost.per_tuple = 0;
***************
*** 373,378 **** build_join_rel(PlannerInfo *root,
--- 374,380 ----
  	joinrel->subroot = NULL;
  	joinrel->fdwroutine = NULL;
  	joinrel->fdw_private = NULL;
+ 	joinrel->has_table_sample = false;
  	joinrel->baserestrictinfo = NIL;
  	joinrel->baserestrictcost.startup = 0;
  	joinrel->baserestrictcost.per_tuple = 0;
*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***************
*** 412,417 **** transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
--- 412,426 ----
  	/* There can't be any outer WITH to worry about */
  	Assert(pstate->p_ctenamespace == NIL);
  
+ 	/*
+ 	 * This should be disallowed by the parser, but check anyway for
+ 	 * the sake of paranoia.
+ 	 */
+ 	if (stmt->relation->sample_info)
+ 		ereport(ERROR,
+ 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				errmsg("TABLESAMPLE cannot be specified for INSERT")));
+ 
  	qry->commandType = CMD_INSERT;
  	pstate->p_is_insert = true;
  
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
***************
*** 402,407 **** static void processCASbits(int cas_bits, int location, const char *constrType,
--- 402,411 ----
  %type <ielem>	index_elem
  %type <node>	table_ref
  %type <jexpr>	joined_table
+ %type <node>	opt_table_sample
+ %type <range>	relation_expr_opt_sample
+ %type <value>	sample_method
+ %type <value>	opt_repeatable_clause
  %type <range>	relation_expr
  %type <range>	relation_expr_opt_alias
  %type <target>	target_el single_set_clause set_target insert_column_item
***************
*** 497,503 **** static void processCASbits(int cas_bits, int location, const char *constrType,
  	AGGREGATE ALL ALSO ALTER ALWAYS ANALYSE ANALYZE AND ANY ARRAY AS ASC
  	ASSERTION ASSIGNMENT ASYMMETRIC AT ATTRIBUTE AUTHORIZATION
  
! 	BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT
  	BOOLEAN_P BOTH BY
  
  	CACHE CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
--- 501,507 ----
  	AGGREGATE ALL ALSO ALTER ALWAYS ANALYSE ANALYZE AND ANY ARRAY AS ASC
  	ASSERTION ASSIGNMENT ASYMMETRIC AT ATTRIBUTE AUTHORIZATION
  
! 	BACKWARD BEFORE BEGIN_P BERNOULLI BETWEEN BIGINT BINARY BIT
  	BOOLEAN_P BOTH BY
  
  	CACHE CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
***************
*** 563,569 **** static void processCASbits(int cas_bits, int location, const char *constrType,
  	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
! 	TABLE TABLES TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
  	TO TRAILING TRANSACTION TREAT TRIGGER TRIM TRUE_P
  	TRUNCATE TRUSTED TYPE_P TYPES_P
  
--- 567,573 ----
  	STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING
  	SYMMETRIC SYSID SYSTEM_P
  
! 	TABLE TABLES TABLESAMPLE TABLESPACE TEMP TEMPLATE TEMPORARY TEXT_P THEN TIME TIMESTAMP
  	TO TRAILING TRANSACTION TREAT TRIGGER TRIM TRUE_P
  	TRUNCATE TRUSTED TYPE_P TYPES_P
  
***************
*** 9312,9318 **** from_list:
  /*
   * table_ref is where an alias clause can be attached.
   */
! table_ref:	relation_expr opt_alias_clause
  				{
  					$1->alias = $2;
  					$$ = (Node *) $1;
--- 9316,9322 ----
  /*
   * table_ref is where an alias clause can be attached.
   */
! table_ref:  relation_expr_opt_sample opt_alias_clause
  				{
  					$1->alias = $2;
  					$$ = (Node *) $1;
***************
*** 9577,9582 **** join_qual:	USING '(' name_list ')'					{ $$ = (Node *) $3; }
--- 9581,9647 ----
  			| ON a_expr								{ $$ = $2; }
  		;
  
+ /*
+  * We want to allow the TABLESAMPLE clause to be specified for 
+  * SELECT, DELETE, and UPDATE, but not for DDL commands. Therefore,
+  * we add a new production that is "relation_expr + optional TABLESAMPLE",
+  * and use that anywhere we'd like to allow a TABLESAMPLE clause to be specified.
+  */
+ 
+ relation_expr_opt_sample:
+ 			relation_expr opt_table_sample
+ 			{
+ 				$$ = $1;
+ 				$$->sample_info = (TableSampleInfo *) $2;
+ 			}
+ 		;
+ 
+ opt_table_sample:
+ 			TABLESAMPLE sample_method '('Iconst')' opt_repeatable_clause
+ 			{
+ 				TableSampleInfo *n = makeNode(TableSampleInfo);
+ 
+ 				if( $2 == 1 )
+ 					n->sample_method = SAMPLE_BERNOULLI;
+ 				else if( $2 == 2 )
+ 					n->sample_method = SAMPLE_SYSTEM;
+ 				else
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 							 errmsg("Sampling method not supported")));
+ 
+ 				n->sample_percent = $4;
+ 				if($4 > 100)
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_SAMPLE_SIZE),
+ 							 errmsg("TABLESAMPLE percentage"
+ 									"be greater than 100")));
+ 				if($4 < 0)
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_INVALID_SAMPLE_SIZE),
+ 							 errmsg("TABLESAMPLE percentage must"
+ 									"be greater than 0")));
+ 
+ 				if($6 != NULL)
+ 				{
+ 					n->is_repeatable = true;
+ 					n->repeat_seed = intVal($6);
+ 				}
+ 
+ 				$$ = (Node *)n;
+ 			}
+ 			| /*EMPTY*/				{ $$ = NULL; }
+ 		;
+ 
+ sample_method:
+ 			BERNOULLI				{ $$ = 1; }
+ 			| SYSTEM_P 				{ $$ = 2; }
+ 		;
+ 
+ opt_repeatable_clause:
+ 			REPEATABLE '('Iconst')' { $$ = makeInteger($3); }
+ 			| /*EMPTY*/				{ $$ = NULL; }
+ 		;
  
  relation_expr:
  			qualified_name
***************
*** 9625,9642 **** relation_expr_list:
   * has, causing the parser to prefer to reduce, in effect assuming that the
   * SET is not an alias.
   */
! relation_expr_opt_alias: relation_expr					%prec UMINUS
  				{
  					$$ = $1;
  				}
! 			| relation_expr ColId
  				{
  					Alias *alias = makeNode(Alias);
  					alias->aliasname = $2;
  					$1->alias = alias;
  					$$ = $1;
  				}
! 			| relation_expr AS ColId
  				{
  					Alias *alias = makeNode(Alias);
  					alias->aliasname = $3;
--- 9690,9707 ----
   * has, causing the parser to prefer to reduce, in effect assuming that the
   * SET is not an alias.
   */
! relation_expr_opt_alias: relation_expr_opt_sample					%prec UMINUS
  				{
  					$$ = $1;
  				}
! 			| relation_expr_opt_sample ColId
  				{
  					Alias *alias = makeNode(Alias);
  					alias->aliasname = $2;
  					$1->alias = alias;
  					$$ = $1;
  				}
! 			| relation_expr_opt_sample AS ColId
  				{
  					Alias *alias = makeNode(Alias);
  					alias->aliasname = $3;
***************
*** 12385,12390 **** unreserved_keyword:
--- 12450,12456 ----
  			| BACKWARD
  			| BEFORE
  			| BEGIN_P
+ 			| BERNOULLI
  			| BY
  			| CACHE
  			| CALLED
***************
*** 12539,12545 **** unreserved_keyword:
  			| RELATIVE_P
  			| RELEASE
  			| RENAME
- 			| REPEATABLE
  			| REPLACE
  			| REPLICA
  			| RESET
--- 12605,12610 ----
***************
*** 12712,12717 **** type_func_name_keyword:
--- 12777,12783 ----
  			| OVERLAPS
  			| RIGHT
  			| SIMILAR
+ 			| TABLESAMPLE
  			| VERBOSE
  		;
  
***************
*** 12780,12785 **** reserved_keyword:
--- 12846,12852 ----
  			| PLACING
  			| PRIMARY
  			| REFERENCES
+ 			| REPEATABLE
  			| RETURNING
  			| SELECT
  			| SESSION_USER
*** a/src/backend/parser/parse_clause.c
--- b/src/backend/parser/parse_clause.c
***************
*** 190,195 **** setTargetTable(ParseState *pstate, RangeVar *relation,
--- 190,201 ----
  										relation->alias, inh, false);
  	pstate->p_target_rangetblentry = rte;
  
+ 	/*
+ 	 * Minor kludge: addRangeTableEntryForRelation() can't see the
+ 	 * TABLESAMPLE clause, so attach it to the new RTE manually.
+ 	 */
+ 	rte->sample_info = relation->sample_info;
+ 
  	/* assume new rte is at end */
  	rtindex = list_length(pstate->p_rtable);
  	Assert(rte == rt_fetch(rtindex, pstate->p_rtable));
*** a/src/backend/parser/parse_relation.c
--- b/src/backend/parser/parse_relation.c
***************
*** 330,336 **** searchRangeTableForRel(ParseState *pstate, RangeVar *relation)
  
  			if (rte->rtekind == RTE_RELATION &&
  				OidIsValid(relId) &&
! 				rte->relid == relId)
  				return rte;
  			if (rte->rtekind == RTE_CTE &&
  				cte != NULL &&
--- 330,337 ----
  
  			if (rte->rtekind == RTE_RELATION &&
  				OidIsValid(relId) &&
! 				rte->relid == relId &&
! 				equal(rte->sample_info, relation->sample_info))
  				return rte;
  			if (rte->rtekind == RTE_CTE &&
  				cte != NULL &&
***************
*** 995,1000 **** addRangeTableEntry(ParseState *pstate,
--- 996,1002 ----
  	rel = parserOpenTable(pstate, relation, lockmode);
  	rte->relid = RelationGetRelid(rel);
  	rte->relkind = rel->rd_rel->relkind;
+ 	rte->sample_info = relation->sample_info;
  
  	/*
  	 * Build the list of effective column names using user-supplied aliases
***************
*** 1055,1060 **** addRangeTableEntryForRelation(ParseState *pstate,
--- 1057,1063 ----
  	rte->alias = alias;
  	rte->relid = RelationGetRelid(rel);
  	rte->relkind = rel->rd_rel->relkind;
+ 	rte->sample_info = NULL;
  
  	/*
  	 * Build the list of effective column names using user-supplied aliases
***************
*** 1110,1115 **** addRangeTableEntryForSubquery(ParseState *pstate,
--- 1113,1119 ----
  
  	rte->rtekind = RTE_SUBQUERY;
  	rte->relid = InvalidOid;
+ 	rte->sample_info = NULL;
  	rte->subquery = subquery;
  	rte->alias = alias;
  
***************
*** 1189,1194 **** addRangeTableEntryForFunction(ParseState *pstate,
--- 1193,1199 ----
  
  	rte->rtekind = RTE_FUNCTION;
  	rte->relid = InvalidOid;
+ 	rte->sample_info = NULL;
  	rte->subquery = NULL;
  	rte->funcexpr = funcexpr;
  	rte->funccoltypes = NIL;
***************
*** 1323,1328 **** addRangeTableEntryForValues(ParseState *pstate,
--- 1328,1334 ----
  
  	rte->rtekind = RTE_VALUES;
  	rte->relid = InvalidOid;
+ 	rte->sample_info = NULL;
  	rte->subquery = NULL;
  	rte->values_lists = exprs;
  	rte->values_collations = collations;
***************
*** 1403,1408 **** addRangeTableEntryForJoin(ParseState *pstate,
--- 1409,1415 ----
  
  	rte->rtekind = RTE_JOIN;
  	rte->relid = InvalidOid;
+ 	rte->sample_info = NULL;
  	rte->subquery = NULL;
  	rte->jointype = jointype;
  	rte->joinaliasvars = aliasvars;
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
***************
*** 2174,2179 **** deparse_context_for(const char *aliasname, Oid relid)
--- 2174,2180 ----
  	rte = makeNode(RangeTblEntry);
  	rte->rtekind = RTE_RELATION;
  	rte->relid = relid;
+ 	rte->sample_info = NULL;
  	rte->relkind = RELKIND_RELATION;	/* no need for exactness here */
  	rte->eref = makeAlias(aliasname, NIL);
  	rte->lateral = false;
***************
*** 6719,6724 **** get_from_clause_item(Node *jtnode, Query *query, deparse_context *context)
--- 6720,6750 ----
  			 */
  			get_from_clause_alias(rte->alias, rte, context);
  		}
+ 
+ 		if (rte->sample_info)
+ 		{
+ 			TableSampleInfo		*sample_info = rte->sample_info;
+ 			const char			*method_name;
+ 
+ 			Assert(rte->rtekind == RTE_RELATION);
+ 
+ 			switch(sample_info->sample_method)
+ 			{
+ 				case SAMPLE_BERNOULLI:
+ 					method_name = "BERNOULLI";
+ 					break;
+ 				case SAMPLE_SYSTEM:
+ 					method_name = "SYSTEM";
+ 					break;
+ 				default:
+ 					elog(ERROR, "unrecognized sample method: %s", 
+ 						 (int) sample_info->sample_method);
+ 					method_name = NULL;
+ 			}
+ 
+ 			appendStringInfo(buf, " TABLESAMPLE %s (%d)",
+ 							 method_name, sample_info->sample_percent);
+ 		}
  	}
  	else if (IsA(jtnode, JoinExpr))
  	{
*** a/src/backend/utils/errcodes.txt
--- b/src/backend/utils/errcodes.txt
***************
*** 175,180 **** Section: Class 22 - Data Exception
--- 175,181 ----
  22010    E    ERRCODE_INVALID_INDICATOR_PARAMETER_VALUE                      invalid_indicator_parameter_value
  22023    E    ERRCODE_INVALID_PARAMETER_VALUE                                invalid_parameter_value
  2201B    E    ERRCODE_INVALID_REGULAR_EXPRESSION                             invalid_regular_expression
+ 2202H    E    ERRCODE_INVALID_SAMPLE_SIZE                                    invalid_sample_size
  2201W    E    ERRCODE_INVALID_ROW_COUNT_IN_LIMIT_CLAUSE                      invalid_row_count_in_limit_clause
  2201X    E    ERRCODE_INVALID_ROW_COUNT_IN_RESULT_OFFSET_CLAUSE              invalid_row_count_in_result_offset_clause
  22009    E    ERRCODE_INVALID_TIME_ZONE_DISPLACEMENT_VALUE                   invalid_time_zone_displacement_value
*** a/src/backend/utils/misc/Makefile
--- b/src/backend/utils/misc/Makefile
***************
*** 15,21 **** include $(top_builddir)/src/Makefile.global
  override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
  
  OBJS = guc.o help_config.o pg_rusage.o ps_status.o rbtree.o \
!        superuser.o timeout.o tzparser.o
  
  # This location might depend on the installation directories. Therefore
  # we can't subsitute it into pg_config.h.
--- 15,21 ----
  override CPPFLAGS := -I. -I$(srcdir) $(CPPFLAGS)
  
  OBJS = guc.o help_config.o pg_rusage.o ps_status.o rbtree.o \
!        superuser.o timeout.o tzparser.o sampleutils.o
  
  # This location might depend on the installation directories. Therefore
  # we can't subsitute it into pg_config.h.
*** /dev/null
--- b/src/backend/utils/misc/sampleutils.c
***************
*** 0 ****
--- 1,272 ----
+ /*-------------------------------------------------------------------------
+  *
+  * sampleutils.c
+  *	  Routines designed for TableSample Clause, containing Vitter's Reservoir
+  *	  Sampling Algo routines. Tablesample System, Bernoulli sampling and analyze
+  *	  is using these apis.
+  *
+  * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/utils/misc/sampleutils.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "postgres.h"
+ 
+ #include "utils/sampleutils.h"
+ #include "storage/procarray.h"
+ #include "storage/bufmgr.h"
+ #include "utils/tqual.h"
+ #include "utils/rel.h"
+ 
+ #define MEM_OVERHEAD 100000000
+ 
+ 
+ /*
+  * BlockSampler_Init -- prepare for random sampling of blocknumbers
+  *
+  * BlockSampler is used for stage one of our new two-stage tuple
+  * sampling mechanism as discussed on pgsql-hackers 2004-04-02 (subject
+  * "Large DB").  It selects a random sample of samplesize blocks out of
+  * the nblocks blocks in the table.  If the table has less than
+  * samplesize blocks, all blocks are selected.
+  *
+  * Since we know the total number of blocks in advance, we can use the
+  * straightforward Algorithm S from Knuth 3.4.2, rather than Vitter's
+  * algorithm.
+  */
+ void
+ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize)
+ {
+ 	bs->N = nblocks;			/* measured table size */
+ 
+ 	/*
+ 	 * If we decide to reduce samplesize for tables that have less or not much
+ 	 * more than samplesize blocks, here is the place to do it.
+ 	 */
+ 	bs->n = samplesize;
+ 	bs->t = 0;					/* blocks scanned so far */
+ 	bs->m = 0;					/* blocks selected so far */
+ }
+ 
+ bool
+ BlockSampler_HasMore(BlockSampler bs)
+ {
+ 	return (bs->t < bs->N) && (bs->m < bs->n);
+ }
+ 
+ BlockNumber
+ BlockSampler_Next(void *state, BlockSampler bs)
+ {
+ 	BlockNumber K = bs->N - bs->t;		/* remaining blocks */
+ 	int			k = bs->n - bs->m;		/* blocks still to sample */
+ 	double		p;				/* probability to skip block */
+ 	double		V;				/* random */
+ 
+ 	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+ 
+ 	if ((BlockNumber) k >= K)
+ 	{
+ 		/* need all the rest */
+ 		bs->m++;
+ 		return bs->t++;
+ 	}
+ 
+ 	/*----------
+ 	 * It is not obvious that this code matches Knuth's Algorithm S.
+ 	 * Knuth says to skip the current block with probability 1 - k/K.
+ 	 * If we are to skip, we should advance t (hence decrease K), and
+ 	 * repeat the same probabilistic test for the next block.  The naive
+ 	 * implementation thus requires an anl_random_fract() call for each block
+ 	 * number.	But we can reduce this to one anl_random_fract() call per
+ 	 * selected block, by noting that each time the while-test succeeds,
+ 	 * we can reinterpret V as a uniform random number in the range 0 to p.
+ 	 * Therefore, instead of choosing a new V, we just adjust p to be
+ 	 * the appropriate fraction of its former value, and our next loop
+ 	 * makes the appropriate probabilistic test.
+ 	 *
+ 	 * We have initially K > k > 0.  If the loop reduces K to equal k,
+ 	 * the next while-test must fail since p will become exactly zero
+ 	 * (we assume there will not be roundoff error in the division).
+ 	 * (Note: Knuth suggests a "<=" loop condition, but we use "<" just
+ 	 * to be doubly sure about roundoff error.)  Therefore K cannot become
+ 	 * less than k, which means that we cannot fail to select enough blocks.
+ 	 *----------
+ 	 */
+ 	V = anl_random_fract(state);
+ 	p = 1.0 - (double) k / (double) K;
+ 	while (V < p)
+ 	{
+ 		/* skip */
+ 		bs->t++;
+ 		K--;					/* keep K == N - t */
+ 
+ 		/* adjust p to be new cutoff point in reduced range */
+ 		p *= 1.0 - (double) k / (double) K;
+ 	}
+ 
+ 	/* select */
+ 	bs->m++;
+ 	return bs->t++;
+ }
+ 
+ 
+ 
+ /*
+  * These two routines embody Algorithm Z from "Random sampling with a
+  * reservoir" by Jeffrey S. Vitter, in ACM Trans. Math. Softw. 11, 1
+  * (Mar. 1985), Pages 37-57.  Vitter describes his algorithm in terms
+  * of the count S of records to skip before processing another record.
+  * It is computed primarily based on t, the number of records already read.
+  * The only extra state needed between calls is W, a random state variable.
+  *
+  * anl_init_selection_state computes the initial W value.
+  *
+  * Given that we've already read t records (t >= n), anl_get_next_S
+  * determines the number of records to skip before the next record is
+  * processed.
+  */
+ double
+ anl_init_selection_state(void *state, int n)
+ {
+ 	/* Initial value of W (for use when Algorithm Z is first applied) */
+ 	return exp(-log(anl_random_fract(state)) / n);
+ }
+ 
+ double
+ anl_get_next_S(void *state, double t, int n, double *stateptr)
+ {
+ 	double		S;
+ 
+ 	/* The magic constant here is T from Vitter's paper */
+ 	if (t <= (22.0 * n))
+ 	{
+ 		/* Process records using Algorithm X until t is large enough */
+ 		double		V,
+ 					quot;
+ 
+ 		V = anl_random_fract(state); /* Generate V */
+ 		S = 0;
+ 		t += 1;
+ 		/* Note: "num" in Vitter's code is always equal to t - n */
+ 		quot = (t - (double) n) / t;
+ 		/* Find min S satisfying (4.1) */
+ 		while (quot > V)
+ 		{
+ 			S += 1;
+ 			t += 1;
+ 			quot *= (t - (double) n) / t;
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/* Now apply Algorithm Z */
+ 		double		W = *stateptr;
+ 		double		term = t - (double) n + 1;
+ 
+ 		for (;;)
+ 		{
+ 			double		numer,
+ 						numer_lim,
+ 						denom;
+ 			double		U,
+ 						X,
+ 						lhs,
+ 						rhs,
+ 						y,
+ 						tmp;
+ 
+ 			
+ 			U = anl_random_fract(state); /* Generate U and X */
+ 			X = t * (W - 1.0);
+ 			S = floor(X);		/* S is tentatively set to floor(X) */
+ 			/* Test if U <= h(S)/cg(X) in the manner of (6.3) */
+ 			tmp = (t + 1) / term;
+ 			lhs = exp(log(((U * tmp * tmp) * (term + S)) / (t + X)) / n);
+ 			rhs = (((t + X) / (term + S)) * term) / t;
+ 			if (lhs <= rhs)
+ 			{
+ 				W = rhs / lhs;
+ 				break;
+ 			}
+ 			/* Test if U <= f(S)/cg(X) */
+ 			y = (((U * (t + 1)) / term) * (t + S + 1)) / (t + X);
+ 			if ((double) n < S)
+ 			{
+ 				denom = t;
+ 				numer_lim = term + S;
+ 			}
+ 			else
+ 			{
+ 				denom = t - (double) n + S;
+ 				numer_lim = t + 1;
+ 			}
+ 			for (numer = t + S; numer >= numer_lim; numer -= 1)
+ 			{
+ 				y *= numer / denom;
+ 				denom -= 1;
+ 			}
+ 			W = exp(-log(anl_random_fract(state)) / n);		/* Generate W in advance */
+ 			if (exp(log(y) / n) <= (t + X) / t)
+ 				break;
+ 		}
+ 		*stateptr = W;
+ 	}
+ 	return S;
+ }
+ 
+ /* ---------------------------------------------------
+  * Random number generator
+  * ---------------------------------------------------
+  */
+ 
+ /* Select a random value R uniformly distributed in (0 - 1), for tablsample use */
+ double
+ anl_random_fract(void *state)
+ {
+ 	/* XXX: If the sample_random can work as good as random(), may merge,
+ 	 * but now, as sample_random surely needs improvement, we leave 
+ 	 * analzye.c and tablesample using different random generator.
+ 	 */
+ 	if(state == NULL)
+ 	{
+ 		return ((double) random() + 1) / ((double) MAX_RANDOM_VALUE + 2);
+ 	}else{
+ 		return sample_random(state);
+ 	}
+ }
+ 
+ /*
+  * Returns a randomly-generated trigger x, such that a <= x < b
+  */
+ int
+ get_rand_in_range(void *rand_state, int a, int b)
+ {
+ 	double rand = sample_random(rand_state);
+ 	return (rand * b) + a;
+ }
+ 
+ /*
+  * Random Number Seeding
+  */
+ void sample_set_seed(void *random_state, double seed)
+ {
+ 	/*     
+ 	 * XXX. This seeding algorithm could certainly be improved.
+ 	 * May need some effort on this.
+ 	 */    
+ 	seed += MEM_OVERHEAD; //seed must be large enough to make difference
+ 
+ 	memset(random_state, 0, sizeof(random_state));
+ 	memcpy(random_state,
+ 			&seed,
+ 			Min(sizeof(random_state), sizeof(seed)));
+ }
+ 
+ double sample_random(void *random_state)
+ {
+ 	return pg_erand48(random_state);
+ }
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 51,56 **** typedef struct HeapScanDescData
--- 51,69 ----
  	int			rs_mindex;		/* marked tuple's saved index */
  	int			rs_ntuples;		/* number of visible tuples on page */
  	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+ 
+ 	/* The random number generator state */
+ 	unsigned short rs_randstate[3];
+ 
+ 	/* TableSample Bernoulli Vitter's Algo samples */
+ 	bool		is_sample_scan;
+ 	int			sample_percent;
+ 	TableSampleMethod sample_method;
+ 	HeapTuple	*rs_samplerows;
+ 	int			rs_samplesize;
+ 	int			rs_curindex;
+ 	bool		rs_sampleinited;
+ 	int			targrows;
  }	HeapScanDescData;
  
  /*
*** a/src/include/commands/vacuum.h
--- b/src/include/commands/vacuum.h
***************
*** 170,177 **** extern void lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
  extern void analyze_rel(Oid relid, VacuumStmt *vacstmt,
  			BufferAccessStrategy bstrategy);
  extern bool std_typanalyze(VacAttrStats *stats);
- extern double anl_random_fract(void);
- extern double anl_init_selection_state(int n);
- extern double anl_get_next_S(double t, int n, double *stateptr);
  
  #endif   /* VACUUM_H */
--- 170,174 ----
*** /dev/null
--- b/src/include/executor/nodeSamplescan.h
***************
*** 0 ****
--- 1,28 ----
+ /*-------------------------------------------------------------------------
+  *
+  * nodeSamplescan.h
+  *
+  *
+  *
+  * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/executor/nodeSamplescan.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef NODESAMPLESCAN_H
+ #define NODESAMPLESCAN_H
+ 
+ #include "nodes/execnodes.h"
+ 
+ #define RAND_STATE_SIZE 128
+ 
+ extern SampleScanState *ExecInitSampleScan(SampleScan *node, EState *estate, int eflags);
+ extern TupleTableSlot *ExecSampleScan(SampleScanState *node);
+ extern void ExecEndSampleScan(SampleScanState *node);
+ extern void ExecSampleMarkPos(SampleScanState *node);
+ extern void ExecSampleRestrPos(SampleScanState *node);
+ extern void ExecReScanSampleScan(SampleScanState *node);
+ 
+ #endif   /* NODESAMPLESCAN_H */
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1430,1435 **** typedef struct ValuesScanState
--- 1430,1455 ----
  	int			marked_idx;
  } ValuesScanState;
  
+ /* --------------
+  * This State struct should be decided by the algorithm used in Sampling
+  * method.
+  *
+  * SampleScanState: the run-time state associated with a single sample
+  * scan. This is the run-time dual of the SampleScan plan node: for
+  * each SampleScan in the Plan tree, we create a SampleScanState in
+  * the corresponding PlanState tree. A PlanState's associated Plan can
+  * be found via ss.ps.plan.
+  *
+  * --------------
+  */
+ 
+ typedef struct SampleScanState
+ {
+ 	/* parent class; first field is NodeTag */
+ 	ScanState			 ss;
+ } SampleScanState;
+ 
+ 
  /* ----------------
   *	 CteScanState information
   *
*** a/src/include/nodes/nodes.h
--- b/src/include/nodes/nodes.h
***************
*** 51,56 **** typedef enum NodeTag
--- 51,57 ----
  	T_BitmapOr,
  	T_Scan,
  	T_SeqScan,
+ 	T_SampleScan,
  	T_IndexScan,
  	T_IndexOnlyScan,
  	T_BitmapIndexScan,
***************
*** 96,101 **** typedef enum NodeTag
--- 97,103 ----
  	T_BitmapOrState,
  	T_ScanState,
  	T_SeqScanState,
+ 	T_SampleScanState,
  	T_IndexScanState,
  	T_IndexOnlyScanState,
  	T_BitmapIndexScanState,
***************
*** 171,176 **** typedef enum NodeTag
--- 173,179 ----
  	T_JoinExpr,
  	T_FromExpr,
  	T_IntoClause,
+ 	T_TableSampleInfo,
  
  	/*
  	 * TAGS FOR EXPRESSION STATE NODES (execnodes.h)
*** a/src/include/nodes/parsenodes.h
--- b/src/include/nodes/parsenodes.h
***************
*** 701,708 **** typedef struct RangeTblEntry
  	/*
  	 * Fields valid for a plain relation RTE (else zero):
  	 */
! 	Oid			relid;			/* OID of the relation */
! 	char		relkind;		/* relation kind (see pg_class.relkind) */
  
  	/*
  	 * Fields valid for a subquery RTE (else NULL):
--- 701,709 ----
  	/*
  	 * Fields valid for a plain relation RTE (else zero):
  	 */
! 	Oid			relid;				/* OID of the relation */
! 	TableSampleInfo *sample_info;	/* Fields for tablesampleinfo */
! 	char		relkind;			/* relation kind (see pg_class.relkind) */
  
  	/*
  	 * Fields valid for a subquery RTE (else NULL):
***************
*** 765,770 **** typedef struct RangeTblEntry
--- 766,772 ----
  	Oid			checkAsUser;	/* if valid, check access as this role */
  	Bitmapset  *selectedCols;	/* columns needing SELECT permission */
  	Bitmapset  *modifiedCols;	/* columns needing INSERT/UPDATE permission */
+ 
  } RangeTblEntry;
  
  /*
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 271,276 **** typedef struct Scan
--- 271,294 ----
  typedef Scan SeqScan;
  
  /* ----------------
+  * SampleScan node
+  *
+  * This is the information about a SampleScan that is fixed for a given Plan.
+  * SampleScanState holds the run-time(executor-time) state associated with a
+  * given ScanScan node.
+  *
+  * In addition to our parent class, we need only a single additional piece of
+  * information: the information contained in the TABLESAMPLE clause that corresponds
+  * to this SampleScan.
+  * ----------------
+  */
+ typedef struct SampleScan
+ {
+ 	Scan				scan;
+ 	TableSampleInfo	   *sample_info;
+ } SampleScan;
+ 
+ /* ----------------
   *		index scan node
   *
   * indexqualorig is an implicitly-ANDed list of index qual expressions, each
*** a/src/include/nodes/primnodes.h
--- b/src/include/nodes/primnodes.h
***************
*** 58,63 **** typedef enum OnCommitAction
--- 58,86 ----
  	ONCOMMIT_DROP				/* ON COMMIT DROP */
  } OnCommitAction;
  
+ /*---------
+  * TableSampleMethod - defines the methods for tablesample query
+  *---------
+  */
+ typedef enum TableSampleMethod
+ {
+ 	SAMPLE_BERNOULLI,
+ 	SAMPLE_SYSTEM
+ } TableSampleMethod;
+ 
+ /*--------
+  * TableSampleInfo - information related to a tablesample query
+  *-------
+  */
+ typedef struct TableSampleInfo
+ {
+ 	NodeTag				type;
+ 	int					sample_percent;
+ 	TableSampleMethod	sample_method;
+ 	bool				is_repeatable;
+ 	int					repeat_seed;
+ } TableSampleInfo;
+ 
  /*
   * RangeVar - range variable, used in FROM clauses
   *
***************
*** 65,70 **** typedef enum OnCommitAction
--- 88,94 ----
   * field is not used, and inhOpt shows whether to apply the operation
   * recursively to child tables.  In some contexts it is also useful to carry
   * a TEMP table indication here.
+  * Also support for TableSample.
   */
  typedef struct RangeVar
  {
***************
*** 77,82 **** typedef struct RangeVar
--- 101,108 ----
  	char		relpersistence; /* see RELPERSISTENCE_* in pg_class.h */
  	Alias	   *alias;			/* table alias & optional column aliases */
  	int			location;		/* token location, or -1 if unknown */
+ 
+ 	TableSampleInfo *sample_info; /* TABLESAMPLE clause, if any */
  } RangeVar;
  
  /*
*** a/src/include/nodes/relation.h
--- b/src/include/nodes/relation.h
***************
*** 98,103 **** typedef struct PlannerGlobal
--- 98,105 ----
  	Index		lastRowMarkId;	/* highest PlanRowMark ID assigned */
  
  	bool		transientPlan;	/* redo plan when TransactionXmin changes? */
+ 
+ 	bool		isCursorStmt;	/* Is this a cursor stmt? Only for Cursor Tablesample Check */
  } PlannerGlobal;
  
  /* macro for fetching the Plan associated with a SubPlan node */
***************
*** 362,367 **** typedef struct PlannerInfo
--- 364,373 ----
   *					EquivalenceClasses)
   *		has_eclass_joins - flag that EquivalenceClass joins are possible
   *
+  * The next field will be set true if there is tablesample clause presense in
+  * the query, otherwise false;
+  *		has_table_sample - flag for tablesample clause
+  *
   * Note: Keeping a restrictinfo list in the RelOptInfo is useful only for
   * base rels, because for a join rel the set of clauses that are treated as
   * restrict clauses varies depending on which sub-relations we choose to join.
***************
*** 435,440 **** typedef struct RelOptInfo
--- 441,449 ----
  	List	   *joininfo;		/* RestrictInfo structures for join clauses
  								 * involving this rel */
  	bool		has_eclass_joins;		/* T means joininfo is incomplete */
+ 
+ 	/* used if there is tablesample clause */
+ 	bool		has_table_sample;
  } RelOptInfo;
  
  /*
***************
*** 703,708 **** typedef struct Path
--- 712,719 ----
  #define PATH_REQ_OUTER(path)  \
  	((path)->param_info ? (path)->param_info->ppi_req_outer : (Relids) NULL)
  
+ 
+ 
  /*----------
   * IndexPath represents an index scan over a single index.
   *
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 51,56 **** extern PGDLLIMPORT double cpu_operator_cost;
--- 51,57 ----
  extern PGDLLIMPORT int effective_cache_size;
  extern Cost disable_cost;
  extern bool enable_seqscan;
+ extern bool enable_samplescan;
  extern bool enable_indexscan;
  extern bool enable_indexonlyscan;
  extern bool enable_bitmapscan;
***************
*** 84,89 **** extern void cost_functionscan(Path *path, PlannerInfo *root,
--- 85,91 ----
  				  RelOptInfo *baserel, ParamPathInfo *param_info);
  extern void cost_valuesscan(Path *path, PlannerInfo *root,
  				RelOptInfo *baserel, ParamPathInfo *param_info);
+ extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_ctescan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_recursive_union(Plan *runion, Plan *nrterm, Plan *rterm);
  extern void cost_sort(Path *path, PlannerInfo *root,
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 32,37 **** extern bool add_path_precheck(RelOptInfo *parent_rel,
--- 32,38 ----
  
  extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel,
  					Relids required_outer);
+ extern Path *create_samplescan_path(PlannerInfo *root, RelOptInfo *rel);
  extern IndexPath *create_index_path(PlannerInfo *root,
  				  IndexOptInfo *index,
  				  List *indexclauses,
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
***************
*** 54,59 **** PG_KEYWORD("authorization", AUTHORIZATION, TYPE_FUNC_NAME_KEYWORD)
--- 54,60 ----
  PG_KEYWORD("backward", BACKWARD, UNRESERVED_KEYWORD)
  PG_KEYWORD("before", BEFORE, UNRESERVED_KEYWORD)
  PG_KEYWORD("begin", BEGIN_P, UNRESERVED_KEYWORD)
+ PG_KEYWORD("bernoulli", BERNOULLI, UNRESERVED_KEYWORD)
  PG_KEYWORD("between", BETWEEN, COL_NAME_KEYWORD)
  PG_KEYWORD("bigint", BIGINT, COL_NAME_KEYWORD)
  PG_KEYWORD("binary", BINARY, TYPE_FUNC_NAME_KEYWORD)
***************
*** 358,363 **** PG_KEYWORD("sysid", SYSID, UNRESERVED_KEYWORD)
--- 359,365 ----
  PG_KEYWORD("system", SYSTEM_P, UNRESERVED_KEYWORD)
  PG_KEYWORD("table", TABLE, RESERVED_KEYWORD)
  PG_KEYWORD("tables", TABLES, UNRESERVED_KEYWORD)
+ PG_KEYWORD("tablesample", TABLESAMPLE, TYPE_FUNC_NAME_KEYWORD)
  PG_KEYWORD("tablespace", TABLESPACE, UNRESERVED_KEYWORD)
  PG_KEYWORD("temp", TEMP, UNRESERVED_KEYWORD)
  PG_KEYWORD("template", TEMPLATE, UNRESERVED_KEYWORD)
*** /dev/null
--- b/src/include/utils/sampleutils.h
***************
*** 0 ****
--- 1,49 ----
+ /*-------------------------------------------------------------------------
+  *
+  * sampleutils.h
+  *	  APIs for Tablesample Support, containing Vitter's Algo APIs, analyze.c
+  *	  is also using the APIs.
+  *
+  *
+  * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/utils/sampleutils.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef SAMPLEUTILS_H
+ #define SAMPLEUTILS_H
+ 
+ #include "storage/block.h"
+ #include "utils/relcache.h"
+ #include "access/htup.h"
+ #include "storage/buf.h"
+ 
+ /* Data structure for Algorithm S from Knuth 3.4.2 */
+ typedef struct
+ {
+ 	BlockNumber N;				/* number of blocks, known in advance */
+ 	int			n;				/* desired sample size */
+ 	BlockNumber t;				/* current block number */
+ 	int			m;				/* blocks selected so far */
+ } BlockSamplerData;
+ 
+ typedef BlockSamplerData *BlockSampler;
+ 
+ 
+ extern void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
+ 				  int samplesize);
+ extern bool BlockSampler_HasMore(BlockSampler bs);
+ extern BlockNumber BlockSampler_Next(void *state, BlockSampler bs);
+ extern double anl_init_selection_state(void *state, int n);
+ extern double anl_get_next_S(void *state, double t, int n, double *stateptr);
+ extern double anl_random_fract(void *state);
+ extern int get_rand_in_range(void *state, int a, int b);
+ extern void sample_set_seed(void *random_state, double seed);
+ extern double sample_random(void *random_state);
+ extern int acquire_vitter_rows(Relation onerel, void *state, HeapTuple *rows, int targrows,
+ 							BlockNumber totalblocks, BufferAccessStrategy vac_strategy,
+ 							double *totalrows, double *totaldeadrows, bool isanalyze, int elevel);
+ 
+ #endif /* SAMPLEUTILS_H */
*** /dev/null
--- b/src/test/regress/expected/tablesample.out
***************
*** 0 ****
--- 1,188 ----
+ --
+ -- TABLESAMPLE TEST
+ --
+ CREATE TABLE sample (id INT, name text);
+ INSERT INTO sample VALUES (1, 'victor');
+ INSERT INTO sample VALUES (2, 'peter');
+ INSERT INTO sample VALUES (3, 'james');
+ INSERT INTO sample VALUES (4, 'paul');
+ INSERT INTO sample VALUES (5, 'kate');
+ INSERT INTO sample VALUES (6, 'jane');
+ INSERT INTO sample VALUES (7, 'samual');
+ INSERT INTO sample VALUES (8, 'david');
+ INSERT INTO sample VALUES (9, 'job');
+ INSERT INTO sample VALUES (10, 'neil');
+ INSERT INTO sample VALUES (11, 'victor');
+ INSERT INTO sample VALUES (12, 'peter');
+ INSERT INTO sample VALUES (13, 'james');
+ INSERT INTO sample VALUES (14, 'paul');
+ INSERT INTO sample VALUES (15, 'kate');
+ INSERT INTO sample VALUES (16, 'jane');
+ INSERT INTO sample VALUES (17, 'samual');
+ INSERT INTO sample VALUES (18, 'david');
+ INSERT INTO sample VALUES (19, 'job');
+ INSERT INTO sample VALUES (20, 'neil');
+ ANALYZE sample;
+ SELECT * FROM sample TABLESAMPLE SYSTEM (60) REPEATABLE (10);
+  id |  name  
+ ----+--------
+   1 | victor
+   2 | peter
+   3 | james
+   4 | paul
+   5 | kate
+   6 | jane
+   7 | samual
+   8 | david
+   9 | job
+  10 | neil
+  11 | victor
+  12 | peter
+  13 | james
+  14 | paul
+  15 | kate
+  16 | jane
+  17 | samual
+  18 | david
+  19 | job
+  20 | neil
+ (20 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11);
+  id |  name  
+ ----+--------
+  18 | david
+   2 | peter
+   3 | james
+  15 | kate
+   5 | kate
+   6 | jane
+   7 | samual
+  20 | neil
+   9 | job
+  10 | neil
+  11 | victor
+  12 | peter
+  16 | jane
+  17 | samual
+ (14 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11);
+  id |  name  
+ ----+--------
+  18 | david
+   2 | peter
+   3 | james
+  15 | kate
+   5 | kate
+   6 | jane
+   7 | samual
+  20 | neil
+   9 | job
+  10 | neil
+  11 | victor
+  12 | peter
+  16 | jane
+  17 | samual
+ (14 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (19);
+  id |  name  
+ ----+--------
+   1 | victor
+  18 | david
+   3 | james
+  17 | samual
+   5 | kate
+   6 | jane
+   7 | samual
+   8 | david
+  15 | kate
+  19 | job
+  11 | victor
+  12 | peter
+  13 | james
+  14 | paul
+ (14 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (7) REPEATABLE (19);
+  id |  name  
+ ----+--------
+   7 | samual
+ (1 row)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (1) REPEATABLE (19);
+  id | name 
+ ----+------
+ (0 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (20);
+  id |  name  
+ ----+--------
+   1 | victor
+   2 | peter
+   3 | james
+   4 | paul
+   5 | kate
+   6 | jane
+   7 | samual
+   8 | david
+   9 | job
+  10 | neil
+  11 | victor
+  12 | peter
+  13 | james
+  14 | paul
+  15 | kate
+  16 | jane
+  17 | samual
+  18 | david
+  19 | job
+  20 | neil
+ (20 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (2);
+  id | name 
+ ----+------
+ (0 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (20);
+  id |  name  
+ ----+--------
+   1 | victor
+   2 | peter
+   3 | james
+   4 | paul
+   5 | kate
+   6 | jane
+   7 | samual
+   8 | david
+   9 | job
+  10 | neil
+  11 | victor
+  12 | peter
+  13 | james
+  14 | paul
+  15 | kate
+  16 | jane
+  17 | samual
+  18 | david
+  19 | job
+  20 | neil
+ (20 rows)
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11) order by id limit 5;
+  id |  name  
+ ----+--------
+   2 | peter
+   3 | james
+   5 | kate
+   6 | jane
+   7 | samual
+ (5 rows)
+ 
+ SELECT * INTO tmp FROM sample TABLESAMPLE BERNOULLI (4);
+ DROP TABLE tmp;
+ SELECT * INTO tmp FROM sample TABLESAMPLE SYSTEM (20);
+ DROP TABLE tmp;
+ DROP TABLE sample;
*** a/src/test/regress/parallel_schedule
--- b/src/test/regress/parallel_schedule
***************
*** 59,65 **** test: create_index create_view
  # ----------
  # Another group of parallel tests
  # ----------
! test: create_aggregate create_function_3 create_cast constraints triggers inherit create_table_like typed_table vacuum drop_if_exists
  
  # ----------
  # sanity_check does a vacuum, affecting the sort order of SELECT *
--- 59,65 ----
  # ----------
  # Another group of parallel tests
  # ----------
! test: create_aggregate create_function_3 create_cast constraints triggers inherit create_table_like typed_table vacuum drop_if_exists tablesample
  
  # ----------
  # sanity_check does a vacuum, affecting the sort order of SELECT *
*** a/src/test/regress/serial_schedule
--- b/src/test/regress/serial_schedule
***************
*** 67,72 **** test: create_table_like
--- 67,73 ----
  test: typed_table
  test: vacuum
  test: drop_if_exists
+ test: tablesample
  test: sanity_check
  test: errors
  test: select
*** /dev/null
--- b/src/test/regress/sql/tablesample.sql
***************
*** 0 ****
--- 1,58 ----
+ --
+ -- TABLESAMPLE TEST
+ --
+ 
+ CREATE TABLE sample (id INT, name text);
+ 
+ INSERT INTO sample VALUES (1, 'victor');
+ INSERT INTO sample VALUES (2, 'peter');
+ INSERT INTO sample VALUES (3, 'james');
+ INSERT INTO sample VALUES (4, 'paul');
+ INSERT INTO sample VALUES (5, 'kate');
+ INSERT INTO sample VALUES (6, 'jane');
+ INSERT INTO sample VALUES (7, 'samual');
+ INSERT INTO sample VALUES (8, 'david');
+ INSERT INTO sample VALUES (9, 'job');
+ INSERT INTO sample VALUES (10, 'neil');
+ INSERT INTO sample VALUES (11, 'victor');
+ INSERT INTO sample VALUES (12, 'peter');
+ INSERT INTO sample VALUES (13, 'james');
+ INSERT INTO sample VALUES (14, 'paul');
+ INSERT INTO sample VALUES (15, 'kate');
+ INSERT INTO sample VALUES (16, 'jane');
+ INSERT INTO sample VALUES (17, 'samual');
+ INSERT INTO sample VALUES (18, 'david');
+ INSERT INTO sample VALUES (19, 'job');
+ INSERT INTO sample VALUES (20, 'neil');
+ 
+ ANALYZE sample;
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (60) REPEATABLE (10);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (19);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (7) REPEATABLE (19);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (1) REPEATABLE (19);
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (20);
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (2);
+ 
+ SELECT * FROM sample TABLESAMPLE SYSTEM (65) REPEATABLE (20);
+ 
+ SELECT * FROM sample TABLESAMPLE BERNOULLI (70) REPEATABLE (11) order by id limit 5;
+ 
+ SELECT * INTO tmp FROM sample TABLESAMPLE BERNOULLI (4);
+ 
+ DROP TABLE tmp;
+ 
+ SELECT * INTO tmp FROM sample TABLESAMPLE SYSTEM (20);
+ 
+ DROP TABLE tmp;
+ 
+ DROP TABLE sample;

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Qi Huang (#1)

Re: [PATCH]Tablesample Submission

On Mon, Aug 20, 2012 at 9:52 PM, Qi Huang <huangqiyx@outlook.com> wrote:

Hi, hackers
I made the final version tablesample patch. It is implementing SYSTEM
and BERNOULLI sample method, which is basically "feature-complete". The
regression test is also included in this patch.
There is an wiki documentation on
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation. The detail
about this patch and this project is all included in this documentation.

Please add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Qi Huang

huangqiyx@outlook.com

over 13 years ago

In reply to: Robert Haas (#2)

Re: [PATCH]Tablesample Submission

Please add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi, Robert
I added it under "Miscellaneous".
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Hitoshi Harada

umi.tanuki@gmail.com

over 13 years ago

In reply to: Qi Huang (#3)

Re: [PATCH]Tablesample Submission

On Tue, Aug 21, 2012 at 8:08 AM, Qi Huang <huangqiyx@outlook.com> wrote:

Please add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi, Robert
I added it under "Miscellaneous".
https://commitfest.postgresql.org/action/patch_view?id=918

Patch does not apply cleanly against latest master. outfuncs.c,
allpath.c and cost.h have rejected parts. The make check failed in a
lot of cases up to 26 out of 133. I didn't look into each issue but I
suggest rebasing on the latest master and making sure the regression
test passes.

Some of the patch don't follow our coding standard. Please adjust
brace positions, for example. For the header include list and
Makefile, place a new files in alphabetical order.

The patch doesn't include any documentation. Consider add some doc
patch for such a big feature like this.

You should update kwlist.h for REPEATABLE but I'm not sure if
REPEATABLE should become a reserved keyword yet.

I don't see why you created T_TableSampleInfo. TableSampleInfo looks
fine with a simple struct rather than a Node.

If we want to disable a cursor over a sampling table, we should check
it in the parser not the planner. In the wiki page, one of the TODOs
says about cursor support, but how much difficult is it? How does the
standard say?

s/skiped/skipped/

Don't we need to reset seed on ExecReScanSampleScan? Should we add a
new executor node SampleScan? It seems everything about random
sampling is happening under heapam.

It looks like BERNOULLI allocates heap tuple array beforehand, and
copy all the tuples into it. This doesn't look acceptable when you
are dealing with a large number of rows in the table.

As wiki says, BERNOULLI relies on the statistics of the table, which
doesn't sound good to me. Of course we could say this is our
restriction and say good-bye to users who hadn't run ANALYZE first,
but it is too hard for a normal users to use it. We may need
quick-and-rough count(*) for this.

That is pretty much I have so far. I haven't read all the code nor
the standard, so I might be wrong somewhere.

Thanks,
--
Hitoshi Harada

Alvaro Herrera

alvherre@2ndquadrant.com

about 13 years ago

In reply to: Hitoshi Harada (#4)

Re: [PATCH]Tablesample Submission

Hitoshi Harada escribió:

Patch does not apply cleanly against latest master. outfuncs.c,
allpath.c and cost.h have rejected parts. The make check failed in a
lot of cases up to 26 out of 133. I didn't look into each issue but I
suggest rebasing on the latest master and making sure the regression
test passes.

We've been waiting for a rebase for long enough, so I've marked this
patch as Returned with Feedback (for which we thank Hitoshi Harada).
Since this is said to be useful functionality, please make sure you
update the patch and resubmit to the next commitfest. Thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Qi Huang

huangqiyx@outlook.com

about 13 years ago

In reply to: Qi Huang (#3)

Re: [PATCH]Tablesample Submission

Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

From: huangqiyx@outlook.com
To: robertmhaas@gmail.com
CC: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [PATCH]Tablesample Submission
Date: Tue, 21 Aug 2012 23:08:41 +0800

Please add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hi, Robert
I added it under "Miscellaneous".
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Jaime Casanova

jaime@2ndquadrant.com

about 13 years ago

In reply to: Qi Huang (#6)

Re: [PATCH]Tablesample Submission

On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx@outlook.com> wrote:

Dear hackers
Sorry for not replying the patch review. I didn't see the review until
recently as my mail box is full of Postgres mails and I didn't notice the
one for me, my mail box configuration problem.
I am still kind of busy with my university final year project. I shall
not have time to work on updating the patch until this semester finishes
which is December. I will work on then.

While we are still in the middle of a commitfest, i'm curious...
should we still wait for an update of this patch?

i know, any update on this should go to the next commitfest but i
wanted to ask before i forget about it

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Qi Huang (#6)

Re: [PATCH]Tablesample Submission

On 11/04/2012 07:22 PM, Qi Huang wrote:

Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Did you ever do the update of the patch?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Josh Berkus (#8)

Re: [PATCH]Tablesample Submission

On 17 January 2013 18:32, Josh Berkus <josh@agliodbs.com> wrote:

On 11/04/2012 07:22 PM, Qi Huang wrote:

Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Did you ever do the update of the patch?

We aren't just waiting for a rebase, we're waiting for Hitoshi's
comments to be addressed.

I would add to them by saying I am very uncomfortable with the idea of
allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
want that you can use a sub-select.

Plus the patch contains zero documentation.

So I can't see this going anywhere for 9.3. I've moved it to CF1 of
9.4 marked Waiting on Author

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Josh Berkus

josh@agliodbs.com

almost 13 years ago

In reply to: Simon Riggs (#9)

Re: [PATCH]Tablesample Submission

So I can't see this going anywhere for 9.3. I've moved it to CF1 of
9.4 marked Waiting on Author

Agreed. I wish I'd noticed that it got lost earlier.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Jaime Casanova

jaime@2ndquadrant.com

almost 13 years ago

In reply to: Simon Riggs (#9)

Re: [PATCH]Tablesample Submission

On Thu, Jan 17, 2013 at 2:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 January 2013 18:32, Josh Berkus <josh@agliodbs.com> wrote:

On 11/04/2012 07:22 PM, Qi Huang wrote:

Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Did you ever do the update of the patch?

We aren't just waiting for a rebase, we're waiting for Hitoshi's
comments to be addressed.

I would add to them by saying I am very uncomfortable with the idea of
allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
want that you can use a sub-select.

also, i don't think that the REPEATABLE clause should be included in a
first revision.
and if we ever want to add more sample methods we can't just put
BERNOULLI nor SYSTEM in gram.y, a new catalog is probably needed
there.

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Jaime Casanova

jaime@2ndquadrant.com

over 12 years ago

In reply to: Qi Huang (#6)

Re: [PATCH]Tablesample Submission

On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx@outlook.com> wrote:

Dear hackers
Sorry for not replying the patch review. I didn't see the review until
recently as my mail box is full of Postgres mails and I didn't notice the
one for me, my mail box configuration problem.
I am still kind of busy with my university final year project. I shall
not have time to work on updating the patch until this semester finishes
which is December. I will work on then.

Hi,

Should we expect an updated patch for next commitfest?

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Jaime Casanova (#12)

Re: [PATCH]Tablesample Submission

On 17 May 2013 21:46, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx@outlook.com> wrote:

Dear hackers
Sorry for not replying the patch review. I didn't see the review until
recently as my mail box is full of Postgres mails and I didn't notice the
one for me, my mail box configuration problem.
I am still kind of busy with my university final year project. I shall
not have time to work on updating the patch until this semester finishes
which is December. I will work on then.

Should we expect an updated patch for next commitfest?

This was added to CF1 of 9.4. The patch is nowhere near committable
and hasn't been worked on at all since last time it was submitted.

It's important that we have an efficient implementation of TABLESAMPLE
in Postgres.

I'm going to remove it from CF, for now, but I'll also take
responsibility for this for 9.4, barring objections.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Simon Riggs

simon@2ndQuadrant.com

over 12 years ago

In reply to: Hitoshi Harada (#4)

Re: [PATCH]Tablesample Submission

On 18 September 2012 10:32, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

As wiki says, BERNOULLI relies on the statistics of the table, which
doesn't sound good to me. Of course we could say this is our
restriction and say good-bye to users who hadn't run ANALYZE first,
but it is too hard for a normal users to use it. We may need
quick-and-rough count(*) for this.

For Bernoulli sampling, SQL Standard says "Further, whether a given
row of RT is included in result of TF is independent of whether other
rows of RT are included in result of TF."

Which means BERNOULLI sampling looks essentially identical to using

WHERE random() <= ($percent/100)

So my proposed implementation route for bernoulli sampling is to
literally add an AND-ed qual that does a random() test (and
repeatability also). That looks fairly simple and it is still
accurate, because it doesn't matter whether we do the indpendent test
to include the tuple before or after any other quals. I realise that
isn't a cool and hip approach, but it works and is exactly accurate.
Which would change the patch quite a bit.

Taking the random() approach would mean we don't rely on statistics either.

Thoughts?

SYSTEM sampling uses a completely different approach and is the really
interesting part of this feature.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers