Tuplesort merge pre-reading

Started by Heikki Linnakangasover 9 years ago61 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
1 attachment(s)

While reviewing Peter's latest round of sorting patches, and trying to
understand the new "batch allocation" mechanism, I started to wonder how
useful the pre-reading in the merge stage is in the first place.

I'm talking about the code that reads a bunch of from each tape, loading
them into the memtuples array. That code was added by Tom Lane, back in
1999:

commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 30 17:27:15 1999 +0000

Further performance improvements in sorting: reduce number of
comparisons
during initial run formation by keeping both current run and next-run
tuples in the same heap (yup, Knuth is smarter than I am). And, during
merge passes, make use of available sort memory to load multiple tuples
from any one input 'tape' at a time, thereby improving locality of
access to the temp file.

So apparently there was a benefit back then, but is it still worthwhile?
The LogicalTape buffers one block at a time, anyway, how much gain are
we getting from parsing the tuples into SortTuple format in batches?

I wrote a quick patch to test that, attached. It seems to improve
performance, at least in this small test case:

create table lotsofints(i integer);
insert into lotsofints select random() * 1000000000.0 from
generate_series(1, 10000000);
vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s on
unpatched master, and 6.2 s with the attached patch.

So, at least in some cases, the pre-loading hurts. I think we should get
rid of it. This patch probably needs some polishing: I replaced the
batch allocations with a simpler scheme with a buffer to hold just a
single tuple for each tape, and that might need some more work to allow
downsizing those buffers if you have a few very large tuples in an
otherwise narrow table. And perhaps we should free and reallocate a
smaller memtuples array for the merging, now that we're not making use
of the whole of it. And perhaps we should teach LogicalTape to use
larger buffers, if we can't rely on the OS to do the buffering for us.
But overall, this seems to make the code both simpler and faster.

Am I missing something?

- Heikki

Attachments:

0001-Don-t-bother-to-pre-read-tuples-into-slots-during-me.patchapplication/x-patch; name=0001-Don-t-bother-to-pre-read-tuples-into-slots-during-me.patchDownload
From ea4ce25a33d0dec370a1b5e45cbc6f794e377a90 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 6 Sep 2016 14:38:54 +0300
Subject: [PATCH 1/1] Don't bother to pre-read tuples into slots during merge.

That only seems to add overhead. We're doing the same number of READTUP()
calls either way, but we're spreading the memory usage over a larger area
if we try to pre-read, so it doesn't seem worth it. Although, we're not
using all the available memory this way. Are we now doing too short reads
from the underlying files? Perhaps we should increase the buffer size in
LogicalTape instead, if that would help?
---
 src/backend/utils/sort/tuplesort.c | 487 ++++++-------------------------------
 1 file changed, 80 insertions(+), 407 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c8fbcf8..1fc1b5e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -358,42 +358,27 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
 
 	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
+	 * Per-tape batch state, when final on-the-fly merge uses pre-allocated
+	 * buffers to hold just the latest tuple, instead of using palloc() for
+	 * each tuple. We have one buffer to hold the next tuple from each tape,
+	 * plus one buffer to hold the tuple we last returned to the caller.
 	 *
 	 * Aside from the general benefits of performing fewer individual retail
 	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
+	 * since we reuse the same memory quickly.
 	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
+	char	  **mergetuples;		/* Each tape's memory allocation */
+	int		   *mergetuplesizes;	/* size of each allocation */
+
+	char	   *mergelasttuple;
+	int			mergelasttuplesize;	/* allocated size */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -555,14 +540,8 @@ static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
 static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
 static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void mergebatch(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -1976,8 +1955,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
 				/*
 				 * Returned tuple is still counted in our memory space most of
@@ -1988,42 +1966,15 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				 */
 				*stup = state->memtuples[0];
 				tuplesort_heap_siftup(state, false);
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
-
-					mergeprereadone(state, srcTape);
 
-					/*
-					 * if still no data, we've reached end of run on this tape
-					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+				/* pull next tuple from tape, insert in heap */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
+					/* we've reached end of run on this tape */
+					return true;
 				}
-				/* pull next preread tuple from list, insert in heap */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				tuplesort_heap_insert(state, newtup, srcTape, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+
+				tuplesort_heap_insert(state, &newtup, srcTape, false);
 				return true;
 			}
 			return false;
@@ -2350,14 +2301,8 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
 	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetuplesizes = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2617,10 +2562,6 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
@@ -2635,33 +2576,21 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
 		/* compact the heap */
 		tuplesort_heap_siftup(state, false);
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
+
+		/* pull next tuple from tape, insert in heap */
+		if (!mergereadnext(state, srcTape, &stup))
 		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-				continue;
+			/* we've reached end of run on this tape */
+			continue;
 		}
-		/* pull next preread tuple from list, insert in heap */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tuplesort_heap_insert(state, tup, srcTape, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		tuplesort_heap_insert(state, &stup, srcTape, false);
 	}
 
 	/*
@@ -2704,8 +2633,6 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2729,14 +2656,6 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
 	if (finalMergeBatch)
 	{
 		/* Free outright buffers for tape never actually allocated */
@@ -2749,22 +2668,7 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 		batchmemtuples(state);
 	}
 
-	/*
-	 * Initialize space allocation to let each active input tape have an equal
-	 * share of preread space.
-	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
 
 	/*
 	 * Preallocate tuple batch memory for each tape.  This is the memory used
@@ -2773,35 +2677,21 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	 * once per sort, just in advance of the final on-the-fly merge step.
 	 */
 	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
+		mergebatch(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
+		SortTuple	tup;
 
-		if (tupIndex)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tuplesort_heap_insert(state, tup, srcTape, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
+			tuplesort_heap_insert(state, &tup, srcTape, false);
 
 #ifdef TRACE_SORT
 			if (trace_sort && finalMergeBatch)
 			{
+#if 0
 				int64		perTapeKB = (spacePerTape + 1023) / 1024;
 				int64		usedSpaceKB;
 				int			usedSlots;
@@ -2828,6 +2718,7 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 					 (double) usedSpaceKB / (double) perTapeKB,
 					 usedSlots, slotsPerTape,
 					 (double) usedSlots / (double) slotsPerTape);
+#endif
 			}
 #endif
 		}
@@ -2923,7 +2814,7 @@ batchmemtuples(Tuplesortstate *state)
  * goal.
  */
 static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
+mergebatch(Tuplesortstate *state)
 {
 	int			srcTape;
 
@@ -2943,283 +2834,46 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
 			continue;
 
 		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
+		mergetuples = MemoryContextAlloc(state->tuplecontext, BLCKSZ);
 
 		/* Initialize state for tape */
 		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
+		state->mergetuplesizes[srcTape] = BLCKSZ;
 	}
 
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
+	/* and one more buffer that's not associated with any tape initially */
+	state->mergelasttuple = MemoryContextAlloc(state->tuplecontext, BLCKSZ);
+	state->mergelasttuplesize = BLCKSZ;
 
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
-		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
-		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
+	state->batchUsed = true;
 }
 
 /*
- * mergepreread - load tuples from merge input tapes
+ * mergereadnext - load tuple from one merge input tape
  *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
+ * Returns false on EOF.
  *
  * Read tuples from the specified tape until it has used up its free memory
  * or array slots; but ensure that we have at least one tuple, if any are
  * to be had.
  */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
-
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3861,14 +3515,33 @@ readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
 {
 	if (state->batchUsed)
 	{
+		char	   *buf;
+		int			bufsize;
+
 		/*
+		 * Recycle the buffer that held the previous tuple returned from
+		 * the sort. Enlarge it if it's not large enough to hold the new
+		 * tuple.
+		 *
 		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
+		 * is based on tape-private state.
 		 */
-		return mergebatchalloc(state, tapenum, tuplen);
+		if (tuplen > state->mergelasttuplesize)
+		{
+			state->mergelasttuple = repalloc(state->mergelasttuple, tuplen);
+			state->mergelasttuplesize = tuplen;
+		}
+		buf = state->mergelasttuple;
+		bufsize = state->mergelasttuplesize;
+
+		/* we will return the previous tuple from this tape next. */
+		state->mergelasttuple = state->mergetuples[tapenum];
+		state->mergelasttuplesize = state->mergetuplesizes[tapenum];
+
+		state->mergetuples[tapenum] = buf;
+		state->mergetuplesizes[tapenum] = bufsize;
+
+		return buf;
 	}
 	else
 	{
-- 
2.9.3

#2Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#1)
Re: Tuplesort merge pre-reading

On Tue, Sep 6, 2016 at 5:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I wrote a quick patch to test that, attached. It seems to improve
performance, at least in this small test case:

create table lotsofints(i integer);
insert into lotsofints select random() * 1000000000.0 from
generate_series(1, 10000000);
vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s on
unpatched master, and 6.2 s with the attached patch.

The benefits have a lot to do with OS read-ahead, and efficient use of
memory bandwidth during the merge, where we want to access the caller
tuples sequentially per tape (i.e. that's what the batch memory stuff
added -- it also made much better use of available memory). Note that
I've been benchmarking the parallel CREATE INDEX patch on a server
with many HDDs, since sequential performance is mostly all that
matters. I think that in 1999, the preloading had a lot more to do
with logtape.c's ability to aggressively recycle blocks during merges,
such that the total storage overhead does not exceed the original size
of the caller tuples (plus what it calls "trivial bookkeeping
overhead" IIRC). That's less important these days, but still matters
some (it's more of an issue when you can't complete the sort in one
pass, which is rare these days).

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#2)
Re: Tuplesort merge pre-reading

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#3)
5 attachment(s)
Re: Tuplesort merge pre-reading

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

I wrote a little testing toolkit, see third patch. I'm not proposing to
commit that, but that's what I used for testing. It creates four tables,
about 1GB in size each (it also creates smaller and larger tables, but I
used the "medium" sized ones for these tests). Two of the tables contain
integers, and two contain text strings. Two of the tables are completely
ordered, two are in random order. To measure, it runs ORDER BY queries
on the tables, with different work_mem settings.

Attached are the full results. In summary, these patches improve
performance in some of the tests, and are a wash on others. The patches
help in particular in the randomly ordered cases, with up to 512 MB of
work_mem.

For example, with work_mem=256MB, which is enough to get a single merge
pass:

with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three
numbers on each row mean. I.e. the differences between the numbers on
same row are noise)

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

I'm not 100% sure I'm accounting for all the memory correctly. But I
didn't touch the way the initial quicksort works, nor the way the runs
are built. And the merge passes don't actually need or benefit from a
lot of memory, so I doubt it's very sensitive to that.

In this patch, the memory available for the read buffers is just divided
evenly across maxTapes. The buffers for the tapes that are not currently
active are wasted. It could be made smarter, by freeing all the
currently-unused buffers for tapes that are not active at the moment.
Might do that later, but this is what I'm going to benchmark for now. I
don't think adding buffers is helpful beyond a certain point, so this is
probably good enough in practice. Although it would be nice to free the
memory we don't need earlier, in case there are other processes that
could make use of it.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just
double-buffering.

- Heikki

Attachments:

0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchtext/x-diff; name=0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchDownload
From d4d89c88c5e26be70c976a756e874af65ad6ec55 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 8 Sep 2016 14:31:31 +0300
Subject: [PATCH 1/3] Don't bother to pre-read tuples into SortTuple slots
 during merge.

That only seems to add overhead. We're doing the same number of READTUP()
calls either way, but we're spreading the memory usage over a larger area
if we try to pre-read, so it doesn't seem worth it.

The pre-reading can be helpful, to trigger the OS readahead of the
underlying tape, because it will make the read pattern appear more
sequential. But we'll fix that in the next patch, by teaching logtape.c to
read in larger chunks.
---
 src/backend/utils/sort/tuplesort.c | 889 ++++++++++---------------------------
 1 file changed, 223 insertions(+), 666 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c8fbcf8..b9fb99c 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -203,6 +203,20 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size buffers to store
+ * tuples in. To avoid palloc/pfree overhead.
+ *
+ * 'nextfree' is valid when this chunk is in the free list. When in use, the
+ * buffer holds a tuple.
+ */
+#define MERGETUPLEBUFFER_SIZE 1024
+
+typedef union MergeTupleBuffer
+{
+	union MergeTupleBuffer *nextfree;
+	char		buffer[MERGETUPLEBUFFER_SIZE];
+} MergeTupleBuffer;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -307,14 +321,6 @@ struct Tuplesortstate
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
@@ -332,12 +338,40 @@ struct Tuplesortstate
 	/*
 	 * Memory for tuples is sometimes allocated in batch, rather than
 	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * been abandoned.  Currently, this happens when we start merging.
+	 * Large batch allocations can store tuples (e.g. IndexTuples) without
+	 * palloc() fragmentation and other overhead.
+	 *
+	 * For the batch memory, we use one large allocation, divided into
+	 * MERGETUPLEBUFFER_SIZE chunks. The allocation is sized to hold
+	 * one chunk per tape, plus one additional chunk. We need that many
+	 * chunks to hold all the tuples kept in the heap during merge, plus
+	 * the one we have last returned from the sort.
+	 *
+	 * Initially, all the chunks are kept in a linked list, in freeBufferHead.
+	 * When a tuple is read from a tape, it is put to the next available
+	 * chunk, if it fits. If the tuple is larger than MERGETUPLEBUFFER_SIZE,
+	 * it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the chunk back to the
+	 * free list, or pfree() if it was palloc'd. We know that a tuple was
+	 * allocated from the batch memory arena, if its pointer value is between
+	 * mergeTupleBuffersBegin and -End.
 	 */
 	bool		batchUsed;
 
+	char	   *batchMemoryBegin;	/* beginning of batch memory arena */
+	char	   *batchMemoryEnd;		/* end of batch memory arena */
+	MergeTupleBuffer *freeBufferHead;	/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller that came from a tape (that is,
+	 * in TSS_SORTEDONTAPE or TSS_FINALMERGE modes), we remember the tuple
+	 * in 'readlasttuple', so that we can recycle the memory on next
+	 * gettuple call.
+	 */
+	void	   *readlasttuple;
+
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
@@ -358,42 +392,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,11 +484,33 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the batch memory arena?
+ */
+#define IS_MERGETUPLE_BUFFER(state, tuple) \
+	((char *) tuple >= state->batchMemoryBegin && \
+	 (char *) tuple < state->batchMemoryEnd)
+
+/*
+ * Return the given tuple to the batch memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_MERGETUPLE_BUFFER(state, tuple) \
+	do { \
+		MergeTupleBuffer *buf = (MergeTupleBuffer *) tuple; \
+		\
+		if (IS_MERGETUPLE_BUFFER(state, tuple)) \
+		{ \
+			buf->nextfree = state->freeBufferHead; \
+			state->freeBufferHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -553,16 +578,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -574,7 +591,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -582,7 +599,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -590,7 +606,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -600,7 +615,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -608,7 +622,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -760,7 +773,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -833,7 +845,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -925,7 +936,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -993,7 +1003,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1036,7 +1045,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1881,14 +1889,33 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
 			Assert(!state->batchUsed);
-			*should_free = true;
+
+			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call. (This can be NULL, in the Datum case).
+					 */
+					state->readlasttuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1962,6 +1989,14 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->readlasttuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
@@ -1971,13 +2006,22 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			*should_free = false;
 
 			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
 				/*
 				 * Returned tuple is still counted in our memory space most of
@@ -1988,42 +2032,22 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				 */
 				*stup = state->memtuples[0];
 				tuplesort_heap_siftup(state, false);
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
 
-					mergeprereadone(state, srcTape);
+				/*
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
+				 */
+				state->readlasttuple = stup->tuple;
 
-					/*
-					 * if still no data, we've reached end of run on this tape
-					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+				/* pull next tuple from tape, insert in heap */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
+					/* we've reached end of run on this tape */
+					return true;
 				}
-				/* pull next preread tuple from list, insert in heap */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				tuplesort_heap_insert(state, newtup, srcTape, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+
+				tuplesort_heap_insert(state, &newtup, srcTape, false);
+
 				return true;
 			}
 			return false;
@@ -2325,7 +2349,8 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but
 	 * don't decrease it to the point that we have no room for tuples. (That
 	 * case is only likely to occur if sorting pass-by-value Datums; in all
 	 * other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2350,14 +2375,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2468,6 +2485,8 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	char	   *p;
+	int			i;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2504,6 +2523,36 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape. Shrink
+	 * it, to make the memory available for other use. We only need one slot per
+	 * tape.
+	 */
+	pfree(state->memtuples);
+	FREEMEM(state, state->memtupsize * sizeof(SortTuple));
+	state->memtupsize = state->maxTapes;
+	state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+	USEMEM(state, state->memtupsize * sizeof(SortTuple));
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage.
+	 */
+	state->batchUsed = true;
+
+	/* Initialize the merge tuple buffer arena.  */
+	state->batchMemoryBegin = palloc((state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin + (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+
+	p = state->batchMemoryBegin;
+	for (i = 0; i < state->maxTapes; i++)
+	{
+		((MergeTupleBuffer *) p)->nextfree = (MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
+		p += MERGETUPLEBUFFER_SIZE;
+	}
+	((MergeTupleBuffer *) p)->nextfree = NULL;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2534,7 +2583,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2617,16 +2666,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2635,33 +2680,25 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
+
+		/* Recycle the buffer we just wrote out, for the next read */
+		RELEASE_MERGETUPLE_BUFFER(state, state->memtuples[0].tuple);
+
 		/* compact the heap */
 		tuplesort_heap_siftup(state, false);
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
+
+		/* pull next tuple from tape, insert in heap */
+		if (!mergereadnext(state, srcTape, &stup))
 		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-				continue;
+			/* we've reached end of run on this tape */
+			continue;
 		}
-		/* pull next preread tuple from list, insert in heap */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tuplesort_heap_insert(state, tup, srcTape, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		tuplesort_heap_insert(state, &stup, srcTape, false);
 	}
 
 	/*
@@ -2694,18 +2731,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2729,497 +2761,48 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tuplesort_heap_insert(state, tup, srcTape, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
+		SortTuple	tup;
 
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
+		if (mergereadnext(state, srcTape, &tup))
+			tuplesort_heap_insert(state, &tup, srcTape, false);
 	}
 }
 
 /*
- * batchmemtuples - grow memtuples without palloc overhead
+ * mergereadnext - load tuple from one merge input tape
  *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	/* Should not matter, but be tidy */
-	FREEMEM(state, availMemLessRefund);
-	state->growmemtuples = false;
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
-		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
-		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
+ * Returns false on EOF.
  *
  * Read tuples from the specified tape until it has used up its free memory
  * or array slots; but ensure that we have at least one tuple, if any are
  * to be had.
  */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3857,27 +3440,24 @@ markrunend(Tuplesortstate *state, int tapenum)
  * routines.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	MergeTupleBuffer *buf;
+
+	/*
+	 * We pre-allocate enough buffers in the arena that we should never run out.
+	 */
+	Assert(state->freeBufferHead);
+
+	if (tuplen > MERGETUPLEBUFFER_SIZE || !state->freeBufferHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->freeBufferHead;
+		/* Reuse this buffer */
+		state->freeBufferHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4046,8 +3626,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4056,7 +3639,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4077,12 +3660,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4289,8 +3866,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4299,7 +3879,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4324,19 +3903,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4604,8 +4170,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4613,7 +4182,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4628,12 +4197,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4700,7 +4263,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->batchUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4730,7 +4293,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4744,12 +4307,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
-- 
2.9.3

0002-Use-larger-read-buffers-in-logtape.patchtext/x-diff; name=0002-Use-larger-read-buffers-in-logtape.patchDownload
From 379580dc600c079b8de3cc2f392376ad46429758 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 8 Sep 2016 20:34:06 +0300
Subject: [PATCH 2/3] Use larger read buffers in logtape.

This makes the access pattern appear more sequential to the OS, making it
more likely that the OS will do read-head for use. It will also ensure that
there are more sequential blocks available when writing, because we can
free more blocks in the underlying file at once. Sequential I/O is much
cheaper than random I/O.

We used to do pre-reading from each tape, in tuplesort.c, for the same
reasons. But it seems simpler to do it in logtape.c, reading the raw data
into larger a buffer, than converting every tuple to SortTuple format when
pre-reading, like tuplesort.c used to do.
---
 src/backend/utils/sort/logtape.c   | 134 +++++++++++++++++++++++++++++++------
 src/backend/utils/sort/tuplesort.c |  35 +++++++++-
 src/include/utils/logtape.h        |   1 +
 3 files changed, 147 insertions(+), 23 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..05d7697 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -131,9 +131,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +231,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +596,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +680,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +691,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +811,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +843,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +894,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +943,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1010,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1000,6 +1070,9 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 {
 	LogicalTape *lt;
 
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	*blocknum = lt->curBlockNumber;
@@ -1014,3 +1087,24 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index b9fb99c..dc35fcf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2487,6 +2487,8 @@ mergeruns(Tuplesortstate *state)
 				svDummy;
 	char	   *p;
 	int			i;
+	int			per_tape, cutoff;
+	long		avail_blocks;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2535,15 +2537,17 @@ mergeruns(Tuplesortstate *state)
 	USEMEM(state, state->memtupsize * sizeof(SortTuple));
 
 	/*
-	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
-	 * track memory usage.
+	 * If we had fewer runs than tapes, refund buffers for tapes that were never
+	 * allocated.
 	 */
-	state->batchUsed = true;
+	if (state->currentRun < state->maxTapes)
+		FREEMEM(state, (state->maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
 
 	/* Initialize the merge tuple buffer arena.  */
 	state->batchMemoryBegin = palloc((state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
 	state->batchMemoryEnd = state->batchMemoryBegin + (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
 	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+	USEMEM(state, (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
 
 	p = state->batchMemoryBegin;
 	for (i = 0; i < state->maxTapes; i++)
@@ -2553,6 +2557,31 @@ mergeruns(Tuplesortstate *state)
 	}
 	((MergeTupleBuffer *) p)->nextfree = NULL;
 
+	/*
+	 * Use all the spare memory we have available for read buffers. Divide it
+	 * memory evenly among all the tapes.
+	 */
+	avail_blocks = state->availMem / BLCKSZ;
+	per_tape = avail_blocks / state->maxTapes;
+	cutoff = avail_blocks % state->maxTapes;
+	if (per_tape == 0)
+	{
+		per_tape = 1;
+		cutoff = 0;
+	}
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										(per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+	}
+	USEMEM(state, avail_blocks * BLCKSZ);
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of indivitual tuples.
+	 */
+	state->batchUsed = true;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

0003-Add-sorting-test-suite.patchtext/x-diff; name=0003-Add-sorting-test-suite.patchDownload
From c63cd34aa51941d5851dfd6d3d273415ad02a7fb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 8 Sep 2016 21:42:55 +0300
Subject: [PATCH 3/3] Add sorting test suite

---
 src/test/sorttestsuite/Makefile      |  31 ++++++
 src/test/sorttestsuite/correctness.c | 153 +++++++++++++++++++++++++++
 src/test/sorttestsuite/generate.c    | 198 +++++++++++++++++++++++++++++++++++
 src/test/sorttestsuite/speed.c       | 139 ++++++++++++++++++++++++
 4 files changed, 521 insertions(+)
 create mode 100644 src/test/sorttestsuite/Makefile
 create mode 100644 src/test/sorttestsuite/correctness.c
 create mode 100644 src/test/sorttestsuite/generate.c
 create mode 100644 src/test/sorttestsuite/speed.c

diff --git a/src/test/sorttestsuite/Makefile b/src/test/sorttestsuite/Makefile
new file mode 100644
index 0000000..91c8ccd
--- /dev/null
+++ b/src/test/sorttestsuite/Makefile
@@ -0,0 +1,31 @@
+CFLAGS=-g -I/home/heikki/pgsql.master/include
+
+LDFLAGS=-L/home/heikki/pgsql.master/lib -lpq -lm
+
+TESTDB=sorttest
+
+# For testing quicksort.
+SCALE_SMALL=1024	# 1 MB
+
+# For testing external sort, while the dataset still fits in OS cache.
+SCALE_MEDIUM=1048576	# 1 GB
+
+# Does not fit in memory.
+SCALE_LARGE=20971520	# 20 GB
+#SCALE_LARGE=1500000	# 20 GB
+
+all: generate speed correctness
+
+generate: generate.c
+
+speed: speed.c
+
+correctness: correctness.c
+
+generate_testdata:
+	dropdb --if-exists $(TESTDB)
+	createdb $(TESTDB)
+	psql $(TESTDB) -c "CREATE SCHEMA small; CREATE SCHEMA medium; CREATE SCHEMA large;"
+	(echo "set search_path=small;"; ./generate all $(SCALE_SMALL)) | psql $(TESTDB)
+	(echo "set search_path=medium;"; ./generate all $(SCALE_MEDIUM)) | psql $(TESTDB)
+	(echo "set search_path=large;"; ./generate all $(SCALE_LARGE)) | psql $(TESTDB)
diff --git a/src/test/sorttestsuite/correctness.c b/src/test/sorttestsuite/correctness.c
new file mode 100644
index 0000000..b41aa2e
--- /dev/null
+++ b/src/test/sorttestsuite/correctness.c
@@ -0,0 +1,153 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <sys/time.h>
+
+#include <libpq-fe.h>
+
+static PGconn *conn;
+
+static void
+execute(const char *sql)
+{
+	int			i;
+	PGresult   *res;
+
+	fprintf(stderr, "%s\n", sql);
+	
+	res = PQexec(conn, sql);
+	if (PQresultStatus(res) != PGRES_COMMAND_OK && PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		fprintf(stderr,"command failed: %s\n%s", sql, PQerrorMessage(conn));
+		PQclear(res);
+		exit(1);
+	}
+
+	PQclear(res);
+}
+
+static void
+check_sorted(const char *sql, int (*cmp)(const char *a, const char *b))
+{
+	int			i;
+	PGresult   *res;
+	PGresult   *prevres = NULL;
+	int			rowno;
+
+	fprintf(stderr, "running query: %s\n", sql);
+	if (!PQsendQuery(conn, sql))
+	{
+		fprintf(stderr,"query failed: %s\n%s", sql, PQerrorMessage(conn));
+		PQclear(res);
+		exit(1);
+	}
+	if (!PQsetSingleRowMode(conn))
+	{
+		fprintf(stderr,"setting single-row mode failed: %s", PQerrorMessage(conn));
+		PQclear(res);
+		exit(1);
+	}
+
+	rowno = 1;
+	while (res = PQgetResult(conn))
+	{
+		if (PQresultStatus(res) == PGRES_TUPLES_OK)
+			continue;
+		if (PQresultStatus(res) != PGRES_SINGLE_TUPLE)
+		{
+			fprintf(stderr,"error while fetching: %d, %s\n%s", PQresultStatus(res), sql, PQerrorMessage(conn));
+			PQclear(res);
+			exit(1);
+		}
+
+		if (prevres)
+		{
+			if (!cmp(PQgetvalue(prevres, 0, 0), PQgetvalue(res, 0, 0)))
+			{
+				fprintf(stderr,"FAIL: result not sorted, row %d: %s, prev %s\n", rowno,
+						PQgetvalue(prevres, 0, 0), PQgetvalue(res, 0, 0));
+				PQclear(res);
+				exit(1);
+			}
+			PQclear(prevres);
+		}
+		prevres = res;
+
+		rowno++;
+	}
+
+	if (prevres)
+		PQclear(prevres);
+}
+
+
+static int
+compare_strings(const char *a, const char *b)
+{
+	return strcmp(a, b) <= 0;
+}
+
+static int
+compare_ints(const char *a, const char *b)
+{
+	return atoi(a) <= atoi(b);
+}
+
+int
+main(int argc, char **argv)
+{
+	double duration;
+	char		buf[1000];
+
+	/* Make a connection to the database */
+	conn = PQconnectdb("");
+
+	/* Check to see that the backend connection was successfully made */
+	if (PQstatus(conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "Connection to database failed: %s",
+				PQerrorMessage(conn));
+		exit(1);
+	}
+	execute("set trace_sort=on");
+
+	execute("set work_mem = '4MB'");
+
+	check_sorted("SELECT * FROM small.ordered_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM small.random_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM small.ordered_text ORDER BY t", compare_strings);
+	check_sorted("SELECT * FROM small.random_text ORDER BY t", compare_strings);
+
+	execute("set work_mem = '16MB'");
+
+	check_sorted("SELECT * FROM medium.ordered_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.random_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.ordered_text ORDER BY t", compare_strings);
+	check_sorted("SELECT * FROM medium.random_text ORDER BY t", compare_strings);
+
+	execute("set work_mem = '256MB'");
+
+	check_sorted("SELECT * FROM medium.ordered_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.random_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.ordered_text ORDER BY t", compare_strings);
+	check_sorted("SELECT * FROM medium.random_text ORDER BY t", compare_strings);
+
+	execute("set work_mem = '512MB'");
+
+	check_sorted("SELECT * FROM medium.ordered_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.random_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.ordered_text ORDER BY t", compare_strings);
+	check_sorted("SELECT * FROM medium.random_text ORDER BY t", compare_strings);
+
+	execute("set work_mem = '2048MB'");
+
+	check_sorted("SELECT * FROM medium.ordered_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.random_ints ORDER BY i", compare_ints);
+	check_sorted("SELECT * FROM medium.ordered_text ORDER BY t", compare_strings);
+	check_sorted("SELECT * FROM medium.random_text ORDER BY t", compare_strings);
+
+	PQfinish(conn);
+
+	return 0;
+}
diff --git a/src/test/sorttestsuite/generate.c b/src/test/sorttestsuite/generate.c
new file mode 100644
index 0000000..f481189
--- /dev/null
+++ b/src/test/sorttestsuite/generate.c
@@ -0,0 +1,198 @@
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+static void
+generate_ordered_integers(int scale)
+{
+	int			rows = ((double) scale) * 28.75;
+	int			i;
+
+	printf("DROP TABLE IF EXISTS ordered_ints;\n");
+	printf("BEGIN;");
+	printf("CREATE TABLE ordered_ints (i int4);\n");
+	printf("COPY ordered_ints FROM STDIN WITH (FREEZE);\n");
+
+	for (i = 0; i < rows; i++)
+		printf("%d\n", i);
+
+	printf("\\.\n");
+	printf("COMMIT;\n");
+}
+
+static void
+generate_random_integers(int scale)
+{
+	int			rows = ((double) scale) * 28.75;
+	int			i;
+
+	printf("DROP TABLE IF EXISTS random_ints;\n");
+	printf("BEGIN;");
+	printf("CREATE TABLE random_ints (i int4);\n");
+	printf("COPY random_ints FROM STDIN WITH (FREEZE);\n");
+
+	for (i = 0; i < rows; i++)
+		printf("%d\n", random());
+
+	printf("\\.\n");
+	printf("COMMIT;\n");
+}
+
+#define ALPHABET_SIZE 26
+static const char alphabet[ALPHABET_SIZE + 1] = "abcdefghijklmnopqrstuvwxyz";
+
+#define TEXT_LEN 50
+
+static void
+random_string(char *buf, int len)
+{
+	int			i;
+	long		r;
+	long		m;
+
+	m = 0;
+	for (i = 0; i < len; i++)
+	{
+		if (m / ALPHABET_SIZE < ALPHABET_SIZE)
+		{
+			m = RAND_MAX;
+			r = random();
+		}
+
+		*buf = alphabet[r % ALPHABET_SIZE];
+		m = m / ALPHABET_SIZE;
+		r = r / ALPHABET_SIZE;
+		buf++;
+	}
+	*buf = '\0';
+	return;
+}
+
+static void
+generate_random_text(int scale)
+{
+	int			rows = ((double) scale) * 12.7;
+	int			i;
+	char		buf[TEXT_LEN + 1] = { 0 };
+
+	printf("DROP TABLE IF EXISTS random_text;\n");
+	printf("BEGIN;");
+	printf("CREATE TABLE random_text (t text);\n");
+	printf("COPY random_text FROM STDIN WITH (FREEZE);\n");
+
+	for (i = 0; i < rows; i++)
+	{
+		random_string(buf, TEXT_LEN);
+		printf("%s\n", buf);
+	}
+
+	printf("\\.\n");
+	printf("COMMIT;\n");
+}
+
+static void
+generate_ordered_text(int scale)
+{
+	int			rows = ((double) scale) * 12.7;
+	int			i;
+	int			j;
+	char		indexes[TEXT_LEN] = {0};
+	char		buf[TEXT_LEN + 1];
+	double			digits;
+
+	printf("DROP TABLE IF EXISTS ordered_text;\n");
+	printf("BEGIN;");
+	printf("CREATE TABLE ordered_text (t text);\n");
+	printf("COPY ordered_text FROM STDIN WITH (FREEZE);\n");
+
+	/*
+	 * We don't want all the strings to have the same prefix.
+	 * That makes the comparisons very expensive. That might be an
+	 * interesting test case too, but not what we want here. To avoid
+	 * that, figure out how many characters will change, with the #
+	 * of rows we chose.
+	 */
+	digits = ceil(log(rows) / log((double) ALPHABET_SIZE));
+
+	if (digits > TEXT_LEN)
+		digits = TEXT_LEN;
+
+	for (i = 0; i < rows; i++)
+	{
+		for (j = 0; j < TEXT_LEN; j++)
+		{
+			buf[j] = alphabet[indexes[j]];
+		}
+		buf[j] = '\0';
+		printf("%s\n", buf);
+
+		/* increment last character, carrying if needed */
+		for (j = digits - 1; j >= 0; j--)
+		{
+			indexes[j]++;
+			if (indexes[j] == ALPHABET_SIZE)
+				indexes[j] = 0;
+			else
+				break;
+		}
+	}
+
+	printf("\\.\n");
+	printf("COMMIT;\n");
+}
+
+
+struct
+{
+	char *name;
+	void (*generate_func)(int scale);
+} datasets[] =
+{
+ 	{ "ordered_integers", generate_ordered_integers },
+	{ "random_integers", generate_random_integers },
+	{ "ordered_text", generate_ordered_text },
+	{ "random_text", generate_random_text },
+	{ NULL, NULL }
+};
+
+void
+usage()
+{
+	printf("Usage: generate <dataset name> [scale] [schema]");
+	exit(1);
+}
+
+int
+main(int argc, char **argv)
+{
+	int			scale;
+	int			i;
+	int			found = 0;
+
+	if (argc < 2)
+		usage();
+
+	if (argc >= 3)
+		scale = atoi(argv[2]);
+	else
+		scale = 1024; /* 1 MB */
+
+	for (i = 0; datasets[i].name != NULL; i++)
+	{
+		if (strcmp(argv[1], datasets[i].name) == 0 ||
+			strcmp(argv[1], "all") == 0)
+		{
+			fprintf (stderr, "Generating %s for %d kB...\n", datasets[i].name, scale);
+			datasets[i].generate_func(scale);
+			found = 1;
+		}
+	}
+
+	if (!found)
+	{
+		fprintf(stderr, "unrecognized test name %s\n", argv[1]);
+		exit(1);
+	}
+	exit(0);
+}
diff --git a/src/test/sorttestsuite/speed.c b/src/test/sorttestsuite/speed.c
new file mode 100644
index 0000000..3ebc57c
--- /dev/null
+++ b/src/test/sorttestsuite/speed.c
@@ -0,0 +1,139 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <sys/time.h>
+
+#include <libpq-fe.h>
+
+#define REPETITIONS 3
+
+static PGconn *conn;
+
+/* returns duration in ms */
+static double
+execute(const char *sql)
+{
+	struct timeval before, after;
+	PGresult   *res;
+
+	gettimeofday(&before, NULL);
+	res = PQexec(conn, sql);
+	gettimeofday(&after, NULL);
+	if (PQresultStatus(res) != PGRES_COMMAND_OK && PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		fprintf(stderr,"command failed: %s\n%s", sql, PQerrorMessage(conn));
+		PQclear(res);
+		exit(1);
+	}
+	PQclear(res);
+
+	return (((double) (after.tv_sec - before.tv_sec)) * 1000.0 + ((double) (after.tv_usec - before.tv_usec) / 1000.0));
+}
+
+static void
+execute_test(const char *testname, const char *query)
+{
+	double		duration;
+	char		buf[100];
+	int			i;
+
+	printf ("%s: ", testname);
+	fflush(stdout);
+	for (i = 0; i < REPETITIONS; i++)
+	{
+		duration = execute(query);
+
+		if (i > 0)
+			printf(", ");
+		printf("%.0f ms", duration);
+		fflush(stdout);
+	}
+	printf("\n");
+}
+
+int
+main(int argc, char **argv)
+{
+	double duration;
+	char		buf[1000];
+
+	/* Make a connection to the database */
+	conn = PQconnectdb("");
+
+	/* Check to see that the backend connection was successfully made */
+	if (PQstatus(conn) != CONNECTION_OK)
+	{
+		fprintf(stderr, "Connection to database failed: %s",
+				PQerrorMessage(conn));
+		exit(1);
+	}
+
+	execute("set trace_sort=on");
+
+	printf("# Tests on small tables (1 MB), 4MB work_mem\n");
+	printf("# Performs a quicksort\n");
+	printf("-----\n");
+	execute("set work_mem='4MB'");
+	execute_test("ordered_ints,", "SELECT COUNT(*) FROM (SELECT * FROM small.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM small.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM small.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM small.random_text ORDER BY t) t");
+	printf("\n");
+
+	printf("# Tests on medium-sized tables (1 GB), 4MB work_mem\n");
+	printf("# Performs an external sort, but the table still fits in OS cache\n");
+	printf("# Needs a multi-stage merge\n");
+	printf("-----\n");
+	execute("set work_mem='4MB'");
+	execute_test("ordered_ints", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_text ORDER BY t) t");
+	printf("\n");
+
+	printf("# Tests on medium-sized tables (1 GB), 16MB work_mem\n");
+	printf("# Same as previous test, but with larger work_mem\n");
+	printf("-----\n");
+	execute("set work_mem='16MB'");
+	execute_test("ordered_ints", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_text ORDER BY t) t");
+	printf("\n");
+
+	printf("# Tests on medium-sized tables (1 GB), 256MB work_mem\n");
+	printf("# This works with a single merge pass\n");
+	printf("-----\n");
+	execute("set work_mem='256MB'");
+	execute_test("ordered_ints", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_text ORDER BY t) t");
+	printf("\n");
+
+	printf("# Tests on medium-sized tables (1 GB), 512MB work_mem\n");
+	printf("# This works with a single merge pass\n");
+	printf("-----\n");
+	execute("set work_mem='512MB'");
+	execute_test("ordered_ints", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_text ORDER BY t) t");
+	printf("\n");
+	
+	printf("# Tests on medium-sized tables (1 GB), 2GB work_mem\n");
+	printf("# I thought 2GB would be enough to do a quicksort, but because of\n");
+	printf("# SortTuple overhead (?), it doesn't fit. Performs an external sort with two runs\n");
+	printf("-----\n");
+	execute("set work_mem='2048MB'");
+	execute_test("ordered_ints", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_ints ORDER BY i) t");
+	execute_test("random_ints",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t");
+	execute_test("ordered_text", "SELECT COUNT(*) FROM (SELECT * FROM medium.ordered_text ORDER BY t) t");
+	execute_test("random_text",  "SELECT COUNT(*) FROM (SELECT * FROM medium.random_text ORDER BY t) t");
+	printf("\n");
+
+	PQfinish(conn);
+
+	return 0;
+}
-- 
2.9.3

results-master.txttext/plain; charset=UTF-8; name=results-master.txtDownload
results-patched.txttext/plain; charset=UTF-8; name=results-patched.txtDownload
#5Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#4)
4 attachment(s)
Re: Tuplesort merge pre-reading

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

It looks like there was a regression when sorting random text, with 256
MB work_mem. I suspect that was a fluke - I only ran these tests once
because they took so long. But I don't know for sure.

Claudio, if you could also repeat the tests you ran on Peter's patch set
on the other thread, with these patches, that'd be nice. These patches
are effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few
comment fixes, and a change to the 2nd patch to not allocate tape
buffers for tapes that were completely unused.

- Heikki

Attachments:

results-large-master.txttext/plain; charset=UTF-8; name=results-large-master.txtDownload
results-large-patched.txttext/plain; charset=UTF-8; name=results-large-patched.txtDownload
0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchtext/x-diff; name=0001-Don-t-bother-to-pre-read-tuples-into-SortTuple-slots.patchDownload
From 90137ebfac0d5f2e80e2fb24cd12bfb664367f5d Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 9 Sep 2016 14:10:05 +0300
Subject: [PATCH 1/2] Don't bother to pre-read tuples into SortTuple slots
 during merge.

That only seems to add overhead. We're doing the same number of READTUP()
calls either way, but we're spreading the memory usage over a larger area
if we try to pre-read, so it doesn't seem worth it.

The pre-reading can be helpful, to trigger the OS readahead of the
underlying tape, because it will make the read pattern appear more
sequential. But we'll fix that in the next patch, by teaching logtape.c to
read in larger chunks.
---
 src/backend/utils/sort/tuplesort.c | 903 ++++++++++---------------------------
 1 file changed, 226 insertions(+), 677 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c8fbcf8..a6d167a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,20 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size buffers to store
+ * tuples in. To avoid palloc/pfree overhead.
+ *
+ * 'nextfree' is valid when this chunk is in the free list. When in use, the
+ * buffer holds a tuple.
+ */
+#define MERGETUPLEBUFFER_SIZE 1024
+
+typedef union MergeTupleBuffer
+{
+	union MergeTupleBuffer *nextfree;
+	char		buffer[MERGETUPLEBUFFER_SIZE];
+} MergeTupleBuffer;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -307,14 +320,6 @@ struct Tuplesortstate
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
@@ -332,12 +337,40 @@ struct Tuplesortstate
 	/*
 	 * Memory for tuples is sometimes allocated in batch, rather than
 	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * been abandoned.  Currently, this happens when we start merging.
+	 * Large batch allocations can store tuples (e.g. IndexTuples) without
+	 * palloc() fragmentation and other overhead.
+	 *
+	 * For the batch memory, we use one large allocation, divided into
+	 * MERGETUPLEBUFFER_SIZE chunks. The allocation is sized to hold
+	 * one chunk per tape, plus one additional chunk. We need that many
+	 * chunks to hold all the tuples kept in the heap during merge, plus
+	 * the one we have last returned from the sort.
+	 *
+	 * Initially, all the chunks are kept in a linked list, in freeBufferHead.
+	 * When a tuple is read from a tape, it is put to the next available
+	 * chunk, if it fits. If the tuple is larger than MERGETUPLEBUFFER_SIZE,
+	 * it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the chunk back to the
+	 * free list, or pfree() if it was palloc'd. We know that a tuple was
+	 * allocated from the batch memory arena, if its pointer value is between
+	 * batchMemoryBegin and -End.
 	 */
 	bool		batchUsed;
 
+	char	   *batchMemoryBegin;	/* beginning of batch memory arena */
+	char	   *batchMemoryEnd;		/* end of batch memory arena */
+	MergeTupleBuffer *freeBufferHead;	/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller that came from a tape (that is,
+	 * in TSS_SORTEDONTAPE or TSS_FINALMERGE modes), we remember the tuple
+	 * in 'readlasttuple', so that we can recycle the memory on next
+	 * gettuple call.
+	 */
+	void	   *readlasttuple;
+
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
@@ -358,42 +391,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,11 +483,33 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the batch memory arena?
+ */
+#define IS_MERGETUPLE_BUFFER(state, tuple) \
+	((char *) tuple >= state->batchMemoryBegin && \
+	 (char *) tuple < state->batchMemoryEnd)
+
+/*
+ * Return the given tuple to the batch memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_MERGETUPLE_BUFFER(state, tuple) \
+	do { \
+		MergeTupleBuffer *buf = (MergeTupleBuffer *) tuple; \
+		\
+		if (IS_MERGETUPLE_BUFFER(state, tuple)) \
+		{ \
+			buf->nextfree = state->freeBufferHead; \
+			state->freeBufferHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -553,16 +577,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -574,7 +590,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -582,7 +598,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -590,7 +605,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -600,7 +614,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -608,7 +621,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -760,7 +772,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -833,7 +844,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -925,7 +935,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -993,7 +1002,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1036,7 +1044,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1881,14 +1888,33 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
 			Assert(!state->batchUsed);
-			*should_free = true;
+
+			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call. (This can be NULL, in the Datum case).
+					 */
+					state->readlasttuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1962,68 +1988,58 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->readlasttuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
 			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the batch memory arena. */
 			*should_free = false;
 
 			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
-				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
-				 */
 				*stup = state->memtuples[0];
 				tuplesort_heap_siftup(state, false);
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
 
-					mergeprereadone(state, srcTape);
+				/*
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
+				 */
+				state->readlasttuple = stup->tuple;
 
-					/*
-					 * if still no data, we've reached end of run on this tape
-					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+				/* pull next tuple from tape, insert in heap */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
+					/* if no more data, we've reached end of run on this tape */
+					return true;
 				}
-				/* pull next preread tuple from list, insert in heap */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				tuplesort_heap_insert(state, newtup, srcTape, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+
+				tuplesort_heap_insert(state, &newtup, srcTape, false);
+
 				return true;
 			}
 			return false;
@@ -2325,7 +2341,8 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but
 	 * don't decrease it to the point that we have no room for tuples. (That
 	 * case is only likely to occur if sorting pass-by-value Datums; in all
 	 * other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2350,14 +2367,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2468,6 +2477,8 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	char	   *p;
+	int			i;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2504,6 +2515,36 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape. Shrink
+	 * it, to make the memory available for other use. We only need one slot per
+	 * tape.
+	 */
+	pfree(state->memtuples);
+	FREEMEM(state, state->memtupsize * sizeof(SortTuple));
+	state->memtupsize = state->maxTapes;
+	state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+	USEMEM(state, state->memtupsize * sizeof(SortTuple));
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage.
+	 */
+	state->batchUsed = true;
+
+	/* Initialize the merge tuple buffer arena.  */
+	state->batchMemoryBegin = palloc((state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin + (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+
+	p = state->batchMemoryBegin;
+	for (i = 0; i < state->maxTapes; i++)
+	{
+		((MergeTupleBuffer *) p)->nextfree = (MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
+		p += MERGETUPLEBUFFER_SIZE;
+	}
+	((MergeTupleBuffer *) p)->nextfree = NULL;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2534,7 +2575,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2617,16 +2658,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2635,33 +2672,25 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
+
+		/* Recycle the buffer we just wrote out, for the next read */
+		RELEASE_MERGETUPLE_BUFFER(state, state->memtuples[0].tuple);
+
 		/* compact the heap */
 		tuplesort_heap_siftup(state, false);
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
+
+		/* pull next tuple from tape, insert in heap */
+		if (!mergereadnext(state, srcTape, &stup))
 		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-				continue;
+			/* we've reached end of run on this tape */
+			continue;
 		}
-		/* pull next preread tuple from list, insert in heap */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tuplesort_heap_insert(state, tup, srcTape, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		tuplesort_heap_insert(state, &stup, srcTape, false);
 	}
 
 	/*
@@ -2694,18 +2723,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2729,497 +2753,48 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tuplesort_heap_insert(state, tup, srcTape, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
+		SortTuple	tup;
 
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
+		if (mergereadnext(state, srcTape, &tup))
+			tuplesort_heap_insert(state, &tup, srcTape, false);
 	}
 }
 
 /*
- * batchmemtuples - grow memtuples without palloc overhead
+ * mergereadnext - load tuple from one merge input tape
  *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	/* Should not matter, but be tidy */
-	FREEMEM(state, availMemLessRefund);
-	state->growmemtuples = false;
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
-		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
-		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
+ * Returns false on EOF.
  *
  * Read tuples from the specified tape until it has used up its free memory
  * or array slots; but ensure that we have at least one tuple, if any are
  * to be had.
  */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3857,27 +3432,24 @@ markrunend(Tuplesortstate *state, int tapenum)
  * routines.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	MergeTupleBuffer *buf;
+
+	/*
+	 * We pre-allocate enough buffers in the arena that we should never run out.
+	 */
+	Assert(state->freeBufferHead);
+
+	if (tuplen > MERGETUPLEBUFFER_SIZE || !state->freeBufferHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->freeBufferHead;
+		/* Reuse this buffer */
+		state->freeBufferHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4046,8 +3618,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4056,7 +3631,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4077,12 +3652,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4289,8 +3858,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4299,7 +3871,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4324,19 +3895,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4604,8 +4162,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4613,7 +4174,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4628,12 +4189,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4700,7 +4255,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->batchUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4730,7 +4285,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4744,12 +4299,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
-- 
2.9.3

0002-Use-larger-read-buffers-in-logtape.patchtext/x-diff; name=0002-Use-larger-read-buffers-in-logtape.patchDownload
From d28de3cab15ceae31ba1e8d469dc41302470df88 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 8 Sep 2016 20:34:06 +0300
Subject: [PATCH 2/2] Use larger read buffers in logtape.

This makes the access pattern appear more sequential to the OS, making it
more likely that the OS will do read-head for use. It will also ensure that
there are more sequential blocks available when writing, because we can
free more blocks in the underlying file at once. Sequential I/O is much
cheaper than random I/O.

We used to do pre-reading from each tape, in tuplesort.c, for the same
reasons. But it seems simpler to do it in logtape.c, reading the raw data
into larger a buffer, than converting every tuple to SortTuple format when
pre-reading, like tuplesort.c used to do.
---
 src/backend/utils/sort/logtape.c   | 134 +++++++++++++++++++++++++++++++------
 src/backend/utils/sort/tuplesort.c |  53 +++++++++++++--
 src/include/utils/logtape.h        |   1 +
 3 files changed, 162 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..05d7697 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -131,9 +131,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +231,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +596,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +680,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +691,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +811,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +843,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +894,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +943,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1010,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1000,6 +1070,9 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 {
 	LogicalTape *lt;
 
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	*blocknum = lt->curBlockNumber;
@@ -1014,3 +1087,24 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index a6d167a..7f5e165 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2479,6 +2479,9 @@ mergeruns(Tuplesortstate *state)
 				svDummy;
 	char	   *p;
 	int			i;
+	int			per_tape, cutoff;
+	long		avail_blocks;
+	int			maxTapes;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2527,24 +2530,62 @@ mergeruns(Tuplesortstate *state)
 	USEMEM(state, state->memtupsize * sizeof(SortTuple));
 
 	/*
-	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
-	 * track memory usage.
+	 * If we had fewer runs than tapes, refund buffers for tapes that were never
+	 * allocated.
 	 */
-	state->batchUsed = true;
+	maxTapes = state->maxTapes;
+	if (state->currentRun < maxTapes)
+	{
+		FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+		maxTapes = state->currentRun;
+	}
 
 	/* Initialize the merge tuple buffer arena.  */
-	state->batchMemoryBegin = palloc((state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
-	state->batchMemoryEnd = state->batchMemoryBegin + (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
 	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+	USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
 
 	p = state->batchMemoryBegin;
-	for (i = 0; i < state->maxTapes; i++)
+	for (i = 0; i < maxTapes; i++)
 	{
 		((MergeTupleBuffer *) p)->nextfree = (MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
 		p += MERGETUPLEBUFFER_SIZE;
 	}
 	((MergeTupleBuffer *) p)->nextfree = NULL;
 
+	/*
+	 * Use all the spare memory we have available for read buffers. Divide it
+	 * memory evenly among all the tapes.
+	 */
+	avail_blocks = state->availMem / BLCKSZ;
+	per_tape = avail_blocks / maxTapes;
+	cutoff = avail_blocks % maxTapes;
+	if (per_tape == 0)
+	{
+		per_tape = 1;
+		cutoff = 0;
+	}
+	for (tapenum = 0; tapenum < maxTapes; tapenum++)
+	{
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										(per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+			 (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif
+
+	USEMEM(state, avail_blocks * BLCKSZ);
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of indivitual tuples.
+	 */
+	state->batchUsed = true;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#6Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#5)
8 attachment(s)
Re: Tuplesort merge pre-reading

On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki

Attachments:

logtape-trace-patched-16MBtext/plain; charset=UTF-8; name=logtape-trace-patched-16MBDownload
trace-logtape.patchtext/x-diff; name=trace-logtape.patchDownload
commit 2d7524e2fa2810fee5c63cb84cae70b8317bf1d5
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Fri Sep 9 14:08:29 2016 +0300

    temp file access tracing

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 05d7697..cededac 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -77,8 +77,20 @@
 
 #include "postgres.h"
 
+/* #define TRACE_BUFFER_WRITES */
+#define TRACE_BUFFER_READS
+
+#if defined(TRACE_BUFFER_WRITES) || defined(TRACE_BUFFER_READS)
+#define TRACE_BUFFER_ACCESS
+#endif
+
+
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#ifdef TRACE_BUFFER_ACCESS
+#include "storage/fd.h"
+#include "tcop/tcopprot.h"
+#endif
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -169,6 +181,10 @@ struct LogicalTapeSet
 	int			nFreeBlocks;	/* # of currently free blocks */
 	int			freeBlocksLen;	/* current allocated length of freeBlocks[] */
 
+#ifdef TRACE_BUFFER_ACCESS
+	FILE *tracefile;
+#endif
+
 	/* The array of logical tapes. */
 	int			nTapes;			/* # of logical tapes in set */
 	LogicalTape tapes[FLEXIBLE_ARRAY_MEMBER];	/* has nTapes nentries */
@@ -211,6 +227,9 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 				(errcode_for_file_access(),
 				 errmsg("could not write block %ld of temporary file: %m",
 						blocknum)));
+#ifdef TRACE_BUFFER_WRITES
+	fprintf(lts->tracefile, "1 %ld\n", blocknum);
+#endif
 }
 
 /*
@@ -228,6 +247,10 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 				(errcode_for_file_access(),
 				 errmsg("could not read block %ld of temporary file: %m",
 						blocknum)));
+
+#ifdef TRACE_BUFFER_READS
+	fprintf(lts->tracefile, "0 %ld\n", blocknum);
+#endif
 }
 
 /*
@@ -602,6 +625,16 @@ LogicalTapeSetCreate(int ntapes)
 		lt->pos = 0;
 		lt->nbytes = 0;
 	}
+
+#ifdef TRACE_BUFFER_ACCESS
+	lts->tracefile = AllocateFile("logtape-trace", "w+");
+	if (lts->tracefile == NULL)
+		elog(ERROR, "could not open file \"logtape-trace\": %m");
+
+	fprintf(lts->tracefile, "# LogTapeSet with %d tapes allocated\n", ntapes);
+	fprintf(lts->tracefile, "# Query: %s\n", debug_query_string);
+#endif
+
 	return lts;
 }
 
@@ -630,6 +663,10 @@ LogicalTapeSetClose(LogicalTapeSet *lts)
 	}
 	pfree(lts->freeBlocks);
 	pfree(lts);
+
+#ifdef TRACE_BUFFER_ACCESS
+	(void) FreeFile(lts->tracefile);
+#endif
 }
 
 /*
logtape-master.pngimage/png; name=logtape-master.pngDownload
�PNG


IHDR�d�?2PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������������___???�%�	pHYs���+ IDATx��]v�*$��~XG>����8�m�'6h�}�`���M[���`0��`0��`0��`0��`0��`0��`0��`0��`0��`0?��ty}������`�>���p>�7���x<N�����	���`�������r�{
���z�N��?N���]OS\p�<<]P�0����<���<	��q�]�K����v~�������3������k0Z��w�����cy�n������L�F�)���4�I���ru�<��f���K�O�8�������<��_p<O��F������q70��+��B������;p�����;�Mp&Cwx
���=O�O���#x}*�+�?]�w�OQ��7Lp��=E����x\�Jp�|�����gf: ��a����D��������)1}����>�����`0��`0��`0��`0��`0��`0��`0��`0���C�������E,�0]�v���'���4��M��0�1w���<�+�$C_�~~�m��0L�w�4��t�;��f~?��$~�S������y���=�����I����a�������	��0�������^��C0���,�<�}�B^0�|��v�����H*��]#��y�~��_�0`���y���h������f ������Y����d��Q��Fv��O�Fv�H7�F��5��F��5r�F���5��@��4j����nb��Q������'����]~���������������O\w����?�u�����4��-h�-L{�m���1�%�_T�����x�p�]8���nb�$���~yLC�~��N~(>��9���36I��a`�(xHV��P��������w�<o�����S��#��9�\�	����00�o�B�)\�b���z=M[���O�N���C�w���I��������z�]�������?��7K�����������r~��E�����#����:�O���f���|��F��5O^��1���$��M�����������82���{=����������O��3l���N���������{3�+��-x������I������W�]�`R�����&�?7�
�,<&��0��vzs�s����C/�\��9�����\�������>��G���myv3W8������p��^����%��3T��r����%�����s�o�Y����D�xa������``�1���	�?�P�/����L_����W��%�����/V^N���^?�
1�>��]8��g����G� �X#���K��������
$�L��0&\>���8��x����������������K s�����`$������-�
��>t����5�I�����rt`v��x��2�w���I~���0��y`�%�m��=O��U��A`�:5��|��������r��]8����(��g��>���u(Qu�9�(�L���60������}��}p	�.������i k�������S9���u!1q8��u}��!����:����x�O;��"��q�/����M�9����N|�.��r����tS��������0����Q0���.��Ln�
�����o�����p&B����]�{���gpq��������O�~`��9�;���F��SC���iL��9��q�����gm�dz���x ���7�
����\E�;������f�e��,(<����w�L�~��[?
|Y�r���^w��H�������@�����<L���:'?���4�5��}&�]��� �<�C��}&�]��9�|���c�.L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`����Y�T�����`3J���7�O�<��{������70���2+X@�"u��LAK&��������!7���`��QfR�n�������@#��bO �������+������J��"+�gLaO=������@g��|-���2��N���R�@;��	i�����q����4������9�I���������\'�9c���4�@4��L����E�)�q������	��o��R�]�H��B�F��G�L�`g��HF���A��<"'���(#�l�U���4�a�	I7�{��*+"�x5�X;���L*c1)�� L��1h
�h� �����Fi�E$W�ZG�j�D�,`h`�a/z���X>Jw'���l-iSQ����	h>��M������v��L�����h��������i��3u����'�FM.")G[l����9(����pfDM*�����I������M��;���s�1��`/�&�`
�G2�*�R`LVA9j�z�+��5�7@����0�H������S�E8�x|����N]?�f������O7&7j�{�uh� u8R�$6�}�&���������8M"�����L"�h����lb�7H���a��&�����������<�O����$�jv�~A�4*Afe���.:�`��PEz����~���bS����[�X����#����8�s��-����i�|F���~�pg#;l��	@5���~i����Y4���3��
����	��`z ��bkP��Rgd��4w�:�&��#��TM���Gw�>�7}��N��;�6���!'�i�B{i�a�"��=kx���w��n����|:�?�s�?�C$�F��
0�)F2��oq"��;�
)V"���_	��?���g���?�7��|�LR0��=�/�nNF�d�G� �E�5�(aG���<0#9����/��(�����:c���)��M����Ep;;�q�W\�r]@�$3"B��Z�p�	@ � GJ>��p����}�.GOC����`Tl+*��
�{�����v�0��A%��Md��@�h�cv��b�6>�`�%����$�-u-��3�:1��"��F���r��m$�(�Y��t�45���r��	�4 �Q��d�]�E8t��R��j`�)�T�m*�B�l��r�E62C�3h
�h�<I3���&_�
�C#:3� ohsx�w�I����'�I��^�{�H�r��&�1�=��&����"�68������y2�:�1{���_(�``(�J��)������s�#9�^.]������e�\�1���������3�#�����u����T��(R���J�u-�[��A����t�M��F�C@��{ �0).�dT�������hu9Z��F��� �D
��JH�3E�l�D�1�1�<~&�W��n��s��\%9Y9�U����
� ~��S���gUI�^a���P/��R����UP��D���,�8}�*BabM9�����H�����UP��b�PO0
��pQ�����E
�l���	xH7����^���X�J���� &���a��vpLx�m}����;�3�\���f�<� z�
�Po�wM����E��OV)�� 0�#C��B�hTkSx�'�	���E�j�;�L<�
r�B2u�x�
E����y��y��}�&�����8X�%�Vb���|��vl:�5�qC�1�����e����x��=;�����F�r�EhYUC�����d�a����{��Es��q�<X%&G���A����W��b'��8'>�����Gs��,#t`��F{X�����#q�3�8�G�q���S��2&r��1�=����A���C��s�o��}�.8�������]�`D$�?�0$���:���z���E�O�.�2���Y68��%FjT�S*��L�&c`�I�W��q���<Y�qe&a�90�$��+�vpX��~�B�;(1	:��'�fG��P(:�i`�a1��,`"�|�p�=�D��3�O��$�LB&c�[ ��ip=@�/�xz�B���Nx�i������
�"%i���>�^bQ�Z�qFE0{S��dYz:YpX������W�Z�!v�Q4���SFO.��8`��F�C� 
X(\$��\��{K��OL�;:��,��&�� �)��d���)�01bJw|<!�2�G2��)��4���)�����a��v.<N�O���*�W��t
�,8,��p�"����E2\#=iA*�U�x���E�j�;�"B�Ca9���6�����Y
&���L��E��cY/|j���D�
 g�%��4��1o�[gT1�7��Q�0�������H���o�'�rXpX$K��
`D�1<��d���Y���,@v�9��~�hw8�P�����Q^!$#
�tJ��i cv&��z``H�1
�#��r�x���`�`w/eL�ow
S9K6<6u<w�e�D���"������`���a�aA_3�{�t�2�0$��4��+w��,�D�@L��	��=$w
9&a�LP���������	�!��\����$i�����	��8&f�b��)������#�c��.���������Jv�,8,��+`Ao %���Nf��V�����"�\0/�U��a� d�t�S�KB�d��� ������z`��F�%p�+64��L���Q������,4�@��11��H���z�z��h��Kx,�N#	S\����H#d�a�"<�����>���h� UG�~�hw��0,j~��2X0v6�pt���!n)�g��F�d/���[�po�F�i����}'���q���&vpL�|�hE���(#\�B�v'��u�R����
#�v0$�3�����J!�r�������H'�6F������s�gXB>|��o��$e,7�I�}�C�J�m�#�0',G�x��;��R�8,<�R�e�W�
r~����a�>��I�"�>�Z��Q@��Wi�	���H�G����BT,�Tb"
;��/	@!?�2�r��0��R/g����2y�h�������#J�%��:
:%sAN����H?���*^*��a���;Z���&�a�� �STQ���d9��p5���A�����`��<�$Z^GX�S(<�&�	@���gzE%@�����5'-������	���
����5t.\��i":rqJ�������oI����+�}�x�x��|�?�(�[����]E�j�;�)5�KQFxM��V��FY��?��T�&�f�r���T�&"��B�$��fdA�	@[x�h$��K:�1��6/�����su��q �0G�iHM/H�o�'|F�,�\����9v&�3�&�E �O��+
}|�;�W�����&��� �bS���d_�*�o� �D�j�)��4�"hE����i�h�g#�8��B��o�	.vU,l����E��O-�� � ���'S*�i��DS��i>^��F���aH�}����P51DP��5������N��\��D���-`���t���>����{�v���<Is���}��)RQ�����2�w�^E^r��%����rXpX����h����8sE��2bS8%X�]T����"�_5�#���3��9]A��r��8&����00�@&4	�71�E�=�,���\�	�����<M>v�S��_���FM��W����(a�a�D��EW�	��nR�b"����v�%G��~8�F�j�;�Z�$���i�M� �&b.=�=���	�/��T�5�DSVb�haw�DZ�d��F�"��qTU�G@��sF.�W@����E�sXT�K�a���$�"������]������B���E�z�=��3�k��L��7���s[-��ot-���]��b�	�8&<$_������b�F��&���J;���r��,8,�`L�����D�Iw��	����w508,~"Y�e�W��O����rZ
������c?�"*�	o�l�&&UFxG9��/.�r��a!����-�K�G�	�4�e�\D�0��2(�E��P@�$��+ea8B��p�0������s��\���%�h��a���d��,�r�L/9|i&��h�R�&��O����r{��&6j1!CK/&�tS�/g��9���Q�z���wp;�'����+�X���
`@��=1��A�M�5��F/}K��_;7e~<���$�g�~�1����@�A��U�K6�)���#?H_9��a�P���ru�����:�h����-@�)�hI�*�H] w��� ���I.�w��$�5jx����������-{�-*��N�"Kw@(!��~u��s
�v��g��k
���i$KS�X�c	���8��!t�0�����Jp���tT��
u(������2y��%��i���2>��(��)������	����s�%����.��O�����b�e-1�"D�xP�+��J�6����z�&\&n_�g�weQ��F�����6k��x����b� <�Wt[bJ��&mIM��F�������e� �y���h�h%���L��]���\O����,�
�
Yd�B ������%*��#��~���Q�Z�3@�R�#M�9��J�8��2��N�O,���-�n�p�FWpS����Jo�B-`��wI�G|Q��s�C�J��<�/���-X%���p�~P@�ZF7���IL�|���J��E�R�����s��&��X��w�it���@�a���
2��jBh����1��+S:�3�����������I
*`�F~4�
��C#���Z�|`V��K2��K�!f�A��|��@��%_�S�����V�	:��&���	�w���>��h�2,�D��D#�On��� �R�FFL���|l�����#H���q1�b}�*��"�B&8L�&����@�$*��XcL�h���&O�D��3�&18�(��K�C�<S	�@g���d�s5)(�&�aP
 �����
���,a�9���D1�,f�%��	�aW0���8 j����9����;� �
L�}�f�+�W�v��&�
��0���*��4A(����E�	��-��l3��>�e��	�1�����r���R�+��EV1�u;ZJ�a�x��"O���
�!0`f�6Q�}`X�@�c�"����pt�U�������&{k�;x�"]h��o�7�*����Xl�vX�&<L�	`y�Z�%������(�.�.��r	&]�%��	�a�c	�>\A
��7�w�
f2���Y0.~#,	d�T������v�zL%*���#'�`o�� ���z;����G���B���2�B�#�����&��hR2)�h��&�K&Ci8���� �������~�e6#���'3\�@�B�Lg0.�'B�h�&e2{�G���L�^��&�i�/��1�nt�83�"����G�M��@�����K@	��,9���! IDAT1s�����<����r�pG^�U����� ��<s���C��
'� �	_��6�*��� �h9�V3�W��H����u|K��.�HxBCr�����w��rIlO��^�����	8���I�,��sh��AM������"C��hdF�$u\�f�.8�3�	70`h/RZ�\������cn���0�jU�18T�}���8$G����4��|����K�@�����Z�&�E$e�:7P�B� 
w��*��b�Xv�	�`T�"RR�������QH�v��T����#��V���9rH ���R�8�,�0h�j�A��t�}w��9��0-",M=O�z��"i�I��
�@P!D��8�!�����L�E����Q O��[:���lypG��4)B�30��
�Y!�q���M7���|����i3�L���@�����b�:"V�kvl��(�)^*�0� 2c�0AOYq��&��\��7P�~gp`�]T[]��He���|8*��B�a��ME @J����(&,X���NG)�Dke2�`pL���|�@p@�Ypb�
9����N�e}R�&�E�h�eG��R1Q���S482���E�b�@ow�mQS�eH�{9��1�j��g�v�P��2
����>Q|�Q�&`t���	�1�����
��r������-�$3�T�1X9a�X�508,����[��G������K��a"O�<��R�;��&�����.H��cS)��I��<�j`����dL�2u��l����.�.4�LJ`pL�����

d&Y�=u$�F2E�9�y�5`pXxhl���p�3����{���(�o��|��A�~pX�H�[��H���gbq�\u��	���k���ldsHNZ.�%W��)(!&Sl(X�@�t08,�%l�D�F&����{��2�Ta1W��b0P�M�K����n"T�E����`X|Q�7>�\����K��6&1�w�}����`�����o��D�RZ�'���I�� p:��eD�
q����x{j���_��N��sR;D��1�[L6�~V��a
�9��I�Ho6�r��&MaP
�<���E�I�6������$G�Gd����r�0<�-C7/�9m�C0���T(����O�>���,x�x�`X>��{.��Z��UL�Ec`( �������/#tK� ��\�������L���~E���i�#,��s�#s����I��@*��vVh�	�a":�fM��q�	~������b�9�=��X0(�,�I�����zh.7a��p����YqdL,t@�@��i���u�[DH6�p���TM6�	@5�� �@I����e��	I:��D7��I��&�p�$s��&v�Z�Ai�D���~2P���H����$u�p/�C"B��#p��&5��c��/���sZ[q��a��o�	&��_�]&����(�l�dhM������9m���
&���@���( �i8�6[���
c��	���w�>��IQ9�=Rw�pxR��M&&-@@~��l�D��S�~K.��"�`����$oL�~��d��t@(�%��|>Kt���1�����N��!�X(������d�����,C8:��S�erd��"�a�3`v&6>�Uc��o�Q1*8��n��W�v�*���'a�m&����n�6}��&�&�0x����+��=(J��R\�9#	�Pg��D��y�v�>*�s��A�-$K5Lv-��
mBE����<g������%��|��	�VH\�����&��8G:���C@�(�p�o>D���JK��b��F����G�
@D1�JR
n��'/%_�����->h���	@#
��yV1���
��KV	\�����������������4	���0���<��`���O���a� ����	c�)�`��)��L�%$s50��V\/�z������a�#�;h	{��P,�J�)�d��F?��7Z_�6_����t�9[pj�S&K��|�����^<|G�N�s����������l��J�;���T���
W\��-�;�R���F����E�� K	L�V��:/��|%�bH@
��Ro�B����F����G�"&�����"��4��4�A�?�-��5%)T'�����h�XQ��U�X=Q��!&���	�n��^������jDk$]���c���>��B�H$�g��O-����;������7Ry -��-�e�;�-5��-��0(FV����}	���*	#i0�BM�����|
�l����F�F[���.o=�pw�)���vP��I����z
G�
~.��Xb�x����T2u�w�<�28_O����<�L�U��h-z��L�������-���X�!}qZ�B�C��^�����������h�h0��E&$�H)gc���Ge��!�������u'���|}&��D�Fk!c3C;7[
���(J[���}�B����;���������tcm����v3Pk?Q��|����
��|.2���+��	Q
W�:.���'���~r�I4l��@������E��	�DePS�X������V���}	��t:O��)};]�D�F�A���y��%WI���/4
Vm� �_��%6�p����S�>�hD`l�%O>'+���7"w���.����J4�k4�������#��j.zP}b�#mQq�7�(VN>����_��S2�U?\
j����������T�]��Fys�	@[��a�L�!m6�_�z�_�>?��<���%�����=x�)��4��8`r�J���|�h�	cb:�|P��G�A����"���Zd���������2#c�-�-����RG�����W��?d�Pu�;�G����M@
N%��Mms�� �A6GV�-�;dZ���C����ap�FW��������@�HK>$Y�n���y�N+b�D�&_�*j��I
m#8'hO>�;�`�X�@�e����"�"�-���R.��o�_�R�I~�XhG0!���*?)��c+A���}�6C�5U%jH�b��S[\�R�	@�#V")��+E�2��y���R��CM���a�3(5�zQVh���Q�ca�06�fH��(��>x�_4�"��
PO�����DP�����ZI����M���Jbw&?�n�$ds�B-�]�p@a����e���1�twxX�|O�>A*��I�

N�GM��u�B�'5
�����7	�R�fwA���'�y����J�)}��	�L�(�Q��	��G�q��|���������e�(j��F�Q,���+
@%[�J��bwQ
'=�i�i�$pg�=vp@�M�����a��S���c,�n0@�0}OK*}��e�;�fX�nP�XT���!���x��7���6d�|��>ug@K�
���^��-H%�"bl����3CB��Yd��i5�[�9XfyF��@��"���4����z���@�������A�cH��	������5��)	[`�B�A.$]�_���R/�5��H���VC�\0�8���
�)m������I��U�F�L+m2P@"j$��F�
�Z
��Ul[8^���q�4�y�rg��F?(�ss��!�[��Zt�J�D$B�j.r���C�wcw5���l�m5O���T�J-��� 	����|`}�+qI�B	�`�3h$j8IO��������Q���R�H;���z��:/>�c�O,�d���o�9���	L?��)�V����H�GVR)�,���%qg��F?�S���[
������0�Vr�j����B��M,TP����xm���F�1�8��PUH
�iP��~�eU%P�7@�0%Z�6�"�L�$jr�e�1v��w&��V�lN|��(NSRJ0c,����]8:6���|b�'�e��` �s�t������R~��&���R�QM�E�T���&�a�w�i[����: 04��%_zly��!P������]&L~1��y��*���~�`� ���L����R��:Ii.��u2+��L���\J��_
�E�����UZ�)�z�1��<��y�;�&�@@���PT~+);%0^J,�c��}`)|�k���|������a@P?�p@`h�%K�d3/[�P�d�h y7��&L>��b'��D�3���:�8����|�@J6�5(�
L���-.��2c�iFL�w&:P��a��(N�YN�)��0^���AG�W �_��y�*�<��f��<T#G[Q��=@�"=��*Y�R�ol�/�h!����g�j�!�w�
m+<�TB-�"�+�����>��>@$���T!���V��p����KW�bR�����ZB�e���/q���n�#���	�������7���0�I'��r�1��(������E0c&��"j8��kJb�NRJA�p���BP)�,�C����;�`�X�i>�%�0��1��bGw���N`��DC�G���]6�)f�9����H�L�N@#�������p�qXS
@
I����R����R��}�#�h��0�o=�����f~2k��L�jR#+*2)9�0�����J�nZ	����t��,6p9wm���1��M��h+*�J(�m�26�@P�Z��f�Z��M��0N����N������bD�A���n����z��F�����1�8�=8�)'j���x���0O���%��:��eji}��+YP�2����VoO?d�'�`,�IX.c�l�^D"�1�����X��%3f��F?��2���K�;w���������G�����G��-(0����h���������b���EF16��'�; $���D[���45����g���8��,SFQ"�%mir��Ce��5��	@1
@
��Ro�B�x\��Z�	-pv��E���%h=/�e���4%����4����`��)�z�=C���)���nM���o��,���D�IF�t�j1�8��PX�aj�-�h��������}��T=9�-q%�V���+Y��6�FasQ�=�z�4�����$%���t��ifbET2(�D���k�� ���Ly��L> �'����<��u�a48T�H�������GzP� ��06q�8�X�IQ��X�G�q{k�Zpz'n��r�m$j,�����d��"�2�H����SVb�a�W�&�P�
�Y�RG�d�p1��d!�=�(M�;5XPC**�z�����T�������[�P���N���r<0~^�3�������PB�BFn&d����E��� �n��}�3-�iK yk��pz'na^�-Y(jO�x��h�B�b���r�i*�����d6�����
��z��'NS���^���h�:�`��K�V���XL6-jh��:��+k��$���c"j���`s�Lv��2*K���vMU��u<pd�25^F�@��Dbv���|%�2�`o������&����x\��#��$q�@ID�I��49��L��y���	�Z��5���II�!�k�W��o��tg�	P#K['0}[����HM�I���p�!��(&"�NWi��4�^�\J��3��BM �����s��`���JLq��~�r5�H�M�@�'n�����@
���x��!<Z,c
��W�M%z���"�b�	���{<��v)��l�,B�9��a,2$	��������	@1BPCJ[')���F�[��>S�$�������d��;����NH:�ss+�"O���"#��&� ��u��HKo@�����z'n�[m��G�O�x����(�}���K�U
�����!�j��L
@
�^���	�/H'��bA6��c����|�:Z���+�	�F���{]�i��`��}��2�+��XB?Io����6G��	����;�bXj{n"�=vt�_>I�`3OS	��*�&�0�)D�O�)x�
'n%���k���j�[���E���������D(Si^0�	���6z��'��L_�e�%Tc-R��"|^�J-��L~�"�5,�����B��~��H� ���+�u"�Jk�>�?�-|��J.��N"*'dk�MM�r��q~��������|�@�n�$�����@ �� ��������c8p�
��O�'�p��P����]���y����KfHwk��;���!�Ne����|_(JxG�(f��r��8p����*`+���������*��6�8;��d��;��H�<j-|�X��$�KTb8p7}����\��3y��� ���_��G��,���a �L�&mir��	@[����p8#	@�-�%�k*9�7CR=A��J>z��	)5��J������$�C)�%��GD%�"���y@,��	�!�@��h�����$��"d��*����KM��������r@�Zow�h|�W�NJ>�+���@@�'m�� �x
��	@;PC+%�p���T��4�*9m%DO��TT	
(L* ����R)���l!,��������dH���D�@�&�l��.�	,�v+@:�9�\��G���1w�J������.�[�;�:�n+P#��<������-�]g��'$AK��R��u�U�7h j���2���fK5�D��B��D�fPN�(�
��T����-��;E�|&[`�w���h_�)-��SC��v�\���E���u!�C�R�$��	�+'p[P#�4��S����3��Hd"��*�e�!�&���J�`:�,%b�������J�D��}XL*�s[
9m��
@J0n����,�J�{W>$]�����	�~� ����k������k�;���6���[��������I���@
��p8��d�I���P �I!�a�&��T,B�����@-j4����Nh�9QT���jd�
+��`�GJ15VO�O���P���8�d�����&Gk���"�m�m��� ���5?��s�\�m�wL(�H��/U�*.����yJ\��K&���z��l/jp>��0_�RIw}�F����'O4����������rz*��t�=�c�m�:P��9����-x%�RD>���Jbw]�k���~g>���y�����ic�8Q�M%$9��;���!%i����������'w;������$�6Z��%���*����p�l2����Y@�k�1[Z�z�sH^.��d*��W�j.>~?��$�6*����i�&�SH���[�%+qwL(������t(��''/�x8��<�����'�//'���I4mt�t�?���k��,|0?�@�[�@fKJ���p�A�P������m�����F��H�k@�(��`��.�����?S��a������ M�>���pq���B��&��UG@�B��N	����� ��|��,>Z����gB)ty��E��*�{e"v���2��8��]~	�O�	J7q1������%�5����5� 	XNS��R�������=��j��>PV�p�7����|?�NDm����ZJo_�!U�?�[������N�~�$b&=���'T}��>��i����"5Z�o�u�h`(��������oL)�Q�0��U�C�UP�`��C�?
��iY
!��t��P\����A��d]�����#	�B.%�N���]ZH���0J`9�e�:H� IDATg�bg�_�$=\(��D���RR�>u8������lX�C�����cn�:����[& ��'���?��7^CO,�I�!1>��T�8B�Be	G��.��YF��P�o	�������]
��@8K�%�q�����/&[��i!�Z����B|�S:�4]F}
-���+z�#�]�/r� *�#���R�X�Wn���w����
c���a��T�,,�S����$%�4�4�W��
@�5��E��"�*�����h�H*��%���l	�� $�Bo�+m�����R!��,���wW@s�;��7�v�R�hP(/i�d<�2Z����:� ��|��?���&H��J��E��G��Bj,�cqA&]d���-�:�Y0���|��<�VB�L	���\��R$H�HJ>�2������"�Q%rS#K�Fz�6��`J��
�<bQ�p�����3��A�6"��>�I�|B��v��8�H�)v5m@���<��:*�8	#�TP#���&��x���%�#_h�KY�-�3�N,%j�dP+z��M_ ��lw���Ka�������w��$���}*�B�@��&wH2u�t�!=�2�	(L�8�@+�6����$���q?Q[`�!���� �"Y���b}s:X "+R\#j���y����ZI�1AM�����z2��C�����ru�w&5�F�9�J|� xs�l�������
����	��H)���	K�L������G�YCE���B"�G����`P�]�����vMU	@!���Q��R�Qz�Vb,
�u"�Z��/9�'/�zX���@�>E�JB1���@��5��1�/BnYu����8��a*�\%L�T��jT��q[�$[���`�&S^*,Q�#��|!����6hG�7�����dx���H����J&�a�w
�.�>BK>G����A��]!c���G�E;�b�oW	��T;��l.0�C�Ql  x�ZT'}��Dt���y</PV�1� _���`)���ez�w�)F[��X�%�'�y��y*�2H���A3R��v a}GY��Xy�y`�D����J&����\1���2h��i�G(a���z�d���]�����X����g(���S:$��q�d|oA��%V`�ec�9�'t���]L�D�@|*�)4����:�#��]:�i
��U#���E���P��VJW���N`*�b��J��D[pw�������+��	�#�H��ji�x�
A����0���|���E@a�[�e;.�j0b��dU
�����<)�\�<�@�la���"�?|x��DF�!�<=/���#�C	@�����<��	���9r)�T�-�;�U�!q�95Ci���6/���9@��f�fY�E���bH2����R�by���\E���*0p�|)<�75��p�Q��0��3O#cqw���[�PL�
��O�'�P�8_����0��k�4������Tf��}7&*��(��_
@J)���B���Es���!b�`t���_�����!u�83��F����XZq�tz��
D� c���Y��4�����E�������s����7�����#��*-��RK%�F�V�BJ���')[��|��s��-�����i��
i1h���H<�����
�X���Q�.�����=F�N��Z\�Z�8��r ����M��#�#U���
&�D��)�]Bf��@@��h��=`����-�%�ZH�v/p�Sa\�nr2K��,n�WBLj@
\J� M�P�/4ow�+UGX)�2D2(��\%��A�^�R�#H(��%5i,�	N]�=�zBo�$c%��@��1�8�K;��\�Z�~2<��������[���&�#�����_���E9��5���Ca�CLr�v+�'a	miy�wk���J&jPC�\O@2
�)���Ql�O�J���m>����K���l��bjdi��������$���X�;�7L@1p��C,�d�e����qSD[����,j��kV���h>��$��Hha�;�o!O�5�Bj�����y�a��%�5��������#�N�Ja���t��wS������F$�;� ����X4�y�||���	������K��	�w�H
L���o	@J� 	)�Y���	cYz�&������,�{]y�M��}�h':4��p������AJ�f����!���
���A�U
��`�@~�W@r=y��.gf��+�4&��[*##D���DR��K&%�FVJ��)�[)%6�@���9}2cu���f�����NK�v� ��ZD;�H��e�.����%%f����	�dqs3n�`�<L��5[���I'��>���"�����b�gdKPC�n��]�Q�O xS,�g���+	-KG����R��[40	�9��
���w���8/��I����y	���K�i�;��	��o@J��	m
@�;��E����0'}Pf��y��<k�V��y�XL������
%Pbe����$����Z��NR2c��L-+��1p�r�����9�|Q���yS��@�!I�]�4��5�R�Fj�%�����|������U����a":/���GE�L����u�m,d+�L�]bq�B��5Y$qD���,�
��]��D�Sl�P�3���XJ��2��Ih>�`jJJf�PI%����!�`�T���#���/�@[kV��!C*��Rb&&��	C65�z!���[�l,pt;�q�e�JYm*b��$\.sg���wy��k����������
3	2��R�7A���\T
*!P��������~�JX�n���X<��K�B�y����#
A��D���pW�n�X@����p���_z�J�@��RP}�(2cb=A*�
��q[�Q����I���J�Q�2�z�*����8����N`j"���t�'v�����(X����S��85DTH��FC���F�#��F�\��F��h���$�d�>�g+�D(��`P)z��MG�LL[�Q%�d�+Awe�f���!������K�)V��v�%<RU�}(tg�b�w����
Q&!���\c�G�JA��S[6o���]�@���-����E,��]�S���Y��pI!{+6�/a��Ol���;�!�Wo�&jn��mi��8���$�q�Hw�hi����-r.�*���RL��S�S(*_���[h�I����;.H��	8	�!U%����@
#�BuP�udI���q����25�e�b����`���JV��j��2n	@��C�"kvR�{X�{��X<kit��	�\q����@���sH>��<[���t���+�l��Gj�h���@R,O(K5ci��[�R���HK��l�]�3�#k�'��B��#�{����|hq�%�X�-�e8@����PA����q����z�A�O�Q�k��(e�P�#��(����/:����d_E�)��!�a'6O�ZPC��!��4��r�XJ��w���8�	��d�<��%c�4������?y�*:g����"d�	g���36����Kg�5���X�$�ui$�'C�7���w){��RB8%#��T�4��,��y���k60�E���F��e�i[�`�3z���z�,@�+9��v���bF����p �����P����D/B��X*��>@Mo��T
�������@�����6�Kf�H9��� �����epx�xg\!3�r�X��	="C�)�c ���R���_9�?.������T/��AP$���R��'�V.�� d8���'`Ef��1"1h�gpm����tp��������qAv(BpA@F@~1���m(b���'��c4&��H�.K��sY��`=�]~�@�`�@%��K>����d�0-8�E���	@���)�\(R�b��B�peU*�w.�B"��b�U����@Fo>���7�8��Y�QDRe�JF(�:��	��g��:�L����&����G��$F/�x2�����h���3P"p�������9����q,]�[����#������z����\�u�w:��^���%��s�DD�����t��.��� ` Z�����,�5�]����\#���-������MD��@���p��@������Lg�� ��<.9!��'�����_
����p������Q���1>��7��HJ���B>}�SQ���Q}�����������tcA�8��H��	cR|��22JR������}���r1��7��\gq-)�� ���;h6}W��O��X.���C2CZ�>����N&�*9���iw�Bj�r���
\'����_K<�����)J�&��E�a}���W�]�k��,�G�@@f��m��hLH�p�kA�$X,>D��@���,�\N�SN��@^�<ua��3�I//����W���)�b>}������6���W6�*�����IG��
G��M
fU���������>S�+�A����������y�������������E
~J��\qFJ���&/<�1���������mb��a�"69��d�����`�,�Ju������_�'�x�-q���1o��� '�%$����h���e�d~�m��=^��@��B�����2N�������l��6|�c~+>���-R�����,�rH�j������w�d�
@FT����[TI�qD��CI0�$2�Q����������)x@���8@-���Q�G	U!�[���q�;�~������T���f�C�F��C�q�������u?��(XT���kA
zK���+
���s�L���h6���o�"�["]���)��\��BOD�����#���z���FW�A����"z��3��g=����U��]
������	R�F(z*�!
�'OSl�,��C�
��
;b�4*�w�A`��#(�����u�
$'�trEK9����|��/$�T$��H�h���gJ�%��
�����T���T �]��	�H�*Bz�dXF"#k=���W/��	�� �Z
��+���������p�\���M(x��/�tAi�f��q���(�y��~8���S2w(Y��M�p��@
�����bQ���������VYNC���v-M��f�u���@F�����3p�*�N�z�oM�|�����E4s%7�������&���?.�'�D7��+SJ�O�'��uG�� �rj���#���	�����)�*�u#	�d�E��j2H8�H�W��I����3��\%�����)c{������%A�	���(��\�cV���PH/�X���(����_F�f:B9���"]�(��*����Od/��r�EP����������jWq��Xg�&$�(�*W\��p���F���]�Jb�GP���,2g����e���v��^N E>0����,����Lo	��G[A�5L���;�����)xc� ��$#&^�Eo=�L
	���R]@�����U/�]D1�@��NG��U����Odrf�������E#<`�����>/#&�
#�L�3�6@,����\X\���e8(�w��	�#25;N����e��#H��I�P����	@Cx�� p�}@T���~P�������]�3eQI�@��^b��������s ���'|�E]�^.}T�Ta�~���r5�[� �s�2����c���hW����lx��� `���_�"��e���6eSE���+���&�\�iO_�f���XDd�������s��@C%:S��ww(������X����E�
RP���k�H��p"n�j��(�
5�)�����K4�`�>7�����4����N�`�,
"���Xd�wC�O���!�P�j�&?��X)a�$g��8<�$kdMG�.����U��E2>� !���
�~����k�!\K|q�5�8�<d�T�V�G(��DCL2��S����Y�hoA�W�&oH�4XB[����JJ���M������Hf�$�:T+��EM��_���,��E��	��6cQ���Jt.
��F29��_�{<d@���f�@1�3�l����g�U ��4�G<�p#CF�j��@��0HJ� aD�Z
��7��
�uf�L���Bm�?�x��g�\|xH�����z)�=���f$���A &K~�"K@�_z;h�S\�G�J
�����>�=��������lN�0�����uUB�z9��u&Bz�K�	@��s/`S����)�b����h�m�+>{ ��B��A�8j����a���T���\�C�o�~5�.(�qFD�����	�`���F����u�����z3;�HA(Fx
0���rMF����d~P�x>���3(�w�7Md���J0�"j�J	����W�9 ����vj ��@!�+���{+�(�)E���G6G]q��%g{��7U������@�9���7�V�H&�I�l�,��,�#�����	��E\�U	*�y���o��}�(�Kd� �P�	&=��9���.�u�%�"U�N��@��yp�E���I�"��[�y}�S�0�fKm��]�T5��/�FNH��V������e��.��x$��l$S��0�)�#W$e���/���%����U�,��x��S�d�������k�_�y+�I���N���K�~��j�<5�����E�2�JJ��(�f����>�X��G�>@�r�/��Z�}P���O����qz��1��PkhI���nd��������qA��{
=�r-�����m���<��;
���%U����O�r?h�� �l4x*m�=��`f��@
@NU����N�`%����_��n ��9��f,���
�D��P��J7�s-������J�WN�q#'�"z�jHO���s���kQ�������I
�	�+=�+R)m�JQ,'�]X����$�*FH�e���w1�]�'�e�����hJ
l��~CJDE?�G� ��� ��:�S���p()h=�n��D$"����lM�����zw��e���B�	��%d�4_�9�hPDz����<���<���hPN����D�Q�qDU�L8��c��_|\��ao�7�O������U����H;���y?G�������Wo' ����x53a+U��T<�\��!Hy��HG����"2i���cX��.��R���o����<�B�������>�mPL�p�t��O6�J�!
��c$H����"	\"R�b���0Z����~��x` �P(��&�l����` ���e�#�#2]M���LqW�Q��;��"mv&r~g����2��8����L^�p�x�	@	|�`fo�<_&I#]������t��b��8����d_J���� Gi������C�������~���������Lb9�E�Y��F�����_G�O�p
T.ws�V����\>�
@1���L�d��(!a��T�SGT>H{������	O�`^e��k��X��YH�����@��d(���y?A`�H���5������$$�P��U��':#�#��Bz�+����Jk��W��^/�)���|M7��*`L2���y�IDAT��)��9;5���0Y'�����|T�f���W�HE����Q�Bb[,�E����HW�YJ�`��ZI���K1�K�38]F�P�N�c�#��@C�^AZ��J!j�l(�w�k��P j��.�3�*��<�"�,<vP�g�(�)]�"�2�p�p������oX�(�N���X�?�
�q\$@E����	��s2�e���3��u���@�2�0�d�
�\�Ly�8m+D�T���@G�<�MfC�sM�(��*�x���|�*�..c��O�<x9�%�r��p��Q�\���S^�C��*]*.�b1���	�:K�����Q��Q`eT�"��M}�p�RI0p2(��@4L���@K���OG�E]g�v���&A�2z3/�X�0OI�J*w��\��4�����p�������<�� P��[��K �������������&w/�$VV�w�����4������O!(�h8��Pl#>L��%(�8m"*��m�S<��t��J%��_�����+h4'2~�q�����J�3U;_#�y�I/]	f��>�����v�?�?��g�?
&�L���N`X@�:s���3"���`P��W���o��`J��y�K�E��#���+��]�k�{�ig����h
�X������
\�����~HH��F3�R|����U����D	��t=M�FH�#�+W��@�� �i'E\cK�$i����[�'z��b�3<0�`Z�]��������Fx��R�NO��#�� hdS ����p2���������o����[�'~��

#�_�8g��|�J�"�i�������J`�~�d8�c���|�!�]��D���a/�`0��`0��`0��`0��'j�w:�������s�<q=m����1��M�p��8�>j�a#��'f������������:����,������25�i#����my������<��O�����l���lM��O��j��~�m>��s�_J�����%au�\�[�LB���?�w}#���x;�On�F����m����[��{��-9	|�i����tq�~IX���fu�<y�}#O�����y�E
[62���J#&Fl��,[6�vp��
�����j�fp��$��y�����^Q���\�����e#����P��@N���=����U}�F���;�����p��&��T��,��DU+���l�����g����g[6r}L���M������u�F&�B�m�{_�n��;�8�7�	�#���d�F�}�%�����Gr>����w�md��ff�F�K�M9O|�_�����k�}�q�v����G����1�e�F�<O��d��|:0��p�K��{Z6m�U����l��������AdV�myN�c�����?��{�����%a��|p��$�����a#������l�6�2��-q������[6�q��l���+xZ\��o���v�%au�=C���=��$���>��q#���@[52�u������>��������/[72o.�����m#�E����������
��;}
m��8��������>
����p������F�+�������1��`0��`0��`0��`0��`0��`0��`0���5��z��=��IEND�B`�
logtape-master-zoomed.pngimage/png; name=logtape-master-zoomed.pngDownload
logtape-patched.pngimage/png; name=logtape-patched.pngDownload
logtape-patched-zoomed.pngimage/png; name=logtape-patched-zoomed.pngDownload
logtape.plottext/plain; charset=UTF-8; name=logtape.plotDownload
logtape-trace-master-16MBtext/plain; charset=UTF-8; name=logtape-trace-master-16MBDownload
#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#5)
8 attachment(s)
Re: Tuplesort merge pre-reading

(Resending with compressed files, to get below the message size limit of
the list)

On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:

On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:

On 09/06/2016 10:26 PM, Peter Geoghegan wrote:

On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki

Attachments:

trace-logtape.patchtext/x-diff; name=trace-logtape.patchDownload
commit 2d7524e2fa2810fee5c63cb84cae70b8317bf1d5
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date:   Fri Sep 9 14:08:29 2016 +0300

    temp file access tracing

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 05d7697..cededac 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -77,8 +77,20 @@
 
 #include "postgres.h"
 
+/* #define TRACE_BUFFER_WRITES */
+#define TRACE_BUFFER_READS
+
+#if defined(TRACE_BUFFER_WRITES) || defined(TRACE_BUFFER_READS)
+#define TRACE_BUFFER_ACCESS
+#endif
+
+
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#ifdef TRACE_BUFFER_ACCESS
+#include "storage/fd.h"
+#include "tcop/tcopprot.h"
+#endif
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -169,6 +181,10 @@ struct LogicalTapeSet
 	int			nFreeBlocks;	/* # of currently free blocks */
 	int			freeBlocksLen;	/* current allocated length of freeBlocks[] */
 
+#ifdef TRACE_BUFFER_ACCESS
+	FILE *tracefile;
+#endif
+
 	/* The array of logical tapes. */
 	int			nTapes;			/* # of logical tapes in set */
 	LogicalTape tapes[FLEXIBLE_ARRAY_MEMBER];	/* has nTapes nentries */
@@ -211,6 +227,9 @@ ltsWriteBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 				(errcode_for_file_access(),
 				 errmsg("could not write block %ld of temporary file: %m",
 						blocknum)));
+#ifdef TRACE_BUFFER_WRITES
+	fprintf(lts->tracefile, "1 %ld\n", blocknum);
+#endif
 }
 
 /*
@@ -228,6 +247,10 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 				(errcode_for_file_access(),
 				 errmsg("could not read block %ld of temporary file: %m",
 						blocknum)));
+
+#ifdef TRACE_BUFFER_READS
+	fprintf(lts->tracefile, "0 %ld\n", blocknum);
+#endif
 }
 
 /*
@@ -602,6 +625,16 @@ LogicalTapeSetCreate(int ntapes)
 		lt->pos = 0;
 		lt->nbytes = 0;
 	}
+
+#ifdef TRACE_BUFFER_ACCESS
+	lts->tracefile = AllocateFile("logtape-trace", "w+");
+	if (lts->tracefile == NULL)
+		elog(ERROR, "could not open file \"logtape-trace\": %m");
+
+	fprintf(lts->tracefile, "# LogTapeSet with %d tapes allocated\n", ntapes);
+	fprintf(lts->tracefile, "# Query: %s\n", debug_query_string);
+#endif
+
 	return lts;
 }
 
@@ -630,6 +663,10 @@ LogicalTapeSetClose(LogicalTapeSet *lts)
 	}
 	pfree(lts->freeBlocks);
 	pfree(lts);
+
+#ifdef TRACE_BUFFER_ACCESS
+	(void) FreeFile(lts->tracefile);
+#endif
 }
 
 /*

logtape-master.pngimage/png; name=logtape-master.pngDownload
�PNG


IHDR�d�?2PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� ������������___???�%�	pHYs���+ IDATx��]v�*$��~XG>����8�m�'6h�}�`���M[���`0��`0��`0��`0��`0��`0��`0��`0��`0��`0?��ty}������`�>���p>�7���x<N�����	���`�������r�{
���z�N��?N���]OS\p�<<]P�0����<���<	��q�]�K����v~�������3������k0Z��w�����cy�n������L�F�)���4�I���ru�<��f���K�O�8�������<��_p<O��F������q70��+��B������;p�����;�Mp&Cwx
���=O�O���#x}*�+�?]�w�OQ��7Lp��=E����x\�Jp�|�����gf: ��a����D��������)1}����>�����`0��`0��`0��`0��`0��`0��`0��`0���C�������E,�0]�v���'���4��M��0�1w���<�+�$C_�~~�m��0L�w�4��t�;��f~?��$~�S������y���=�����I����a�������	��0�������^��C0���,�<�}�B^0�|��v�����H*��]#��y�~��_�0`���y���h������f ������Y����d��Q��Fv��O�Fv�H7�F��5��F��5r�F���5��@��4j����nb��Q������'����]~���������������O\w����?�u�����4��-h�-L{�m���1�%�_T�����x�p�]8���nb�$���~yLC�~��N~(>��9���36I��a`�(xHV��P��������w�<o�����S��#��9�\�	����00�o�B�)\�b���z=M[���O�N���C�w���I��������z�]�������?��7K�����������r~��E�����#����:�O���f���|��F��5O^��1���$��M�����������82���{=����������O��3l���N���������{3�+��-x������I������W�]�`R�����&�?7�
�,<&��0��vzs�s����C/�\��9�����\�������>��G���myv3W8������p��^����%��3T��r����%�����s�o�Y����D�xa������``�1���	�?�P�/����L_����W��%�����/V^N���^?�
1�>��]8��g����G� �X#���K��������
$�L��0&\>���8��x����������������K s�����`$������-�
��>t����5�I�����rt`v��x��2�w���I~���0��y`�%�m��=O��U��A`�:5��|��������r��]8����(��g��>���u(Qu�9�(�L���60������}��}p	�.������i k�������S9���u!1q8��u}��!����:����x�O;��"��q�/����M�9����N|�.��r����tS��������0����Q0���.��Ln�
�����o�����p&B����]�{���gpq��������O�~`��9�;���F��SC���iL��9��q�����gm�dz���x ���7�
����\E�;������f�e��,(<����w�L�~��[?
|Y�r���^w��H�������@�����<L���:'?���4�5��}&�]��� �<�C��}&�]��9�|���c�.L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`���`&��00�	��00L��a`0�a`����Y�T�����`3J���7�O�<��{������70���2+X@�"u��LAK&��������!7���`��QfR�n�������@#��bO �������+������J��"+�gLaO=������@g��|-���2��N���R�@;��	i�����q����4������9�I���������\'�9c���4�@4��L����E�)�q������	��o��R�]�H��B�F��G�L�`g��HF���A��<"'���(#�l�U���4�a�	I7�{��*+"�x5�X;���L*c1)�� L��1h
�h� �����Fi�E$W�ZG�j�D�,`h`�a/z���X>Jw'���l-iSQ����	h>��M������v��L�����h��������i��3u����'�FM.")G[l����9(����pfDM*�����I������M��;���s�1��`/�&�`
�G2�*�R`LVA9j�z�+��5�7@����0�H������S�E8�x|����N]?�f������O7&7j�{�uh� u8R�$6�}�&���������8M"�����L"�h����lb�7H���a��&�����������<�O����$�jv�~A�4*Afe���.:�`��PEz����~���bS����[�X����#����8�s��-����i�|F���~�pg#;l��	@5���~i����Y4���3��
����	��`z ��bkP��Rgd��4w�:�&��#��TM���Gw�>�7}��N��;�6���!'�i�B{i�a�"��=kx���w��n����|:�?�s�?�C$�F��
0�)F2��oq"��;�
)V"���_	��?���g���?�7��|�LR0��=�/�nNF�d�G� �E�5�(aG���<0#9����/��(�����:c���)��M����Ep;;�q�W\�r]@�$3"B��Z�p�	@ � GJ>��p����}�.GOC����`Tl+*��
�{�����v�0��A%��Md��@�h�cv��b�6>�`�%����$�-u-��3�:1��"��F���r��m$�(�Y��t�45���r��	�4 �Q��d�]�E8t��R��j`�)�T�m*�B�l��r�E62C�3h
�h�<I3���&_�
�C#:3� ohsx�w�I����'�I��^�{�H�r��&�1�=��&����"�68������y2�:�1{���_(�``(�J��)������s�#9�^.]������e�\�1���������3�#�����u����T��(R���J�u-�[��A����t�M��F�C@��{ �0).�dT�������hu9Z��F��� �D
��JH�3E�l�D�1�1�<~&�W��n��s��\%9Y9�U����
� ~��S���gUI�^a���P/��R����UP��D���,�8}�*BabM9�����H�����UP��b�PO0
��pQ�����E
�l���	xH7����^���X�J���� &���a��vpLx�m}����;�3�\���f�<� z�
�Po�wM����E��OV)�� 0�#C��B�hTkSx�'�	���E�j�;�L<�
r�B2u�x�
E����y��y��}�&�����8X�%�Vb���|��vl:�5�qC�1�����e����x��=;�����F�r�EhYUC�����d�a����{��Es��q�<X%&G���A����W��b'��8'>�����Gs��,#t`��F{X�����#q�3�8�G�q���S��2&r��1�=����A���C��s�o��}�.8�������]�`D$�?�0$���:���z���E�O�.�2���Y68��%FjT�S*��L�&c`�I�W��q���<Y�qe&a�90�$��+�vpX��~�B�;(1	:��'�fG��P(:�i`�a1��,`"�|�p�=�D��3�O��$�LB&c�[ ��ip=@�/�xz�B���Nx�i������
�"%i���>�^bQ�Z�qFE0{S��dYz:YpX������W�Z�!v�Q4���SFO.��8`��F�C� 
X(\$��\��{K��OL�;:��,��&�� �)��d���)�01bJw|<!�2�G2��)��4���)�����a��v.<N�O���*�W��t
�,8,��p�"����E2\#=iA*�U�x���E�j�;�"B�Ca9���6�����Y
&���L��E��cY/|j���D�
 g�%��4��1o�[gT1�7��Q�0�������H���o�'�rXpX$K��
`D�1<��d���Y���,@v�9��~�hw8�P�����Q^!$#
�tJ��i cv&��z``H�1
�#��r�x���`�`w/eL�ow
S9K6<6u<w�e�D���"������`���a�aA_3�{�t�2�0$��4��+w��,�D�@L��	��=$w
9&a�LP���������	�!��\����$i�����	��8&f�b��)������#�c��.���������Jv�,8,��+`Ao %���Nf��V�����"�\0/�U��a� d�t�S�KB�d��� ������z`��F�%p�+64��L���Q������,4�@��11��H���z�z��h��Kx,�N#	S\����H#d�a�"<�����>���h� UG�~�hw��0,j~��2X0v6�pt���!n)�g��F�d/���[�po�F�i����}'���q���&vpL�|�hE���(#\�B�v'��u�R����
#�v0$�3�����J!�r�������H'�6F������s�gXB>|��o��$e,7�I�}�C�J�m�#�0',G�x��;��R�8,<�R�e�W�
r~����a�>��I�"�>�Z��Q@��Wi�	���H�G����BT,�Tb"
;��/	@!?�2�r��0��R/g����2y�h�������#J�%��:
:%sAN����H?���*^*��a���;Z���&�a�� �STQ���d9��p5���A�����`��<�$Z^GX�S(<�&�	@���gzE%@�����5'-������	���
����5t.\��i":rqJ�������oI����+�}�x�x��|�?�(�[����]E�j�;�)5�KQFxM��V��FY��?��T�&�f�r���T�&"��B�$��fdA�	@[x�h$��K:�1��6/�����su��q �0G�iHM/H�o�'|F�,�\����9v&�3�&�E �O��+
}|�;�W�����&��� �bS���d_�*�o� �D�j�)��4�"hE����i�h�g#�8��B��o�	.vU,l����E��O-�� � ���'S*�i��DS��i>^��F���aH�}����P51DP��5������N��\��D���-`���t���>����{�v���<Is���}��)RQ�����2�w�^E^r��%����rXpX����h����8sE��2bS8%X�]T����"�_5�#���3��9]A��r��8&����00�@&4	�71�E�=�,���\�	�����<M>v�S��_���FM��W����(a�a�D��EW�	��nR�b"����v�%G��~8�F�j�;�Z�$���i�M� �&b.=�=���	�/��T�5�DSVb�haw�DZ�d��F�"��qTU�G@��sF.�W@����E�sXT�K�a���$�"������]������B���E�z�=��3�k��L��7���s[-��ot-���]��b�	�8&<$_������b�F��&���J;���r��,8,�`L�����D�Iw��	����w508,~"Y�e�W��O����rZ
������c?�"*�	o�l�&&UFxG9��/.�r��a!����-�K�G�	�4�e�\D�0��2(�E��P@�$��+ea8B��p�0������s��\���%�h��a���d��,�r�L/9|i&��h�R�&��O����r{��&6j1!CK/&�tS�/g��9���Q�z���wp;�'����+�X���
`@��=1��A�M�5��F/}K��_;7e~<���$�g�~�1����@�A��U�K6�)���#?H_9��a�P���ru�����:�h����-@�)�hI�*�H] w��� ���I.�w��$�5jx����������-{�-*��N�"Kw@(!��~u��s
�v��g��k
���i$KS�X�c	���8��!t�0�����Jp���tT��
u(������2y��%��i���2>��(��)������	����s�%����.��O�����b�e-1�"D�xP�+��J�6����z�&\&n_�g�weQ��F�����6k��x����b� <�Wt[bJ��&mIM��F�������e� �y���h�h%���L��]���\O����,�
�
Yd�B ������%*��#��~���Q�Z�3@�R�#M�9��J�8��2��N�O,���-�n�p�FWpS����Jo�B-`��wI�G|Q��s�C�J��<�/���-X%���p�~P@�ZF7���IL�|���J��E�R�����s��&��X��w�it���@�a���
2��jBh����1��+S:�3�����������I
*`�F~4�
��C#���Z�|`V��K2��K�!f�A��|��@��%_�S�����V�	:��&���	�w���>��h�2,�D��D#�On��� �R�FFL���|l�����#H���q1�b}�*��"�B&8L�&����@�$*��XcL�h���&O�D��3�&18�(��K�C�<S	�@g���d�s5)(�&�aP
 �����
���,a�9���D1�,f�%��	�aW0���8 j����9����;� �
L�}�f�+�W�v��&�
��0���*��4A(����E�	��-��l3��>�e��	�1�����r���R�+��EV1�u;ZJ�a�x��"O���
�!0`f�6Q�}`X�@�c�"����pt�U�������&{k�;x�"]h��o�7�*����Xl�vX�&<L�	`y�Z�%������(�.�.��r	&]�%��	�a�c	�>\A
��7�w�
f2���Y0.~#,	d�T������v�zL%*���#'�`o�� ���z;����G���B���2�B�#�����&��hR2)�h��&�K&Ci8���� �������~�e6#���'3\�@�B�Lg0.�'B�h�&e2{�G���L�^��&�i�/��1�nt�83�"����G�M��@�����K@	��,9���! IDAT1s�����<����r�pG^�U����� ��<s���C��
'� �	_��6�*��� �h9�V3�W��H����u|K��.�HxBCr�����w��rIlO��^�����	8���I�,��sh��AM������"C��hdF�$u\�f�.8�3�	70`h/RZ�\������cn���0�jU�18T�}���8$G����4��|����K�@�����Z�&�E$e�:7P�B� 
w��*��b�Xv�	�`T�"RR�������QH�v��T����#��V���9rH ���R�8�,�0h�j�A��t�}w��9��0-",M=O�z��"i�I��
�@P!D��8�!�����L�E����Q O��[:���lypG��4)B�30��
�Y!�q���M7���|����i3�L���@�����b�:"V�kvl��(�)^*�0� 2c�0AOYq��&��\��7P�~gp`�]T[]��He���|8*��B�a��ME @J����(&,X���NG)�Dke2�`pL���|�@p@�Ypb�
9����N�e}R�&�E�h�eG��R1Q���S482���E�b�@ow�mQS�eH�{9��1�j��g�v�P��2
����>Q|�Q�&`t���	�1�����
��r������-�$3�T�1X9a�X�508,����[��G������K��a"O�<��R�;��&�����.H��cS)��I��<�j`����dL�2u��l����.�.4�LJ`pL�����

d&Y�=u$�F2E�9�y�5`pXxhl���p�3����{���(�o��|��A�~pX�H�[��H���gbq�\u��	���k���ldsHNZ.�%W��)(!&Sl(X�@�t08,�%l�D�F&����{��2�Ta1W��b0P�M�K����n"T�E����`X|Q�7>�\����K��6&1�w�}����`�����o��D�RZ�'���I�� p:��eD�
q����x{j���_��N��sR;D��1�[L6�~V��a
�9��I�Ho6�r��&MaP
�<���E�I�6������$G�Gd����r�0<�-C7/�9m�C0���T(����O�>���,x�x�`X>��{.��Z��UL�Ec`( �������/#tK� ��\�������L���~E���i�#,��s�#s����I��@*��vVh�	�a":�fM��q�	~������b�9�=��X0(�,�I�����zh.7a��p����YqdL,t@�@��i���u�[DH6�p���TM6�	@5�� �@I����e��	I:��D7��I��&�p�$s��&v�Z�Ai�D���~2P���H����$u�p/�C"B��#p��&5��c��/���sZ[q��a��o�	&��_�]&����(�l�dhM������9m���
&���@���( �i8�6[���
c��	���w�>��IQ9�=Rw�pxR��M&&-@@~��l�D��S�~K.��"�`����$oL�~��d��t@(�%��|>Kt���1�����N��!�X(������d�����,C8:��S�erd��"�a�3`v&6>�Uc��o�Q1*8��n��W�v�*���'a�m&����n�6}��&�&�0x����+��=(J��R\�9#	�Pg��D��y�v�>*�s��A�-$K5Lv-��
mBE����<g������%��|��	�VH\�����&��8G:���C@�(�p�o>D���JK��b��F����G�
@D1�JR
n��'/%_�����->h���	@#
��yV1���
��KV	\�����������������4	���0���<��`���O���a� ����	c�)�`��)��L�%$s50��V\/�z������a�#�;h	{��P,�J�)�d��F?��7Z_�6_����t�9[pj�S&K��|�����^<|G�N�s����������l��J�;���T���
W\��-�;�R���F����E�� K	L�V��:/��|%�bH@
��Ro�B����F����G�"&�����"��4��4�A�?�-��5%)T'�����h�XQ��U�X=Q��!&���	�n��^������jDk$]���c���>��B�H$�g��O-����;������7Ry -��-�e�;�-5��-��0(FV����}	���*	#i0�BM�����|
�l����F�F[���.o=�pw�)���vP��I����z
G�
~.��Xb�x����T2u�w�<�28_O����<�L�U��h-z��L�������-���X�!}qZ�B�C��^�����������h�h0��E&$�H)gc���Ge��!�������u'���|}&��D�Fk!c3C;7[
���(J[���}�B����;���������tcm����v3Pk?Q��|����
��|.2���+��	Q
W�:.���'���~r�I4l��@������E��	�DePS�X������V���}	��t:O��)};]�D�F�A���y��%WI���/4
Vm� �_��%6�p����S�>�hD`l�%O>'+���7"w���.����J4�k4�������#��j.zP}b�#mQq�7�(VN>����_��S2�U?\
j����������T�]��Fys�	@[��a�L�!m6�_�z�_�>?��<���%�����=x�)��4��8`r�J���|�h�	cb:�|P��G�A����"���Zd���������2#c�-�-����RG�����W��?d�Pu�;�G����M@
N%��Mms�� �A6GV�-�;dZ���C����ap�FW��������@�HK>$Y�n���y�N+b�D�&_�*j��I
m#8'hO>�;�`�X�@�e����"�"�-���R.��o�_�R�I~�XhG0!���*?)��c+A���}�6C�5U%jH�b��S[\�R�	@�#V")��+E�2��y���R��CM���a�3(5�zQVh���Q�ca�06�fH��(��>x�_4�"��
PO�����DP�����ZI����M���Jbw&?�n�$ds�B-�]�p@a����e���1�twxX�|O�>A*��I�

N�GM��u�B�'5
�����7	�R�fwA���'�y����J�)}��	�L�(�Q��	��G�q��|���������e�(j��F�Q,���+
@%[�J��bwQ
'=�i�i�$pg�=vp@�M�����a��S���c,�n0@�0}OK*}��e�;�fX�nP�XT���!���x��7���6d�|��>ug@K�
���^��-H%�"bl����3CB��Yd��i5�[�9XfyF��@��"���4����z���@�������A�cH��	������5��)	[`�B�A.$]�_���R/�5��H���VC�\0�8���
�)m������I��U�F�L+m2P@"j$��F�
�Z
��Ul[8^���q�4�y�rg��F?(�ss��!�[��Zt�J�D$B�j.r���C�wcw5���l�m5O���T�J-��� 	����|`}�+qI�B	�`�3h$j8IO��������Q���R�H;���z��:/>�c�O,�d���o�9���	L?��)�V����H�GVR)�,���%qg��F?�S���[
������0�Vr�j����B��M,TP����xm���F�1�8��PUH
�iP��~�eU%P�7@�0%Z�6�"�L�$jr�e�1v��w&��V�lN|��(NSRJ0c,����]8:6���|b�'�e��` �s�t������R~��&���R�QM�E�T���&�a�w�i[����: 04��%_zly��!P������]&L~1��y��*���~�`� ���L����R��:Ii.��u2+��L���\J��_
�E�����UZ�)�z�1��<��y�;�&�@@���PT~+);%0^J,�c��}`)|�k���|������a@P?�p@`h�%K�d3/[�P�d�h y7��&L>��b'��D�3���:�8����|�@J6�5(�
L���-.��2c�iFL�w&:P��a��(N�YN�)��0^���AG�W �_��y�*�<��f��<T#G[Q��=@�"=��*Y�R�ol�/�h!����g�j�!�w�
m+<�TB-�"�+�����>��>@$���T!���V��p����KW�bR�����ZB�e���/q���n�#���	�������7���0�I'��r�1��(������E0c&��"j8��kJb�NRJA�p���BP)�,�C����;�`�X�i>�%�0��1��bGw���N`��DC�G���]6�)f�9����H�L�N@#�������p�qXS
@
I����R����R��}�#�h��0�o=�����f~2k��L�jR#+*2)9�0�����J�nZ	����t��,6p9wm���1��M��h+*�J(�m�26�@P�Z��f�Z��M��0N����N������bD�A���n����z��F�����1�8�=8�)'j���x���0O���%��:��eji}��+YP�2����VoO?d�'�`,�IX.c�l�^D"�1�����X��%3f��F?��2���K�;w���������G�����G��-(0����h���������b���EF16��'�; $���D[���45����g���8��,SFQ"�%mir��Ce��5��	@1
@
��Ro�B�x\��Z�	-pv��E���%h=/�e���4%����4����`��)�z�=C���)���nM���o��,���D�IF�t�j1�8��PX�aj�-�h��������}��T=9�-q%�V���+Y��6�FasQ�=�z�4�����$%���t��ifbET2(�D���k�� ���Ly��L> �'����<��u�a48T�H�������GzP� ��06q�8�X�IQ��X�G�q{k�Zpz'n��r�m$j,�����d��"�2�H����SVb�a�W�&�P�
�Y�RG�d�p1��d!�=�(M�;5XPC**�z�����T�������[�P���N���r<0~^�3�������PB�BFn&d����E��� �n��}�3-�iK yk��pz'na^�-Y(jO�x��h�B�b���r�i*�����d6�����
��z��'NS���^���h�:�`��K�V���XL6-jh��:��+k��$���c"j���`s�Lv��2*K���vMU��u<pd�25^F�@��Dbv���|%�2�`o������&����x\��#��$q�@ID�I��49��L��y���	�Z��5���II�!�k�W��o��tg�	P#K['0}[����HM�I���p�!��(&"�NWi��4�^�\J��3��BM �����s��`���JLq��~�r5�H�M�@�'n�����@
���x��!<Z,c
��W�M%z���"�b�	���{<��v)��l�,B�9��a,2$	��������	@1BPCJ[')���F�[��>S�$�������d��;����NH:�ss+�"O���"#��&� ��u��HKo@�����z'n�[m��G�O�x����(�}���K�U
�����!�j��L
@
�^���	�/H'��bA6��c����|�:Z���+�	�F���{]�i��`��}��2�+��XB?Io����6G��	����;�bXj{n"�=vt�_>I�`3OS	��*�&�0�)D�O�)x�
'n%���k���j�[���E���������D(Si^0�	���6z��'��L_�e�%Tc-R��"|^�J-��L~�"�5,�����B��~��H� ���+�u"�Jk�>�?�-|��J.��N"*'dk�MM�r��q~��������|�@�n�$�����@ �� ��������c8p�
��O�'�p��P����]���y����KfHwk��;���!�Ne����|_(JxG�(f��r��8p����*`+���������*��6�8;��d��;��H�<j-|�X��$�KTb8p7}����\��3y��� ���_��G��,���a �L�&mir��	@[����p8#	@�-�%�k*9�7CR=A��J>z��	)5��J������$�C)�%��GD%�"���y@,��	�!�@��h�����$��"d��*����KM��������r@�Zow�h|�W�NJ>�+���@@�'m�� �x
��	@;PC+%�p���T��4�*9m%DO��TT	
(L* ����R)���l!,��������dH���D�@�&�l��.�	,�v+@:�9�\��G���1w�J������.�[�;�:�n+P#��<������-�]g��'$AK��R��u�U�7h j���2���fK5�D��B��D�fPN�(�
��T����-��;E�|&[`�w���h_�)-��SC��v�\���E���u!�C�R�$��	�+'p[P#�4��S����3��Hd"��*�e�!�&���J�`:�,%b�������J�D��}XL*�s[
9m��
@J0n����,�J�{W>$]�����	�~� ����k������k�;���6���[��������I���@
��p8��d�I���P �I!�a�&��T,B�����@-j4����Nh�9QT���jd�
+��`�GJ15VO�O���P���8�d�����&Gk���"�m�m��� ���5?��s�\�m�wL(�H��/U�*.����yJ\��K&���z��l/jp>��0_�RIw}�F����'O4����������rz*��t�=�c�m�:P��9����-x%�RD>���Jbw]�k���~g>���y�����ic�8Q�M%$9��;���!%i����������'w;������$�6Z��%���*����p�l2����Y@�k�1[Z�z�sH^.��d*��W�j.>~?��$�6*����i�&�SH���[�%+qwL(������t(��''/�x8��<�����'�//'���I4mt�t�?���k��,|0?�@�[�@fKJ���p�A�P������m�����F��H�k@�(��`��.�����?S��a������ M�>���pq���B��&��UG@�B��N	����� ��|��,>Z����gB)ty��E��*�{e"v���2��8��]~	�O�	J7q1������%�5����5� 	XNS��R�������=��j��>PV�p�7����|?�NDm����ZJo_�!U�?�[������N�~�$b&=���'T}��>��i����"5Z�o�u�h`(��������oL)�Q�0��U�C�UP�`��C�?
��iY
!��t��P\����A��d]�����#	�B.%�N���]ZH���0J`9�e�:H� IDATg�bg�_�$=\(��D���RR�>u8������lX�C�����cn�:����[& ��'���?��7^CO,�I�!1>��T�8B�Be	G��.��YF��P�o	�������]
��@8K�%�q�����/&[��i!�Z����B|�S:�4]F}
-���+z�#�]�/r� *�#���R�X�Wn���w����
c���a��T�,,�S����$%�4�4�W��
@�5��E��"�*�����h�H*��%���l	�� $�Bo�+m�����R!��,���wW@s�;��7�v�R�hP(/i�d<�2Z����:� ��|��?���&H��J��E��G��Bj,�cqA&]d���-�:�Y0���|��<�VB�L	���\��R$H�HJ>�2������"�Q%rS#K�Fz�6��`J��
�<bQ�p�����3��A�6"��>�I�|B��v��8�H�)v5m@���<��:*�8	#�TP#���&��x���%�#_h�KY�-�3�N,%j�dP+z��M_ ��lw���Ka�������w��$���}*�B�@��&wH2u�t�!=�2�	(L�8�@+�6����$���q?Q[`�!���� �"Y���b}s:X "+R\#j���y����ZI�1AM�����z2��C�����ru�w&5�F�9�J|� xs�l�������
����	��H)���	K�L������G�YCE���B"�G����`P�]�����vMU	@!���Q��R�Qz�Vb,
�u"�Z��/9�'/�zX���@�>E�JB1���@��5��1�/BnYu����8��a*�\%L�T��jT��q[�$[���`�&S^*,Q�#��|!����6hG�7�����dx���H����J&�a�w
�.�>BK>G����A��]!c���G�E;�b�oW	��T;��l.0�C�Ql  x�ZT'}��Dt���y</PV�1� _���`)���ez�w�)F[��X�%�'�y��y*�2H���A3R��v a}GY��Xy�y`�D����J&����\1���2h��i�G(a���z�d���]�����X����g(���S:$��q�d|oA��%V`�ec�9�'t���]L�D�@|*�)4����:�#��]:�i
��U#���E���P��VJW���N`*�b��J��D[pw�������+��	�#�H��ji�x�
A����0���|���E@a�[�e;.�j0b��dU
�����<)�\�<�@�la���"�?|x��DF�!�<=/���#�C	@�����<��	���9r)�T�-�;�U�!q�95Ci���6/���9@��f�fY�E���bH2����R�by���\E���*0p�|)<�75��p�Q��0��3O#cqw���[�PL�
��O�'�P�8_����0��k�4������Tf��}7&*��(��_
@J)���B���Es���!b�`t���_�����!u�83��F����XZq�tz��
D� c���Y��4�����E�������s����7�����#��*-��RK%�F�V�BJ���')[��|��s��-�����i��
i1h���H<�����
�X���Q�.�����=F�N��Z\�Z�8��r ����M��#�#U���
&�D��)�]Bf��@@��h��=`����-�%�ZH�v/p�Sa\�nr2K��,n�WBLj@
\J� M�P�/4ow�+UGX)�2D2(��\%��A�^�R�#H(��%5i,�	N]�=�zBo�$c%��@��1�8�K;��\�Z�~2<��������[���&�#�����_���E9��5���Ca�CLr�v+�'a	miy�wk���J&jPC�\O@2
�)���Ql�O�J���m>����K���l��bjdi��������$���X�;�7L@1p��C,�d�e����qSD[����,j��kV���h>��$��Hha�;�o!O�5�Bj�����y�a��%�5��������#�N�Ja���t��wS������F$�;� ����X4�y�||���	������K��	�w�H
L���o	@J� 	)�Y���	cYz�&������,�{]y�M��}�h':4��p������AJ�f����!���
���A�U
��`�@~�W@r=y��.gf��+�4&��[*##D���DR��K&%�FVJ��)�[)%6�@���9}2cu���f�����NK�v� ��ZD;�H��e�.����%%f����	�dqs3n�`�<L��5[���I'��>���"�����b�gdKPC�n��]�Q�O xS,�g���+	-KG����R��[40	�9��
���w���8/��I����y	���K�i�;��	��o@J��	m
@�;��E����0'}Pf��y��<k�V��y�XL������
%Pbe����$����Z��NR2c��L-+��1p�r�����9�|Q���yS��@�!I�]�4��5�R�Fj�%�����|������U����a":/���GE�L����u�m,d+�L�]bq�B��5Y$qD���,�
��]��D�Sl�P�3���XJ��2��Ih>�`jJJf�PI%����!�`�T���#���/�@[kV��!C*��Rb&&��	C65�z!���[�l,pt;�q�e�JYm*b��$\.sg���wy��k����������
3	2��R�7A���\T
*!P��������~�JX�n���X<��K�B�y����#
A��D���pW�n�X@����p���_z�J�@��RP}�(2cb=A*�
��q[�Q����I���J�Q�2�z�*����8����N`j"���t�'v�����(X����S��85DTH��FC���F�#��F�\��F��h���$�d�>�g+�D(��`P)z��MG�LL[�Q%�d�+Awe�f���!������K�)V��v�%<RU�}(tg�b�w����
Q&!���\c�G�JA��S[6o���]�@���-����E,��]�S���Y��pI!{+6�/a��Ol���;�!�Wo�&jn��mi��8���$�q�Hw�hi����-r.�*���RL��S�S(*_���[h�I����;.H��	8	�!U%����@
#�BuP�udI���q����25�e�b����`���JV��j��2n	@��C�"kvR�{X�{��X<kit��	�\q����@���sH>��<[���t���+�l��Gj�h���@R,O(K5ci��[�R���HK��l�]�3�#k�'��B��#�{����|hq�%�X�-�e8@����PA����q����z�A�O�Q�k��(e�P�#��(����/:����d_E�)��!�a'6O�ZPC��!��4��r�XJ��w���8�	��d�<��%c�4������?y�*:g����"d�	g���36����Kg�5���X�$�ui$�'C�7���w){��RB8%#��T�4��,��y���k60�E���F��e�i[�`�3z���z�,@�+9��v���bF����p �����P����D/B��X*��>@Mo��T
�������@�����6�Kf�H9��� �����epx�xg\!3�r�X��	="C�)�c ���R���_9�?.������T/��AP$���R��'�V.�� d8���'`Ef��1"1h�gpm����tp��������qAv(BpA@F@~1���m(b���'��c4&��H�.K��sY��`=�]~�@�`�@%��K>����d�0-8�E���	@���)�\(R�b��B�peU*�w.�B"��b�U����@Fo>���7�8��Y�QDRe�JF(�:��	��g��:�L����&����G��$F/�x2�����h���3P"p�������9����q,]�[����#������z����\�u�w:��^���%��s�DD�����t��.��� ` Z�����,�5�]����\#���-������MD��@���p��@������Lg�� ��<.9!��'�����_
����p������Q���1>��7��HJ���B>}�SQ���Q}�����������tcA�8��H��	cR|��22JR������}���r1��7��\gq-)�� ���;h6}W��O��X.���C2CZ�>����N&�*9���iw�Bj�r���
\'����_K<�����)J�&��E�a}���W�]�k��,�G�@@f��m��hLH�p�kA�$X,>D��@���,�\N�SN��@^�<ua��3�I//����W���)�b>}������6���W6�*�����IG��
G��M
fU���������>S�+�A����������y�������������E
~J��\qFJ���&/<�1���������mb��a�"69��d�����`�,�Ju������_�'�x�-q���1o��� '�%$����h���e�d~�m��=^��@��B�����2N�������l��6|�c~+>���-R�����,�rH�j������w�d�
@FT����[TI�qD��CI0�$2�Q����������)x@���8@-���Q�G	U!�[���q�;�~������T���f�C�F��C�q�������u?��(XT���kA
zK���+
���s�L���h6���o�"�["]���)��\��BOD�����#���z���FW�A����"z��3��g=����U��]
������	R�F(z*�!
�'OSl�,��C�
��
;b�4*�w�A`��#(�����u�
$'�trEK9����|��/$�T$��H�h���gJ�%��
�����T���T �]��	�H�*Bz�dXF"#k=���W/��	�� �Z
��+���������p�\���M(x��/�tAi�f��q���(�y��~8���S2w(Y��M�p��@
�����bQ���������VYNC���v-M��f�u���@F�����3p�*�N�z�oM�|�����E4s%7�������&���?.�'�D7��+SJ�O�'��uG�� �rj���#���	�����)�*�u#	�d�E��j2H8�H�W��I����3��\%�����)c{������%A�	���(��\�cV���PH/�X���(����_F�f:B9���"]�(��*����Od/��r�EP����������jWq��Xg�&$�(�*W\��p���F���]�Jb�GP���,2g����e���v��^N E>0����,����Lo	��G[A�5L���;�����)xc� ��$#&^�Eo=�L
	���R]@�����U/�]D1�@��NG��U����Odrf�������E#<`�����>/#&�
#�L�3�6@,����\X\���e8(�w��	�#25;N����e��#H��I�P����	@Cx�� p�}@T���~P�������]�3eQI�@��^b��������s ���'|�E]�^.}T�Ta�~���r5�[� �s�2����c���hW����lx��� `���_�"��e���6eSE���+���&�\�iO_�f���XDd�������s��@C%:S��ww(������X����E�
RP���k�H��p"n�j��(�
5�)�����K4�`�>7�����4����N�`�,
"���Xd�wC�O���!�P�j�&?��X)a�$g��8<�$kdMG�.����U��E2>� !���
�~����k�!\K|q�5�8�<d�T�V�G(��DCL2��S����Y�hoA�W�&oH�4XB[����JJ���M������Hf�$�:T+��EM��_���,��E��	��6cQ���Jt.
��F29��_�{<d@���f�@1�3�l����g�U ��4�G<�p#CF�j��@��0HJ� aD�Z
��7��
�uf�L���Bm�?�x��g�\|xH�����z)�=���f$���A &K~�"K@�_z;h�S\�G�J
�����>�=��������lN�0�����uUB�z9��u&Bz�K�	@��s/`S����)�b����h�m�+>{ ��B��A�8j����a���T���\�C�o�~5�.(�qFD�����	�`���F����u�����z3;�HA(Fx
0���rMF����d~P�x>���3(�w�7Md���J0�"j�J	����W�9 ����vj ��@!�+���{+�(�)E���G6G]q��%g{��7U������@�9���7�V�H&�I�l�,��,�#�����	��E\�U	*�y���o��}�(�Kd� �P�	&=��9���.�u�%�"U�N��@��yp�E���I�"��[�y}�S�0�fKm��]�T5��/�FNH��V������e��.��x$��l$S��0�)�#W$e���/���%����U�,��x��S�d�������k�_�y+�I���N���K�~��j�<5�����E�2�JJ��(�f����>�X��G�>@�r�/��Z�}P���O����qz��1��PkhI���nd��������qA��{
=�r-�����m���<��;
���%U����O�r?h�� �l4x*m�=��`f��@
@NU����N�`%����_��n ��9��f,���
�D��P��J7�s-������J�WN�q#'�"z�jHO���s���kQ�������I
�	�+=�+R)m�JQ,'�]X����$�*FH�e���w1�]�'�e�����hJ
l��~CJDE?�G� ��� ��:�S���p()h=�n��D$"����lM�����zw��e���B�	��%d�4_�9�hPDz����<���<���hPN����D�Q�qDU�L8��c��_|\��ao�7�O������U����H;���y?G�������Wo' ����x53a+U��T<�\��!Hy��HG����"2i���cX��.��R���o����<�B�������>�mPL�p�t��O6�J�!
��c$H����"	\"R�b���0Z����~��x` �P(��&�l����` ���e�#�#2]M���LqW�Q��;��"mv&r~g����2��8����L^�p�x�	@	|�`fo�<_&I#]������t��b��8����d_J���� Gi������C�������~���������Lb9�E�Y��F�����_G�O�p
T.ws�V����\>�
@1���L�d��(!a��T�SGT>H{������	O�`^e��k��X��YH�����@��d(���y?A`�H���5������$$�P��U��':#�#��Bz�+����Jk��W��^/�)���|M7��*`L2���y�IDAT��)��9;5���0Y'�����|T�f���W�HE����Q�Bb[,�E����HW�YJ�`��ZI���K1�K�38]F�P�N�c�#��@C�^AZ��J!j�l(�w�k��P j��.�3�*��<�"�,<vP�g�(�)]�"�2�p�p������oX�(�N���X�?�
�q\$@E����	��s2�e���3��u���@�2�0�d�
�\�Ly�8m+D�T���@G�<�MfC�sM�(��*�x���|�*�..c��O�<x9�%�r��p��Q�\���S^�C��*]*.�b1���	�:K�����Q��Q`eT�"��M}�p�RI0p2(��@4L���@K���OG�E]g�v���&A�2z3/�X�0OI�J*w��\��4�����p�������<�� P��[��K �������������&w/�$VV�w�����4������O!(�h8��Pl#>L��%(�8m"*��m�S<��t��J%��_�����+h4'2~�q�����J�3U;_#�y�I/]	f��>�����v�?�?��g�?
&�L���N`X@�:s���3"���`P��W���o��`J��y�K�E��#���+��]�k�{�ig����h
�X������
\�����~HH��F3�R|����U����D	��t=M�FH�#�+W��@�� �i'E\cK�$i����[�'z��b�3<0�`Z�]��������Fx��R�NO��#�� hdS ����p2���������o����[�'~��

#�_�8g��|�J�"�i�������J`�~�d8�c���|�!�]��D���a/�`0��`0��`0��`0��'j�w:�������s�<q=m����1��M�p��8�>j�a#��'f������������:����,������25�i#����my������<��O�����l���lM��O��j��~�m>��s�_J�����%au�\�[�LB���?�w}#���x;�On�F����m����[��{��-9	|�i����tq�~IX���fu�<y�}#O�����y�E
[62���J#&Fl��,[6�vp��
�����j�fp��$��y�����^Q���\�����e#����P��@N���=����U}�F���;�����p��&��T��,��DU+���l�����g����g[6r}L���M������u�F&�B�m�{_�n��;�8�7�	�#���d�F�}�%�����Gr>����w�md��ff�F�K�M9O|�_�����k�}�q�v����G����1�e�F�<O��d��|:0��p�K��{Z6m�U����l��������AdV�myN�c�����?��{�����%a��|p��$�����a#������l�6�2��-q������[6�q��l���+xZ\��o���v�%au�=C���=��$���>��q#���@[52�u������>��������/[72o.�����m#�E����������
��;}
m��8��������>
����p������F�+�������1��`0��`0��`0��`0��`0��`0��`0���5��z��=��IEND�B`�
logtape-master-zoomed.pngimage/png; name=logtape-master-zoomed.pngDownload
logtape-patched.pngimage/png; name=logtape-patched.pngDownload
logtape-patched-zoomed.pngimage/png; name=logtape-patched-zoomed.pngDownload
logtape.plottext/plain; charset=UTF-8; name=logtape.plotDownload
logtape-trace-master-16MB.xzapplication/x-xz; name=logtape-trace-master-16MB.xzDownload
�7zXZ���F!t/�������]���R&��?j�@-����bDp���q���
����������]��-�ni�����Y4I$&������S�Ey�����ep���][����9�"��"9����3�S���>�����Fg��w�SM�gZ��E��i�)p'dG������P3�����a�S��'����j��`c8���xP5�����u��F:��t��$�j�a��o����;A@B��$f��|���A����PI��z��4fG���~�l��w�H�+t�����v*}��{����{����)6|.�D)I���_�	��j��'�;��9�K�0�U��>���wc�m'���	
�Z�4�.
(W�?��=>y!�I�0�@������.�2	
CS���QS!�����5:,h�1MzHs�,�[��&����4U[�6�n��� B_��$��<�����t�;��^K�fKug�)C��N���Z���p���gX39\g�� �0�3���|F8T��pS�w�����BPk��;�96��sIA=��������rG���A�(>�c�V�H	�����l�V�Ki���������Z�d-/l�p`�jL��z�Kx"� �k��!���`�T�7���X���|G��={-J����?�(A�����'�c3�6'�����c������:�)>�s���:h��R4 �nrS,�{�� ��?D�qBo�������/4��U��J`�
'}9�L��U��6��9�T~�"��b��9��*����m �\
t0h|qY+T�0
��e�]=�����{M��"�}>������	����
��Y�E*����4�V�[t�u���h������ �I?������
z��3&� $�S�f�2wEzVqn�sLk���GH�.�{��F��� ��V���~��	�2��XQm��L�?��)������;�-�I^ze�&?�y�����I���/�,�2��	<�}��zZ�!I8�����&�.S#nV5�U��I[\�'����~����+���'�yZ���y`��h�������l���DP�����i������M`���Lo^[��4%��4�siA�A���7�|��87���]e"������	�������.�l7f�E��������c1���~�J�.�d�T��
y��W]�B��`2���f/�� ��I1�����u���3TZ>���;���B��	��"���a�^���o���K�A�*�*�
Z�p�/�a����/�{pst��_G�'{���
O��c�+������m�%���8�+��3��i�n=��G�Q����=]���\y���+[�s!�0q������Y,�\�������/g��/�,�D�+��������8H���W�^�i�m�R��Kr�;�
�"�^$Lp������g�������5�PY��A��B�^r��o�>�4�rm�F����M�~WB������5��q-���I�@��L��Dj�X�2���_o�gfj��-w�����k%����A{�W������"���"#�d/����<��
��}������}%&I�����:\�S64e��J�Q���Z�6�=�'��+���]�(����{I^��f��,��0�&���"Wpro_��l�j.��q�^���'�R�W�pj�f����Bf���P��ck��O�y���L�����;��A1uB�[@{�F�Rbwn��yA�\�:���'�IW��R��Y���h��8����+���� ��<�����/B��&���b��L�,fub��}j�����:�`�<���V)�
C�@��Csoo���x�+����%"��g���Kt��Y	�,��N���T��L���C�5�7��]��H�
�(��Q8G��}�]����1�>)b���fP�Z��k�u��6g��I�����zG�<@���I�,�Z�����������$�`&����|5:v�f����9�5�i	g�d����
z.�*�c'a���7>��c0n9��Xk��j��q����������Mc�{����.�z���s�R�R������NM��\��t�H�"f���"��,�lC����:�����lz+�U]�M�nn���A�	�
��`��uN�H��8p�V}uz5#_}V{�;���zr�	�W����)P�G�8��]h��+���24��E�w-���j���N�Y�e�X�\!4_��'t�|\�����p�9n9����BM�{�����_q�;�p�L(t�W�r��|�$��U1�����>������S�+T�����~K)��c\��XR��%����o{������NhM��)�%zyx�D=����B���!Xx������d����e[�<�;�9��sY��^�����������<��n4q�&��!�^��H�~�����o��r�{
>�S���[�rw�RIe�m�]���8�����|+�ZV�(o�R~�k�4��WTvF�[y�r�g^��H(�'�JX�'�,AB��2?�xV��}[�x�36r�a��C��^�v7���<��[��~�/~��e�Z�z�k�J���|������`#��<�����v�uiQ�����8��6����x������xl�p�U,����L��<
��bx������h+����$;B�|��}���8�k���%ny1Q��T�V����N���e��m]�u1��XO��JB�Kf�_��3����V��M�25����*x�	�j?X���
�i�������I�����}�>(^k�Z��{o��-Qz�@�l��>�4������Es�����Q4.8.�� �<DQ_d�X����������N[�����l{3''�I�������X�s �u�x��
RGjn���k��C�p �E�������3&��>��|�����/�().�M���	���sQ+agL�`���N�O<���J,������H(����������'|&� �@-t��yZ��]$�����>����O�:�%:|��*b�Kep��L�����g}���B��/a�t����k��6�:�=lsc�	�L�P`��M�������x�c�o>���t�r ����d�JE���������'����ez�h��@�^���8|}W��E�%T��uJ�c���cX��Xh+����G�k��M1(�+f����:�><������1M����6FzBb��?7�^)�?M��e���k��&����n��3��&�O�rno����E�*����X�.����Z9P��������W�����	[�1�]�:��7&�������bM��+L�>K�(}^�(�VfF��{-
�[>�>i{x��i�����N}S�����"�-��W�#�!�kZ�������h%T��������,0���W/�I�Ob�~�����Be�tc\j>1\�� w�-��52�z\��b���N������l�C����t��]�kK)��r�\���v���9+��l|m,�aV�����J�B���Y�E�V�X�C2�������4P}{��������^
����R]K����;�4	���G�;AZ��(ji!�	����`��'�4D����b�4s8�������bs�Q���l>�6�Qy�G���������q����3q.�t��E����j��V�����r���������8����W�N�x�z]�}�����N��h����R@A���5r��d��e����������Z��=PY��Q�����}p�����b��������� H����'���v,��{ZH���^V�kP�����?P������oEC����4�.���H�9<�[�	���������3Fq\���'����-���-��V5'.5���=����}A]i#��rL�.x�������
�e���N��]~������!����c��[7����T=�[|Tc<}g�^���.�g�����|"-��3����w�?+�c.�\��q��e��`Q�Pvb��������CM����\����tR|}|�|�|�dN�d��{ih%���U�K�������\\�C""��."�nP��d#.P,�|���#��d����'����?��I��i��mw�h�r�W�R���
�s�x����*���u��)/Y^ �~���4�bW���u����V�\)R8���	�����I	�1Q�
F��Kn[|�t}-�X�U���g��8��1�u����?cb�ET�P�/��
���q�UYKHY#*��G�D.��Tun0�E��oc��Q����Oq���[�9�������	������������k����k��Zh�h����Y���]���VXd2h�xc��<S�Z�7/D"�,�f�q���;��/��8n���-�l���4"���bMfj����t�rCt������Nx�E�p��'����Wq����@�w��������;zd/S`��z����)��[l����Z$9�B������p\	+����G��I[D�$��s��b^�?���z���`�c<�@�k�Ww+,��O��N��`�1���x�����^����s��~�c)+?��|�`iR��j|MD������"Ug�PJ8I�v��������0c��lV����s�7�����T��|l�8�4���#>�5�����)���W����lxIK�0g��#������X�#�7-��� ^����gn����(��������J
�Qz�	QqV�QF���d�������9�����A�O�����'�D����-��rpe���-���R2����.���m��E��������\��fu���-x�C��+X�r��Z����1�w_��h2a�A�p��:�GI��|Q����������7���H7���3i1�/�[��J�k��'���Y�N�����=�������������~���Y���i������kp��R�F��z��S�HF� $�	e&^�j��n���*x�a��K4+#�����x������va'���Y����Q�x�.	JC�E7X�}r��/R����������v
`���F���
akx`�����BcJN���s�l��kz#�CH*C�Qr��'������Y�G�>=�7���+��X��6�z;�������skN6<S�"Qj��Je+)��v����Q���}�G������i��(]F�� t���b�i���x�V��x��f�A1!k&v���*	)����"�%��D�(F�&��:S�D��N'u����t�il�ZfL�������v���P��)T��aC��@��6���P���^+�F	i)��� l>_ �+��t��(���T�oFo��0��}(�Sig-�Sf0��9.[��rm�p��|��	�j�w<����=��2�1h��H���+G��.�xEf�=�������:�
������9����7Q�j<D��7���m����Dl�+��y����qFu*��9�M��p���,�tN��G>|�����,:T%!���J�sD������A�z�}e�*��X�5�aO�>��}��%����D�����Jn6A������"'�l*��1����u���j�w�l����4�8���]O�7N1s����6��YT��Q����V����dx�.�n3� ��r r�*����6��}�
%�X�C\�BnD�	D�C"J�����vG�(�����d_�?���AM���8)��2q���4�eMK �rM2���hz�`�0^<���q�6��2��MTA�\'�����fq]Bs��|d�i�s0,^H�>�U��W�\i�/�W�QF�b={H=��8:��h��C�(
���6����7wZ-
�O��`'h���.��K� Q�e0��MLR��_qY�C'*��������ZV>s�T0�C��b�$�\�_����u�9-�T��a�=��P���
qr���n����+����r+��R�%f������8cp�m1�5�{�uo��+�g�a��D�]�R�Lk�5�?��jo��eSpE$!+����?R�e.�n���_�>�:%(���ku*g�C.q�V���V���yoB;��a8(�4��D���y[�B���o�
�Y�0l��
���gy��?��=���h
�Sn��Y���T,��"}�,�_��'��Z����T]C9�n�~M������9�B�T�V��;��6]Z��i���fp�'��Q>��%gF�qG�N"��	D�-�����(D3�K���>i���@�;�G'���P93dPOu�t;@��4��/ ��:���gm�-]�p�0���s*�L��-����1�����!��'8�J�|��j��:)77�:�/+k�����ZK������(Q��,�CJ�����5Z��������lY��r�68��-En����fh~UB��?}�H3��O�u,��|��m#+�����P����a���^��@��c!�Fq�[%�[����
�9@���������u��_�$~��C����w�d���t!��%AM~�4�7L��_SfQn:��Qi���\i��AIi���!�=~Q�
{j��mF���7���!������U%��`�LTA"�=E������L�Z�����Eh�M��z���h]�8P��O�����P���{��p�Q�������I^�n;��y3�b��5Btq�GA^�|�4����H�$���",k���
�_K
q2Su���N8>2���(��*xh��Fq�~m*���4����^,+�������7���S��9$~l�X�}C*��\J�B��g�[�	j�g�����A��4	�d���r �|I���AT����I>��X$/�+v%F�+� ���:7������25_L���I��w�~��q�����V���TD�~���y@���Y�M
z�^���~�$5y$<�#r%�7�;�$��"G�o��ak�&�	�C"�����������Axd�+����1�s~��mm��������@/W&B`��3���5~*�*��J��>�oZ���� �^F��gXO������,O3��:6�M���|@���
����pK'6x"�+�@��R�!��H���Vu-�G��t��k�����+�k%�zVt@����Q��!]?�����L�������@�#�pB�Po���){�vHh��V�y���}����z:.��f��3���z��U��e(���2�=�I�K!(�<�w���6n�m�
�C����!�L��v������R77���i�-��'�b'E5��7�,��C��X�\�B�N���}�}�94P���r�I aj����@��L������/���Q��vob�wR�
c����(;�s��������j����|������aQ�)`��_EZ������M��D�C1�����y�����i'2�S=���?�e���
�TSu6�]�r���m�I�����.���?�|�v�(�veM��Z�������m���������x����5����HP��G��	�2cR��,�WC����2��QC�k�'�����e�C�k���Gc�����(��Va����M�z�t��d]>����zI�����=�q����{���5����P��e}�)���|�x����}
����D�D�O��	�3�����|'S�n�7�ow�����n'P��??�M7:�����~���l��!�������������k���7T��|O�	�����F��)��.��$���%2Ly��i���ta��|7�oe��Pe(}p�D2�
��������M���!)S�~�#�x��1�����
ln��g
'Z��V���5C��(�~4���}�?��������c?	���-)��;��b�wPw,0l�T��(��UB[G�>��E����m�q�9�`o/N�xxQIF�+qO��'��4a@;D��.m��8��`|����`�3P
�L|�e�_�@I�J�[D����
|E�}�hH%
W����
K�x�Z��SBa�gJ��0��=��[.<���)���[��@F�bA���r��>��y.���H�Rr	*�}�CFmn!r\8�V���������M����Y��?��
��4���6Y���7�?&;n�CT�e(���\��w�O&�M�R[m"s�M�7�[�4#�O	-O)b��Y��M���h!����8�V��&9h#}�ee�iR��
2`��Ts�J�|~��eR�^�n<w�c���d�����dji<F/K�&k��iP#���E?����T_�K��%+E��uam��K���Y	bp���K+���u�,\$`�<uhO����b�{��~����v��&R������2��N;W��s5�b�[ A���`��SAG���A'yyw����_���� ���%xF���g�u�a�E���X_��!�r��{4����%������#��
��n����{?���&L����(��Q���0e��
w�^�I ����7l���OE8M�O��v��|XN�(���6�p=xF���9`���	����o�"��^U�7�����/3c��qL�ygq��{1w����+����������UB�l}c�GnI@�J3�.��iL���vg}�=�WV�M�UK�����_dV�z�E?RU��x%(�9p���	Or�z��7�:�P��B�Y�6�]'w�5_:��;����K�5���m�4���V��n�o�M��9���5���{�Y��I���:^G��S������B�ll�Z|��E�1������^oP
�`���%5��\�~��5G�9q�{>x����������i�n������#35�@���z���8�����.���{��^������qYJ���^�p:<D�Y�Y�L������*X84F
TA�;��\���K]_���Ta/hU#alF���?��o�#G[�2*��B����#O���fte�[������V��D�R��C��_d.�0�!���t&�]�TV�IEO� ������g��1�K�I��	U��&Z��_��k,�I����(%���������!��=�#BY����E|F�\J�������kS��<?�e��~�:T\����8�2�
�������d�����&}���~�F�R'EW��X�Q�#����������$��~n�C"r��6��&�l���W~��"(u*6s����F��D-"5�M�5�(�.[F���O��x��W&���v��e@L2�e���*�p�5���>��q:���gCA,��t0������=��	4$9[���h��O��u�C,k��f�����"}*��%���l��a�/��s>v��Ux�h��s�@�2-�����������T)�K�~��B�F�F����O�U�����M7���8M�2bz�.0�,�*�&��l������d"��L����_:�7�~c�$k]3�A����S�=	�yiR��`T^�q�j�
V����������v������C�)"��};��9�''�B���B[�A�I�b�����Rw�8�+��!Ga/��=�/k�~���7	h-%�����C�m����c������D�Y�������{�����!�������I�(�#tN�oQL�2W`�D����T/������g0��vP�_�'4
D��n/�2��U���DT�}j����<KtvL��b&��%sURZ4{g����������B�I�����'��>"OW9�\_Z��	��<�:�o�`y�Wbi�.�r�zj��^��_��%P�V�����c��KJ�f��,�L�3�|��o�b>�]��?������3J�}Etd��I�r*G~U}�X����Ewf�w�U�|O��<�F�3�LlK����n��F4k���,����%�V���Xq1���E�&[
���&Tgs���vZ�e�d�k�bV+�	Z��t��&����3�|��@�����2���j0\f����I�������	6��w�kd�3�+8)�I�������y����s5��13�F�C�0��Vh��.���k����w7�+��>�!�n�@6�c��oO7��~��C���-���j|��p�����}/
>��3�m&����u�C��E=�8L<��������S�����a�������xIO��'�|��tS]���i�m`3Q"3,��U�
~|��1�GsH*s����58��5!�k����iW@�q�{�����e����[���2*2��/���Y�|C��;����]�@3��Aq��c�d�!f���~!����Y�j����������Y�C��I3.�����Z��fKo
c��T�	�'1��:���K��$=}"��lz2��Y�pZh��Y�p+��*����Yj
��`���Y��.������.��k�	�o�/��}��Q�@CB����V��YS���c���;��.���j��.}�V,,�C;O�2s�S�[$����H�)(i����Z�1������r����2�X;��g��(O������L��H8����h��I2��,.��!������h�x%�s�n��Y'F5YPa�
��|�"������q� "b����~W��P����O5H]6�
S��[W�Y��:W(�^��|�E��tw�n�&��26��y@%J�����|b�q�-r]�9����V&\�,���Z2�<F��+�P����!�%3��{7k��Mxjq8B��{����v�,��8L^�=��:%H���P�;#�7'���[���&�T��������j��>�u��o�f���9+���?�GD��7��3����W���]�|����CY���c(
o�����"�BPr��������
*�1����d�X�~�/��w ��I�����&��1I���F5����I�y�������6O?��UL��(w
�����o���i�P��l'������p��WMc�-�6��!�	����II,]�CK��>���v9�?��o�>��B�d��i����E<;��E�E�v�d�V*8m2�YTn�d4��gg���t����#�/��Q��`���������Nv��1g�;�v��n�
f�z�Ij�7����r�b��@�k��bD���O���JSAx:
��j1��)�e���w�R(M�(T"��r�����0�w�4�-��\0����}�� �����ZcA���p�B���]g������Km1��x&�����������(��S	q��)�x�"F�J��V� �R�Fc���@�:v>�d0���
���(����N�`�����^�,��s�y�7���K�@�9�s��b��	?��,��O�����ta+�25��u�b�=��6�e"�h+R��O�->@�_$�?�2����'C\���#)qN������8<?3�u�����Q\�����t��d���
c��2N��x����g��e�b��8TE�m�zZVRjX���7�.&k�y���L�u�����@O.W�k�T?Zfjj��7Y @m�L
?���������K}���;�������opHw�@B�g�UG���azBA_�!���_��Yb�B��������d�����N��h��������G��<�F�
�^*
+�����]���H���O��~��.j�c�����k�[_��2���!�|K!�\<3i�I	,�|�����:~�$R9�U�t`��%����#�IGr�*(�?��8���q�/��7:��i���������g�C'B��f���49�d�Q����?b�����u}���.&E�I������p� 3e�X���`�^�_�t��g:f�VW:y�����K��y�oH	��7"mA�=�t%9sI��[�0���}�� ��`�l�cN�����2[���'������C�V"��woL�������
�+�>���S�:�<x�/��z/���b>�3�m����@��T���I6u����.����0���o���X�hF�p^W�C��������x��	e|��R�:5"��n��1+q��Z��i�S12��
� �w1���$���*]��TY�s����/u'����U�)���K'3�8^��.���"�[������
-��
Pu��=���]��j������u2��09��=jL�\����v#rN{�S������a�4������!����
�Z�g}E�M�����2w��<�n�������@�F�=�]B�{�r����l|*p[@��s����q��b��nc�o5�X5w��;?���NE����"�������`������C�h���������������K�a���[i`�P�1�����=R��s�sjo8�:q�����n4����Y�j���^�*p���8��mX�����!Z'u��kw�haP������v���T��P���Q�aU��c���X�	=�?KSj����0��
��	��u&�!�!� ��GB�������iZ��~S0���Z���������3-�Z����g�^{@]W�
�D��>ET�!"M��~�4���W�����4��b�ydT-�����Etg���X&8�JL�����u�@��B��B�|5��d)�d��q�:��a��w������?:�Q�k�>a��������$�Q�1�GzU�x�%=��0�{w+(BZ��Sm��@(S��@��?�5a|n��-V^|�j���E3�/����`r�l�l�_QY7����e�S�H�0Y[������o6)��I5ZE	9��
UI�L��G f)�����<ei����o\h�q]^i�R+�I@a�����J�u&A��w1�2�Pt�8H�����2�<+��Y����*��Z@���m�C�CV��e��,O��m��',�<��@g��:<'���`������6iGd����w��_l8��X#|�<�=����� k�#S���P��q<�������bM�y�Z�2VU������d���g�����K��X�L�O5�R�������U<�P>7{�X�fT�r��!�� ���J�����9W��@��S��t�G����>�,������Y9�~d��])
��h���<j��e�=��.���WNh~���G#���)AJ4}PW]�!��������B���T����>�������2���c��<s���,�t�����v�����Vg��z>����{�Y���*"!������8$�s��u�9��+�p=����q���F�?2b�[����{�;c��w�Z��'���u~Z�&9q��k���d`�{
��05�	��;����Q�bWt��=	�.���:uB��*9}#�,�>&��`�����d�o
��"R�&2�Q�����ml8>Q��['#[�K�yo1NDM�^�q�~���	�
;tK�����%����6���S2��>�3�����$��q���l91QH��!uC�*���3`������A�v���-��0+m���A��d��c=��cm���*������~Pt�E��\��������Az����FR��'`�c(���Y�U����1�=�L�aM���\���r�����B������=����5q����V�b$z�����V�����������h1���t�.��`�\O�M&�UP�e�J���/�P`�Gu:�C3OM
�n���������.4S�,i�9�+Y����(��#��RXqs��s~e�P1�Ki����b
>�V��/F���9zM��.����hE��V�$��gi����h���4��**�Q����J������;��Z=�61�B�Na��)��f
�����$|����9�|�+�5)8���u�!�Q�#�Tf�8�����lD�n��z���D~\$����Kx���w^T]��Xi��7rT:�f�f��l��}6p�,QC��S"��*z�����v�r�'�K��N�wg��� �r����:>Wx��]+1j��/������3�`x�9����?`Y�d[��g��+-�2��I�6:rE��y5���p���������Z'���NUc�����D�D�b<�Jw;�?eC>Y���<|o���(�4vN��2�Ie+�D���?�x�45Lk!�K����|o�����&����b��3��m����0D   P�>K;�"�R���`,��1i9��Ao�kn1;`ka���@���B��=������w�����>�ND���L����,��?q�C��--!�j�_Q�'[�w���Q����* �^(�),� ����^�Y���X8�\9���5�[���b?xs����s<�\o$-�4��G�n��������l��h^bI����\��AJ�;��s����}j	y�#_"j^Xzu�n�k�i���_���Qym�q�3b�"J���IX+z|�Z�E����C79��+�?�y�����3�YP�n��"b�	�����;�
3qf���Ud-^��i�U����$���x(D����p���5��1��y[i>����/sk���AY����n�IU��G���N����8���R �H=�ZJ��7��_�%k/c��V)	|��~��aM\��0��
�?�%���Pc��Goq�s�K�����m?w0@�n@M�$:��p'O�	x:�z��K��,�<
�`X��w�:
%�Y���r��
4��j�Y��,R�)�����V����5$��`D~����96i3
��K~���[�9@���1w��U��Q��mq���C��vq�Q�|�?�$=��{�p4�e���t���4����u4��$~�
����je��C8�o4��zs���U�������5Qy��%pq�lR����|*>�����usH�4����I�T�N���9R���Z3��k5�_���E�O`	iP��t��(�&�V��%J��u�V�k�}�2��i��m��M��r-�;M�E]��)�m��d6�d2���C�w�y�Bb���<�Y���H47����[���y����>�p�z�Q^�s�i�={��1����S`��\` ���Z�p����M���T�l����Q���&�-�g!��*@1hdx��j��p�M�yy���Y���B�KA��}x�O���C�nmt)��+�s�ff1�Z����e7�Om���`=�w������vB��)]E�������x�k�;�f����S��M_�y>�$�3���zif�m����j�A<�#:P���(2�}M���z
/�����+�&��x����d�Y�����.;.�7@�~0�D����j�����V�����3�����]����z�p�xn��n���k�k\�3*{R��5���>���K��j�	������*��@4�h������y2N|��� ���?�`�4��=T,E��G�h1�apq���%][�fX�
����SZ*'�n�65�����1Pf��YSi�b����=�U�h�.�8e��d�(����t5X�8��c�Us��I�S�!����B@�S>����.�e�b���6�
K��PY�q2�Qt�������8�q��X���y������]'�W�w<w���<v�uM$�!���2��QrG���J�8���s
��a
U���g��������4����Q*"�a��3�9��5c�X��3i�+kN����/l����&���($
\�X$}�c�~�1�sG8�vy������?jNH��>�>��\OO�%��T��i���b�����Z������h�
y<��M���AF���
�4����-�V^)K�D����QRj	b!�l�4hKid-4=�<CIN�F���t��x��?�����fg�rI9I=���.���GUr�
��)�x%l�YS1(��N�$*s��I����?!��w\>_�����fdMi>�7�������!��t(��j�Q��b^)43}KK�1���w6�Al�;���<������{���b0�	g��_Op~��������5�1!Q������V�������=���R��H�j�$���6�$��\����������_ a��a��L������|�����M�#�)j���@~Xd��i�`���mE�/�,]�zZZ��q�gR���A�&��I����}�f��X�&*;�fp���b� ���
b�4����#9���o��������)�i8����45�����
�n��fD���@�	��I��*�1Mr���>���
����&����_%]yia�0�RtT`*~��_�!��V>2����IfX��BLB�s���JmU$�T![]"}��,���%@���3��4���Q%h�/��\Uz)�-�7(U~N4*���H��&H�@Z$�leoKH'8<i��b�Y-x�h���K�[!�Q��N�^��>WUO�T"?A���[�d��s����cv�Y�N�SwK&�YE~4d*�`��HH�U@00.�
�����ijrfDB�~�W
�v:'Q�7�3�4�u�{7���0�J�}�+�>��0�WNB�t7Wv>
��o����k M���+�3�+��
����dd��u�E��="���
I��������iT�T��	"t3Gv�+�s`�D�����52`�w��bJO�%�6A��1ly���/����|y�5#{���h?/�W;��D����]�9�S4E��{����5}����~ZJ4��,T��A����������q�7�|:q�� b�1���v�U�n��P�������]V�KH���
!^^����FW"������� �9���-�4���$0������c����c(wf2%h��v��.�h9���,#�V54e�L���J%���b)�����
G�E!
�emaC�X������l�V�'-���+U�3��-Wz�����]�::����*H��u6��d��=��D��^ �������K�`T���`�k��Q�����Q_�����)�2	���F��<i��_Z��-�)F���-�	�;C]B%m����c�x
��~`O���~Km�6��Q��7]�c��6�����<yY�gh�������5)�.��f:m� ��9}Y�N�
���F���V���aD�9����s��W��{���*�-T����sk��>�0��:��tv�Tes�rYu_��06;�����J��LW����dr���kW?�l�+3�@��3�1�L�ypr5f�"�y���[���p�!!GQU|/�iFP(F���{�b]==�=3���~j����
��\����FeM$|,�Q��Zn#�*��q�~�i���W"��xW�|��n�W�v����d�N|��r�j�]�J=�m����������D�
w�8���N���#>�",�'<�Yu>lv���E�-��(���V��>�^<�i��G&�����'=�B��zy��L�0F��0��4:8o�-��[��_����(os��������[hJ�F�9	������GI����%�[�^"�8s@�FI����T2����9������W^~��0� �G��^������2�����0YCAXCsJ�mA{Po�-Z�	cp���c���F�iK
�!�n������a�IsL�t�������3:4,{DL�e��n���y�������l)���ZA��t����9=!���j��rpsE���K+�H�E���1|��/D����g��]_��S���m{ Wp�~F��Ayk+��R�Z��d�������aC��F��@v�#��w�FUGn%)�w�Au�����H����d��j�q%M����g`ve���j��������������j�r�Ys�P:9.�������A�"��To�[��a�s�)s�h��d��9�����%�w�<$�O$?������1�;���,=eE�As������k��	RW����\�
Q�������"�� ��l�;t��s�/�����D
��,$9QM�~�@TiX��O��?!��#8��p
2.������	�[��Q�+�W��ME�V 9��D(t�E�-_D�r^��{1���dH�W��J� ��U���Ia
���#�*�G��&��
�����J�\31���������V����O����=P2Ni���rrG3����hZx������2p��,� d�?��c�cm������yf��j�0������]�^�169��=$o�?1�_���M�z	�{,�����S=D2���z��6;q�A�������7��-{�.<��^�T��L�<����.��^l�����3��`/{
�B����s��r�����N�h������C5f�<�����ib���5�`l!��(��|b-�w
eJ�n������4��$�S�sJ���>L,���Q�cy�oy9��5��Z^|��`/|�	p��3��@������(,�3���}�����a���E&�����QW����{���^0���n�fR���*�F���6��$��&7����%Z���o0%�Q���OF}? O����bQ
C��������D�>������v.����Gq�[/^-�q6��7�p?��r��5�?��/���"�`v*�i"<;l�us��P��N8�>����f#>������
5)����>���y<�����Jj}@)�n=�?"����{��d$|�������Ig�tz��<�7�������,�����LRv�l4�����A��+��.g���"��[��D?������j��{�2n�Q�
�9a����'�#
��!���F�����qe���(qS?�^����b�B�p�{:�8����������Y/���~b'q��������.	%�d�$�	b�Mf���O�������N�����U�2�i�Ax���h���������g�
qQ6�aTI��T�cO�=�Ot�(����0]�Y�L,�QAj.9+�(��>��N���r��oYD���@�l�Q_�_kz�X���B0f���������������Fnk�}P�$����nodO�����~��Al�� O{X�E����(����*yfb�W�vh=N�|��&��'K���+Da@�������{��<�2�q�h�
L��r��BJM
,�����.���/�,9;��j�^��9��?yI7��%aaJ���E��ov�&�5:D��3hk2o��!�WU����UEz������]�v�
wv!��l�=��vh�@����{�#�AW�K����#�=I�k"������������	����*���2�w�E��"7������S`�����ob���>�w���Zhd^%���o	PbP��!`~��95���1J�D�
pK�M�������{�q����%3�n)$Qc��d���P��k�ST�.x���hG�K:����0XUM��>���{�:��(��y��^�6\#�1���Ypu�C�4Gi�(��AH[@U���������������5������zQAu�I?����_�nVG���!x��V�N�e�{�{�"�l��;~,56�1�j���7$������%���wM����<�C�w�w��0�M�5�xO��~Y<�(u9v,cl�_o��|`���j���[WB������,l��K}R8��H��9p������x��h��^P�C� "��p��
*[�,yV���]@��d;�uBg�R�C������m����u4&����e�����i�����P`��'
�8�s����o~�������G��y�?�?�At|���T���'z��*���h�{�E���8����Pt��jW�jk-���5�wf)�
���,�Y����K�\I���V��������|�	���gL�����?'\	��j��$e���?�i��`X�q��t���rk>,�B�3K������b4�����hy�X��G&��LO����q��A��W)�;w0��5�u����Y=��W��R�O��g�>��^�5�-[u
�P)��"�`AU+�E��=pcDX	��O������`��rN\�"���!��%7i �I33.?�W��D�DE'�������s��8bTPC �Y�F�+J'G�� �m)���j���K`|�M��V���R����%��R
�[�����j�JB���������(�SFS���g	z�����,���l��gX�!H7X�i�T%�
����]��G�FO�s����|G�������T0�� y�P�!���_��0ob���$�����c�2��
7\o���|!��GK������z1�0�a
����/�M��$�+H4���q",���9E���B�.��<4WyB��6���%mu�=�B��~���Wrf"wo�t��$�U�~�
�3	�$����~�mlGx��c'c��\"Z��P���~�'n��c8!�0���9�oRBHU96N������*�GU���X6�\�j�\����Ar��s���NL�;�-)��`��k��(d@�����TC��������0��j�r�*��e�m��^���pP����"9��r(I����R��6u�h�����P{9��,�ADXm��h�(�^�E������ �d�[fa�����y��'ZO)'�5j�XOwv.������v�d���-��&r�'���3���������6N7Z��� �5:N`s���m��>1
�:�%YCBn�� s.$����d���$;����nD})���)"4C`�A_���A���b[?��d�h�{�|@���q�5@�37��X.���/������\������TK�����k<���Ks��C������HGvba����}��YJ������
N��e�R�4��Wp��'��DH<F�v ���� ���
b�N�����3��b�E��\���N�?Y�,�{��*��6��5u�<�X��I��G���xp����;#���*�NM0�h 
H�B�� ��3�&O3���D�^�%�R���2L��m#��o.���0Zn~�#�<)����9��'8:���VhQ*��+����F�Jd���Z[�?���8��4������)pe2<����
���rC�N����7���T���u!����C��#�I0�]*��
��~��T�� y�~b������yV4h����������-��d��J���-C�3��H	������E�\���8=3a�������57��%�Fn�u3����)���X3S]r\�7��iW����A;��M2&��r������s����Ur���������=Rx<'�S�����h�����.?�A��&���.���� ��-T�
�oz[��4��k]�)(v�<G�N�D�xK%�>,���o����61�
H�0��"�n����*y�+�����l�����M�v�U��������z&\(��,�Z���4�$�*bb>���ik��zkV��{�H��i��E%|� �R�
eL8G�A���!RK��������I��>�
g��S:�]�-��8�,�q�Q5e����//NaBG�T��y����5�^�Ja?���A\ck���>�}��IA)����J=���?�1����4	���M�������gA�p��z��� 1��pJ3N�/��! ���^�$Fa[T�@iw������2W�����41���`G��W3��6�2�����\�Mz`�����G�i�//�A����P!��q'�	Yl_��V�����@���}-N�����e���z�"���G��K�r��
Uey���������&���[@{NJ�e��v�X0�<��g�����,^^1�|��p5��������S�Ugrz/�?T��5t���Ry���B���&m�4��o��M���9q	!;��D*�?�@���k��]�_�~4)��b���@)�����P�+:O���X�D�~1����\�k����������]1�:���B����2��>H`����J/��J}
X���1�*�i���w���%���x�8��	3�:�:�'�Wx�������7��[�����N�����&.������(�]GTy����Av�����	���� ���5�`������v��$����s�����#G@����O��@%d�=��_M;���`�[��A��<�<�,�&�A��$�yS��k���>�l�����U��ku��:��������}@����C�����7`^�_�U����>�������0Hfd)��@[B,�~��G��iH0${Q�[Xh
iXY���}Qt�v�<��#��7/��K��bu~N�1��c�=
������.��I
.O����T6�+����X�����q����W��/U�{y���((e���\3E����rM��"(������"p��f
^��^]](qB����? ���Y�t��2B�����
�����Q�2}�������O��0k8���@���9:&l�y���~z!x��Y�����4:j����]^O��7���*,F���j����Th�{F3�.��/rg������2�����%Y1��I ��7U�
^k-
���p=������V
yb��C=6�<(>�C2�1��M��:�E��IvL�����S�h���d#�0���!'��X�$���E��*,���g^��T0�8�<N��z*5)*�����p-�wErf��K�{�R5z!	��2�Ea�L��5"�������|�-x4=�_��dS�,�e�l
��1`�j����B-��	m�\���������y�����<����m�l�Vd�����^�f�
o3i?m�h���c��������2�5m�l�O�3�m�+5;/"��������hX����7��|L]��#!�]��s^�`��A�
v�*��)�y"0��;	7���G�3ZB%�=��C��u�K��`v�N�k/F����Hq4rH���/h{�=9��=\H��Pk)Z�N��3���'�������/��8��K�8B:��u\<5gP���2M|�R��g*w4��a����^�^�SSE����N��A������z�I���!B���#�j��5�
�����%1&�N]�@��E��I1
�1
��F[X���=���g ��_�m�����O�9[5��������eb�sh�\��P[�J��<(�����v�����ar�Zbt��\C����+a�����E�CV
{[*���>�V�3��2��{P��P
��j�e4�C+�I�����\������.B�l0)�j�����wEoN�����{����R�-;��e��M F9�����
�������}�����$%��X=���a�c$��U�����	������Y<
��v�g�a�y�3$�a
�!%S����'iOmm���m gS��p�6/�/�4V �X���S�z������n�����e�c~�Q��_��������9�_q��,$Nz4���~KOc�6����@��>�����j�~��c1Iw4��Ch��PN�Oe6���~V��gu��e}RM�R���h*��
k�������	=`V��*�6�w�
V7��Atc�,�����y�g�
j�UN4\�^^
8'
f\�b�P��$�����������~��*���d���{��u���?�0� ��H�S�4����x�W+�7VSL������������R��MaqF2�k���(���{�����s)�� !���rCw�����.�D��C�Z��CJ��)�,�� j�_(�7��Q�h�[�SH�2+����!����R��l������������38WC�J:gKe�������K0g
�������J�l�������S��1�~���%y�O��C�������_#��mCSZ-B���	����x�aCU������/����Ta��M�{�4�.�t��b����7t"���C���-�w����:�y�s�P@>��c)��%o�)R0���c�9�t%f6��2���?�f��X1�k�a�mg�g��R��*
B�g?����n�k�Dh��3S���/b6�����8��Jr��4����$���kV���c�9��.V�BN�pqv�j�2�������E�;) �������}�M5�
mLE�����'�}�E��]�'�w0M3�����$)�@����\!c��
yh��K?��!����N�����E4I�x+�,�F�h�������N��u�39�jc�^c��);%��q�D:��~�9�T���%���gl�Y�����{����6�Y��DFzig�`��uY|"Elq'���:]�����L!@���>��cbPP��
��t���l�#I$��V@��T��e��{���{����������G�� =����$K��������;�k�#�0M��*���~�0?�sj^����
��,����];��+��U,/ZB���U#��������Aq�>`�'����,V���9����J
@PI�~6�q�]o�e��?��mk�����Buj0�o��1�ik��w'�T���
�`:��p�@[=W6-p�_�:�-8���9["���K�$��/R.�_��04��Yx=t�=b���;J�a�'��bF��(n`��O���\_�o�P��vo�V�
��[#U�?�Sn'V;A����.�DP��$&3�
�ZO{p���2{j4r�o���OcY��Z�X�t[������L���t����6��������U���{����7��i��>Sa:���*��9�5�S<�f����r������������sQ��T$HM�2�Z2�4W�RU��2p��;����|)E�� ���aT�7j=�~Hw�e�=W���5�7��s�=i!yG.b�����(\�A�4��!�����?!Z��>�
B�a�"n�F�
[����{��Lw����v��tO�T���x��D�7[����B��a�/�*���On�e��b�}gybH�4��!t�F����m���\ �=v����U_o�g������&�����2$%!``��u^L�����S�D�~�'_�&t]x�����/�Y��9����T��*��X�D��X���Qn��BD~���q�N������]"W�jU��P�c�5CRyGpq�������%�����E��-�8m��h��������Y9���r-�6t��������9���5��b��3�vw>#b��`���A���2a����V��R�h�	g�y6�0��w��k�R��<R��2,2x%�$�&�Rv���8�H-Q���,���V���e�	]�� f���_��n�����1Ig�e-��Ab���4�&���bpub��v��3��m���t���������C���p��T����Aq����O(u�B��t<�m�%��z��I�����C@@�O�����<0��}H6a@�������?�z�%�U����	�#)���0L�������d8�	���?��[3�NTJmPH��S�5U���%��8�Rq��F?Ym�j9G�}�JU9.��z�b�WY�O��^�Gh|����V�����,���8;�����aM�y��4��uD`�'�=�S#;W]n7Vg]T����������1��%$�p8�T�0�P�M�11��X\n��G������GFR�����i��2vm����}����y�e�0T�7�0�&�|�,s���\"��C`����"t�|��P��Yx����S��QS;���@����@��=�u�E���mh�j<����`�sf�VR����O!�p���&E�72��<������o����>��D/]����z�+�-[�����6�������b��,��0-XD��Ta��,��~�� T��� �]��k��J:
���"&�U��\�PXZ/������v��vB:�8M�5G��um��xD��}���^��ac��?����X�w�����<B����W`�Al�)^<S]��8�b�����3��LN����a
z���)N�/
���*�kv������'��HC���H����3�<`�T���i����z�����^�VTNL��MX����
������X�����&<��-��L8�a�nW<��
�W�x���H��y����F:X�NAuU�uE��ew�>'f�0�Jv�Z~#��m��`�R�����)���A��f�5}�t���Z�;�w6z��dR��D�{�QQ����Zn�Dd^�=�+���v5���'����u��@In����ph!0���zc5�(zo�f���r�v|K����&��p�b�h�By'�6�
�^��A�rx%y����o������t�=�Nv��^��Z���Q�fg��y�?+�VS�mwk�|d���|@��ip�
)7�K�����(*i0���?h��re��	#~X�������
Q^8���~�.��iX0��������D)��Sh�� �R������g������� �X��/.#r	���A�,
�q��h{����R�/;�,�!Y[�/U#�?n3�����X���p$�/�(�;@��_�X&l��%�mM��a�����
p_*{i�4
��{�.	#��1O.��dn8r\�.F3�x�������o����p����9r�(��I�1�e���qY��4�)W�Y��R���c���)������a���,�K*�� Xw��5sDr���O����UH�J����q���B�����d�M3[],�O��?��2��d��f�uJ\����B�9�1B&�&�0Q������%�A$�[�5O 
��LP��~^tw7��W
��MYL%�c�2���Gg�!��Ve���<����k��)��W�g�:nu>��
�6�(D4��d��S���+{6G����������P7�/u4x�#N����~�u����:c��$�7o�%�s�_'�@K> B =\����
r�TF�R�
���~�o#�|����K��!EG�V�?�a)A�}���D�g�R|��B

*(���_
JvR�����}Q��������hVPv����k�������j��x��������G��brz=r*X���7A�eG �	g��m�|C�J=��F�,XU7�"\�62'�a^:���8'���{��R~�t������.,�%��8��	���������3H�[�rF��K���B7M�x"&H�]���'���]$���9�g$����%��bev}�F�{��y��X��@5��v��G2�c��[��U�C��������"�5C����o/��v)��;s��y�\XM�`�Q��Hn�?�;��U�A@��l[�����Y/<�K]��tL#�����8���cx�Y%������`��!����y��,5M��C�d�@�� ��@J}��fCT4�C��z��v��@4*�u���Zx:&
�8��������:��k��?|�����z���J��)��^��<@��FLt��td���Q��w��M��B��ZR������'\O�|J�,�G<���#c�w24��O+j���X�#mt��as����(>�������O�j�70��Q�L���]B����+��>n������,�������W�Rk��k��Z2�:��n���c��Sd:p�!TQ�&�CG�������Z����O���q�z���3��j�n�*Z�#�O:����0��>��1�"����RXz�o��AP.�[7������!�����1���}����+*�;�z(T?k���F�>��&Yt^�B��9~��B�m4��B;@���i���	b�#V;-�����;5}@�2� a�z�M�Ja���&��$��)���Gy�w���&N�7�#�����b�� D����|�i�"�����1��h��N��9�^�o"����DG��ER��N��Y�~�,#�'NNE~8`E�Dmo���U���(�����,��S�(��*y�F,� z������dOk_'*Xhp��t�_hi]F���x�5(�������n�/,,&Ip���7.	Z��$f�T�	p���+2�E��N�f�0�} ��&]�5����@�uZ������������R�'R������$B5�����d��kZ��F���J���9�1SE���A��\��������^�=\�a�2�1?P��#d�W ��
0���"��S�W& ���/-_��������	;K�>�Hc�����ir�'������m�f�Lc��>�=�2��A"�,��eU���.:�a���,���k��������x!a�����
�����fi�G�<�Y\�r�I|�l�n���/�Q'e��]�IP���������)����Q��'e_b�!��R2���',�T45F������Es_j�Km�~�^<�L����e�~
Qj����v-Z�$Y>���,�a�0/� �.u�D��W�n6��|-l�~i��%���t>��%����E������O����w����.S����������R����bO�;�x�~����(�g2�OO����'�[������F��9���jnS_�;������������eq&'5�����������]`lC/���)e��zl��7N&)��G��������K"1���r�����SF!=*����w��~�y��,<��|Q.	������M��:\��X7��h�a?*8x0�\<������%�I �����t6�������J���a��
R�����������#���G�����)���\�#��^�g�@���g`����9�2����Y�<,X�K\�y���z��?��.��������r�qr���WS\�|��s���<)4p[����C��3������m�K�YG����6<8$�zW��cy�I���Q:��������$���|�x��LLA��
�h������h`�����|R:1V�������2������z[���$`T��eAq���\���U���Xf�uv��RGV���5Mn���K#��f�O���d�q���`��$��J��K2��=���Ct���J���W��Y6I���p�P&|��V�r-n����]��O%i\����Dj�se���:�@[7���{#WY��	�����������������F����pN�Ul�9X))]4(����8��B�Ua���q����A����IUrR��a5.-�@|���";��N�S��{rz�UI�L��<��Oq|f����W�<e6C��c��V	��>���@G�h6��|z�xm+����_O���pr�f�������h�$�fj�y~��kXO�@����z���������VZ�B�W��Y�"��nY���\�$�8P�y�w)��q�B��#D`�mr�#)(�u��LJ�������7b�
)\3���
~���*?���\���$_�b���R�S1yh(��VPI���>���,I����.��d��� �`qV����)r�%���x
����a�F��N{��i>����`��/S�����-g��=��9c���`���
������#���\i&����#6^��'�~��O�F��������,FUg�0(�c�&H���:���f����BS���e���"��9$����-�����y#p!1�x#7���w���QV4`-�1��9l�7O��@�e�4��|�M�w�^�H����6=��������y����U���@�?�y�������#�8�i��Z	;iy�$����%���*��j�������������R���G��aZa��2���z���g����]_d��b����5�1��t���	P��4-TP8�HO�4��J��k�����
�d4���������V��J*b��G�h�I�>������w)eZwd���n,�Gmym���)��D���x�F���*����/VK�M^g�F��{[D{.�h�2����:�����'����f��W#bF/�����mM?��/?
vz�y�E�`E(��y�H���m�������s��G�@�/��:0���4�����oS�.yK�'��}�Xq.}��S���*���`,QG�����("h��Xx'������� ���&����a����0sJ�&�f8w<��������_w0�rJ2�_LK'N�i?�v��7A�@������d.G���61���(�������Ja>��^��k�2�;�H=���;�XN]n�"Q�%D�`��f�i�4�k�>���fPN��s���P�����I�������j�d�
�!!.4^i��5���4�1�f������i��v���U�o�h\��x�Y�Z:A�1x�?������6&�=��
����U�S�J���Y�{o�)�����b�1�E�quse�>p~H<��C�8
�q���Qq��q�B�76�rO>�1�~�/s��A��a�
yT�qq_��8B��X��7������V%�u<��6����7Z�������u�bfw^�n�q��i(;�7�\�t��(����EG[���<���{X1�,�`�Y�u�H�b��qV6�	��c����4�`�Zgz8�%��6[k[���3�$�.��[F$!�y����8f�{��E�p�Gsc���*7��>��,��$�{q&�8������m�`V��q�Z6-��������>��;�,�!��z�	��t��t�,g��{��O����(W_��X����	��I����$g�<�Y��fX�z�#^����U� M�O`�R`~��Q���m	w
�h���f�M���p���^A������'�6��>s���I\��p�b���� pK�������Z�%%�h3�I������]\�|<ar�{�R�$�g�h��Z�w�,r�h�}�*d>�9������|^����d�Xu��\XoR����Sa�bJ��V��>�S�3�7�3kK_5�lU�E-?����1w(2(�>Fx�uY1>blgK��&�y�����<���
{O�"�����H��lp�d������#������9�
�'����YZ�M���uh|�}'��<6�	;j6�����x'���so����2Un������i-��q�����01Y���~�^S�G���Ro���D��(�OPD��|��l*�6�:�s�f"��?�]���Q��2�wm�77���kw4����(/�@eP��=���
Sl��'�����C�+P��f�i��h�`'o���G-���Psb*2�\H�kH�|����b��m*���P���#@�K�������	�K�:y�:�KA�M6�j��,m���{�N2J�v���d�aw4~��'��rX��#�#��.���w[/i8+�
,�?a����g=�O��g�)c�1�����{b?�d'���}x>�K���qJ��B<�<cdq3�2����f
����w��KE��y������.�+&���V,{����������K1�2�������UadU��)��P�>q%4�V/s,��5<�Esi���W��\"��BR<��id�t&�UWd�r&Y1?=������,��`uh�M�}?XR�.��,5���k�~�bH4�U��*k���oW�+V0(w�ix�ZP��h�\{����K��J-���0	\��8��}���w?�&���Sh������?G�d���������0W���"n
�'���e���p8�q!'��$?�\��K���2�M���������]���Z63��G���E���3�e�@�A{��$�����.:*A�J�_��������Eq�M� {��|������]�����K���x�O�&�88Xal�(�~�"F�&�8��I�2)�����~W����m���pJ��T�$��^��D�B�/�W~3�,��!�+�5�yYi2�i��q����sYkkN���Gn5�e���r�To\��[������=�<�0�g{�r���4�M�s�c�����tN���/)�S�&y0�,�N��Dw�����W�$�
e���0����n�d7j�$I��=���W�h�.�\y*�$3�o��L��r3=���������A��U7�(p�^���x>-FN�UAs��[���B�L��[_UZM�p�>��2�w��������u�6
�R%a~��q���wl>YV�A������y���vm��%b�����q�I���+��Bf�bX�B��'�����W�7W%<�);��q�,7~\y�;42D4������������O��D�x���`�&SU0�Mg7�����������.�=b�����W�V���)���9���Z\�@!��_���Z���[T��%��r��F~)�j��*D�ppj�`R�4S����DY����'�A��R�c���()���c!B���RDq�/5f\������
�?`�5
aV��X���q��;��������s��F=�EQ��cjI%�x��<��(���>�r�+��-�;)�I��������q�����r�m���$^8�2��iS��j�8:;��G|���P=2����'M[�[^��	0c{/�������#j�D)|���1��W�~�*��}b�u�c���Y�/Yg�}��� �'�/���D��Cr�gEnAf�A�`{�y����e�X�������J;�����0j�a��h�a�j]���;li����`
��_1p�V��aS�Od;��!R�b>N��3�|�i����|C�r����N7���DY�E����h�%�V�s�wq����;^:y�k����"��P�M|�o�X�<?,g1M���W�4���A���%3���n���L�2�nZXk3��[�44����!�����5��~3��$�K<o�R(��L}��p��9�x_���D�������_r�_i�2������6������iV*3;3����t|��k���r�5�h;��Y�j������8E�D���7����|��T"t�R�r���t�)+�9w�E�[����MQ���5(�5I�D�����(�2
]�MTB�x%�B��@}Y���b��|��?���/�I�*ea2�Y:���p���Q�y\1PXR�Z����'c������e��nF�0�C���<�����}�GK�d����7�Q�UY�$�&�R�R���N��Y[�"C�-'�)�*a��e�9e����d�� KNL��@ngv�v�������f)�

��ik����]�
F��r����NB>����)���&�h\��]��}���t����q'�+8I}���#K��s��Q�@�/]3���Z&�5j������OT;����-+N�����H�M��z�^M�����9h�I_CcV�W����G���;����g���z�die`��X��f~x����;�xr,��G�O��6��	���q��'GG?���?�Ht��Am��>���} ��9]���f"r'�2����x�E=��>ArLi;�4�E#Pp��o����Anu��Y��fj/+ZV\U3e��,k�
���&��'��h?����t���Rrv�R�b��@r���y�9��������0O�`USi��!H
������-~�H�x?r�[��|�6k���S{ �=V���4���M�o`'d>7X�*�����=�m��u-���jr��2�VY:iYc=�>�A������	���-������{=^x�:��H|C�|�5n�+f���h��56d�K��1W��>2������*�E������������uI������\���C|y\����'��J=��O�*��vH2�����RP��[�0��S��:�'U���46[0�(cB!}��+E4��3�8�;���%�c�5��c�1� +{�D����7)N�
�E	I�.�g��`{a���`� �,����E&c�Y7����q�Esy$�U�(�8�CU3x�+N���(fhEL ������oQ�&E�1+�&J|F$�W�m����+<�=�8\�m�_�?���4��s�vz�U�GCt�T�y�b�~�A�in�E:~���#���{����&Iz�����9�����F��a��,T�R�f��O�fy	G�LYbm`�\0�.���nL') �����$��$j�V�����h*�W3�:��	�\g�[�!���4��(>�=B]Mu��e�Z��f�V_sZ�=>�.~����\���^[UJ�J���/(�	�7g�ZGG^X��'1��>v!{�Y����(�e�O����
�a]pk�(���m
0�4�6(�37zN�����%������f
�0��uT!�4qw�c>�-�6rY�*�)�HzJ_�� �����������{���y����\q�������3����y�d����|SN-z�C�[��r��R��)������j-B��N����]����]�I���i_���6uo�?�gQ��YW^����3jv$qOvD���Fyt#���0W�k�6��w6�$��d�e��Q�^��>g�Q��c����t�ZiUo@��.��i�uS��T-*H�JZmi��%U�h.{�-�@�AZ�XMK0���K������dAFy���9d��s�����(��+���H�[���*}���A3/�O;B��C+�V����`v:�H�$c����l���@��z:�������se����za*Y�=�3~Q�(�Tvj5��Y�U��!O��?k�����v�j��h����,<�-�����M��
��G����bY�b�,�~��N����0�`���o7���.����K��R�0b�g���;�xby��U*���H��O�z��Q���M�ebl%���n6YBm��+�v���<eY�$�O/��z7sN�Y�hE�����+{]��+����4�C�/���B��U����������yG������s���`,��~lpG7�S]�����E�H�q�B���(1������P�~D ��C?rl��ft� �<z��]���WC}��|���	k��WJV��A:��:e+���>�!��%�{�3Y�Y7��L��z�["���"����[D���BG�b��257W40I.����1�������(�����%�*%����*�����E�1#I����s�)�����b��9��i���� �Xe�o�P�vj��[����5�vA�Pfj`a���A)O���5L{�����Q}|!l�m�s�Y~��A�l6*�V��oA�pjC�8���z�'�H\�@����^jq�`���N$�}s�%AYc��{�{|��>WW����}�'���J��x�q��p* �5���.���>�g���Q��i�x�EP���7�KR�Z.C�:��A/�|)��z�T�����-����b��3�~�,��K� ����� �����^�����-�S�|"dA1��l�9"\���P�8�H��Tk��%��L��\�e�F>��
;�q9d���)K�*^��]���tr����}���{���1�#M�_������W�{[#����57Q�$���.}B�Y��D]rU�a8��4���	Lpd[/�N:N]s0O�C1������X���'J�G���%1���P=R�G�e�����!>�Tr�
�|c�E��Nn,����.��C������A�]��ADLx��<����{�@�����>dd��B���>�K�?����	rS���TA�:PU��$�9~.O���6���m~��w����8n�f�,|��e�"O����(�����0?�Y���G��`����D��iY$�)��3-��d�G���D[��Vv�� C�Z�</���.��u�Y���o��N],)'�d�Qk��+���G'�ZT3��#�F���U�������02*C$ip������������;���6:8��4��=,��B 	���IWg��5��
��Z����*�.?�wRr1��'��`��p>w�N�(b��Y�q~�������y�D����9o��?� �@�6X�4Q������i�3@���
�ws�Dc5��w��QH{3��


su,�f����5���RV<{������������4
���FO����*�|��'���6��)6�1@~����)w�5��V"(�i�7�G��z��A�]����\S���)X��x1pHDf��8��%�v���i4�H}��B�Wo Y"�L��W�gR�1�Nl�[
_��]�D�����������sM�|�����C�������P8���w�M�*Y������������SF�������X��2P��@�������������Z�U/�.DTO�:zs��UckJ
�G��w��]�Y�97����%�4�s\b���5&
��H����
gn�����Rs!�^����>1�PSqzAv�cK��K��?���D7��4\��;��:V��{�~^�	$�k��v�����8�_�		H$�5]��)~$�W�����/�h���6���'3,,�1��$T���m��Ai@/]�����Y������*x)��hM����>>���
���yP�H��`���0]v�8o�@�@�u1��&�m3?Z9u��x*�X7T$��^J�8Ve��<���"�1\n�5-;9���Rb6��4���{�Ve��y���Y��Cy����������@O�cG�-(����t�y�q�RG�Rc1~e����SJA�{9�>�k�9����K+����1u��:i�] ��]!/?�7�JFT��v����%�jMJ��ckiWsG����-[��~��
M�w�����Z8���D�@�S\�q0�<b�5s6��p��=F{(�S����l�����0F�}��p=���7��S<��+�GX��T�Y��F�bU�����!S+W��z1d������k^�����C������I��,�5����9~��)xDq�8�����R/��M��������)��?-��M��j�:�C�Z�'R|v��0�/#�B�F�p#C��>�NoNXZh�9f"f��#�._��%Q��#f����j[�)��������1����[�c8A����F�V'���@UP+�X�u��v"=y��:�6��.�5��.���R5�S��3r���X��9���~�8GK~��j�s�����=�����%��(��
��GkdA���,QS��Z����h���&���c�����	�#����=��
�x����>�U���N5���9�HU��U�����7�Q���D���G/`o���7�m������x�����zl��Ij8-���sP���9f_v���7�����G2(�qN~)A
$���&��8d,0��)t������=�[x >��<�N�v���w��>��S'�zi:�f��X����zm8����V�p����(J�V8�� �?�����$�(G����u�7��~��V�i�z��$�����F��	�#y��tRL�.(��@��NL��$�Ui�p�7nC�i�'/�,��W5����Q^"�qa�E?:R���}	w��|`@����'�t� "���{��e�����j
(�NVJ�d���L5�3���I�$�s�_�3E�&�MJ��hn�FP��#�����st�Q�KJ��,s���u&���� �B1����u��S�����
�����XZ���3���������~c�d2�B�8�*�5x�&��`O����R��?���N,Dm�NE��r#:=u�Z�C���S1*�v$������"_x*4���-���"������1$�\��I���2����|��RPw�	c��_�d��BHU� _���v4A"��+��,c)
�q n*(��n�R�A\�����U��'�^T�I�*��V�-��/K7
t\S�3zSd5=��*\�����S{d*�b
�(�o.�B�.Oy�f+�!$�I4��T������@�v`���z�c��?���t,c���k���
�V
����f�g�j���eL<O�������-"����W�1z���(<�R�����}���O�;O��6EU�#���:�4o2Y�h�t�wE����vsNT��R�����swO�R����qdX/���r\@}|X��|�0����I;��'!���0-;K����wB�V:\�Qp����b�!AX)crm&���)d��a�j����O�Ox�oz}��#�����N��z-����%?����R�N�I$���C�1i����ul��IWn,V����>����CB���%)5K�s��z+���A5�|�����������2�]��&��aw���(���;���R]�,R�������C d�Y��K��|�a�p�v�M�����N>�Cl��eM
jw������,�.�B�wIRFv!���}[�T��#.dE���7��/u6"����=���R�vJ���m��?sL%<���&��V�����4e�5.G%��
�G��n������d�.�&�L�^���U=��7�<o�}W�|N�.c��";�*A�U���@����9LI&��oMV�������D�1c�:�,#�)j��12`���|��9��mv�����Elxw�"����J`k�\Ny� ��y ��������%��M�f����-� V��?��R
�r��3WO;/��%��'�*��J�]K�x���a��+ �LC��:�9�)��rI����(��)�BW���3��������9�������<|���IL����w6�|�q��{kS�a!��I�R�^A��e���^jt���P��m[{��}89���`��^�G��^�O%�b�I�)����3_�����](m(��G��_k����3���v�_��Rn���9�m�S�&E����c+f��C)lX��,��2c6�zQ����C
���Ap�*�VR�W�[����]���pX�@%�SA'��0����2,'���eK�R�y������)�F#����!�c���J���`VNW8�=5=���V�3F("��~�p=�
�l\��H����D�������ON�����Vw�T�R]��ZO����L�v����� ��_&I���OZ{�� �WW�d��m��4�����B
���mA<��J?D�����(�
[u\���uT�mOpO:v	W�����8���xy���`\-�-5����`= A���j�3~�7<�������^?�����qf_��0�����/$bo��he�������.���#��:�|�}'�@�����g4�p�8ll���]���_���h�!d�H4&����-ih�@���g�6.`���x���]����K9���Sg����/$V������b��iHS8��
�`���� ~0�(������l!��p��TART
*~��R�6��_}��_vzLnF+�)�r�����V+[�L5M��>�QT;���������;����58��tU��M��)���G�Q9����7\��Q��k+Q`�1���B�\p�M�,rp����1�U�o+q��?L�lA��e��6 Y�J��Y����~M�Q��\yw7��4i����PJ�=1�s	��{/����7����/2&�U]���/��J�����/\UD�����8y�@�'k�0�oC�*�
��(�~ ��e|W;���F��j`�x)#���Q-:���!v�Qs��m��$M�!R��)���c��'p5�����ZC��sSd��R_�Ut�/�	��@
��Pv}����.�^�P�J�hm�K>�Jp�g:S�i#M��?aZW��Q'��J�f���W�i��1�"��s�e�0������Qx������Pe|���y��r�1p�����yUJ�A�������d��jcgL	�"���@M/�Q�R��� F3+�?")��r�X�{��R���}9����%����h����tt�w��w������^qG	���7��=�)}s�^~>T�;.a�����E�+���y���)�j����5�� ��2�9%Wc��F *3�X��X������k[�.�����g1�����k�C�#���H�M5�K�3!������$���8������X���`�l���>��a^y��buk�:�
��e5<���9�r�d!�X��#��y �����lq�����'wxSF�g�����h"�����,W	u�.���BS�A�w�q�3^���>>%�t�L�6$�*����	�z��Dz7$80��	M�m9�v��5�a�T���Q&i���Y��!���
��5b�<�v�(�O�~[{`^h��l�������;�=�H�:�F�����,�U.�(���g������]�C�d���=V/H�/�qJr�}Q�$#����&���������*����m:/�W�i�J+O�A���0�.d}�x8�>N��+f���%<�	<%|N'�����i�'�"�T��u�������T�*�A��_*�#N�|�0\��M�7g��?f<���<8�f��\(���`�T�_��	BW�D�yd��k��5K�tA�]�
�O������?����a��8�HF[��\��wDU�&�
�����R0�-��"s-���]���_��G����3Jm���{V$���;��9����'F�g�XZ�4�5BA!M���
�rCq�
���ekvs]0*DP:#�XR�1��.~oM���y2��8����zF�?���C���_��m|w�5�_PO�~s�*
~-����^�0|�vK#���t�j��:�e�.=�3,�5���?�I��#R���/�������a�f����[�'m���<\�����:�=--t�L�.qyy$��q3�����<�?�x���{�3�B��MpI+���y�.b�c5�g<TaA��	t�5`�y�JJf?��]�
�� sa�W��F����S	?Cz������9��[��kQ~i��l���e+/aj����IS���U���XU�V���+�~�'�
�}���8�P��cM��ZC��s-MKq�M����G��v`�J������-!���mE`��>���	���1zm)�\b�E3���*zU�2�f)5�nq���z?$���Ilye�y�F%hH��;aM���4&�7�bM�lA��>�������;0�l�]_�U��eNj�X=�9��NG��.}��M��S�p�1�(�\�5R����,�_X���8��m��������k{��6S?��Z}�R+��z���^�n����
�?�n�($����v�D����`��)Dv�?��E�$��".{����6���d����!����H���p�O��k���gm���1��f�#.x���X�$[E�����l��h�'3?�e����`%wS��.z�)���" ����C�u�~n2[���������89I����nn7u�T�������g>�������9?B������}V�����H����n;k|�����&X{����@.F������HN$}���P�|M�%�{�4R7�1=���-��k���]�xAIo����]��U�r$G�jQU��!l�	�������T>�ro�����{����8�q�r	�h�I39}c�`c�TE@>mq��-K��0�~����/�A�$a�'��p�h����Zc'�rJY��b�����`j5SB���Z��#G�6��m��@7j7k�#0��#-��
��)D��*�0���Fb`0wL����J�`�T��������3
�/4����*o=A�����9�/�6��F*��h��V�
�_
(�\�t)��9W=b�{�E��Xh���xw ����x�k
�.�:�� ZJ�(�6+'�-��>�'�E����h����Vl�.B��d���i��{X�5J��9E'A%�0,��2
�6f�������;Q�p���;4qk��g�!�+�',� .	+�5�nbi���E�>N?�@�_�O��R�J�[q"�����x�A����8E�	�5ga2{_q��9UjSv>�
����LT<V�L�[-�-�(������N��~?s��_��n��5��u��S
�2	s
�
$Iq�����Z�7��ws��D�a�5�����My���%���5�/���D��.���-�h+{�74���>��s
�����0A������^ix�{�:m�|����uY�0a��p�^B�(��~���l��+�/R7*��
��=4��p�Tt`����$�f���
���d�c����<�G\�������\K�p<*��&�K�8��#�O�D��U���������X��l�����s��,�c�9�Q�<�Pf�����5;��e�V���W�������u��T�����]W��0������U��u_����0�9��QP�?�h��a���pND���s�h���/H�~�S#����^4&H6�'��+����Uo��Q#�%�4�w_�t���%�;"B��zy=�&��\{�KU����h�C����F("�����E,��(8A�!G�o����7��/�`�N���#$��^�bq����U�Y��%{Rr���2�K���}���P(��(�^��TS�k�QW//������:���J�-T~O���['�!��"&x���v��C����\���tL����g��s��r������XO������c�V���_V���M�J��,�R���r�0^�v�����ie�����=�D���m��
��j��E��ui1������j����j�� *�]���,�g�?J�m!�C�k����kM\���rl��'������mbf�L�-��p�3�Uau���g��Q<��D���v���d�`��HHp��l��� ���b����2W���O�`[B��j��kW�}��C������kjT0~�:CS;g��rC�PCwR�[C��]�|�57�1�(��3�R�Q�����{@u����Jc=��6�&�D8LHKc�bi��Y9�i���Z����&��b�d2�t��������i��t��	�A��hk�@�x��\F�Wy�1�rb�q�WX>�_aC��/;��o��W8��6Io������=�D���7�+#��)N�S�D9��m�D�3sr�]��A�+z�hO��h��=�0���B^�If���E	�;��j�#���c����tY�T���4�q'��H{��0���t��.���qb����|LV�Xg�\�E��S�(�Xt����0�Po���������IYl�f���n��/&#Y��G�����bnN�Q�oGpG�N�����[��^��df(�$4�ef.�s�F�X�^���c�y���0��J��O�ka�g�$K�'�W|}/�X�n r�3��)�D��Y���XUq�p�9���U��^R���d���6y��PR��^|I�Bba=��eJ����{[��$��P�CL!	Ee�t&�����y�5��qh�W=�\����kW�wQ`H�ZG����6���q �����`�4�p{Es���wRn�T~vI#J-0��/2!t�H�]��<<m����w����]H�Xz=�WF�+���.�J�'���N�N�������\���A��Ts��������6�wt�AW�"��b��W��%���4�j�aF�G�b��@t�~~c��=�V&���g�nzZ�UO�}������o�$�Kb����(h/G�������I����J(���>
#01q�^��+�����:��A���S����������m�a�i�������'I������?y�e�
������!� <y�C�'t�
u����/e�$`cEt��L�����e��_X�g��e�=��x8��J23��=,>�������C��N
�N�"[��f�����[�q��rs��d�60����3T�������<�?r�[9P�J	a������N�2��`��f�Y�E{g��+�g��w��1�
�{9�Z'dS���8�������PA�5�c���}���^��X��5s���"�6��.��$���������I�����a��M*E����*s��~�@��k�j������W[P#���7�~�-���}Lo������0����V�����
q���>�HS�E�
BL=1�vp8����D��e�1?K����_���R���� ����9I�[�J��U�U�sr`?��;�t3E�TR���x���U���DHS:[��7B;!������PD���d-�?vZ�����/�0��jJS�Yc���+�F�1�-BY������"�ID���C������ox9�k�h�(B����k��x%v�@mJ�5�/�������o����I�
��&f����B]81�)�<:����0������������M�@G�#�{)f��N��F��o�� *z��k�k����x)�v���~/e�����Md�����k��W�������f���Yx���.�;�2���4`G/���|��+��!f�j4��R$I�|�/�52��;�jK���������w�������������B�|�w�r�����@��K/mQ[����_�pp����!��v�+l,?Lzv�&G���@O�`�c�S���Mh�h*� ����1��w������'�~���?�"���+����'�Ioo��g�YZ
logtape-trace-patched-16MB.xzapplication/x-xz; name=logtape-trace-patched-16MB.xzDownload
#8Greg Stark
stark@mit.edu
In reply to: Heikki Linnakangas (#7)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with what it looks like. We are in fact getting a more sequential
access pattern with these patches, because we're not expanding the pre-read
tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
SortTuples, allows caching more data overall.

Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Greg Stark (#8)
Re: Tuplesort merge pre-reading

On 09/09/2016 03:25 PM, Greg Stark wrote:

On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with what it looks like. We are in fact getting a more sequential
access pattern with these patches, because we're not expanding the pre-read
tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
SortTuples, allows caching more data overall.

Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

The first patch removed prefetching, but the second patch re-introduced
it, in a different form. The prefetching is now done in logtape.c, by
reading multiple pages at a time. The on-tape representation of tuples
is more compact than having them in memory as SortTuples, so you can fit
more data in memory overall, which makes the access pattern more sequential.

There's one difference between these approaches that I didn't point out
earlier: We used to prefetch tuples from each *run*, and stopped
pre-reading when we reached the end of the run. Now that we're doing the
prefetching as raw tape blocks, we don't stop at run boundaries. I don't
think that makes any big difference one way or another, but I thought
I'd mention it.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#6)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 4:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm happy with the amount of testing I've done now, and the results. Does
anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

I'll try to look at this properly tomorrow. Currently still working
away at creating a new revision of my sorting patchset. Obviously this
is interesting, but it raises certain questions for the parallel
CREATE INDEX patch in particular that I'd like to get straight, aside
from everything else.

I've been using an AWS d2.4xlarge instance for testing all my recent
sort patches, with 16 vCPUs, 122 GiB RAM, 12 x 2 TB disks. It worked
well to emphasize I/O throughput and parallelism over latency. I'd
like to investigate how this pre-reading stuff does there. I recall
that for one very large case, it took a full minute to do just the
first round of preloading during the leader's final merge (this was
with something like 50GB of maintenance_work_mem). So, it will be
interesting.

BTW, noticed a typo here:

+ * track memory usage of indivitual tuples.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Peter Geoghegan
pg@heroku.com
In reply to: Greg Stark (#8)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 5:25 AM, Greg Stark <stark@mit.edu> wrote:

Wow, this is really cool. We should do something like this for query
execution too.

We should certainly do this for tuplestore.c, too. I've been meaning
to adopt it to use batch memory. I did look at it briefly, and recall
that it was surprisingly awkward because a surprisingly large number
of callers want to have memory that they can manage independently of
the lifetime of their tuplestore.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#12)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

It seems both 1 and 1+2 break make check.

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Peter's patch needs some rebasing on top of those but nothing major.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Claudio Freire (#13)
Re: Tuplesort merge pre-reading

On 09/10/2016 04:21 AM, Claudio Freire wrote:

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

Thanks!

It seems both 1 and 1+2 break make check.

Oh. Works for me. What's the failure you're getting?

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Yes. I have been testing without Peter's first patch, with just the two
patches I posted. But it should work together (and does, I just tested)
with that one as well.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#14)
Re: Tuplesort merge pre-reading

On Sat, Sep 10, 2016 at 12:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Yes. I have been testing without Peter's first patch, with just the two
patches I posted. But it should work together (and does, I just tested) with
that one as well.

You're going to need to rebase, since the root displace patch is based
on top of my patch 0002-*, not Heikki's alternative. But, that should
be very straightforward.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#14)
1 attachment(s)
Re: Tuplesort merge pre-reading

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchtext/x-diff; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchDownload
From 6e3813d876cf3efbe5f1b80c45f44ed5494304ab Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Sun, 11 Sep 2016 18:41:44 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized buffers, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge.
---
 src/backend/utils/sort/logtape.c   | 134 ++++-
 src/backend/utils/sort/tuplesort.c | 984 +++++++++++--------------------------
 src/include/utils/logtape.h        |   1 +
 3 files changed, 389 insertions(+), 730 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..05d7697 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -131,9 +131,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +231,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +596,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +680,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +691,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +811,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +843,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +894,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +943,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1010,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1000,6 +1070,9 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 {
 	LogicalTape *lt;
 
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	*blocknum = lt->curBlockNumber;
@@ -1014,3 +1087,24 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d600670..24f141e 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,20 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size buffers to store
+ * tuples in. To avoid palloc/pfree overhead.
+ *
+ * 'nextfree' is valid when this chunk is in the free list. When in use, the
+ * buffer holds a tuple.
+ */
+#define MERGETUPLEBUFFER_SIZE 1024
+
+typedef union MergeTupleBuffer
+{
+	union MergeTupleBuffer *nextfree;
+	char		buffer[MERGETUPLEBUFFER_SIZE];
+} MergeTupleBuffer;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -307,14 +320,6 @@ struct Tuplesortstate
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
@@ -332,12 +337,40 @@ struct Tuplesortstate
 	/*
 	 * Memory for tuples is sometimes allocated in batch, rather than
 	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * been abandoned.  Currently, this happens when we start merging.
+	 * Large batch allocations can store tuples (e.g. IndexTuples) without
+	 * palloc() fragmentation and other overhead.
+	 *
+	 * For the batch memory, we use one large allocation, divided into
+	 * MERGETUPLEBUFFER_SIZE chunks. The allocation is sized to hold
+	 * one chunk per tape, plus one additional chunk. We need that many
+	 * chunks to hold all the tuples kept in the heap during merge, plus
+	 * the one we have last returned from the sort.
+	 *
+	 * Initially, all the chunks are kept in a linked list, in freeBufferHead.
+	 * When a tuple is read from a tape, it is put to the next available
+	 * chunk, if it fits. If the tuple is larger than MERGETUPLEBUFFER_SIZE,
+	 * it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the chunk back to the
+	 * free list, or pfree() if it was palloc'd. We know that a tuple was
+	 * allocated from the batch memory arena, if its pointer value is between
+	 * batchMemoryBegin and -End.
 	 */
 	bool		batchUsed;
 
+	char	   *batchMemoryBegin;	/* beginning of batch memory arena */
+	char	   *batchMemoryEnd;		/* end of batch memory arena */
+	MergeTupleBuffer *freeBufferHead;	/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller that came from a tape (that is,
+	 * in TSS_SORTEDONTAPE or TSS_FINALMERGE modes), we remember the tuple
+	 * in 'readlasttuple', so that we can recycle the memory on next
+	 * gettuple call.
+	 */
+	void	   *readlasttuple;
+
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
@@ -358,42 +391,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,11 +483,33 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the batch memory arena?
+ */
+#define IS_MERGETUPLE_BUFFER(state, tuple) \
+	((char *) tuple >= state->batchMemoryBegin && \
+	 (char *) tuple < state->batchMemoryEnd)
+
+/*
+ * Return the given tuple to the batch memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_MERGETUPLE_BUFFER(state, tuple) \
+	do { \
+		MergeTupleBuffer *buf = (MergeTupleBuffer *) tuple; \
+		\
+		if (IS_MERGETUPLE_BUFFER(state, tuple)) \
+		{ \
+			buf->nextfree = state->freeBufferHead; \
+			state->freeBufferHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -553,16 +577,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -576,7 +592,7 @@ static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -584,7 +600,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -592,7 +607,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -602,7 +616,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -610,7 +623,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -762,7 +774,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -835,7 +846,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -927,7 +937,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -995,7 +1004,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1038,7 +1046,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1884,14 +1891,33 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
 			Assert(!state->batchUsed);
-			*should_free = true;
+
+			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call. (This can be NULL, in the Datum case).
+					 */
+					state->readlasttuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1965,74 +1991,63 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->readlasttuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
 			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the batch memory arena. */
 			*should_free = false;
 
 			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
-				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
-				 */
 				*stup = state->memtuples[0];
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
 
-					mergeprereadone(state, srcTape);
+				/*
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
+				 */
+				state->readlasttuple = stup->tuple;
 
+				/*
+				 * Pull next tuple from tape, and replace the returned tuple
+				 * at top of the heap with it.
+				 */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
 					/*
-					 * if still no data, we've reached end of run on this tape
+					 * If no more data, we've reached end of run on this tape.
+					 * Remove the top node from the heap.
 					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Remove the top node from the heap */
-						tuplesort_heap_delete_top(state, false);
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+					tuplesort_heap_delete_top(state, false);
+					return true;
 				}
-
-				/*
-				 * pull next preread tuple from list, and replace the returned
-				 * tuple at top of the heap with it.
-				 */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				newtup->tupindex = srcTape;
-				tuplesort_heap_replace_top(state, newtup, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+				newtup.tupindex = srcTape;
+				tuplesort_heap_replace_top(state, &newtup, false);
 				return true;
 			}
 			return false;
@@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but
 	 * don't decrease it to the point that we have no room for tuples. (That
 	 * case is only likely to occur if sorting pass-by-value Datums; in all
 	 * other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2478,6 +2486,11 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	char	   *p;
+	int			i;
+	int			per_tape, cutoff;
+	long		avail_blocks;
+	int			maxTapes;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2514,6 +2527,74 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape. Shrink
+	 * it, to make the memory available for other use. We only need one slot per
+	 * tape.
+	 */
+	pfree(state->memtuples);
+	FREEMEM(state, state->memtupsize * sizeof(SortTuple));
+	state->memtupsize = state->maxTapes;
+	state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+	USEMEM(state, state->memtupsize * sizeof(SortTuple));
+
+	/*
+	 * If we had fewer runs than tapes, refund buffers for tapes that were never
+	 * allocated.
+	 */
+	maxTapes = state->maxTapes;
+	if (state->currentRun < maxTapes)
+	{
+		FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+		maxTapes = state->currentRun;
+	}
+
+	/* Initialize the merge tuple buffer arena.  */
+	state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+	USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+
+	p = state->batchMemoryBegin;
+	for (i = 0; i < maxTapes; i++)
+	{
+		((MergeTupleBuffer *) p)->nextfree = (MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
+		p += MERGETUPLEBUFFER_SIZE;
+	}
+	((MergeTupleBuffer *) p)->nextfree = NULL;
+
+	/*
+	 * Use all the spare memory we have available for read buffers. Divide it
+	 * memory evenly among all the tapes.
+	 */
+	avail_blocks = state->availMem / BLCKSZ;
+	per_tape = avail_blocks / maxTapes;
+	cutoff = avail_blocks % maxTapes;
+	if (per_tape == 0)
+	{
+		per_tape = 1;
+		cutoff = 0;
+	}
+	for (tapenum = 0; tapenum < maxTapes; tapenum++)
+	{
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										(per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+			 (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif
+
+	USEMEM(state, avail_blocks * BLCKSZ);
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of indivitual tuples.
+	 */
+	state->batchUsed = true;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2544,7 +2625,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2627,16 +2708,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2645,40 +2722,28 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
-		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-			{
-				/* remove the written-out tuple from the heap */
-				tuplesort_heap_delete_top(state, false);
-				continue;
-			}
-		}
+
+		/* recycle the buffer of the tuple we just wrote out, for the next read */
+		RELEASE_MERGETUPLE_BUFFER(state, state->memtuples[0].tuple);
 
 		/*
 		 * pull next preread tuple from list, and replace the written-out
 		 * tuple in the heap with it.
 		 */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tup->tupindex = srcTape;
-		tuplesort_heap_replace_top(state, tup, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		if (!mergereadnext(state, srcTape, &stup))
+		{
+			/* we've reached end of run on this tape */
+			/* remove the written-out tuple from the heap */
+			tuplesort_heap_delete_top(state, false);
+			continue;
+		}
+		stup.tupindex = srcTape;
+		tuplesort_heap_replace_top(state, &stup, false);
 	}
 
 	/*
@@ -2711,18 +2776,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2746,517 +2806,47 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tup->tupindex = srcTape;
-			tuplesort_heap_insert(state, tup, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
-	}
-}
-
-/*
- * batchmemtuples - grow memtuples without palloc overhead
- *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* Caller error if we have no tapes */
-	Assert(state->activeTapes > 0);
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * We need to be sure that we do not cause LACKMEM to become true, else
-	 * the batch allocation size could be calculated as negative, causing
-	 * havoc.  Hence, if availMemLessRefund is negative at this point, we must
-	 * do nothing.  Moreover, if it's positive but rather small, there's
-	 * little point in proceeding because we could only increase memtuples by
-	 * a small amount, not worth the cost of the repalloc's.  We somewhat
-	 * arbitrarily set the threshold at ALLOCSET_DEFAULT_INITSIZE per tape.
-	 * (Note that this does not represent any assumption about tuple sizes.)
-	 */
-	if (availMemLessRefund <=
-		(int64) state->activeTapes * ALLOCSET_DEFAULT_INITSIZE)
-		return;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	state->growmemtuples = false;
-	/* availMem must stay accurate for spacePerTape calculation */
-	FREEMEM(state, availMemLessRefund);
-	if (LACKMEM(state))
-		elog(ERROR, "unexpected out-of-memory situation in tuplesort");
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
+		SortTuple	tup;
 
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
+			tup.tupindex = srcTape;
+			tuplesort_heap_insert(state, &tup, false);
 		}
-		state->mergeoverflow[srcTape] = NULL;
 	}
 }
 
 /*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
+ * mergereadnext - read next tuple from one merge input tape
  *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
+ * Returns false on EOF.
  */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
- *
- * Read tuples from the specified tape until it has used up its free memory
- * or array slots; but ensure that we have at least one tuple, if any are
- * to be had.
- */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3912,27 +3502,24 @@ markrunend(Tuplesortstate *state, int tapenum)
  * routines.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	MergeTupleBuffer *buf;
+
+	/*
+	 * We pre-allocate enough buffers in the arena that we should never run out.
+	 */
+	Assert(state->freeBufferHead);
+
+	if (tuplen > MERGETUPLEBUFFER_SIZE || !state->freeBufferHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->freeBufferHead;
+		/* Reuse this buffer */
+		state->freeBufferHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4101,8 +3688,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4111,7 +3701,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4132,12 +3722,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4344,8 +3928,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4354,7 +3941,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4379,19 +3965,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4659,8 +4232,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4668,7 +4244,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4683,12 +4259,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4755,7 +4325,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->batchUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4785,7 +4355,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4799,12 +4369,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#17Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#16)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

I think I have my head fully around this now. For some reason, I
initially thought that this patch was a great deal more radical than
it actually is. (Like Greg, I somehow initially thought that you were
rejecting the idea of batch memory in general, and somehow (over)
leveraging the filesystem cache. I think I misunderstood your remarks
when we talked on IM about it early on.)

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

I mention this because I think that cost_sort() should be updated to
consider sequential I/O the norm, alongside this patch of yours (your
patch strengthens the argument [1]/messages/by-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com -- Peter Geoghegan for that general idea). The reason
that this new approach produces more sequential I/O, apart from the
simple fact that you effectively have much more memory available and
so fewer rounds of preloading, is that the indirection block stuff can
make I/O less sequential in order to support eager reclamation of
space. For example, maybe there is interleaving of blocks as logtape.c
manages to reclaim blocks in the event of multiple merge steps. I care
about that second factor a lot more now than I would have a year ago,
when a final on-the-fly merge generally avoids multiple passes (and
associated logtape.c block fragmentation), because parallel CREATE
INDEX is usually affected by that (workers will often want to do their
own merge ahead of the leader's final merge), and because I want to
cap the number of tapes used, which will make multiple passes a bit
more common in practice.

I was always suspicious of the fact that memtuples is so large during
merging, and had a vague plan to fix that (although I was the one
responsible for growing memtuples even more for the merge in 9.6, that
was just because under the status quo of having many memtuples during
the merge, the size of memtuples should at least be in balance with
remaining memory for caller tuples -- it wasn't an endorsement of the
status quo). However, it never occurred to me to do that by pushing
batch memory into the head of logtape.c, which now seems like an
excellent idea.

To summarize my understanding of this patch: If it wasn't for my work
on parallel CREATE INDEX, I would consider this patch to give us only
a moderate improvement to user-visible performance, since it doesn't
really help memory rich cases very much (cases that are not likely to
have much random I/O anyway). In that universe, I'd be more
appreciative of the patch as a simplifying exercise, since while
refactoring. It's nice to get a boost for more memory constrained
cases, it's not a huge win. However, that's not the universe we live
in -- I *am* working on parallel CREATE INDEX, of course. And so, I
really am very excited about this patch, because it really helps with
the particular challenges I have there, even though it's fair to
assume that we have a reasonable amount of memory available when
parallelism is used. If workers are going to do their own merges, as
they often will, then multiple merge pass cases become far more
important, and any additional I/O is a real concern, *especially*
additional random I/O (say from logtape.c fragmentation). The patch
directly addresses that, which is great. Your patch, alongside my
just-committed patch concerning how we maintain the heap invariant,
together attack the merge bottleneck from two different directions:
they address I/O costs, and CPU costs, respectively.

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

* I think that logtape.c header comments are needed for this. Maybe
that's where you should point out that memory access is largely
sequential. But it's surely true that logtape.c needs to take
responsibility for being the place where the vast majority of memory
is allocated during merging.

* i think you should move "bool *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

* You didn't carry over my 0002-* batch memory patch modifications to
comments, even though you should have in a few cases. There remains
some references in comments to batch memory, as a thing exclusively
usable by final on-the-fly merges. That's not true anymore -- it's
usable by final merges, too. For example, readtup_alloc() still
references the final on-the-fly merge.

* You also fail to take credit in the commit message for making batch
memory usable when returning caller tuples to callers that happen to
request "randomAccess" (So, I guess the aforementioned comments above
routines like readtup_alloc() shouldn't even refer to merging, unless
it's to say that non-final merges are not supported due to their
unusual requirements). My patch didn't go that far (I only addressed
the final merge itself, not the subsequent access to tuples when
reading from that materialized final output tape by TSS_SORTEDONTAPE
case). But, that's actually really useful for randomAccess callers,
above and beyond what I proposed (which in any case was mostly written
with parallel workers in mind, which never do TSS_SORTEDONTAPE
processing).

* Furthermore, readtup_alloc() will not just be called in WRITETUP()
routines -- please update comments.

* There is a very subtle issue here:

+   /*
+    * We no longer need a large memtuples array, only one slot per tape. Shrink
+    * it, to make the memory available for other use. We only need one slot per
+    * tape.
+    */
+   pfree(state->memtuples);
+   FREEMEM(state, state->memtupsize * sizeof(SortTuple));
+   state->memtupsize = state->maxTapes;
+   state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+   USEMEM(state, state->memtupsize * sizeof(SortTuple));

The FREEMEM() call needs to count the chunk overhead in both cases. In
short, I think you need to copy the GetMemoryChunkSpace() stuff that
you see within grow_memtuples().

* Whitespace issue here:

@@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
#endif

/*
-    * Decrease availMem to reflect the space needed for tape buffers; but
+    * Decrease availMem to reflect the space needed for tape buffers, when
+    * writing the initial runs; but
* don't decrease it to the point that we have no room for tuples. (That
* case is only likely to occur if sorting pass-by-value Datums; in all
* other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
state->tapeset = LogicalTapeSetCreate(maxTapes);

* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+       /* Allocate a read buffer */
+       if (lt->buffer)
+           pfree(lt->buffer);
+       lt->buffer = palloc(lt->read_buffer_size);
+       lt->buffer_size = lt->read_buffer_size;

* Typo:

+
+   /*
+    * from this point on, we no longer use the usemem()/lackmem() mechanism to
+    * track memory usage of indivitual tuples.
+    */
+   state->batchused = true;

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

+#ifdef TRACE_SORT
+   if (trace_sort)
+       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif

* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):

+   /*
+    * Use all the spare memory we have available for read buffers. Divide it
+    * memory evenly among all the tapes.
+    */
+   avail_blocks = state->availMem / BLCKSZ;
+   per_tape = avail_blocks / maxTapes;
+   cutoff = avail_blocks % maxTapes;
+   if (per_tape == 0)
+   {
+       per_tape = 1;
+       cutoff = 0;
+   }
+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the chance to save us here (if there is a bug
along these lines, let's at least make the right "can't happen error"
complaint to user when it pops up).

* It looks like your patch makes us less eager about freeing per-tape
batch memory, now held as preload buffer space within logtape.c.

ISTM that there should be some way to have the "tape exhausted" code
path within tuplesort_gettuple_common() (as well as the similar spot
within mergeonerun()) instruct logtape.c that we're done with that
tape. In other words, when mergeprereadone() (now renamed to
mergereadnext()) detects the tape is exhausted, it should have
logtape.c free its huge tape buffer immediately. Think about cases
where input is presorted, and earlier tapes can be freed quite early
on. It's worth keeping that around, (you removed the old way that this
happened, through mergebatchfreetape()).

That's all I have right now. I like the direction this is going in,
but I think this needs more polishing.

[1]: /messages/by-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

+#ifdef TRACE_SORT
+   if (trace_sort)
+       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
+            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
+#endif

Also, please remove the int cast, and use INT64_FORMAT. Again, this
should match existing trace_sort traces concerning batch memory.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+       /* Allocate a read buffer */
+       if (lt->buffer)
+           pfree(lt->buffer);
+       lt->buffer = palloc(lt->read_buffer_size);
+       lt->buffer_size = lt->read_buffer_size;

Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.

More generally, I think you also need to explain in comments why there
is a "read_buffer_size" field in addition to the "buffer_size" field.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

+   /*
+    * If we had fewer runs than tapes, refund buffers for tapes that were never
+    * allocated.
+    */
+   maxTapes = state->maxTapes;
+   if (state->currentRun < maxTapes)
+   {
+       FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+       maxTapes = state->currentRun;
+   }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.
You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Immediately afterwards, you do this:

+   /* Initialize the merge tuple buffer arena.  */
+   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Peter Geoghegan (#17)
Re: Tuplesort merge pre-reading

On 12/09/16 10:13, Peter Geoghegan wrote:

On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

[...]

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

If you select ten pages at random and sort them, then consecutive reads
of the sorted list are more likely to access pages in the same block or
closely adjacent (is my understanding).

eg

blocks: XXXX XXXX XXXX XXXX XXXX
pages: 0 1 2 3 4 5 6 7 8 9

if the ten 'random pages' were selected in the random order:
6 1 7 8 4 2 9 3 0
Consecutive reads would always read new blocks, but the sorted list
would have blocks read sequentially.

In practice, it would be rarely this simple. However, if any of the
random pages where in the same block, then that block would only need to
be fetched once - similarly if 2 of the random pages where in
consecutive blocks, then the two blocks would be logically adjacent
(which means they are likely to be physically close together, but not
guaranteed!).

[...]

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Gavin Flower (#21)
Re: Tuplesort merge pre-reading

On 12/09/16 12:16, Gavin Flower wrote:
[...]

two blocks would be logically adjacent (which means they are likely
to be physically close together, but not guaranteed!).

[...]

Actual disk layouts are quite complicated, the above is an over
simplification, but the message is still valid.

There are various tricks of disc layout ( & low level handling) that can
be used to minimise the time taken to read 2 blocks that are logically
adjacent. I had to know this stuff for discs that MainFrame computers
used in the 1980's - modern disk systems are way more complicated, but
the conclusions are still valid.

I am extremely glad that I no longer have to concern myself with
understanding the precise low stuff on discs these days - there is no
longer a one-to-one correspondence of what the O/S thinks is a disk
block, with how the data is physically recorded on the disc.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#16)
Re: Tuplesort merge pre-reading

On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

I don't know what was up with the other ones, but this one works fine.

Benchmarking now.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#23)
2 attachment(s)
Re: Tuplesort merge pre-reading

On Mon, Sep 12, 2016 at 12:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

I don't know what was up with the other ones, but this one works fine.

Benchmarking now.

I spoke too soon, git AM had failed and I didn't notice.

regression.diffs attached

Built with

./configure --enable-debug --enable-cassert && make clean && make -j7
&& make check

Attachments:

regression.diffsapplication/octet-stream; name=regression.diffsDownload
*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/cluster.out	2016-09-05 20:45:48.604032169 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/cluster.out	2016-09-12 12:14:51.831413902 -0300
***************
*** 461,482 ****
  -- run, due to the fact that input is found to be presorted:
  set replacement_sort_tuples = 150000;
  cluster clstr_4 using cluster_sort;
! select * from
! (select hundred, lag(hundred) over () as lhundred,
!         thousand, lag(thousand) over () as lthousand,
!         tenthous, lag(tenthous) over () as ltenthous from clstr_4) ss
! where row(hundred, thousand, tenthous) <= row(lhundred, lthousand, ltenthous);
!  hundred | lhundred | thousand | lthousand | tenthous | ltenthous 
! ---------+----------+----------+-----------+----------+-----------
! (0 rows)
! 
! reset enable_indexscan;
! reset maintenance_work_mem;
! reset replacement_sort_tuples;
! -- clean up
! DROP TABLE clustertest;
! DROP TABLE clstr_1;
! DROP TABLE clstr_2;
! DROP TABLE clstr_3;
! DROP TABLE clstr_4;
! DROP USER regress_clstr_user;
--- 461,467 ----
  -- run, due to the fact that input is found to be presorted:
  set replacement_sort_tuples = 150000;
  cluster clstr_4 using cluster_sort;
! server closed the connection unexpectedly
! 	This probably means the server terminated abnormally
! 	before or while processing the request.
! connection to server was lost

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/plancache.out	2016-09-05 20:45:48.872032991 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/plancache.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,254 ****
! --
! -- Tests to exercise the plan caching/invalidation mechanism
! --
! CREATE TEMP TABLE pcachetest AS SELECT * FROM int8_tbl;
! -- create and use a cached plan
! PREPARE prepstmt AS SELECT * FROM pcachetest;
! EXECUTE prepstmt;
!         q1        |        q2         
! ------------------+-------------------
!               123 |               456
!               123 |  4567890123456789
!  4567890123456789 |               123
!  4567890123456789 |  4567890123456789
!  4567890123456789 | -4567890123456789
! (5 rows)
! 
! -- and one with parameters
! PREPARE prepstmt2(bigint) AS SELECT * FROM pcachetest WHERE q1 = $1;
! EXECUTE prepstmt2(123);
!  q1  |        q2        
! -----+------------------
!  123 |              456
!  123 | 4567890123456789
! (2 rows)
! 
! -- invalidate the plans and see what happens
! DROP TABLE pcachetest;
! EXECUTE prepstmt;
! ERROR:  relation "pcachetest" does not exist
! EXECUTE prepstmt2(123);
! ERROR:  relation "pcachetest" does not exist
! -- recreate the temp table (this demonstrates that the raw plan is
! -- purely textual and doesn't depend on OIDs, for instance)
! CREATE TEMP TABLE pcachetest AS SELECT * FROM int8_tbl ORDER BY 2;
! EXECUTE prepstmt;
!         q1        |        q2         
! ------------------+-------------------
!  4567890123456789 | -4567890123456789
!  4567890123456789 |               123
!               123 |               456
!               123 |  4567890123456789
!  4567890123456789 |  4567890123456789
! (5 rows)
! 
! EXECUTE prepstmt2(123);
!  q1  |        q2        
! -----+------------------
!  123 |              456
!  123 | 4567890123456789
! (2 rows)
! 
! -- prepared statements should prevent change in output tupdesc,
! -- since clients probably aren't expecting that to change on the fly
! ALTER TABLE pcachetest ADD COLUMN q3 bigint;
! EXECUTE prepstmt;
! ERROR:  cached plan must not change result type
! EXECUTE prepstmt2(123);
! ERROR:  cached plan must not change result type
! -- but we're nice guys and will let you undo your mistake
! ALTER TABLE pcachetest DROP COLUMN q3;
! EXECUTE prepstmt;
!         q1        |        q2         
! ------------------+-------------------
!  4567890123456789 | -4567890123456789
!  4567890123456789 |               123
!               123 |               456
!               123 |  4567890123456789
!  4567890123456789 |  4567890123456789
! (5 rows)
! 
! EXECUTE prepstmt2(123);
!  q1  |        q2        
! -----+------------------
!  123 |              456
!  123 | 4567890123456789
! (2 rows)
! 
! -- Try it with a view, which isn't directly used in the resulting plan
! -- but should trigger invalidation anyway
! CREATE TEMP VIEW pcacheview AS
!   SELECT * FROM pcachetest;
! PREPARE vprep AS SELECT * FROM pcacheview;
! EXECUTE vprep;
!         q1        |        q2         
! ------------------+-------------------
!  4567890123456789 | -4567890123456789
!  4567890123456789 |               123
!               123 |               456
!               123 |  4567890123456789
!  4567890123456789 |  4567890123456789
! (5 rows)
! 
! CREATE OR REPLACE TEMP VIEW pcacheview AS
!   SELECT q1, q2/2 AS q2 FROM pcachetest;
! EXECUTE vprep;
!         q1        |        q2         
! ------------------+-------------------
!  4567890123456789 | -2283945061728394
!  4567890123456789 |                61
!               123 |               228
!               123 |  2283945061728394
!  4567890123456789 |  2283945061728394
! (5 rows)
! 
! -- Check basic SPI plan invalidation
! create function cache_test(int) returns int as $$
! declare total int;
! begin
! 	create temp table t1(f1 int);
! 	insert into t1 values($1);
! 	insert into t1 values(11);
! 	insert into t1 values(12);
! 	insert into t1 values(13);
! 	select sum(f1) into total from t1;
! 	drop table t1;
! 	return total;
! end
! $$ language plpgsql;
! select cache_test(1);
!  cache_test 
! ------------
!          37
! (1 row)
! 
! select cache_test(2);
!  cache_test 
! ------------
!          38
! (1 row)
! 
! select cache_test(3);
!  cache_test 
! ------------
!          39
! (1 row)
! 
! -- Check invalidation of plpgsql "simple expression"
! create temp view v1 as
!   select 2+2 as f1;
! create function cache_test_2() returns int as $$
! begin
! 	return f1 from v1;
! end$$ language plpgsql;
! select cache_test_2();
!  cache_test_2 
! --------------
!             4
! (1 row)
! 
! create or replace temp view v1 as
!   select 2+2+4 as f1;
! select cache_test_2();
!  cache_test_2 
! --------------
!             8
! (1 row)
! 
! create or replace temp view v1 as
!   select 2+2+4+(select max(unique1) from tenk1) as f1;
! select cache_test_2();
!  cache_test_2 
! --------------
!         10007
! (1 row)
! 
! --- Check that change of search_path is honored when re-using cached plan
! create schema s1
!   create table abc (f1 int);
! create schema s2
!   create table abc (f1 int);
! insert into s1.abc values(123);
! insert into s2.abc values(456);
! set search_path = s1;
! prepare p1 as select f1 from abc;
! execute p1;
!  f1  
! -----
!  123
! (1 row)
! 
! set search_path = s2;
! select f1 from abc;
!  f1  
! -----
!  456
! (1 row)
! 
! execute p1;
!  f1  
! -----
!  456
! (1 row)
! 
! alter table s1.abc add column f2 float8;   -- force replan
! execute p1;
!  f1  
! -----
!  456
! (1 row)
! 
! drop schema s1 cascade;
! NOTICE:  drop cascades to table s1.abc
! drop schema s2 cascade;
! NOTICE:  drop cascades to table abc
! reset search_path;
! -- Check that invalidation deals with regclass constants
! create temp sequence seq;
! prepare p2 as select nextval('seq');
! execute p2;
!  nextval 
! ---------
!        1
! (1 row)
! 
! drop sequence seq;
! create temp sequence seq;
! execute p2;
!  nextval 
! ---------
!        1
! (1 row)
! 
! -- Check DDL via SPI, immediately followed by SPI plan re-use
! -- (bug in original coding)
! create function cachebug() returns void as $$
! declare r int;
! begin
!   drop table if exists temptable cascade;
!   create temp table temptable as select * from generate_series(1,3) as f1;
!   create temp view vv as select * from temptable;
!   for r in select * from vv loop
!     raise notice '%', r;
!   end loop;
! end$$ language plpgsql;
! select cachebug();
! NOTICE:  table "temptable" does not exist, skipping
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
!  cachebug 
! ----------
!  
! (1 row)
! 
! select cachebug();
! NOTICE:  drop cascades to view vv
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
!  cachebug 
! ----------
!  
! (1 row)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/limit.out	2016-09-05 20:45:48.780032710 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/limit.out	2016-09-12 12:14:51.883413916 -0300
***************
*** 1,322 ****
! --
! -- LIMIT
! -- Check the LIMIT/OFFSET feature of SELECT
! --
! SELECT ''::text AS two, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 > 50
! 		ORDER BY unique1 LIMIT 2;
!  two | unique1 | unique2 | stringu1 
! -----+---------+---------+----------
!      |      51 |      76 | ZBAAAA
!      |      52 |     985 | ACAAAA
! (2 rows)
! 
! SELECT ''::text AS five, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 > 60
! 		ORDER BY unique1 LIMIT 5;
!  five | unique1 | unique2 | stringu1 
! ------+---------+---------+----------
!       |      61 |     560 | JCAAAA
!       |      62 |     633 | KCAAAA
!       |      63 |     296 | LCAAAA
!       |      64 |     479 | MCAAAA
!       |      65 |      64 | NCAAAA
! (5 rows)
! 
! SELECT ''::text AS two, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 > 60 AND unique1 < 63
! 		ORDER BY unique1 LIMIT 5;
!  two | unique1 | unique2 | stringu1 
! -----+---------+---------+----------
!      |      61 |     560 | JCAAAA
!      |      62 |     633 | KCAAAA
! (2 rows)
! 
! SELECT ''::text AS three, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 > 100
! 		ORDER BY unique1 LIMIT 3 OFFSET 20;
!  three | unique1 | unique2 | stringu1 
! -------+---------+---------+----------
!        |     121 |     700 | REAAAA
!        |     122 |     519 | SEAAAA
!        |     123 |     777 | TEAAAA
! (3 rows)
! 
! SELECT ''::text AS zero, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 < 50
! 		ORDER BY unique1 DESC LIMIT 8 OFFSET 99;
!  zero | unique1 | unique2 | stringu1 
! ------+---------+---------+----------
! (0 rows)
! 
! SELECT ''::text AS eleven, unique1, unique2, stringu1
! 		FROM onek WHERE unique1 < 50
! 		ORDER BY unique1 DESC LIMIT 20 OFFSET 39;
!  eleven | unique1 | unique2 | stringu1 
! --------+---------+---------+----------
!         |      10 |     520 | KAAAAA
!         |       9 |      49 | JAAAAA
!         |       8 |     653 | IAAAAA
!         |       7 |     647 | HAAAAA
!         |       6 |     978 | GAAAAA
!         |       5 |     541 | FAAAAA
!         |       4 |     833 | EAAAAA
!         |       3 |     431 | DAAAAA
!         |       2 |     326 | CAAAAA
!         |       1 |     214 | BAAAAA
!         |       0 |     998 | AAAAAA
! (11 rows)
! 
! SELECT ''::text AS ten, unique1, unique2, stringu1
! 		FROM onek
! 		ORDER BY unique1 OFFSET 990;
!  ten | unique1 | unique2 | stringu1 
! -----+---------+---------+----------
!      |     990 |     369 | CMAAAA
!      |     991 |     426 | DMAAAA
!      |     992 |     363 | EMAAAA
!      |     993 |     661 | FMAAAA
!      |     994 |     695 | GMAAAA
!      |     995 |     144 | HMAAAA
!      |     996 |     258 | IMAAAA
!      |     997 |      21 | JMAAAA
!      |     998 |     549 | KMAAAA
!      |     999 |     152 | LMAAAA
! (10 rows)
! 
! SELECT ''::text AS five, unique1, unique2, stringu1
! 		FROM onek
! 		ORDER BY unique1 OFFSET 990 LIMIT 5;
!  five | unique1 | unique2 | stringu1 
! ------+---------+---------+----------
!       |     990 |     369 | CMAAAA
!       |     991 |     426 | DMAAAA
!       |     992 |     363 | EMAAAA
!       |     993 |     661 | FMAAAA
!       |     994 |     695 | GMAAAA
! (5 rows)
! 
! SELECT ''::text AS five, unique1, unique2, stringu1
! 		FROM onek
! 		ORDER BY unique1 LIMIT 5 OFFSET 900;
!  five | unique1 | unique2 | stringu1 
! ------+---------+---------+----------
!       |     900 |     913 | QIAAAA
!       |     901 |     931 | RIAAAA
!       |     902 |     702 | SIAAAA
!       |     903 |     641 | TIAAAA
!       |     904 |     793 | UIAAAA
! (5 rows)
! 
! -- Stress test for variable LIMIT in conjunction with bounded-heap sorting
! SELECT
!   (SELECT n
!      FROM (VALUES (1)) AS x,
!           (SELECT n FROM generate_series(1,10) AS n
!              ORDER BY n LIMIT 1 OFFSET s-1) AS y) AS z
!   FROM generate_series(1,10) AS s;
!  z  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (10 rows)
! 
! --
! -- Test behavior of volatile and set-returning functions in conjunction
! -- with ORDER BY and LIMIT.
! --
! create temp sequence testseq;
! explain (verbose, costs off)
! select unique1, unique2, nextval('testseq')
!   from tenk1 order by unique2 limit 10;
!                            QUERY PLAN                           
! ----------------------------------------------------------------
!  Limit
!    Output: unique1, unique2, (nextval('testseq'::regclass))
!    ->  Index Scan using tenk1_unique2 on public.tenk1
!          Output: unique1, unique2, nextval('testseq'::regclass)
! (4 rows)
! 
! select unique1, unique2, nextval('testseq')
!   from tenk1 order by unique2 limit 10;
!  unique1 | unique2 | nextval 
! ---------+---------+---------
!     8800 |       0 |       1
!     1891 |       1 |       2
!     3420 |       2 |       3
!     9850 |       3 |       4
!     7164 |       4 |       5
!     8009 |       5 |       6
!     5057 |       6 |       7
!     6701 |       7 |       8
!     4321 |       8 |       9
!     3043 |       9 |      10
! (10 rows)
! 
! select currval('testseq');
!  currval 
! ---------
!       10
! (1 row)
! 
! explain (verbose, costs off)
! select unique1, unique2, nextval('testseq')
!   from tenk1 order by tenthous limit 10;
!                                 QUERY PLAN                                
! --------------------------------------------------------------------------
!  Limit
!    Output: unique1, unique2, (nextval('testseq'::regclass)), tenthous
!    ->  Result
!          Output: unique1, unique2, nextval('testseq'::regclass), tenthous
!          ->  Sort
!                Output: unique1, unique2, tenthous
!                Sort Key: tenk1.tenthous
!                ->  Seq Scan on public.tenk1
!                      Output: unique1, unique2, tenthous
! (9 rows)
! 
! select unique1, unique2, nextval('testseq')
!   from tenk1 order by tenthous limit 10;
!  unique1 | unique2 | nextval 
! ---------+---------+---------
!        0 |    9998 |      11
!        1 |    2838 |      12
!        2 |    2716 |      13
!        3 |    5679 |      14
!        4 |    1621 |      15
!        5 |    5557 |      16
!        6 |    2855 |      17
!        7 |    8518 |      18
!        8 |    5435 |      19
!        9 |    4463 |      20
! (10 rows)
! 
! select currval('testseq');
!  currval 
! ---------
!       20
! (1 row)
! 
! explain (verbose, costs off)
! select unique1, unique2, generate_series(1,10)
!   from tenk1 order by unique2 limit 7;
!                         QUERY PLAN                        
! ----------------------------------------------------------
!  Limit
!    Output: unique1, unique2, (generate_series(1, 10))
!    ->  Index Scan using tenk1_unique2 on public.tenk1
!          Output: unique1, unique2, generate_series(1, 10)
! (4 rows)
! 
! select unique1, unique2, generate_series(1,10)
!   from tenk1 order by unique2 limit 7;
!  unique1 | unique2 | generate_series 
! ---------+---------+-----------------
!     8800 |       0 |               1
!     8800 |       0 |               2
!     8800 |       0 |               3
!     8800 |       0 |               4
!     8800 |       0 |               5
!     8800 |       0 |               6
!     8800 |       0 |               7
! (7 rows)
! 
! explain (verbose, costs off)
! select unique1, unique2, generate_series(1,10)
!   from tenk1 order by tenthous limit 7;
!                              QUERY PLAN                             
! --------------------------------------------------------------------
!  Limit
!    Output: unique1, unique2, (generate_series(1, 10)), tenthous
!    ->  Result
!          Output: unique1, unique2, generate_series(1, 10), tenthous
!          ->  Sort
!                Output: unique1, unique2, tenthous
!                Sort Key: tenk1.tenthous
!                ->  Seq Scan on public.tenk1
!                      Output: unique1, unique2, tenthous
! (9 rows)
! 
! select unique1, unique2, generate_series(1,10)
!   from tenk1 order by tenthous limit 7;
!  unique1 | unique2 | generate_series 
! ---------+---------+-----------------
!        0 |    9998 |               1
!        0 |    9998 |               2
!        0 |    9998 |               3
!        0 |    9998 |               4
!        0 |    9998 |               5
!        0 |    9998 |               6
!        0 |    9998 |               7
! (7 rows)
! 
! -- use of random() is to keep planner from folding the expressions together
! explain (verbose, costs off)
! select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
!                                               QUERY PLAN                                              
! ------------------------------------------------------------------------------------------------------
!  Result
!    Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
! (2 rows)
! 
! select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2;
!  s1 | s2 
! ----+----
!   0 |  0
!   1 |  1
!   2 |  2
! (3 rows)
! 
! explain (verbose, costs off)
! select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
! order by s2 desc;
!                                                  QUERY PLAN                                                 
! ------------------------------------------------------------------------------------------------------------
!  Sort
!    Output: (generate_series(0, 2)), (generate_series(((random() * '0.1'::double precision))::integer, 2))
!    Sort Key: (generate_series(((random() * '0.1'::double precision))::integer, 2)) DESC
!    ->  Result
!          Output: generate_series(0, 2), generate_series(((random() * '0.1'::double precision))::integer, 2)
! (5 rows)
! 
! select generate_series(0,2) as s1, generate_series((random()*.1)::int,2) as s2
! order by s2 desc;
!  s1 | s2 
! ----+----
!   2 |  2
!   1 |  1
!   0 |  0
! (3 rows)
! 
! -- test for failure to set all aggregates' aggtranstype
! explain (verbose, costs off)
! select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
!   from tenk1 group by thousand order by thousand limit 3;
!                                                     QUERY PLAN                                                     
! -------------------------------------------------------------------------------------------------------------------
!  Limit
!    Output: (sum(tenthous)), (((sum(tenthous))::double precision + (random() * '0'::double precision))), thousand
!    ->  GroupAggregate
!          Output: sum(tenthous), ((sum(tenthous))::double precision + (random() * '0'::double precision)), thousand
!          Group Key: tenk1.thousand
!          ->  Index Only Scan using tenk1_thous_tenthous on public.tenk1
!                Output: thousand, tenthous
! (7 rows)
! 
! select sum(tenthous) as s1, sum(tenthous) + random()*0 as s2
!   from tenk1 group by thousand order by thousand limit 3;
!   s1   |  s2   
! -------+-------
!  45000 | 45000
!  45010 | 45010
!  45020 | 45020
! (3 rows)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/plpgsql.out	2016-09-05 20:45:48.892033053 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/plpgsql.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,5645 ****
! --
! -- PLPGSQL
! --
! -- Scenario:
! --
! --     A building with a modern TP cable installation where any
! --     of the wall connectors can be used to plug in phones,
! --     ethernet interfaces or local office hubs. The backside
! --     of the wall connectors is wired to one of several patch-
! --     fields in the building.
! --
! --     In the patchfields, there are hubs and all the slots
! --     representing the wall connectors. In addition there are
! --     slots that can represent a phone line from the central
! --     phone system.
! --
! --     Triggers ensure consistency of the patching information.
! --
! --     Functions are used to build up powerful views that let
! --     you look behind the wall when looking at a patchfield
! --     or into a room.
! --
! create table Room (
!     roomno	char(8),
!     comment	text
! );
! create unique index Room_rno on Room using btree (roomno bpchar_ops);
! create table WSlot (
!     slotname	char(20),
!     roomno	char(8),
!     slotlink	char(20),
!     backlink	char(20)
! );
! create unique index WSlot_name on WSlot using btree (slotname bpchar_ops);
! create table PField (
!     name	text,
!     comment	text
! );
! create unique index PField_name on PField using btree (name text_ops);
! create table PSlot (
!     slotname	char(20),
!     pfname	text,
!     slotlink	char(20),
!     backlink	char(20)
! );
! create unique index PSlot_name on PSlot using btree (slotname bpchar_ops);
! create table PLine (
!     slotname	char(20),
!     phonenumber	char(20),
!     comment	text,
!     backlink	char(20)
! );
! create unique index PLine_name on PLine using btree (slotname bpchar_ops);
! create table Hub (
!     name	char(14),
!     comment	text,
!     nslots	integer
! );
! create unique index Hub_name on Hub using btree (name bpchar_ops);
! create table HSlot (
!     slotname	char(20),
!     hubname	char(14),
!     slotno	integer,
!     slotlink	char(20)
! );
! create unique index HSlot_name on HSlot using btree (slotname bpchar_ops);
! create index HSlot_hubname on HSlot using btree (hubname bpchar_ops);
! create table System (
!     name	text,
!     comment	text
! );
! create unique index System_name on System using btree (name text_ops);
! create table IFace (
!     slotname	char(20),
!     sysname	text,
!     ifname	text,
!     slotlink	char(20)
! );
! create unique index IFace_name on IFace using btree (slotname bpchar_ops);
! create table PHone (
!     slotname	char(20),
!     comment	text,
!     slotlink	char(20)
! );
! create unique index PHone_name on PHone using btree (slotname bpchar_ops);
! -- ************************************************************
! -- *
! -- * Trigger procedures and functions for the patchfield
! -- * test of PL/pgSQL
! -- *
! -- ************************************************************
! -- ************************************************************
! -- * AFTER UPDATE on Room
! -- *	- If room no changes let wall slots follow
! -- ************************************************************
! create function tg_room_au() returns trigger as '
! begin
!     if new.roomno != old.roomno then
!         update WSlot set roomno = new.roomno where roomno = old.roomno;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_room_au after update
!     on Room for each row execute procedure tg_room_au();
! -- ************************************************************
! -- * AFTER DELETE on Room
! -- *	- delete wall slots in this room
! -- ************************************************************
! create function tg_room_ad() returns trigger as '
! begin
!     delete from WSlot where roomno = old.roomno;
!     return old;
! end;
! ' language plpgsql;
! create trigger tg_room_ad after delete
!     on Room for each row execute procedure tg_room_ad();
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on WSlot
! -- *	- Check that room exists
! -- ************************************************************
! create function tg_wslot_biu() returns trigger as $$
! begin
!     if count(*) = 0 from Room where roomno = new.roomno then
!         raise exception 'Room % does not exist', new.roomno;
!     end if;
!     return new;
! end;
! $$ language plpgsql;
! create trigger tg_wslot_biu before insert or update
!     on WSlot for each row execute procedure tg_wslot_biu();
! -- ************************************************************
! -- * AFTER UPDATE on PField
! -- *	- Let PSlots of this field follow
! -- ************************************************************
! create function tg_pfield_au() returns trigger as '
! begin
!     if new.name != old.name then
!         update PSlot set pfname = new.name where pfname = old.name;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_pfield_au after update
!     on PField for each row execute procedure tg_pfield_au();
! -- ************************************************************
! -- * AFTER DELETE on PField
! -- *	- Remove all slots of this patchfield
! -- ************************************************************
! create function tg_pfield_ad() returns trigger as '
! begin
!     delete from PSlot where pfname = old.name;
!     return old;
! end;
! ' language plpgsql;
! create trigger tg_pfield_ad after delete
!     on PField for each row execute procedure tg_pfield_ad();
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on PSlot
! -- *	- Ensure that our patchfield does exist
! -- ************************************************************
! create function tg_pslot_biu() returns trigger as $proc$
! declare
!     pfrec	record;
!     ps          alias for new;
! begin
!     select into pfrec * from PField where name = ps.pfname;
!     if not found then
!         raise exception $$Patchfield "%" does not exist$$, ps.pfname;
!     end if;
!     return ps;
! end;
! $proc$ language plpgsql;
! create trigger tg_pslot_biu before insert or update
!     on PSlot for each row execute procedure tg_pslot_biu();
! -- ************************************************************
! -- * AFTER UPDATE on System
! -- *	- If system name changes let interfaces follow
! -- ************************************************************
! create function tg_system_au() returns trigger as '
! begin
!     if new.name != old.name then
!         update IFace set sysname = new.name where sysname = old.name;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_system_au after update
!     on System for each row execute procedure tg_system_au();
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on IFace
! -- *	- set the slotname to IF.sysname.ifname
! -- ************************************************************
! create function tg_iface_biu() returns trigger as $$
! declare
!     sname	text;
!     sysrec	record;
! begin
!     select into sysrec * from system where name = new.sysname;
!     if not found then
!         raise exception $q$system "%" does not exist$q$, new.sysname;
!     end if;
!     sname := 'IF.' || new.sysname;
!     sname := sname || '.';
!     sname := sname || new.ifname;
!     if length(sname) > 20 then
!         raise exception 'IFace slotname "%" too long (20 char max)', sname;
!     end if;
!     new.slotname := sname;
!     return new;
! end;
! $$ language plpgsql;
! create trigger tg_iface_biu before insert or update
!     on IFace for each row execute procedure tg_iface_biu();
! -- ************************************************************
! -- * AFTER INSERT or UPDATE or DELETE on Hub
! -- *	- insert/delete/rename slots as required
! -- ************************************************************
! create function tg_hub_a() returns trigger as '
! declare
!     hname	text;
!     dummy	integer;
! begin
!     if tg_op = ''INSERT'' then
! 	dummy := tg_hub_adjustslots(new.name, 0, new.nslots);
! 	return new;
!     end if;
!     if tg_op = ''UPDATE'' then
! 	if new.name != old.name then
! 	    update HSlot set hubname = new.name where hubname = old.name;
! 	end if;
! 	dummy := tg_hub_adjustslots(new.name, old.nslots, new.nslots);
! 	return new;
!     end if;
!     if tg_op = ''DELETE'' then
! 	dummy := tg_hub_adjustslots(old.name, old.nslots, 0);
! 	return old;
!     end if;
! end;
! ' language plpgsql;
! create trigger tg_hub_a after insert or update or delete
!     on Hub for each row execute procedure tg_hub_a();
! -- ************************************************************
! -- * Support function to add/remove slots of Hub
! -- ************************************************************
! create function tg_hub_adjustslots(hname bpchar,
!                                    oldnslots integer,
!                                    newnslots integer)
! returns integer as '
! begin
!     if newnslots = oldnslots then
!         return 0;
!     end if;
!     if newnslots < oldnslots then
!         delete from HSlot where hubname = hname and slotno > newnslots;
! 	return 0;
!     end if;
!     for i in oldnslots + 1 .. newnslots loop
!         insert into HSlot (slotname, hubname, slotno, slotlink)
! 		values (''HS.dummy'', hname, i, '''');
!     end loop;
!     return 0;
! end
! ' language plpgsql;
! -- Test comments
! COMMENT ON FUNCTION tg_hub_adjustslots_wrong(bpchar, integer, integer) IS 'function with args';
! ERROR:  function tg_hub_adjustslots_wrong(character, integer, integer) does not exist
! COMMENT ON FUNCTION tg_hub_adjustslots(bpchar, integer, integer) IS 'function with args';
! COMMENT ON FUNCTION tg_hub_adjustslots(bpchar, integer, integer) IS NULL;
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on HSlot
! -- *	- prevent from manual manipulation
! -- *	- set the slotname to HS.hubname.slotno
! -- ************************************************************
! create function tg_hslot_biu() returns trigger as '
! declare
!     sname	text;
!     xname	HSlot.slotname%TYPE;
!     hubrec	record;
! begin
!     select into hubrec * from Hub where name = new.hubname;
!     if not found then
!         raise exception ''no manual manipulation of HSlot'';
!     end if;
!     if new.slotno < 1 or new.slotno > hubrec.nslots then
!         raise exception ''no manual manipulation of HSlot'';
!     end if;
!     if tg_op = ''UPDATE'' and new.hubname != old.hubname then
! 	if count(*) > 0 from Hub where name = old.hubname then
! 	    raise exception ''no manual manipulation of HSlot'';
! 	end if;
!     end if;
!     sname := ''HS.'' || trim(new.hubname);
!     sname := sname || ''.'';
!     sname := sname || new.slotno::text;
!     if length(sname) > 20 then
!         raise exception ''HSlot slotname "%" too long (20 char max)'', sname;
!     end if;
!     new.slotname := sname;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_hslot_biu before insert or update
!     on HSlot for each row execute procedure tg_hslot_biu();
! -- ************************************************************
! -- * BEFORE DELETE on HSlot
! -- *	- prevent from manual manipulation
! -- ************************************************************
! create function tg_hslot_bd() returns trigger as '
! declare
!     hubrec	record;
! begin
!     select into hubrec * from Hub where name = old.hubname;
!     if not found then
!         return old;
!     end if;
!     if old.slotno > hubrec.nslots then
!         return old;
!     end if;
!     raise exception ''no manual manipulation of HSlot'';
! end;
! ' language plpgsql;
! create trigger tg_hslot_bd before delete
!     on HSlot for each row execute procedure tg_hslot_bd();
! -- ************************************************************
! -- * BEFORE INSERT on all slots
! -- *	- Check name prefix
! -- ************************************************************
! create function tg_chkslotname() returns trigger as '
! begin
!     if substr(new.slotname, 1, 2) != tg_argv[0] then
!         raise exception ''slotname must begin with %'', tg_argv[0];
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_chkslotname before insert
!     on PSlot for each row execute procedure tg_chkslotname('PS');
! create trigger tg_chkslotname before insert
!     on WSlot for each row execute procedure tg_chkslotname('WS');
! create trigger tg_chkslotname before insert
!     on PLine for each row execute procedure tg_chkslotname('PL');
! create trigger tg_chkslotname before insert
!     on IFace for each row execute procedure tg_chkslotname('IF');
! create trigger tg_chkslotname before insert
!     on PHone for each row execute procedure tg_chkslotname('PH');
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on all slots with slotlink
! -- *	- Set slotlink to empty string if NULL value given
! -- ************************************************************
! create function tg_chkslotlink() returns trigger as '
! begin
!     if new.slotlink isnull then
!         new.slotlink := '''';
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_chkslotlink before insert or update
!     on PSlot for each row execute procedure tg_chkslotlink();
! create trigger tg_chkslotlink before insert or update
!     on WSlot for each row execute procedure tg_chkslotlink();
! create trigger tg_chkslotlink before insert or update
!     on IFace for each row execute procedure tg_chkslotlink();
! create trigger tg_chkslotlink before insert or update
!     on HSlot for each row execute procedure tg_chkslotlink();
! create trigger tg_chkslotlink before insert or update
!     on PHone for each row execute procedure tg_chkslotlink();
! -- ************************************************************
! -- * BEFORE INSERT or UPDATE on all slots with backlink
! -- *	- Set backlink to empty string if NULL value given
! -- ************************************************************
! create function tg_chkbacklink() returns trigger as '
! begin
!     if new.backlink isnull then
!         new.backlink := '''';
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_chkbacklink before insert or update
!     on PSlot for each row execute procedure tg_chkbacklink();
! create trigger tg_chkbacklink before insert or update
!     on WSlot for each row execute procedure tg_chkbacklink();
! create trigger tg_chkbacklink before insert or update
!     on PLine for each row execute procedure tg_chkbacklink();
! -- ************************************************************
! -- * BEFORE UPDATE on PSlot
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_pslot_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname then
!         delete from PSlot where slotname = old.slotname;
! 	insert into PSlot (
! 		    slotname,
! 		    pfname,
! 		    slotlink,
! 		    backlink
! 		) values (
! 		    new.slotname,
! 		    new.pfname,
! 		    new.slotlink,
! 		    new.backlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_pslot_bu before update
!     on PSlot for each row execute procedure tg_pslot_bu();
! -- ************************************************************
! -- * BEFORE UPDATE on WSlot
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_wslot_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname then
!         delete from WSlot where slotname = old.slotname;
! 	insert into WSlot (
! 		    slotname,
! 		    roomno,
! 		    slotlink,
! 		    backlink
! 		) values (
! 		    new.slotname,
! 		    new.roomno,
! 		    new.slotlink,
! 		    new.backlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_wslot_bu before update
!     on WSlot for each row execute procedure tg_Wslot_bu();
! -- ************************************************************
! -- * BEFORE UPDATE on PLine
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_pline_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname then
!         delete from PLine where slotname = old.slotname;
! 	insert into PLine (
! 		    slotname,
! 		    phonenumber,
! 		    comment,
! 		    backlink
! 		) values (
! 		    new.slotname,
! 		    new.phonenumber,
! 		    new.comment,
! 		    new.backlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_pline_bu before update
!     on PLine for each row execute procedure tg_pline_bu();
! -- ************************************************************
! -- * BEFORE UPDATE on IFace
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_iface_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname then
!         delete from IFace where slotname = old.slotname;
! 	insert into IFace (
! 		    slotname,
! 		    sysname,
! 		    ifname,
! 		    slotlink
! 		) values (
! 		    new.slotname,
! 		    new.sysname,
! 		    new.ifname,
! 		    new.slotlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_iface_bu before update
!     on IFace for each row execute procedure tg_iface_bu();
! -- ************************************************************
! -- * BEFORE UPDATE on HSlot
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_hslot_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname or new.hubname != old.hubname then
!         delete from HSlot where slotname = old.slotname;
! 	insert into HSlot (
! 		    slotname,
! 		    hubname,
! 		    slotno,
! 		    slotlink
! 		) values (
! 		    new.slotname,
! 		    new.hubname,
! 		    new.slotno,
! 		    new.slotlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_hslot_bu before update
!     on HSlot for each row execute procedure tg_hslot_bu();
! -- ************************************************************
! -- * BEFORE UPDATE on PHone
! -- *	- do delete/insert instead of update if name changes
! -- ************************************************************
! create function tg_phone_bu() returns trigger as '
! begin
!     if new.slotname != old.slotname then
!         delete from PHone where slotname = old.slotname;
! 	insert into PHone (
! 		    slotname,
! 		    comment,
! 		    slotlink
! 		) values (
! 		    new.slotname,
! 		    new.comment,
! 		    new.slotlink
! 		);
!         return null;
!     end if;
!     return new;
! end;
! ' language plpgsql;
! create trigger tg_phone_bu before update
!     on PHone for each row execute procedure tg_phone_bu();
! -- ************************************************************
! -- * AFTER INSERT or UPDATE or DELETE on slot with backlink
! -- *	- Ensure that the opponent correctly points back to us
! -- ************************************************************
! create function tg_backlink_a() returns trigger as '
! declare
!     dummy	integer;
! begin
!     if tg_op = ''INSERT'' then
!         if new.backlink != '''' then
! 	    dummy := tg_backlink_set(new.backlink, new.slotname);
! 	end if;
! 	return new;
!     end if;
!     if tg_op = ''UPDATE'' then
!         if new.backlink != old.backlink then
! 	    if old.backlink != '''' then
! 	        dummy := tg_backlink_unset(old.backlink, old.slotname);
! 	    end if;
! 	    if new.backlink != '''' then
! 	        dummy := tg_backlink_set(new.backlink, new.slotname);
! 	    end if;
! 	else
! 	    if new.slotname != old.slotname and new.backlink != '''' then
! 	        dummy := tg_slotlink_set(new.backlink, new.slotname);
! 	    end if;
! 	end if;
! 	return new;
!     end if;
!     if tg_op = ''DELETE'' then
!         if old.backlink != '''' then
! 	    dummy := tg_backlink_unset(old.backlink, old.slotname);
! 	end if;
! 	return old;
!     end if;
! end;
! ' language plpgsql;
! create trigger tg_backlink_a after insert or update or delete
!     on PSlot for each row execute procedure tg_backlink_a('PS');
! create trigger tg_backlink_a after insert or update or delete
!     on WSlot for each row execute procedure tg_backlink_a('WS');
! create trigger tg_backlink_a after insert or update or delete
!     on PLine for each row execute procedure tg_backlink_a('PL');
! -- ************************************************************
! -- * Support function to set the opponents backlink field
! -- * if it does not already point to the requested slot
! -- ************************************************************
! create function tg_backlink_set(myname bpchar, blname bpchar)
! returns integer as '
! declare
!     mytype	char(2);
!     link	char(4);
!     rec		record;
! begin
!     mytype := substr(myname, 1, 2);
!     link := mytype || substr(blname, 1, 2);
!     if link = ''PLPL'' then
!         raise exception
! 		''backlink between two phone lines does not make sense'';
!     end if;
!     if link in (''PLWS'', ''WSPL'') then
!         raise exception
! 		''direct link of phone line to wall slot not permitted'';
!     end if;
!     if mytype = ''PS'' then
!         select into rec * from PSlot where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.backlink != blname then
! 	    update PSlot set backlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''WS'' then
!         select into rec * from WSlot where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.backlink != blname then
! 	    update WSlot set backlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''PL'' then
!         select into rec * from PLine where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.backlink != blname then
! 	    update PLine set backlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     raise exception ''illegal backlink beginning with %'', mytype;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * Support function to clear out the backlink field if
! -- * it still points to specific slot
! -- ************************************************************
! create function tg_backlink_unset(bpchar, bpchar)
! returns integer as '
! declare
!     myname	alias for $1;
!     blname	alias for $2;
!     mytype	char(2);
!     rec		record;
! begin
!     mytype := substr(myname, 1, 2);
!     if mytype = ''PS'' then
!         select into rec * from PSlot where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.backlink = blname then
! 	    update PSlot set backlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''WS'' then
!         select into rec * from WSlot where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.backlink = blname then
! 	    update WSlot set backlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''PL'' then
!         select into rec * from PLine where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.backlink = blname then
! 	    update PLine set backlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
! end
! ' language plpgsql;
! -- ************************************************************
! -- * AFTER INSERT or UPDATE or DELETE on slot with slotlink
! -- *	- Ensure that the opponent correctly points back to us
! -- ************************************************************
! create function tg_slotlink_a() returns trigger as '
! declare
!     dummy	integer;
! begin
!     if tg_op = ''INSERT'' then
!         if new.slotlink != '''' then
! 	    dummy := tg_slotlink_set(new.slotlink, new.slotname);
! 	end if;
! 	return new;
!     end if;
!     if tg_op = ''UPDATE'' then
!         if new.slotlink != old.slotlink then
! 	    if old.slotlink != '''' then
! 	        dummy := tg_slotlink_unset(old.slotlink, old.slotname);
! 	    end if;
! 	    if new.slotlink != '''' then
! 	        dummy := tg_slotlink_set(new.slotlink, new.slotname);
! 	    end if;
! 	else
! 	    if new.slotname != old.slotname and new.slotlink != '''' then
! 	        dummy := tg_slotlink_set(new.slotlink, new.slotname);
! 	    end if;
! 	end if;
! 	return new;
!     end if;
!     if tg_op = ''DELETE'' then
!         if old.slotlink != '''' then
! 	    dummy := tg_slotlink_unset(old.slotlink, old.slotname);
! 	end if;
! 	return old;
!     end if;
! end;
! ' language plpgsql;
! create trigger tg_slotlink_a after insert or update or delete
!     on PSlot for each row execute procedure tg_slotlink_a('PS');
! create trigger tg_slotlink_a after insert or update or delete
!     on WSlot for each row execute procedure tg_slotlink_a('WS');
! create trigger tg_slotlink_a after insert or update or delete
!     on IFace for each row execute procedure tg_slotlink_a('IF');
! create trigger tg_slotlink_a after insert or update or delete
!     on HSlot for each row execute procedure tg_slotlink_a('HS');
! create trigger tg_slotlink_a after insert or update or delete
!     on PHone for each row execute procedure tg_slotlink_a('PH');
! -- ************************************************************
! -- * Support function to set the opponents slotlink field
! -- * if it does not already point to the requested slot
! -- ************************************************************
! create function tg_slotlink_set(bpchar, bpchar)
! returns integer as '
! declare
!     myname	alias for $1;
!     blname	alias for $2;
!     mytype	char(2);
!     link	char(4);
!     rec		record;
! begin
!     mytype := substr(myname, 1, 2);
!     link := mytype || substr(blname, 1, 2);
!     if link = ''PHPH'' then
!         raise exception
! 		''slotlink between two phones does not make sense'';
!     end if;
!     if link in (''PHHS'', ''HSPH'') then
!         raise exception
! 		''link of phone to hub does not make sense'';
!     end if;
!     if link in (''PHIF'', ''IFPH'') then
!         raise exception
! 		''link of phone to hub does not make sense'';
!     end if;
!     if link in (''PSWS'', ''WSPS'') then
!         raise exception
! 		''slotlink from patchslot to wallslot not permitted'';
!     end if;
!     if mytype = ''PS'' then
!         select into rec * from PSlot where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.slotlink != blname then
! 	    update PSlot set slotlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''WS'' then
!         select into rec * from WSlot where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.slotlink != blname then
! 	    update WSlot set slotlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''IF'' then
!         select into rec * from IFace where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.slotlink != blname then
! 	    update IFace set slotlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''HS'' then
!         select into rec * from HSlot where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.slotlink != blname then
! 	    update HSlot set slotlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''PH'' then
!         select into rec * from PHone where slotname = myname;
! 	if not found then
! 	    raise exception ''% does not exist'', myname;
! 	end if;
! 	if rec.slotlink != blname then
! 	    update PHone set slotlink = blname where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     raise exception ''illegal slotlink beginning with %'', mytype;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * Support function to clear out the slotlink field if
! -- * it still points to specific slot
! -- ************************************************************
! create function tg_slotlink_unset(bpchar, bpchar)
! returns integer as '
! declare
!     myname	alias for $1;
!     blname	alias for $2;
!     mytype	char(2);
!     rec		record;
! begin
!     mytype := substr(myname, 1, 2);
!     if mytype = ''PS'' then
!         select into rec * from PSlot where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.slotlink = blname then
! 	    update PSlot set slotlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''WS'' then
!         select into rec * from WSlot where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.slotlink = blname then
! 	    update WSlot set slotlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''IF'' then
!         select into rec * from IFace where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.slotlink = blname then
! 	    update IFace set slotlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''HS'' then
!         select into rec * from HSlot where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.slotlink = blname then
! 	    update HSlot set slotlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
!     if mytype = ''PH'' then
!         select into rec * from PHone where slotname = myname;
! 	if not found then
! 	    return 0;
! 	end if;
! 	if rec.slotlink = blname then
! 	    update PHone set slotlink = '''' where slotname = myname;
! 	end if;
! 	return 0;
!     end if;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * Describe the backside of a patchfield slot
! -- ************************************************************
! create function pslot_backlink_view(bpchar)
! returns text as '
! <<outer>>
! declare
!     rec		record;
!     bltype	char(2);
!     retval	text;
! begin
!     select into rec * from PSlot where slotname = $1;
!     if not found then
!         return '''';
!     end if;
!     if rec.backlink = '''' then
!         return ''-'';
!     end if;
!     bltype := substr(rec.backlink, 1, 2);
!     if bltype = ''PL'' then
!         declare
! 	    rec		record;
! 	begin
! 	    select into rec * from PLine where slotname = "outer".rec.backlink;
! 	    retval := ''Phone line '' || trim(rec.phonenumber);
! 	    if rec.comment != '''' then
! 	        retval := retval || '' ('';
! 		retval := retval || rec.comment;
! 		retval := retval || '')'';
! 	    end if;
! 	    return retval;
! 	end;
!     end if;
!     if bltype = ''WS'' then
!         select into rec * from WSlot where slotname = rec.backlink;
! 	retval := trim(rec.slotname) || '' in room '';
! 	retval := retval || trim(rec.roomno);
! 	retval := retval || '' -> '';
! 	return retval || wslot_slotlink_view(rec.slotname);
!     end if;
!     return rec.backlink;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * Describe the front of a patchfield slot
! -- ************************************************************
! create function pslot_slotlink_view(bpchar)
! returns text as '
! declare
!     psrec	record;
!     sltype	char(2);
!     retval	text;
! begin
!     select into psrec * from PSlot where slotname = $1;
!     if not found then
!         return '''';
!     end if;
!     if psrec.slotlink = '''' then
!         return ''-'';
!     end if;
!     sltype := substr(psrec.slotlink, 1, 2);
!     if sltype = ''PS'' then
! 	retval := trim(psrec.slotlink) || '' -> '';
! 	return retval || pslot_backlink_view(psrec.slotlink);
!     end if;
!     if sltype = ''HS'' then
!         retval := comment from Hub H, HSlot HS
! 			where HS.slotname = psrec.slotlink
! 			  and H.name = HS.hubname;
!         retval := retval || '' slot '';
! 	retval := retval || slotno::text from HSlot
! 			where slotname = psrec.slotlink;
! 	return retval;
!     end if;
!     return psrec.slotlink;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * Describe the front of a wall connector slot
! -- ************************************************************
! create function wslot_slotlink_view(bpchar)
! returns text as '
! declare
!     rec		record;
!     sltype	char(2);
!     retval	text;
! begin
!     select into rec * from WSlot where slotname = $1;
!     if not found then
!         return '''';
!     end if;
!     if rec.slotlink = '''' then
!         return ''-'';
!     end if;
!     sltype := substr(rec.slotlink, 1, 2);
!     if sltype = ''PH'' then
!         select into rec * from PHone where slotname = rec.slotlink;
! 	retval := ''Phone '' || trim(rec.slotname);
! 	if rec.comment != '''' then
! 	    retval := retval || '' ('';
! 	    retval := retval || rec.comment;
! 	    retval := retval || '')'';
! 	end if;
! 	return retval;
!     end if;
!     if sltype = ''IF'' then
! 	declare
! 	    syrow	System%RowType;
! 	    ifrow	IFace%ROWTYPE;
!         begin
! 	    select into ifrow * from IFace where slotname = rec.slotlink;
! 	    select into syrow * from System where name = ifrow.sysname;
! 	    retval := syrow.name || '' IF '';
! 	    retval := retval || ifrow.ifname;
! 	    if syrow.comment != '''' then
! 	        retval := retval || '' ('';
! 		retval := retval || syrow.comment;
! 		retval := retval || '')'';
! 	    end if;
! 	    return retval;
! 	end;
!     end if;
!     return rec.slotlink;
! end;
! ' language plpgsql;
! -- ************************************************************
! -- * View of a patchfield describing backside and patches
! -- ************************************************************
! create view Pfield_v1 as select PF.pfname, PF.slotname,
! 	pslot_backlink_view(PF.slotname) as backside,
! 	pslot_slotlink_view(PF.slotname) as patch
!     from PSlot PF;
! --
! -- First we build the house - so we create the rooms
! --
! insert into Room values ('001', 'Entrance');
! insert into Room values ('002', 'Office');
! insert into Room values ('003', 'Office');
! insert into Room values ('004', 'Technical');
! insert into Room values ('101', 'Office');
! insert into Room values ('102', 'Conference');
! insert into Room values ('103', 'Restroom');
! insert into Room values ('104', 'Technical');
! insert into Room values ('105', 'Office');
! insert into Room values ('106', 'Office');
! --
! -- Second we install the wall connectors
! --
! insert into WSlot values ('WS.001.1a', '001', '', '');
! insert into WSlot values ('WS.001.1b', '001', '', '');
! insert into WSlot values ('WS.001.2a', '001', '', '');
! insert into WSlot values ('WS.001.2b', '001', '', '');
! insert into WSlot values ('WS.001.3a', '001', '', '');
! insert into WSlot values ('WS.001.3b', '001', '', '');
! insert into WSlot values ('WS.002.1a', '002', '', '');
! insert into WSlot values ('WS.002.1b', '002', '', '');
! insert into WSlot values ('WS.002.2a', '002', '', '');
! insert into WSlot values ('WS.002.2b', '002', '', '');
! insert into WSlot values ('WS.002.3a', '002', '', '');
! insert into WSlot values ('WS.002.3b', '002', '', '');
! insert into WSlot values ('WS.003.1a', '003', '', '');
! insert into WSlot values ('WS.003.1b', '003', '', '');
! insert into WSlot values ('WS.003.2a', '003', '', '');
! insert into WSlot values ('WS.003.2b', '003', '', '');
! insert into WSlot values ('WS.003.3a', '003', '', '');
! insert into WSlot values ('WS.003.3b', '003', '', '');
! insert into WSlot values ('WS.101.1a', '101', '', '');
! insert into WSlot values ('WS.101.1b', '101', '', '');
! insert into WSlot values ('WS.101.2a', '101', '', '');
! insert into WSlot values ('WS.101.2b', '101', '', '');
! insert into WSlot values ('WS.101.3a', '101', '', '');
! insert into WSlot values ('WS.101.3b', '101', '', '');
! insert into WSlot values ('WS.102.1a', '102', '', '');
! insert into WSlot values ('WS.102.1b', '102', '', '');
! insert into WSlot values ('WS.102.2a', '102', '', '');
! insert into WSlot values ('WS.102.2b', '102', '', '');
! insert into WSlot values ('WS.102.3a', '102', '', '');
! insert into WSlot values ('WS.102.3b', '102', '', '');
! insert into WSlot values ('WS.105.1a', '105', '', '');
! insert into WSlot values ('WS.105.1b', '105', '', '');
! insert into WSlot values ('WS.105.2a', '105', '', '');
! insert into WSlot values ('WS.105.2b', '105', '', '');
! insert into WSlot values ('WS.105.3a', '105', '', '');
! insert into WSlot values ('WS.105.3b', '105', '', '');
! insert into WSlot values ('WS.106.1a', '106', '', '');
! insert into WSlot values ('WS.106.1b', '106', '', '');
! insert into WSlot values ('WS.106.2a', '106', '', '');
! insert into WSlot values ('WS.106.2b', '106', '', '');
! insert into WSlot values ('WS.106.3a', '106', '', '');
! insert into WSlot values ('WS.106.3b', '106', '', '');
! --
! -- Now create the patch fields and their slots
! --
! insert into PField values ('PF0_1', 'Wallslots basement');
! --
! -- The cables for these will be made later, so they are unconnected for now
! --
! insert into PSlot values ('PS.base.a1', 'PF0_1', '', '');
! insert into PSlot values ('PS.base.a2', 'PF0_1', '', '');
! insert into PSlot values ('PS.base.a3', 'PF0_1', '', '');
! insert into PSlot values ('PS.base.a4', 'PF0_1', '', '');
! insert into PSlot values ('PS.base.a5', 'PF0_1', '', '');
! insert into PSlot values ('PS.base.a6', 'PF0_1', '', '');
! --
! -- These are already wired to the wall connectors
! --
! insert into PSlot values ('PS.base.b1', 'PF0_1', '', 'WS.002.1a');
! insert into PSlot values ('PS.base.b2', 'PF0_1', '', 'WS.002.1b');
! insert into PSlot values ('PS.base.b3', 'PF0_1', '', 'WS.002.2a');
! insert into PSlot values ('PS.base.b4', 'PF0_1', '', 'WS.002.2b');
! insert into PSlot values ('PS.base.b5', 'PF0_1', '', 'WS.002.3a');
! insert into PSlot values ('PS.base.b6', 'PF0_1', '', 'WS.002.3b');
! insert into PSlot values ('PS.base.c1', 'PF0_1', '', 'WS.003.1a');
! insert into PSlot values ('PS.base.c2', 'PF0_1', '', 'WS.003.1b');
! insert into PSlot values ('PS.base.c3', 'PF0_1', '', 'WS.003.2a');
! insert into PSlot values ('PS.base.c4', 'PF0_1', '', 'WS.003.2b');
! insert into PSlot values ('PS.base.c5', 'PF0_1', '', 'WS.003.3a');
! insert into PSlot values ('PS.base.c6', 'PF0_1', '', 'WS.003.3b');
! --
! -- This patchfield will be renamed later into PF0_2 - so its
! -- slots references in pfname should follow
! --
! insert into PField values ('PF0_X', 'Phonelines basement');
! insert into PSlot values ('PS.base.ta1', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.ta2', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.ta3', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.ta4', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.ta5', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.ta6', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb1', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb2', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb3', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb4', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb5', 'PF0_X', '', '');
! insert into PSlot values ('PS.base.tb6', 'PF0_X', '', '');
! insert into PField values ('PF1_1', 'Wallslots first floor');
! insert into PSlot values ('PS.first.a1', 'PF1_1', '', 'WS.101.1a');
! insert into PSlot values ('PS.first.a2', 'PF1_1', '', 'WS.101.1b');
! insert into PSlot values ('PS.first.a3', 'PF1_1', '', 'WS.101.2a');
! insert into PSlot values ('PS.first.a4', 'PF1_1', '', 'WS.101.2b');
! insert into PSlot values ('PS.first.a5', 'PF1_1', '', 'WS.101.3a');
! insert into PSlot values ('PS.first.a6', 'PF1_1', '', 'WS.101.3b');
! insert into PSlot values ('PS.first.b1', 'PF1_1', '', 'WS.102.1a');
! insert into PSlot values ('PS.first.b2', 'PF1_1', '', 'WS.102.1b');
! insert into PSlot values ('PS.first.b3', 'PF1_1', '', 'WS.102.2a');
! insert into PSlot values ('PS.first.b4', 'PF1_1', '', 'WS.102.2b');
! insert into PSlot values ('PS.first.b5', 'PF1_1', '', 'WS.102.3a');
! insert into PSlot values ('PS.first.b6', 'PF1_1', '', 'WS.102.3b');
! insert into PSlot values ('PS.first.c1', 'PF1_1', '', 'WS.105.1a');
! insert into PSlot values ('PS.first.c2', 'PF1_1', '', 'WS.105.1b');
! insert into PSlot values ('PS.first.c3', 'PF1_1', '', 'WS.105.2a');
! insert into PSlot values ('PS.first.c4', 'PF1_1', '', 'WS.105.2b');
! insert into PSlot values ('PS.first.c5', 'PF1_1', '', 'WS.105.3a');
! insert into PSlot values ('PS.first.c6', 'PF1_1', '', 'WS.105.3b');
! insert into PSlot values ('PS.first.d1', 'PF1_1', '', 'WS.106.1a');
! insert into PSlot values ('PS.first.d2', 'PF1_1', '', 'WS.106.1b');
! insert into PSlot values ('PS.first.d3', 'PF1_1', '', 'WS.106.2a');
! insert into PSlot values ('PS.first.d4', 'PF1_1', '', 'WS.106.2b');
! insert into PSlot values ('PS.first.d5', 'PF1_1', '', 'WS.106.3a');
! insert into PSlot values ('PS.first.d6', 'PF1_1', '', 'WS.106.3b');
! --
! -- Now we wire the wall connectors 1a-2a in room 001 to the
! -- patchfield. In the second update we make an error, and
! -- correct it after
! --
! update PSlot set backlink = 'WS.001.1a' where slotname = 'PS.base.a1';
! update PSlot set backlink = 'WS.001.1b' where slotname = 'PS.base.a3';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a3          
!  WS.001.2a            | 001      |                      |                     
!  WS.001.2b            | 001      |                      |                     
!  WS.001.3a            | 001      |                      |                     
!  WS.001.3b            | 001      |                      |                     
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      |                     
!  PS.base.a3           | PF0_1  |                      | WS.001.1b           
!  PS.base.a4           | PF0_1  |                      |                     
!  PS.base.a5           | PF0_1  |                      |                     
!  PS.base.a6           | PF0_1  |                      |                     
! (6 rows)
! 
! update PSlot set backlink = 'WS.001.2a' where slotname = 'PS.base.a3';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      |                     
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      |                     
!  WS.001.3a            | 001      |                      |                     
!  WS.001.3b            | 001      |                      |                     
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      |                     
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      |                     
!  PS.base.a5           | PF0_1  |                      |                     
!  PS.base.a6           | PF0_1  |                      |                     
! (6 rows)
! 
! update PSlot set backlink = 'WS.001.1b' where slotname = 'PS.base.a2';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a2          
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      |                     
!  WS.001.3a            | 001      |                      |                     
!  WS.001.3b            | 001      |                      |                     
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      | WS.001.1b           
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      |                     
!  PS.base.a5           | PF0_1  |                      |                     
!  PS.base.a6           | PF0_1  |                      |                     
! (6 rows)
! 
! --
! -- Same procedure for 2b-3b but this time updating the WSlot instead
! -- of the PSlot. Due to the triggers the result is the same:
! -- WSlot and corresponding PSlot point to each other.
! --
! update WSlot set backlink = 'PS.base.a4' where slotname = 'WS.001.2b';
! update WSlot set backlink = 'PS.base.a6' where slotname = 'WS.001.3a';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a2          
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      | PS.base.a4          
!  WS.001.3a            | 001      |                      | PS.base.a6          
!  WS.001.3b            | 001      |                      |                     
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      | WS.001.1b           
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      | WS.001.2b           
!  PS.base.a5           | PF0_1  |                      |                     
!  PS.base.a6           | PF0_1  |                      | WS.001.3a           
! (6 rows)
! 
! update WSlot set backlink = 'PS.base.a6' where slotname = 'WS.001.3b';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a2          
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      | PS.base.a4          
!  WS.001.3a            | 001      |                      |                     
!  WS.001.3b            | 001      |                      | PS.base.a6          
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      | WS.001.1b           
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      | WS.001.2b           
!  PS.base.a5           | PF0_1  |                      |                     
!  PS.base.a6           | PF0_1  |                      | WS.001.3b           
! (6 rows)
! 
! update WSlot set backlink = 'PS.base.a5' where slotname = 'WS.001.3a';
! select * from WSlot where roomno = '001' order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a2          
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      | PS.base.a4          
!  WS.001.3a            | 001      |                      | PS.base.a5          
!  WS.001.3b            | 001      |                      | PS.base.a6          
! (6 rows)
! 
! select * from PSlot where slotname ~ 'PS.base.a' order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      | WS.001.1b           
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      | WS.001.2b           
!  PS.base.a5           | PF0_1  |                      | WS.001.3a           
!  PS.base.a6           | PF0_1  |                      | WS.001.3b           
! (6 rows)
! 
! insert into PField values ('PF1_2', 'Phonelines first floor');
! insert into PSlot values ('PS.first.ta1', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.ta2', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.ta3', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.ta4', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.ta5', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.ta6', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb1', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb2', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb3', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb4', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb5', 'PF1_2', '', '');
! insert into PSlot values ('PS.first.tb6', 'PF1_2', '', '');
! --
! -- Fix the wrong name for patchfield PF0_2
! --
! update PField set name = 'PF0_2' where name = 'PF0_X';
! select * from PSlot order by slotname;
!        slotname       | pfname |       slotlink       |       backlink       
! ----------------------+--------+----------------------+----------------------
!  PS.base.a1           | PF0_1  |                      | WS.001.1a           
!  PS.base.a2           | PF0_1  |                      | WS.001.1b           
!  PS.base.a3           | PF0_1  |                      | WS.001.2a           
!  PS.base.a4           | PF0_1  |                      | WS.001.2b           
!  PS.base.a5           | PF0_1  |                      | WS.001.3a           
!  PS.base.a6           | PF0_1  |                      | WS.001.3b           
!  PS.base.b1           | PF0_1  |                      | WS.002.1a           
!  PS.base.b2           | PF0_1  |                      | WS.002.1b           
!  PS.base.b3           | PF0_1  |                      | WS.002.2a           
!  PS.base.b4           | PF0_1  |                      | WS.002.2b           
!  PS.base.b5           | PF0_1  |                      | WS.002.3a           
!  PS.base.b6           | PF0_1  |                      | WS.002.3b           
!  PS.base.c1           | PF0_1  |                      | WS.003.1a           
!  PS.base.c2           | PF0_1  |                      | WS.003.1b           
!  PS.base.c3           | PF0_1  |                      | WS.003.2a           
!  PS.base.c4           | PF0_1  |                      | WS.003.2b           
!  PS.base.c5           | PF0_1  |                      | WS.003.3a           
!  PS.base.c6           | PF0_1  |                      | WS.003.3b           
!  PS.base.ta1          | PF0_2  |                      |                     
!  PS.base.ta2          | PF0_2  |                      |                     
!  PS.base.ta3          | PF0_2  |                      |                     
!  PS.base.ta4          | PF0_2  |                      |                     
!  PS.base.ta5          | PF0_2  |                      |                     
!  PS.base.ta6          | PF0_2  |                      |                     
!  PS.base.tb1          | PF0_2  |                      |                     
!  PS.base.tb2          | PF0_2  |                      |                     
!  PS.base.tb3          | PF0_2  |                      |                     
!  PS.base.tb4          | PF0_2  |                      |                     
!  PS.base.tb5          | PF0_2  |                      |                     
!  PS.base.tb6          | PF0_2  |                      |                     
!  PS.first.a1          | PF1_1  |                      | WS.101.1a           
!  PS.first.a2          | PF1_1  |                      | WS.101.1b           
!  PS.first.a3          | PF1_1  |                      | WS.101.2a           
!  PS.first.a4          | PF1_1  |                      | WS.101.2b           
!  PS.first.a5          | PF1_1  |                      | WS.101.3a           
!  PS.first.a6          | PF1_1  |                      | WS.101.3b           
!  PS.first.b1          | PF1_1  |                      | WS.102.1a           
!  PS.first.b2          | PF1_1  |                      | WS.102.1b           
!  PS.first.b3          | PF1_1  |                      | WS.102.2a           
!  PS.first.b4          | PF1_1  |                      | WS.102.2b           
!  PS.first.b5          | PF1_1  |                      | WS.102.3a           
!  PS.first.b6          | PF1_1  |                      | WS.102.3b           
!  PS.first.c1          | PF1_1  |                      | WS.105.1a           
!  PS.first.c2          | PF1_1  |                      | WS.105.1b           
!  PS.first.c3          | PF1_1  |                      | WS.105.2a           
!  PS.first.c4          | PF1_1  |                      | WS.105.2b           
!  PS.first.c5          | PF1_1  |                      | WS.105.3a           
!  PS.first.c6          | PF1_1  |                      | WS.105.3b           
!  PS.first.d1          | PF1_1  |                      | WS.106.1a           
!  PS.first.d2          | PF1_1  |                      | WS.106.1b           
!  PS.first.d3          | PF1_1  |                      | WS.106.2a           
!  PS.first.d4          | PF1_1  |                      | WS.106.2b           
!  PS.first.d5          | PF1_1  |                      | WS.106.3a           
!  PS.first.d6          | PF1_1  |                      | WS.106.3b           
!  PS.first.ta1         | PF1_2  |                      |                     
!  PS.first.ta2         | PF1_2  |                      |                     
!  PS.first.ta3         | PF1_2  |                      |                     
!  PS.first.ta4         | PF1_2  |                      |                     
!  PS.first.ta5         | PF1_2  |                      |                     
!  PS.first.ta6         | PF1_2  |                      |                     
!  PS.first.tb1         | PF1_2  |                      |                     
!  PS.first.tb2         | PF1_2  |                      |                     
!  PS.first.tb3         | PF1_2  |                      |                     
!  PS.first.tb4         | PF1_2  |                      |                     
!  PS.first.tb5         | PF1_2  |                      |                     
!  PS.first.tb6         | PF1_2  |                      |                     
! (66 rows)
! 
! select * from WSlot order by slotname;
!        slotname       |  roomno  |       slotlink       |       backlink       
! ----------------------+----------+----------------------+----------------------
!  WS.001.1a            | 001      |                      | PS.base.a1          
!  WS.001.1b            | 001      |                      | PS.base.a2          
!  WS.001.2a            | 001      |                      | PS.base.a3          
!  WS.001.2b            | 001      |                      | PS.base.a4          
!  WS.001.3a            | 001      |                      | PS.base.a5          
!  WS.001.3b            | 001      |                      | PS.base.a6          
!  WS.002.1a            | 002      |                      | PS.base.b1          
!  WS.002.1b            | 002      |                      | PS.base.b2          
!  WS.002.2a            | 002      |                      | PS.base.b3          
!  WS.002.2b            | 002      |                      | PS.base.b4          
!  WS.002.3a            | 002      |                      | PS.base.b5          
!  WS.002.3b            | 002      |                      | PS.base.b6          
!  WS.003.1a            | 003      |                      | PS.base.c1          
!  WS.003.1b            | 003      |                      | PS.base.c2          
!  WS.003.2a            | 003      |                      | PS.base.c3          
!  WS.003.2b            | 003      |                      | PS.base.c4          
!  WS.003.3a            | 003      |                      | PS.base.c5          
!  WS.003.3b            | 003      |                      | PS.base.c6          
!  WS.101.1a            | 101      |                      | PS.first.a1         
!  WS.101.1b            | 101      |                      | PS.first.a2         
!  WS.101.2a            | 101      |                      | PS.first.a3         
!  WS.101.2b            | 101      |                      | PS.first.a4         
!  WS.101.3a            | 101      |                      | PS.first.a5         
!  WS.101.3b            | 101      |                      | PS.first.a6         
!  WS.102.1a            | 102      |                      | PS.first.b1         
!  WS.102.1b            | 102      |                      | PS.first.b2         
!  WS.102.2a            | 102      |                      | PS.first.b3         
!  WS.102.2b            | 102      |                      | PS.first.b4         
!  WS.102.3a            | 102      |                      | PS.first.b5         
!  WS.102.3b            | 102      |                      | PS.first.b6         
!  WS.105.1a            | 105      |                      | PS.first.c1         
!  WS.105.1b            | 105      |                      | PS.first.c2         
!  WS.105.2a            | 105      |                      | PS.first.c3         
!  WS.105.2b            | 105      |                      | PS.first.c4         
!  WS.105.3a            | 105      |                      | PS.first.c5         
!  WS.105.3b            | 105      |                      | PS.first.c6         
!  WS.106.1a            | 106      |                      | PS.first.d1         
!  WS.106.1b            | 106      |                      | PS.first.d2         
!  WS.106.2a            | 106      |                      | PS.first.d3         
!  WS.106.2b            | 106      |                      | PS.first.d4         
!  WS.106.3a            | 106      |                      | PS.first.d5         
!  WS.106.3b            | 106      |                      | PS.first.d6         
! (42 rows)
! 
! --
! -- Install the central phone system and create the phone numbers.
! -- They are weired on insert to the patchfields. Again the
! -- triggers automatically tell the PSlots to update their
! -- backlink field.
! --
! insert into PLine values ('PL.001', '-0', 'Central call', 'PS.base.ta1');
! insert into PLine values ('PL.002', '-101', '', 'PS.base.ta2');
! insert into PLine values ('PL.003', '-102', '', 'PS.base.ta3');
! insert into PLine values ('PL.004', '-103', '', 'PS.base.ta5');
! insert into PLine values ('PL.005', '-104', '', 'PS.base.ta6');
! insert into PLine values ('PL.006', '-106', '', 'PS.base.tb2');
! insert into PLine values ('PL.007', '-108', '', 'PS.base.tb3');
! insert into PLine values ('PL.008', '-109', '', 'PS.base.tb4');
! insert into PLine values ('PL.009', '-121', '', 'PS.base.tb5');
! insert into PLine values ('PL.010', '-122', '', 'PS.base.tb6');
! insert into PLine values ('PL.015', '-134', '', 'PS.first.ta1');
! insert into PLine values ('PL.016', '-137', '', 'PS.first.ta3');
! insert into PLine values ('PL.017', '-139', '', 'PS.first.ta4');
! insert into PLine values ('PL.018', '-362', '', 'PS.first.tb1');
! insert into PLine values ('PL.019', '-363', '', 'PS.first.tb2');
! insert into PLine values ('PL.020', '-364', '', 'PS.first.tb3');
! insert into PLine values ('PL.021', '-365', '', 'PS.first.tb5');
! insert into PLine values ('PL.022', '-367', '', 'PS.first.tb6');
! insert into PLine values ('PL.028', '-501', 'Fax entrance', 'PS.base.ta2');
! insert into PLine values ('PL.029', '-502', 'Fax first floor', 'PS.first.ta1');
! --
! -- Buy some phones, plug them into the wall and patch the
! -- phone lines to the corresponding patchfield slots.
! --
! insert into PHone values ('PH.hc001', 'Hicom standard', 'WS.001.1a');
! update PSlot set slotlink = 'PS.base.ta1' where slotname = 'PS.base.a1';
! insert into PHone values ('PH.hc002', 'Hicom standard', 'WS.002.1a');
! update PSlot set slotlink = 'PS.base.ta5' where slotname = 'PS.base.b1';
! insert into PHone values ('PH.hc003', 'Hicom standard', 'WS.002.2a');
! update PSlot set slotlink = 'PS.base.tb2' where slotname = 'PS.base.b3';
! insert into PHone values ('PH.fax001', 'Canon fax', 'WS.001.2a');
! update PSlot set slotlink = 'PS.base.ta2' where slotname = 'PS.base.a3';
! --
! -- Install a hub at one of the patchfields, plug a computers
! -- ethernet interface into the wall and patch it to the hub.
! --
! insert into Hub values ('base.hub1', 'Patchfield PF0_1 hub', 16);
! insert into System values ('orion', 'PC');
! insert into IFace values ('IF', 'orion', 'eth0', 'WS.002.1b');
! update PSlot set slotlink = 'HS.base.hub1.1' where slotname = 'PS.base.b2';
! --
! -- Now we take a look at the patchfield
! --
! select * from PField_v1 where pfname = 'PF0_1' order by slotname;
!  pfname |       slotname       |                         backside                         |                     patch                     
! --------+----------------------+----------------------------------------------------------+-----------------------------------------------
!  PF0_1  | PS.base.a1           | WS.001.1a in room 001 -> Phone PH.hc001 (Hicom standard) | PS.base.ta1 -> Phone line -0 (Central call)
!  PF0_1  | PS.base.a2           | WS.001.1b in room 001 -> -                               | -
!  PF0_1  | PS.base.a3           | WS.001.2a in room 001 -> Phone PH.fax001 (Canon fax)     | PS.base.ta2 -> Phone line -501 (Fax entrance)
!  PF0_1  | PS.base.a4           | WS.001.2b in room 001 -> -                               | -
!  PF0_1  | PS.base.a5           | WS.001.3a in room 001 -> -                               | -
!  PF0_1  | PS.base.a6           | WS.001.3b in room 001 -> -                               | -
!  PF0_1  | PS.base.b1           | WS.002.1a in room 002 -> Phone PH.hc002 (Hicom standard) | PS.base.ta5 -> Phone line -103
!  PF0_1  | PS.base.b2           | WS.002.1b in room 002 -> orion IF eth0 (PC)              | Patchfield PF0_1 hub slot 1
!  PF0_1  | PS.base.b3           | WS.002.2a in room 002 -> Phone PH.hc003 (Hicom standard) | PS.base.tb2 -> Phone line -106
!  PF0_1  | PS.base.b4           | WS.002.2b in room 002 -> -                               | -
!  PF0_1  | PS.base.b5           | WS.002.3a in room 002 -> -                               | -
!  PF0_1  | PS.base.b6           | WS.002.3b in room 002 -> -                               | -
!  PF0_1  | PS.base.c1           | WS.003.1a in room 003 -> -                               | -
!  PF0_1  | PS.base.c2           | WS.003.1b in room 003 -> -                               | -
!  PF0_1  | PS.base.c3           | WS.003.2a in room 003 -> -                               | -
!  PF0_1  | PS.base.c4           | WS.003.2b in room 003 -> -                               | -
!  PF0_1  | PS.base.c5           | WS.003.3a in room 003 -> -                               | -
!  PF0_1  | PS.base.c6           | WS.003.3b in room 003 -> -                               | -
! (18 rows)
! 
! select * from PField_v1 where pfname = 'PF0_2' order by slotname;
!  pfname |       slotname       |            backside            |                                 patch                                  
! --------+----------------------+--------------------------------+------------------------------------------------------------------------
!  PF0_2  | PS.base.ta1          | Phone line -0 (Central call)   | PS.base.a1 -> WS.001.1a in room 001 -> Phone PH.hc001 (Hicom standard)
!  PF0_2  | PS.base.ta2          | Phone line -501 (Fax entrance) | PS.base.a3 -> WS.001.2a in room 001 -> Phone PH.fax001 (Canon fax)
!  PF0_2  | PS.base.ta3          | Phone line -102                | -
!  PF0_2  | PS.base.ta4          | -                              | -
!  PF0_2  | PS.base.ta5          | Phone line -103                | PS.base.b1 -> WS.002.1a in room 002 -> Phone PH.hc002 (Hicom standard)
!  PF0_2  | PS.base.ta6          | Phone line -104                | -
!  PF0_2  | PS.base.tb1          | -                              | -
!  PF0_2  | PS.base.tb2          | Phone line -106                | PS.base.b3 -> WS.002.2a in room 002 -> Phone PH.hc003 (Hicom standard)
!  PF0_2  | PS.base.tb3          | Phone line -108                | -
!  PF0_2  | PS.base.tb4          | Phone line -109                | -
!  PF0_2  | PS.base.tb5          | Phone line -121                | -
!  PF0_2  | PS.base.tb6          | Phone line -122                | -
! (12 rows)
! 
! --
! -- Finally we want errors
! --
! insert into PField values ('PF1_1', 'should fail due to unique index');
! ERROR:  duplicate key value violates unique constraint "pfield_name"
! DETAIL:  Key (name)=(PF1_1) already exists.
! update PSlot set backlink = 'WS.not.there' where slotname = 'PS.base.a1';
! ERROR:  WS.not.there         does not exist
! CONTEXT:  PL/pgSQL function tg_backlink_set(character,character) line 30 at RAISE
! PL/pgSQL function tg_backlink_a() line 17 at assignment
! update PSlot set backlink = 'XX.illegal' where slotname = 'PS.base.a1';
! ERROR:  illegal backlink beginning with XX
! CONTEXT:  PL/pgSQL function tg_backlink_set(character,character) line 47 at RAISE
! PL/pgSQL function tg_backlink_a() line 17 at assignment
! update PSlot set slotlink = 'PS.not.there' where slotname = 'PS.base.a1';
! ERROR:  PS.not.there         does not exist
! CONTEXT:  PL/pgSQL function tg_slotlink_set(character,character) line 30 at RAISE
! PL/pgSQL function tg_slotlink_a() line 17 at assignment
! update PSlot set slotlink = 'XX.illegal' where slotname = 'PS.base.a1';
! ERROR:  illegal slotlink beginning with XX
! CONTEXT:  PL/pgSQL function tg_slotlink_set(character,character) line 77 at RAISE
! PL/pgSQL function tg_slotlink_a() line 17 at assignment
! insert into HSlot values ('HS', 'base.hub1', 1, '');
! ERROR:  duplicate key value violates unique constraint "hslot_name"
! DETAIL:  Key (slotname)=(HS.base.hub1.1      ) already exists.
! insert into HSlot values ('HS', 'base.hub1', 20, '');
! ERROR:  no manual manipulation of HSlot
! CONTEXT:  PL/pgSQL function tg_hslot_biu() line 12 at RAISE
! delete from HSlot;
! ERROR:  no manual manipulation of HSlot
! CONTEXT:  PL/pgSQL function tg_hslot_bd() line 12 at RAISE
! insert into IFace values ('IF', 'notthere', 'eth0', '');
! ERROR:  system "notthere" does not exist
! CONTEXT:  PL/pgSQL function tg_iface_biu() line 8 at RAISE
! insert into IFace values ('IF', 'orion', 'ethernet_interface_name_too_long', '');
! ERROR:  IFace slotname "IF.orion.ethernet_interface_name_too_long" too long (20 char max)
! CONTEXT:  PL/pgSQL function tg_iface_biu() line 14 at RAISE
! --
! -- The following tests are unrelated to the scenario outlined above;
! -- they merely exercise specific parts of PL/pgSQL
! --
! --
! -- Test recursion, per bug report 7-Sep-01
! --
! CREATE FUNCTION recursion_test(int,int) RETURNS text AS '
! DECLARE rslt text;
! BEGIN
!     IF $1 <= 0 THEN
!         rslt = CAST($2 AS TEXT);
!     ELSE
!         rslt = CAST($1 AS TEXT) || '','' || recursion_test($1 - 1, $2);
!     END IF;
!     RETURN rslt;
! END;' LANGUAGE plpgsql;
! SELECT recursion_test(4,3);
!  recursion_test 
! ----------------
!  4,3,2,1,3
! (1 row)
! 
! --
! -- Test the FOUND magic variable
! --
! CREATE TABLE found_test_tbl (a int);
! create function test_found()
!   returns boolean as '
!   declare
!   begin
!   insert into found_test_tbl values (1);
!   if FOUND then
!      insert into found_test_tbl values (2);
!   end if;
! 
!   update found_test_tbl set a = 100 where a = 1;
!   if FOUND then
!     insert into found_test_tbl values (3);
!   end if;
! 
!   delete from found_test_tbl where a = 9999; -- matches no rows
!   if not FOUND then
!     insert into found_test_tbl values (4);
!   end if;
! 
!   for i in 1 .. 10 loop
!     -- no need to do anything
!   end loop;
!   if FOUND then
!     insert into found_test_tbl values (5);
!   end if;
! 
!   -- never executes the loop
!   for i in 2 .. 1 loop
!     -- no need to do anything
!   end loop;
!   if not FOUND then
!     insert into found_test_tbl values (6);
!   end if;
!   return true;
!   end;' language plpgsql;
! select test_found();
!  test_found 
! ------------
!  t
! (1 row)
! 
! select * from found_test_tbl;
!   a  
! -----
!    2
!  100
!    3
!    4
!    5
!    6
! (6 rows)
! 
! --
! -- Test set-returning functions for PL/pgSQL
! --
! create function test_table_func_rec() returns setof found_test_tbl as '
! DECLARE
! 	rec RECORD;
! BEGIN
! 	FOR rec IN select * from found_test_tbl LOOP
! 		RETURN NEXT rec;
! 	END LOOP;
! 	RETURN;
! END;' language plpgsql;
! select * from test_table_func_rec();
!   a  
! -----
!    2
!  100
!    3
!    4
!    5
!    6
! (6 rows)
! 
! create function test_table_func_row() returns setof found_test_tbl as '
! DECLARE
! 	row found_test_tbl%ROWTYPE;
! BEGIN
! 	FOR row IN select * from found_test_tbl LOOP
! 		RETURN NEXT row;
! 	END LOOP;
! 	RETURN;
! END;' language plpgsql;
! select * from test_table_func_row();
!   a  
! -----
!    2
!  100
!    3
!    4
!    5
!    6
! (6 rows)
! 
! create function test_ret_set_scalar(int,int) returns setof int as '
! DECLARE
! 	i int;
! BEGIN
! 	FOR i IN $1 .. $2 LOOP
! 		RETURN NEXT i + 1;
! 	END LOOP;
! 	RETURN;
! END;' language plpgsql;
! select * from test_ret_set_scalar(1,10);
!  test_ret_set_scalar 
! ---------------------
!                    2
!                    3
!                    4
!                    5
!                    6
!                    7
!                    8
!                    9
!                   10
!                   11
! (10 rows)
! 
! create function test_ret_set_rec_dyn(int) returns setof record as '
! DECLARE
! 	retval RECORD;
! BEGIN
! 	IF $1 > 10 THEN
! 		SELECT INTO retval 5, 10, 15;
! 		RETURN NEXT retval;
! 		RETURN NEXT retval;
! 	ELSE
! 		SELECT INTO retval 50, 5::numeric, ''xxx''::text;
! 		RETURN NEXT retval;
! 		RETURN NEXT retval;
! 	END IF;
! 	RETURN;
! END;' language plpgsql;
! SELECT * FROM test_ret_set_rec_dyn(1500) AS (a int, b int, c int);
!  a | b  | c  
! ---+----+----
!  5 | 10 | 15
!  5 | 10 | 15
! (2 rows)
! 
! SELECT * FROM test_ret_set_rec_dyn(5) AS (a int, b numeric, c text);
!  a  | b |  c  
! ----+---+-----
!  50 | 5 | xxx
!  50 | 5 | xxx
! (2 rows)
! 
! create function test_ret_rec_dyn(int) returns record as '
! DECLARE
! 	retval RECORD;
! BEGIN
! 	IF $1 > 10 THEN
! 		SELECT INTO retval 5, 10, 15;
! 		RETURN retval;
! 	ELSE
! 		SELECT INTO retval 50, 5::numeric, ''xxx''::text;
! 		RETURN retval;
! 	END IF;
! END;' language plpgsql;
! SELECT * FROM test_ret_rec_dyn(1500) AS (a int, b int, c int);
!  a | b  | c  
! ---+----+----
!  5 | 10 | 15
! (1 row)
! 
! SELECT * FROM test_ret_rec_dyn(5) AS (a int, b numeric, c text);
!  a  | b |  c  
! ----+---+-----
!  50 | 5 | xxx
! (1 row)
! 
! --
! -- Test handling of OUT parameters, including polymorphic cases.
! -- Note that RETURN is optional with OUT params; we try both ways.
! --
! -- wrong way to do it:
! create function f1(in i int, out j int) returns int as $$
! begin
!   return i+1;
! end$$ language plpgsql;
! ERROR:  RETURN cannot have a parameter in function with OUT parameters
! LINE 3:   return i+1;
!                  ^
! create function f1(in i int, out j int) as $$
! begin
!   j := i+1;
!   return;
! end$$ language plpgsql;
! select f1(42);
!  f1 
! ----
!  43
! (1 row)
! 
! select * from f1(42);
!  j  
! ----
!  43
! (1 row)
! 
! create or replace function f1(inout i int) as $$
! begin
!   i := i+1;
! end$$ language plpgsql;
! select f1(42);
!  f1 
! ----
!  43
! (1 row)
! 
! select * from f1(42);
!  i  
! ----
!  43
! (1 row)
! 
! drop function f1(int);
! create function f1(in i int, out j int) returns setof int as $$
! begin
!   j := i+1;
!   return next;
!   j := i+2;
!   return next;
!   return;
! end$$ language plpgsql;
! select * from f1(42);
!  j  
! ----
!  43
!  44
! (2 rows)
! 
! drop function f1(int);
! create function f1(in i int, out j int, out k text) as $$
! begin
!   j := i;
!   j := j+1;
!   k := 'foo';
! end$$ language plpgsql;
! select f1(42);
!     f1    
! ----------
!  (43,foo)
! (1 row)
! 
! select * from f1(42);
!  j  |  k  
! ----+-----
!  43 | foo
! (1 row)
! 
! drop function f1(int);
! create function f1(in i int, out j int, out k text) returns setof record as $$
! begin
!   j := i+1;
!   k := 'foo';
!   return next;
!   j := j+1;
!   k := 'foot';
!   return next;
! end$$ language plpgsql;
! select * from f1(42);
!  j  |  k   
! ----+------
!  43 | foo
!  44 | foot
! (2 rows)
! 
! drop function f1(int);
! create function duplic(in i anyelement, out j anyelement, out k anyarray) as $$
! begin
!   j := i;
!   k := array[j,j];
!   return;
! end$$ language plpgsql;
! select * from duplic(42);
!  j  |    k    
! ----+---------
!  42 | {42,42}
! (1 row)
! 
! select * from duplic('foo'::text);
!   j  |     k     
! -----+-----------
!  foo | {foo,foo}
! (1 row)
! 
! drop function duplic(anyelement);
! --
! -- test PERFORM
! --
! create table perform_test (
! 	a	INT,
! 	b	INT
! );
! create function simple_func(int) returns boolean as '
! BEGIN
! 	IF $1 < 20 THEN
! 		INSERT INTO perform_test VALUES ($1, $1 + 10);
! 		RETURN TRUE;
! 	ELSE
! 		RETURN FALSE;
! 	END IF;
! END;' language plpgsql;
! create function perform_test_func() returns void as '
! BEGIN
! 	IF FOUND then
! 		INSERT INTO perform_test VALUES (100, 100);
! 	END IF;
! 
! 	PERFORM simple_func(5);
! 
! 	IF FOUND then
! 		INSERT INTO perform_test VALUES (100, 100);
! 	END IF;
! 
! 	PERFORM simple_func(50);
! 
! 	IF FOUND then
! 		INSERT INTO perform_test VALUES (100, 100);
! 	END IF;
! 
! 	RETURN;
! END;' language plpgsql;
! SELECT perform_test_func();
!  perform_test_func 
! -------------------
!  
! (1 row)
! 
! SELECT * FROM perform_test;
!   a  |  b  
! -----+-----
!    5 |  15
!  100 | 100
!  100 | 100
! (3 rows)
! 
! drop table perform_test;
! --
! -- Test error trapping
! --
! create function trap_zero_divide(int) returns int as $$
! declare x int;
! 	sx smallint;
! begin
! 	begin	-- start a subtransaction
! 		raise notice 'should see this';
! 		x := 100 / $1;
! 		raise notice 'should see this only if % <> 0', $1;
! 		sx := $1;
! 		raise notice 'should see this only if % fits in smallint', $1;
! 		if $1 < 0 then
! 			raise exception '% is less than zero', $1;
! 		end if;
! 	exception
! 		when division_by_zero then
! 			raise notice 'caught division_by_zero';
! 			x := -1;
! 		when NUMERIC_VALUE_OUT_OF_RANGE then
! 			raise notice 'caught numeric_value_out_of_range';
! 			x := -2;
! 	end;
! 	return x;
! end$$ language plpgsql;
! select trap_zero_divide(50);
! NOTICE:  should see this
! NOTICE:  should see this only if 50 <> 0
! NOTICE:  should see this only if 50 fits in smallint
!  trap_zero_divide 
! ------------------
!                 2
! (1 row)
! 
! select trap_zero_divide(0);
! NOTICE:  should see this
! NOTICE:  caught division_by_zero
!  trap_zero_divide 
! ------------------
!                -1
! (1 row)
! 
! select trap_zero_divide(100000);
! NOTICE:  should see this
! NOTICE:  should see this only if 100000 <> 0
! NOTICE:  caught numeric_value_out_of_range
!  trap_zero_divide 
! ------------------
!                -2
! (1 row)
! 
! select trap_zero_divide(-100);
! NOTICE:  should see this
! NOTICE:  should see this only if -100 <> 0
! NOTICE:  should see this only if -100 fits in smallint
! ERROR:  -100 is less than zero
! CONTEXT:  PL/pgSQL function trap_zero_divide(integer) line 12 at RAISE
! create function trap_matching_test(int) returns int as $$
! declare x int;
! 	sx smallint;
! 	y int;
! begin
! 	begin	-- start a subtransaction
! 		x := 100 / $1;
! 		sx := $1;
! 		select into y unique1 from tenk1 where unique2 =
! 			(select unique2 from tenk1 b where ten = $1);
! 	exception
! 		when data_exception then  -- category match
! 			raise notice 'caught data_exception';
! 			x := -1;
! 		when NUMERIC_VALUE_OUT_OF_RANGE OR CARDINALITY_VIOLATION then
! 			raise notice 'caught numeric_value_out_of_range or cardinality_violation';
! 			x := -2;
! 	end;
! 	return x;
! end$$ language plpgsql;
! select trap_matching_test(50);
!  trap_matching_test 
! --------------------
!                   2
! (1 row)
! 
! select trap_matching_test(0);
! NOTICE:  caught data_exception
!  trap_matching_test 
! --------------------
!                  -1
! (1 row)
! 
! select trap_matching_test(100000);
! NOTICE:  caught data_exception
!  trap_matching_test 
! --------------------
!                  -1
! (1 row)
! 
! select trap_matching_test(1);
! NOTICE:  caught numeric_value_out_of_range or cardinality_violation
!  trap_matching_test 
! --------------------
!                  -2
! (1 row)
! 
! create temp table foo (f1 int);
! create function subxact_rollback_semantics() returns int as $$
! declare x int;
! begin
!   x := 1;
!   insert into foo values(x);
!   begin
!     x := x + 1;
!     insert into foo values(x);
!     raise exception 'inner';
!   exception
!     when others then
!       x := x * 10;
!   end;
!   insert into foo values(x);
!   return x;
! end$$ language plpgsql;
! select subxact_rollback_semantics();
!  subxact_rollback_semantics 
! ----------------------------
!                          20
! (1 row)
! 
! select * from foo;
!  f1 
! ----
!   1
!  20
! (2 rows)
! 
! drop table foo;
! create function trap_timeout() returns void as $$
! begin
!   declare x int;
!   begin
!     -- we assume this will take longer than 2 seconds:
!     select count(*) into x from tenk1 a, tenk1 b, tenk1 c;
!   exception
!     when others then
!       raise notice 'caught others?';
!     when query_canceled then
!       raise notice 'nyeah nyeah, can''t stop me';
!   end;
!   -- Abort transaction to abandon the statement_timeout setting.  Otherwise,
!   -- the next top-level statement would be vulnerable to the timeout.
!   raise exception 'end of function';
! end$$ language plpgsql;
! begin;
! set statement_timeout to 2000;
! select trap_timeout();
! NOTICE:  nyeah nyeah, can't stop me
! ERROR:  end of function
! CONTEXT:  PL/pgSQL function trap_timeout() line 15 at RAISE
! rollback;
! -- Test for pass-by-ref values being stored in proper context
! create function test_variable_storage() returns text as $$
! declare x text;
! begin
!   x := '1234';
!   begin
!     x := x || '5678';
!     -- force error inside subtransaction SPI context
!     perform trap_zero_divide(-100);
!   exception
!     when others then
!       x := x || '9012';
!   end;
!   return x;
! end$$ language plpgsql;
! select test_variable_storage();
! NOTICE:  should see this
! NOTICE:  should see this only if -100 <> 0
! NOTICE:  should see this only if -100 fits in smallint
!  test_variable_storage 
! -----------------------
!  123456789012
! (1 row)
! 
! --
! -- test foreign key error trapping
! --
! create temp table master(f1 int primary key);
! create temp table slave(f1 int references master deferrable);
! insert into master values(1);
! insert into slave values(1);
! insert into slave values(2);	-- fails
! ERROR:  insert or update on table "slave" violates foreign key constraint "slave_f1_fkey"
! DETAIL:  Key (f1)=(2) is not present in table "master".
! create function trap_foreign_key(int) returns int as $$
! begin
! 	begin	-- start a subtransaction
! 		insert into slave values($1);
! 	exception
! 		when foreign_key_violation then
! 			raise notice 'caught foreign_key_violation';
! 			return 0;
! 	end;
! 	return 1;
! end$$ language plpgsql;
! create function trap_foreign_key_2() returns int as $$
! begin
! 	begin	-- start a subtransaction
! 		set constraints all immediate;
! 	exception
! 		when foreign_key_violation then
! 			raise notice 'caught foreign_key_violation';
! 			return 0;
! 	end;
! 	return 1;
! end$$ language plpgsql;
! select trap_foreign_key(1);
!  trap_foreign_key 
! ------------------
!                 1
! (1 row)
! 
! select trap_foreign_key(2);	-- detects FK violation
! NOTICE:  caught foreign_key_violation
!  trap_foreign_key 
! ------------------
!                 0
! (1 row)
! 
! begin;
!   set constraints all deferred;
!   select trap_foreign_key(2);	-- should not detect FK violation
!  trap_foreign_key 
! ------------------
!                 1
! (1 row)
! 
!   savepoint x;
!     set constraints all immediate; -- fails
! ERROR:  insert or update on table "slave" violates foreign key constraint "slave_f1_fkey"
! DETAIL:  Key (f1)=(2) is not present in table "master".
!   rollback to x;
!   select trap_foreign_key_2();  -- detects FK violation
! NOTICE:  caught foreign_key_violation
!  trap_foreign_key_2 
! --------------------
!                   0
! (1 row)
! 
! commit;				-- still fails
! ERROR:  insert or update on table "slave" violates foreign key constraint "slave_f1_fkey"
! DETAIL:  Key (f1)=(2) is not present in table "master".
! drop function trap_foreign_key(int);
! drop function trap_foreign_key_2();
! --
! -- Test proper snapshot handling in simple expressions
! --
! create temp table users(login text, id serial);
! create function sp_id_user(a_login text) returns int as $$
! declare x int;
! begin
!   select into x id from users where login = a_login;
!   if found then return x; end if;
!   return 0;
! end$$ language plpgsql stable;
! insert into users values('user1');
! select sp_id_user('user1');
!  sp_id_user 
! ------------
!           1
! (1 row)
! 
! select sp_id_user('userx');
!  sp_id_user 
! ------------
!           0
! (1 row)
! 
! create function sp_add_user(a_login text) returns int as $$
! declare my_id_user int;
! begin
!   my_id_user = sp_id_user( a_login );
!   IF  my_id_user > 0 THEN
!     RETURN -1;  -- error code for existing user
!   END IF;
!   INSERT INTO users ( login ) VALUES ( a_login );
!   my_id_user = sp_id_user( a_login );
!   IF  my_id_user = 0 THEN
!     RETURN -2;  -- error code for insertion failure
!   END IF;
!   RETURN my_id_user;
! end$$ language plpgsql;
! select sp_add_user('user1');
!  sp_add_user 
! -------------
!           -1
! (1 row)
! 
! select sp_add_user('user2');
!  sp_add_user 
! -------------
!            2
! (1 row)
! 
! select sp_add_user('user2');
!  sp_add_user 
! -------------
!           -1
! (1 row)
! 
! select sp_add_user('user3');
!  sp_add_user 
! -------------
!            3
! (1 row)
! 
! select sp_add_user('user3');
!  sp_add_user 
! -------------
!           -1
! (1 row)
! 
! drop function sp_add_user(text);
! drop function sp_id_user(text);
! --
! -- tests for refcursors
! --
! create table rc_test (a int, b int);
! copy rc_test from stdin;
! create function return_refcursor(rc refcursor) returns refcursor as $$
! begin
!     open rc for select a from rc_test;
!     return rc;
! end
! $$ language plpgsql;
! create function refcursor_test1(refcursor) returns refcursor as $$
! begin
!     perform return_refcursor($1);
!     return $1;
! end
! $$ language plpgsql;
! begin;
! select refcursor_test1('test1');
!  refcursor_test1 
! -----------------
!  test1
! (1 row)
! 
! fetch next in test1;
!  a 
! ---
!  5
! (1 row)
! 
! select refcursor_test1('test2');
!  refcursor_test1 
! -----------------
!  test2
! (1 row)
! 
! fetch all from test2;
!   a  
! -----
!    5
!   50
!  500
! (3 rows)
! 
! commit;
! -- should fail
! fetch next from test1;
! ERROR:  cursor "test1" does not exist
! create function refcursor_test2(int, int) returns boolean as $$
! declare
!     c1 cursor (param1 int, param2 int) for select * from rc_test where a > param1 and b > param2;
!     nonsense record;
! begin
!     open c1($1, $2);
!     fetch c1 into nonsense;
!     close c1;
!     if found then
!         return true;
!     else
!         return false;
!     end if;
! end
! $$ language plpgsql;
! select refcursor_test2(20000, 20000) as "Should be false",
!        refcursor_test2(20, 20) as "Should be true";
!  Should be false | Should be true 
! -----------------+----------------
!  f               | t
! (1 row)
! 
! --
! -- tests for cursors with named parameter arguments
! --
! create function namedparmcursor_test1(int, int) returns boolean as $$
! declare
!     c1 cursor (param1 int, param12 int) for select * from rc_test where a > param1 and b > param12;
!     nonsense record;
! begin
!     open c1(param12 := $2, param1 := $1);
!     fetch c1 into nonsense;
!     close c1;
!     if found then
!         return true;
!     else
!         return false;
!     end if;
! end
! $$ language plpgsql;
! select namedparmcursor_test1(20000, 20000) as "Should be false",
!        namedparmcursor_test1(20, 20) as "Should be true";
!  Should be false | Should be true 
! -----------------+----------------
!  f               | t
! (1 row)
! 
! -- mixing named and positional argument notations
! create function namedparmcursor_test2(int, int) returns boolean as $$
! declare
!     c1 cursor (param1 int, param2 int) for select * from rc_test where a > param1 and b > param2;
!     nonsense record;
! begin
!     open c1(param1 := $1, $2);
!     fetch c1 into nonsense;
!     close c1;
!     if found then
!         return true;
!     else
!         return false;
!     end if;
! end
! $$ language plpgsql;
! select namedparmcursor_test2(20, 20);
!  namedparmcursor_test2 
! -----------------------
!  t
! (1 row)
! 
! -- mixing named and positional: param2 is given twice, once in named notation
! -- and second time in positional notation. Should throw an error at parse time
! create function namedparmcursor_test3() returns void as $$
! declare
!     c1 cursor (param1 int, param2 int) for select * from rc_test where a > param1 and b > param2;
! begin
!     open c1(param2 := 20, 21);
! end
! $$ language plpgsql;
! ERROR:  value for parameter "param2" of cursor "c1" specified more than once
! LINE 5:     open c1(param2 := 20, 21);
!                                   ^
! -- mixing named and positional: same as previous test, but param1 is duplicated
! create function namedparmcursor_test4() returns void as $$
! declare
!     c1 cursor (param1 int, param2 int) for select * from rc_test where a > param1 and b > param2;
! begin
!     open c1(20, param1 := 21);
! end
! $$ language plpgsql;
! ERROR:  value for parameter "param1" of cursor "c1" specified more than once
! LINE 5:     open c1(20, param1 := 21);
!                         ^
! -- duplicate named parameter, should throw an error at parse time
! create function namedparmcursor_test5() returns void as $$
! declare
!   c1 cursor (p1 int, p2 int) for
!     select * from tenk1 where thousand = p1 and tenthous = p2;
! begin
!   open c1 (p2 := 77, p2 := 42);
! end
! $$ language plpgsql;
! ERROR:  value for parameter "p2" of cursor "c1" specified more than once
! LINE 6:   open c1 (p2 := 77, p2 := 42);
!                              ^
! -- not enough parameters, should throw an error at parse time
! create function namedparmcursor_test6() returns void as $$
! declare
!   c1 cursor (p1 int, p2 int) for
!     select * from tenk1 where thousand = p1 and tenthous = p2;
! begin
!   open c1 (p2 := 77);
! end
! $$ language plpgsql;
! ERROR:  not enough arguments for cursor "c1"
! LINE 6:   open c1 (p2 := 77);
!                            ^
! -- division by zero runtime error, the context given in the error message
! -- should be sensible
! create function namedparmcursor_test7() returns void as $$
! declare
!   c1 cursor (p1 int, p2 int) for
!     select * from tenk1 where thousand = p1 and tenthous = p2;
! begin
!   open c1 (p2 := 77, p1 := 42/0);
! end $$ language plpgsql;
! select namedparmcursor_test7();
! ERROR:  division by zero
! CONTEXT:  SQL statement "SELECT 42/0 AS p1, 77 AS p2;"
! PL/pgSQL function namedparmcursor_test7() line 6 at OPEN
! -- check that line comments work correctly within the argument list (there
! -- is some special handling of this case in the code: the newline after the
! -- comment must be preserved when the argument-evaluating query is
! -- constructed, otherwise the comment effectively comments out the next
! -- argument, too)
! create function namedparmcursor_test8() returns int4 as $$
! declare
!   c1 cursor (p1 int, p2 int) for
!     select count(*) from tenk1 where thousand = p1 and tenthous = p2;
!   n int4;
! begin
!   open c1 (77 -- test
!   , 42);
!   fetch c1 into n;
!   return n;
! end $$ language plpgsql;
! select namedparmcursor_test8();
!  namedparmcursor_test8 
! -----------------------
!                      0
! (1 row)
! 
! -- cursor parameter name can match plpgsql variable or unreserved keyword
! create function namedparmcursor_test9(p1 int) returns int4 as $$
! declare
!   c1 cursor (p1 int, p2 int, debug int) for
!     select count(*) from tenk1 where thousand = p1 and tenthous = p2
!       and four = debug;
!   p2 int4 := 1006;
!   n int4;
! begin
!   open c1 (p1 := p1, p2 := p2, debug := 2);
!   fetch c1 into n;
!   return n;
! end $$ language plpgsql;
! select namedparmcursor_test9(6);
!  namedparmcursor_test9 
! -----------------------
!                      1
! (1 row)
! 
! --
! -- tests for "raise" processing
! --
! create function raise_test1(int) returns int as $$
! begin
!     raise notice 'This message has too many parameters!', $1;
!     return $1;
! end;
! $$ language plpgsql;
! ERROR:  too many parameters specified for RAISE
! CONTEXT:  compilation of PL/pgSQL function "raise_test1" near line 3
! create function raise_test2(int) returns int as $$
! begin
!     raise notice 'This message has too few parameters: %, %, %', $1, $1;
!     return $1;
! end;
! $$ language plpgsql;
! ERROR:  too few parameters specified for RAISE
! CONTEXT:  compilation of PL/pgSQL function "raise_test2" near line 3
! create function raise_test3(int) returns int as $$
! begin
!     raise notice 'This message has no parameters (despite having %% signs in it)!';
!     return $1;
! end;
! $$ language plpgsql;
! select raise_test3(1);
! NOTICE:  This message has no parameters (despite having % signs in it)!
!  raise_test3 
! -------------
!            1
! (1 row)
! 
! -- Test re-RAISE inside a nested exception block.  This case is allowed
! -- by Oracle's PL/SQL but was handled differently by PG before 9.1.
! CREATE FUNCTION reraise_test() RETURNS void AS $$
! BEGIN
!    BEGIN
!        RAISE syntax_error;
!    EXCEPTION
!        WHEN syntax_error THEN
!            BEGIN
!                raise notice 'exception % thrown in inner block, reraising', sqlerrm;
!                RAISE;
!            EXCEPTION
!                WHEN OTHERS THEN
!                    raise notice 'RIGHT - exception % caught in inner block', sqlerrm;
!            END;
!    END;
! EXCEPTION
!    WHEN OTHERS THEN
!        raise notice 'WRONG - exception % caught in outer block', sqlerrm;
! END;
! $$ LANGUAGE plpgsql;
! SELECT reraise_test();
! NOTICE:  exception syntax_error thrown in inner block, reraising
! NOTICE:  RIGHT - exception syntax_error caught in inner block
!  reraise_test 
! --------------
!  
! (1 row)
! 
! --
! -- reject function definitions that contain malformed SQL queries at
! -- compile-time, where possible
! --
! create function bad_sql1() returns int as $$
! declare a int;
! begin
!     a := 5;
!     Johnny Yuma;
!     a := 10;
!     return a;
! end$$ language plpgsql;
! ERROR:  syntax error at or near "Johnny"
! LINE 5:     Johnny Yuma;
!             ^
! create function bad_sql2() returns int as $$
! declare r record;
! begin
!     for r in select I fought the law, the law won LOOP
!         raise notice 'in loop';
!     end loop;
!     return 5;
! end;$$ language plpgsql;
! ERROR:  syntax error at or near "the"
! LINE 4:     for r in select I fought the law, the law won LOOP
!                                      ^
! -- a RETURN expression is mandatory, except for void-returning
! -- functions, where it is not allowed
! create function missing_return_expr() returns int as $$
! begin
!     return ;
! end;$$ language plpgsql;
! ERROR:  missing expression at or near ";"
! LINE 3:     return ;
!                    ^
! create function void_return_expr() returns void as $$
! begin
!     return 5;
! end;$$ language plpgsql;
! ERROR:  RETURN cannot have a parameter in function returning void
! LINE 3:     return 5;
!                    ^
! -- VOID functions are allowed to omit RETURN
! create function void_return_expr() returns void as $$
! begin
!     perform 2+2;
! end;$$ language plpgsql;
! select void_return_expr();
!  void_return_expr 
! ------------------
!  
! (1 row)
! 
! -- but ordinary functions are not
! create function missing_return_expr() returns int as $$
! begin
!     perform 2+2;
! end;$$ language plpgsql;
! select missing_return_expr();
! ERROR:  control reached end of function without RETURN
! CONTEXT:  PL/pgSQL function missing_return_expr()
! drop function void_return_expr();
! drop function missing_return_expr();
! --
! -- EXECUTE ... INTO test
! --
! create table eifoo (i integer, y integer);
! create type eitype as (i integer, y integer);
! create or replace function execute_into_test(varchar) returns record as $$
! declare
!     _r record;
!     _rt eifoo%rowtype;
!     _v eitype;
!     i int;
!     j int;
!     k int;
! begin
!     execute 'insert into '||$1||' values(10,15)';
!     execute 'select (row).* from (select row(10,1)::eifoo) s' into _r;
!     raise notice '% %', _r.i, _r.y;
!     execute 'select * from '||$1||' limit 1' into _rt;
!     raise notice '% %', _rt.i, _rt.y;
!     execute 'select *, 20 from '||$1||' limit 1' into i, j, k;
!     raise notice '% % %', i, j, k;
!     execute 'select 1,2' into _v;
!     return _v;
! end; $$ language plpgsql;
! select execute_into_test('eifoo');
! NOTICE:  10 1
! NOTICE:  10 15
! NOTICE:  10 15 20
!  execute_into_test 
! -------------------
!  (1,2)
! (1 row)
! 
! drop table eifoo cascade;
! drop type eitype cascade;
! --
! -- SQLSTATE and SQLERRM test
! --
! create function excpt_test1() returns void as $$
! begin
!     raise notice '% %', sqlstate, sqlerrm;
! end; $$ language plpgsql;
! -- should fail: SQLSTATE and SQLERRM are only in defined EXCEPTION
! -- blocks
! select excpt_test1();
! ERROR:  column "sqlstate" does not exist
! LINE 1: SELECT sqlstate
!                ^
! QUERY:  SELECT sqlstate
! CONTEXT:  PL/pgSQL function excpt_test1() line 3 at RAISE
! create function excpt_test2() returns void as $$
! begin
!     begin
!         begin
!             raise notice '% %', sqlstate, sqlerrm;
!         end;
!     end;
! end; $$ language plpgsql;
! -- should fail
! select excpt_test2();
! ERROR:  column "sqlstate" does not exist
! LINE 1: SELECT sqlstate
!                ^
! QUERY:  SELECT sqlstate
! CONTEXT:  PL/pgSQL function excpt_test2() line 5 at RAISE
! create function excpt_test3() returns void as $$
! begin
!     begin
!         raise exception 'user exception';
!     exception when others then
! 	    raise notice 'caught exception % %', sqlstate, sqlerrm;
! 	    begin
! 	        raise notice '% %', sqlstate, sqlerrm;
! 	        perform 10/0;
!         exception
!             when substring_error then
!                 -- this exception handler shouldn't be invoked
!                 raise notice 'unexpected exception: % %', sqlstate, sqlerrm;
! 	        when division_by_zero then
! 	            raise notice 'caught exception % %', sqlstate, sqlerrm;
! 	    end;
! 	    raise notice '% %', sqlstate, sqlerrm;
!     end;
! end; $$ language plpgsql;
! select excpt_test3();
! NOTICE:  caught exception P0001 user exception
! NOTICE:  P0001 user exception
! NOTICE:  caught exception 22012 division by zero
! NOTICE:  P0001 user exception
!  excpt_test3 
! -------------
!  
! (1 row)
! 
! create function excpt_test4() returns text as $$
! begin
! 	begin perform 1/0;
! 	exception when others then return sqlerrm; end;
! end; $$ language plpgsql;
! select excpt_test4();
!    excpt_test4    
! ------------------
!  division by zero
! (1 row)
! 
! drop function excpt_test1();
! drop function excpt_test2();
! drop function excpt_test3();
! drop function excpt_test4();
! -- parameters of raise stmt can be expressions
! create function raise_exprs() returns void as $$
! declare
!     a integer[] = '{10,20,30}';
!     c varchar = 'xyz';
!     i integer;
! begin
!     i := 2;
!     raise notice '%; %; %; %; %; %', a, a[i], c, (select c || 'abc'), row(10,'aaa',NULL,30), NULL;
! end;$$ language plpgsql;
! select raise_exprs();
! NOTICE:  {10,20,30}; 20; xyz; xyzabc; (10,aaa,,30); <NULL>
!  raise_exprs 
! -------------
!  
! (1 row)
! 
! drop function raise_exprs();
! -- continue statement
! create table conttesttbl(idx serial, v integer);
! insert into conttesttbl(v) values(10);
! insert into conttesttbl(v) values(20);
! insert into conttesttbl(v) values(30);
! insert into conttesttbl(v) values(40);
! create function continue_test1() returns void as $$
! declare _i integer = 0; _r record;
! begin
!   raise notice '---1---';
!   loop
!     _i := _i + 1;
!     raise notice '%', _i;
!     continue when _i < 10;
!     exit;
!   end loop;
! 
!   raise notice '---2---';
!   <<lbl>>
!   loop
!     _i := _i - 1;
!     loop
!       raise notice '%', _i;
!       continue lbl when _i > 0;
!       exit lbl;
!     end loop;
!   end loop;
! 
!   raise notice '---3---';
!   <<the_loop>>
!   while _i < 10 loop
!     _i := _i + 1;
!     continue the_loop when _i % 2 = 0;
!     raise notice '%', _i;
!   end loop;
! 
!   raise notice '---4---';
!   for _i in 1..10 loop
!     begin
!       -- applies to outer loop, not the nested begin block
!       continue when _i < 5;
!       raise notice '%', _i;
!     end;
!   end loop;
! 
!   raise notice '---5---';
!   for _r in select * from conttesttbl loop
!     continue when _r.v <= 20;
!     raise notice '%', _r.v;
!   end loop;
! 
!   raise notice '---6---';
!   for _r in execute 'select * from conttesttbl' loop
!     continue when _r.v <= 20;
!     raise notice '%', _r.v;
!   end loop;
! 
!   raise notice '---7---';
!   for _i in 1..3 loop
!     raise notice '%', _i;
!     continue when _i = 3;
!   end loop;
! 
!   raise notice '---8---';
!   _i := 1;
!   while _i <= 3 loop
!     raise notice '%', _i;
!     _i := _i + 1;
!     continue when _i = 3;
!   end loop;
! 
!   raise notice '---9---';
!   for _r in select * from conttesttbl order by v limit 1 loop
!     raise notice '%', _r.v;
!     continue;
!   end loop;
! 
!   raise notice '---10---';
!   for _r in execute 'select * from conttesttbl order by v limit 1' loop
!     raise notice '%', _r.v;
!     continue;
!   end loop;
! end; $$ language plpgsql;
! select continue_test1();
! NOTICE:  ---1---
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
! NOTICE:  5
! NOTICE:  6
! NOTICE:  7
! NOTICE:  8
! NOTICE:  9
! NOTICE:  10
! NOTICE:  ---2---
! NOTICE:  9
! NOTICE:  8
! NOTICE:  7
! NOTICE:  6
! NOTICE:  5
! NOTICE:  4
! NOTICE:  3
! NOTICE:  2
! NOTICE:  1
! NOTICE:  0
! NOTICE:  ---3---
! NOTICE:  1
! NOTICE:  3
! NOTICE:  5
! NOTICE:  7
! NOTICE:  9
! NOTICE:  ---4---
! NOTICE:  5
! NOTICE:  6
! NOTICE:  7
! NOTICE:  8
! NOTICE:  9
! NOTICE:  10
! NOTICE:  ---5---
! NOTICE:  30
! NOTICE:  40
! NOTICE:  ---6---
! NOTICE:  30
! NOTICE:  40
! NOTICE:  ---7---
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  ---8---
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  ---9---
! NOTICE:  10
! NOTICE:  ---10---
! NOTICE:  10
!  continue_test1 
! ----------------
!  
! (1 row)
! 
! drop function continue_test1();
! drop table conttesttbl;
! -- should fail: CONTINUE is only legal inside a loop
! create function continue_error1() returns void as $$
! begin
!     begin
!         continue;
!     end;
! end;
! $$ language plpgsql;
! ERROR:  CONTINUE cannot be used outside a loop
! LINE 4:         continue;
!                 ^
! -- should fail: unlabeled EXIT is only legal inside a loop
! create function exit_error1() returns void as $$
! begin
!     begin
!         exit;
!     end;
! end;
! $$ language plpgsql;
! ERROR:  EXIT cannot be used outside a loop, unless it has a label
! LINE 4:         exit;
!                 ^
! -- should fail: no such label
! create function continue_error2() returns void as $$
! begin
!     begin
!         loop
!             continue no_such_label;
!         end loop;
!     end;
! end;
! $$ language plpgsql;
! ERROR:  there is no label "no_such_label" attached to any block or loop enclosing this statement
! LINE 5:             continue no_such_label;
!                              ^
! -- should fail: no such label
! create function exit_error2() returns void as $$
! begin
!     begin
!         loop
!             exit no_such_label;
!         end loop;
!     end;
! end;
! $$ language plpgsql;
! ERROR:  there is no label "no_such_label" attached to any block or loop enclosing this statement
! LINE 5:             exit no_such_label;
!                          ^
! -- should fail: CONTINUE can't reference the label of a named block
! create function continue_error3() returns void as $$
! begin
!     <<begin_block1>>
!     begin
!         loop
!             continue begin_block1;
!         end loop;
!     end;
! end;
! $$ language plpgsql;
! ERROR:  block label "begin_block1" cannot be used in CONTINUE
! LINE 6:             continue begin_block1;
!                              ^
! -- On the other hand, EXIT *can* reference the label of a named block
! create function exit_block1() returns void as $$
! begin
!     <<begin_block1>>
!     begin
!         loop
!             exit begin_block1;
!             raise exception 'should not get here';
!         end loop;
!     end;
! end;
! $$ language plpgsql;
! select exit_block1();
!  exit_block1 
! -------------
!  
! (1 row)
! 
! drop function exit_block1();
! -- verbose end block and end loop
! create function end_label1() returns void as $$
! <<blbl>>
! begin
!   <<flbl1>>
!   for _i in 1 .. 10 loop
!     exit flbl1;
!   end loop flbl1;
!   <<flbl2>>
!   for _i in 1 .. 10 loop
!     exit flbl2;
!   end loop;
! end blbl;
! $$ language plpgsql;
! select end_label1();
!  end_label1 
! ------------
!  
! (1 row)
! 
! drop function end_label1();
! -- should fail: undefined end label
! create function end_label2() returns void as $$
! begin
!   for _i in 1 .. 10 loop
!     exit;
!   end loop flbl1;
! end;
! $$ language plpgsql;
! ERROR:  end label "flbl1" specified for unlabelled block
! LINE 5:   end loop flbl1;
!                    ^
! -- should fail: end label does not match start label
! create function end_label3() returns void as $$
! <<outer_label>>
! begin
!   <<inner_label>>
!   for _i in 1 .. 10 loop
!     exit;
!   end loop outer_label;
! end;
! $$ language plpgsql;
! ERROR:  end label "outer_label" differs from block's label "inner_label"
! LINE 7:   end loop outer_label;
!                    ^
! -- should fail: end label on a block without a start label
! create function end_label4() returns void as $$
! <<outer_label>>
! begin
!   for _i in 1 .. 10 loop
!     exit;
!   end loop outer_label;
! end;
! $$ language plpgsql;
! ERROR:  end label "outer_label" specified for unlabelled block
! LINE 6:   end loop outer_label;
!                    ^
! -- using list of scalars in fori and fore stmts
! create function for_vect() returns void as $proc$
! <<lbl>>declare a integer; b varchar; c varchar; r record;
! begin
!   -- fori
!   for i in 1 .. 3 loop
!     raise notice '%', i;
!   end loop;
!   -- fore with record var
!   for r in select gs as aa, 'BB' as bb, 'CC' as cc from generate_series(1,4) gs loop
!     raise notice '% % %', r.aa, r.bb, r.cc;
!   end loop;
!   -- fore with single scalar
!   for a in select gs from generate_series(1,4) gs loop
!     raise notice '%', a;
!   end loop;
!   -- fore with multiple scalars
!   for a,b,c in select gs, 'BB','CC' from generate_series(1,4) gs loop
!     raise notice '% % %', a, b, c;
!   end loop;
!   -- using qualified names in fors, fore is enabled, disabled only for fori
!   for lbl.a, lbl.b, lbl.c in execute $$select gs, 'bb','cc' from generate_series(1,4) gs$$ loop
!     raise notice '% % %', a, b, c;
!   end loop;
! end;
! $proc$ language plpgsql;
! select for_vect();
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  1 BB CC
! NOTICE:  2 BB CC
! NOTICE:  3 BB CC
! NOTICE:  4 BB CC
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
! NOTICE:  1 BB CC
! NOTICE:  2 BB CC
! NOTICE:  3 BB CC
! NOTICE:  4 BB CC
! NOTICE:  1 bb cc
! NOTICE:  2 bb cc
! NOTICE:  3 bb cc
! NOTICE:  4 bb cc
!  for_vect 
! ----------
!  
! (1 row)
! 
! -- regression test: verify that multiple uses of same plpgsql datum within
! -- a SQL command all get mapped to the same $n parameter.  The return value
! -- of the SELECT is not important, we only care that it doesn't fail with
! -- a complaint about an ungrouped column reference.
! create function multi_datum_use(p1 int) returns bool as $$
! declare
!   x int;
!   y int;
! begin
!   select into x,y unique1/p1, unique1/$1 from tenk1 group by unique1/p1;
!   return x = y;
! end$$ language plpgsql;
! select multi_datum_use(42);
!  multi_datum_use 
! -----------------
!  t
! (1 row)
! 
! --
! -- Test STRICT limiter in both planned and EXECUTE invocations.
! -- Note that a data-modifying query is quasi strict (disallow multi rows)
! -- by default in the planned case, but not in EXECUTE.
! --
! create temp table foo (f1 int, f2 int);
! insert into foo values (1,2), (3,4);
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should work
!   insert into foo values(5,6) returning * into x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! NOTICE:  x.f1 = 5, x.f2 = 6
!  footest 
! ---------
!  
! (1 row)
! 
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should fail due to implicit strict
!   insert into foo values(7,8),(9,10) returning * into x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 5 at SQL statement
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should work
!   execute 'insert into foo values(5,6) returning *' into x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! NOTICE:  x.f1 = 5, x.f2 = 6
!  footest 
! ---------
!  
! (1 row)
! 
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- this should work since EXECUTE isn't as picky
!   execute 'insert into foo values(7,8),(9,10) returning *' into x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! NOTICE:  x.f1 = 7, x.f2 = 8
!  footest 
! ---------
!  
! (1 row)
! 
! select * from foo;
!  f1 | f2 
! ----+----
!   1 |  2
!   3 |  4
!   5 |  6
!   5 |  6
!   7 |  8
!   9 | 10
! (6 rows)
! 
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should work
!   select * from foo where f1 = 3 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! NOTICE:  x.f1 = 3, x.f2 = 4
!  footest 
! ---------
!  
! (1 row)
! 
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should fail, no rows
!   select * from foo where f1 = 0 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned no rows
! CONTEXT:  PL/pgSQL function footest() line 5 at SQL statement
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should fail, too many rows
!   select * from foo where f1 > 3 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 5 at SQL statement
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should work
!   execute 'select * from foo where f1 = 3' into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! NOTICE:  x.f1 = 3, x.f2 = 4
!  footest 
! ---------
!  
! (1 row)
! 
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should fail, no rows
!   execute 'select * from foo where f1 = 0' into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned no rows
! CONTEXT:  PL/pgSQL function footest() line 5 at EXECUTE
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- should fail, too many rows
!   execute 'select * from foo where f1 > 3' into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 5 at EXECUTE
! drop function footest();
! -- test printing parameters after failure due to STRICT
! set plpgsql.print_strict_params to true;
! create or replace function footest() returns void as $$
! declare
! x record;
! p1 int := 2;
! p3 text := 'foo';
! begin
!   -- no rows
!   select * from foo where f1 = p1 and f1::text = p3 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned no rows
! DETAIL:  parameters: p1 = '2', p3 = 'foo'
! CONTEXT:  PL/pgSQL function footest() line 8 at SQL statement
! create or replace function footest() returns void as $$
! declare
! x record;
! p1 int := 2;
! p3 text := 'foo';
! begin
!   -- too many rows
!   select * from foo where f1 > p1 or f1::text = p3  into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! DETAIL:  parameters: p1 = '2', p3 = 'foo'
! CONTEXT:  PL/pgSQL function footest() line 8 at SQL statement
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- too many rows, no params
!   select * from foo where f1 > 3 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 5 at SQL statement
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- no rows
!   execute 'select * from foo where f1 = $1 or f1::text = $2' using 0, 'foo' into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned no rows
! DETAIL:  parameters: $1 = '0', $2 = 'foo'
! CONTEXT:  PL/pgSQL function footest() line 5 at EXECUTE
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- too many rows
!   execute 'select * from foo where f1 > $1' using 1 into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! DETAIL:  parameters: $1 = '1'
! CONTEXT:  PL/pgSQL function footest() line 5 at EXECUTE
! create or replace function footest() returns void as $$
! declare x record;
! begin
!   -- too many rows, no parameters
!   execute 'select * from foo where f1 > 3' into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 5 at EXECUTE
! create or replace function footest() returns void as $$
! -- override the global
! #print_strict_params off
! declare
! x record;
! p1 int := 2;
! p3 text := 'foo';
! begin
!   -- too many rows
!   select * from foo where f1 > p1 or f1::text = p3  into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! CONTEXT:  PL/pgSQL function footest() line 10 at SQL statement
! reset plpgsql.print_strict_params;
! create or replace function footest() returns void as $$
! -- override the global
! #print_strict_params on
! declare
! x record;
! p1 int := 2;
! p3 text := 'foo';
! begin
!   -- too many rows
!   select * from foo where f1 > p1 or f1::text = p3  into strict x;
!   raise notice 'x.f1 = %, x.f2 = %', x.f1, x.f2;
! end$$ language plpgsql;
! select footest();
! ERROR:  query returned more than one row
! DETAIL:  parameters: p1 = '2', p3 = 'foo'
! CONTEXT:  PL/pgSQL function footest() line 10 at SQL statement
! -- test warnings and errors
! set plpgsql.extra_warnings to 'all';
! set plpgsql.extra_warnings to 'none';
! set plpgsql.extra_errors to 'all';
! set plpgsql.extra_errors to 'none';
! -- test warnings when shadowing a variable
! set plpgsql.extra_warnings to 'shadowed_variables';
! -- simple shadowing of input and output parameters
! create or replace function shadowtest(in1 int)
! 	returns table (out1 int) as $$
! declare
! in1 int;
! out1 int;
! begin
! end
! $$ language plpgsql;
! WARNING:  variable "in1" shadows a previously defined variable
! LINE 4: in1 int;
!         ^
! WARNING:  variable "out1" shadows a previously defined variable
! LINE 5: out1 int;
!         ^
! select shadowtest(1);
!  shadowtest 
! ------------
! (0 rows)
! 
! set plpgsql.extra_warnings to 'shadowed_variables';
! select shadowtest(1);
!  shadowtest 
! ------------
! (0 rows)
! 
! create or replace function shadowtest(in1 int)
! 	returns table (out1 int) as $$
! declare
! in1 int;
! out1 int;
! begin
! end
! $$ language plpgsql;
! WARNING:  variable "in1" shadows a previously defined variable
! LINE 4: in1 int;
!         ^
! WARNING:  variable "out1" shadows a previously defined variable
! LINE 5: out1 int;
!         ^
! select shadowtest(1);
!  shadowtest 
! ------------
! (0 rows)
! 
! drop function shadowtest(int);
! -- shadowing in a second DECLARE block
! create or replace function shadowtest()
! 	returns void as $$
! declare
! f1 int;
! begin
! 	declare
! 	f1 int;
! 	begin
! 	end;
! end$$ language plpgsql;
! WARNING:  variable "f1" shadows a previously defined variable
! LINE 7:  f1 int;
!          ^
! drop function shadowtest();
! -- several levels of shadowing
! create or replace function shadowtest(in1 int)
! 	returns void as $$
! declare
! in1 int;
! begin
! 	declare
! 	in1 int;
! 	begin
! 	end;
! end$$ language plpgsql;
! WARNING:  variable "in1" shadows a previously defined variable
! LINE 4: in1 int;
!         ^
! WARNING:  variable "in1" shadows a previously defined variable
! LINE 7:  in1 int;
!          ^
! drop function shadowtest(int);
! -- shadowing in cursor definitions
! create or replace function shadowtest()
! 	returns void as $$
! declare
! f1 int;
! c1 cursor (f1 int) for select 1;
! begin
! end$$ language plpgsql;
! WARNING:  variable "f1" shadows a previously defined variable
! LINE 5: c1 cursor (f1 int) for select 1;
!                    ^
! drop function shadowtest();
! -- test errors when shadowing a variable
! set plpgsql.extra_errors to 'shadowed_variables';
! create or replace function shadowtest(f1 int)
! 	returns boolean as $$
! declare f1 int; begin return 1; end $$ language plpgsql;
! ERROR:  variable "f1" shadows a previously defined variable
! LINE 3: declare f1 int; begin return 1; end $$ language plpgsql;
!                 ^
! select shadowtest(1);
! ERROR:  function shadowtest(integer) does not exist
! LINE 1: select shadowtest(1);
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! reset plpgsql.extra_errors;
! reset plpgsql.extra_warnings;
! create or replace function shadowtest(f1 int)
! 	returns boolean as $$
! declare f1 int; begin return 1; end $$ language plpgsql;
! select shadowtest(1);
!  shadowtest 
! ------------
!  t
! (1 row)
! 
! -- test scrollable cursor support
! create function sc_test() returns setof integer as $$
! declare
!   c scroll cursor for select f1 from int4_tbl;
!   x integer;
! begin
!   open c;
!   fetch last from c into x;
!   while found loop
!     return next x;
!     fetch prior from c into x;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!    sc_test   
! -------------
!  -2147483647
!   2147483647
!      -123456
!       123456
!            0
! (5 rows)
! 
! create or replace function sc_test() returns setof integer as $$
! declare
!   c no scroll cursor for select f1 from int4_tbl;
!   x integer;
! begin
!   open c;
!   fetch last from c into x;
!   while found loop
!     return next x;
!     fetch prior from c into x;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();  -- fails because of NO SCROLL specification
! ERROR:  cursor can only scan forward
! HINT:  Declare it with SCROLL option to enable backward scan.
! CONTEXT:  PL/pgSQL function sc_test() line 7 at FETCH
! create or replace function sc_test() returns setof integer as $$
! declare
!   c refcursor;
!   x integer;
! begin
!   open c scroll for select f1 from int4_tbl;
!   fetch last from c into x;
!   while found loop
!     return next x;
!     fetch prior from c into x;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!    sc_test   
! -------------
!  -2147483647
!   2147483647
!      -123456
!       123456
!            0
! (5 rows)
! 
! create or replace function sc_test() returns setof integer as $$
! declare
!   c refcursor;
!   x integer;
! begin
!   open c scroll for execute 'select f1 from int4_tbl';
!   fetch last from c into x;
!   while found loop
!     return next x;
!     fetch relative -2 from c into x;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!    sc_test   
! -------------
!  -2147483647
!      -123456
!            0
! (3 rows)
! 
! create or replace function sc_test() returns setof integer as $$
! declare
!   c refcursor;
!   x integer;
! begin
!   open c scroll for execute 'select f1 from int4_tbl';
!   fetch last from c into x;
!   while found loop
!     return next x;
!     move backward 2 from c;
!     fetch relative -1 from c into x;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!    sc_test   
! -------------
!  -2147483647
!       123456
! (2 rows)
! 
! create or replace function sc_test() returns setof integer as $$
! declare
!   c cursor for select * from generate_series(1, 10);
!   x integer;
! begin
!   open c;
!   loop
!       move relative 2 in c;
!       if not found then
!           exit;
!       end if;
!       fetch next from c into x;
!       if found then
!           return next x;
!       end if;
!   end loop;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!  sc_test 
! ---------
!        3
!        6
!        9
! (3 rows)
! 
! create or replace function sc_test() returns setof integer as $$
! declare
!   c cursor for select * from generate_series(1, 10);
!   x integer;
! begin
!   open c;
!   move forward all in c;
!   fetch backward from c into x;
!   if found then
!     return next x;
!   end if;
!   close c;
! end;
! $$ language plpgsql;
! select * from sc_test();
!  sc_test 
! ---------
!       10
! (1 row)
! 
! drop function sc_test();
! -- test qualified variable names
! create function pl_qual_names (param1 int) returns void as $$
! <<outerblock>>
! declare
!   param1 int := 1;
! begin
!   <<innerblock>>
!   declare
!     param1 int := 2;
!   begin
!     raise notice 'param1 = %', param1;
!     raise notice 'pl_qual_names.param1 = %', pl_qual_names.param1;
!     raise notice 'outerblock.param1 = %', outerblock.param1;
!     raise notice 'innerblock.param1 = %', innerblock.param1;
!   end;
! end;
! $$ language plpgsql;
! select pl_qual_names(42);
! NOTICE:  param1 = 2
! NOTICE:  pl_qual_names.param1 = 42
! NOTICE:  outerblock.param1 = 1
! NOTICE:  innerblock.param1 = 2
!  pl_qual_names 
! ---------------
!  
! (1 row)
! 
! drop function pl_qual_names(int);
! -- tests for RETURN QUERY
! create function ret_query1(out int, out int) returns setof record as $$
! begin
!     $1 := -1;
!     $2 := -2;
!     return next;
!     return query select x + 1, x * 10 from generate_series(0, 10) s (x);
!     return next;
! end;
! $$ language plpgsql;
! select * from ret_query1();
!  column1 | column2 
! ---------+---------
!       -1 |      -2
!        1 |       0
!        2 |      10
!        3 |      20
!        4 |      30
!        5 |      40
!        6 |      50
!        7 |      60
!        8 |      70
!        9 |      80
!       10 |      90
!       11 |     100
!       -1 |      -2
! (13 rows)
! 
! create type record_type as (x text, y int, z boolean);
! create or replace function ret_query2(lim int) returns setof record_type as $$
! begin
!     return query select md5(s.x::text), s.x, s.x > 0
!                  from generate_series(-8, lim) s (x) where s.x % 2 = 0;
! end;
! $$ language plpgsql;
! select * from ret_query2(8);
!                 x                 | y  | z 
! ----------------------------------+----+---
!  a8d2ec85eaf98407310b72eb73dda247 | -8 | f
!  596a3d04481816330f07e4f97510c28f | -6 | f
!  0267aaf632e87a63288a08331f22c7c3 | -4 | f
!  5d7b9adcbe1c629ec722529dd12e5129 | -2 | f
!  cfcd208495d565ef66e7dff9f98764da |  0 | f
!  c81e728d9d4c2f636f067f89cc14862c |  2 | t
!  a87ff679a2f3e71d9181a67b7542122c |  4 | t
!  1679091c5a880faf6fb5e6087eb1b2dc |  6 | t
!  c9f0f895fb98ab9159f51fd0297e236d |  8 | t
! (9 rows)
! 
! -- test EXECUTE USING
! create function exc_using(int, text) returns int as $$
! declare i int;
! begin
!   for i in execute 'select * from generate_series(1,$1)' using $1+1 loop
!     raise notice '%', i;
!   end loop;
!   execute 'select $2 + $2*3 + length($1)' into i using $2,$1;
!   return i;
! end
! $$ language plpgsql;
! select exc_using(5, 'foobar');
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
! NOTICE:  5
! NOTICE:  6
!  exc_using 
! -----------
!         26
! (1 row)
! 
! drop function exc_using(int, text);
! create or replace function exc_using(int) returns void as $$
! declare
!   c refcursor;
!   i int;
! begin
!   open c for execute 'select * from generate_series(1,$1)' using $1+1;
!   loop
!     fetch c into i;
!     exit when not found;
!     raise notice '%', i;
!   end loop;
!   close c;
!   return;
! end;
! $$ language plpgsql;
! select exc_using(5);
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
! NOTICE:  5
! NOTICE:  6
!  exc_using 
! -----------
!  
! (1 row)
! 
! drop function exc_using(int);
! -- test FOR-over-cursor
! create or replace function forc01() returns void as $$
! declare
!   c cursor(r1 integer, r2 integer)
!        for select * from generate_series(r1,r2) i;
!   c2 cursor
!        for select * from generate_series(41,43) i;
! begin
!   for r in c(5,7) loop
!     raise notice '% from %', r.i, c;
!   end loop;
!   -- again, to test if cursor was closed properly
!   for r in c(9,10) loop
!     raise notice '% from %', r.i, c;
!   end loop;
!   -- and test a parameterless cursor
!   for r in c2 loop
!     raise notice '% from %', r.i, c2;
!   end loop;
!   -- and try it with a hand-assigned name
!   raise notice 'after loop, c2 = %', c2;
!   c2 := 'special_name';
!   for r in c2 loop
!     raise notice '% from %', r.i, c2;
!   end loop;
!   raise notice 'after loop, c2 = %', c2;
!   -- and try it with a generated name
!   -- (which we can't show in the output because it's variable)
!   c2 := null;
!   for r in c2 loop
!     raise notice '%', r.i;
!   end loop;
!   raise notice 'after loop, c2 = %', c2;
!   return;
! end;
! $$ language plpgsql;
! select forc01();
! NOTICE:  5 from c
! NOTICE:  6 from c
! NOTICE:  7 from c
! NOTICE:  9 from c
! NOTICE:  10 from c
! NOTICE:  41 from c2
! NOTICE:  42 from c2
! NOTICE:  43 from c2
! NOTICE:  after loop, c2 = c2
! NOTICE:  41 from special_name
! NOTICE:  42 from special_name
! NOTICE:  43 from special_name
! NOTICE:  after loop, c2 = special_name
! NOTICE:  41
! NOTICE:  42
! NOTICE:  43
! NOTICE:  after loop, c2 = <NULL>
!  forc01 
! --------
!  
! (1 row)
! 
! -- try updating the cursor's current row
! create temp table forc_test as
!   select n as i, n as j from generate_series(1,10) n;
! create or replace function forc01() returns void as $$
! declare
!   c cursor for select * from forc_test;
! begin
!   for r in c loop
!     raise notice '%, %', r.i, r.j;
!     update forc_test set i = i * 100, j = r.j * 2 where current of c;
!   end loop;
! end;
! $$ language plpgsql;
! select forc01();
! NOTICE:  1, 1
! NOTICE:  2, 2
! NOTICE:  3, 3
! NOTICE:  4, 4
! NOTICE:  5, 5
! NOTICE:  6, 6
! NOTICE:  7, 7
! NOTICE:  8, 8
! NOTICE:  9, 9
! NOTICE:  10, 10
!  forc01 
! --------
!  
! (1 row)
! 
! select * from forc_test;
!   i   | j  
! ------+----
!   100 |  2
!   200 |  4
!   300 |  6
!   400 |  8
!   500 | 10
!   600 | 12
!   700 | 14
!   800 | 16
!   900 | 18
!  1000 | 20
! (10 rows)
! 
! -- same, with a cursor whose portal name doesn't match variable name
! create or replace function forc01() returns void as $$
! declare
!   c refcursor := 'fooled_ya';
!   r record;
! begin
!   open c for select * from forc_test;
!   loop
!     fetch c into r;
!     exit when not found;
!     raise notice '%, %', r.i, r.j;
!     update forc_test set i = i * 100, j = r.j * 2 where current of c;
!   end loop;
! end;
! $$ language plpgsql;
! select forc01();
! NOTICE:  100, 2
! NOTICE:  200, 4
! NOTICE:  300, 6
! NOTICE:  400, 8
! NOTICE:  500, 10
! NOTICE:  600, 12
! NOTICE:  700, 14
! NOTICE:  800, 16
! NOTICE:  900, 18
! NOTICE:  1000, 20
!  forc01 
! --------
!  
! (1 row)
! 
! select * from forc_test;
!    i    | j  
! --------+----
!   10000 |  4
!   20000 |  8
!   30000 | 12
!   40000 | 16
!   50000 | 20
!   60000 | 24
!   70000 | 28
!   80000 | 32
!   90000 | 36
!  100000 | 40
! (10 rows)
! 
! drop function forc01();
! -- fail because cursor has no query bound to it
! create or replace function forc_bad() returns void as $$
! declare
!   c refcursor;
! begin
!   for r in c loop
!     raise notice '%', r.i;
!   end loop;
! end;
! $$ language plpgsql;
! ERROR:  cursor FOR loop must use a bound cursor variable
! LINE 5:   for r in c loop
!                    ^
! -- test RETURN QUERY EXECUTE
! create or replace function return_dquery()
! returns setof int as $$
! begin
!   return query execute 'select * from (values(10),(20)) f';
!   return query execute 'select * from (values($1),($2)) f' using 40,50;
! end;
! $$ language plpgsql;
! select * from return_dquery();
!  return_dquery 
! ---------------
!             10
!             20
!             40
!             50
! (4 rows)
! 
! drop function return_dquery();
! -- test RETURN QUERY with dropped columns
! create table tabwithcols(a int, b int, c int, d int);
! insert into tabwithcols values(10,20,30,40),(50,60,70,80);
! create or replace function returnqueryf()
! returns setof tabwithcols as $$
! begin
!   return query select * from tabwithcols;
!   return query execute 'select * from tabwithcols';
! end;
! $$ language plpgsql;
! select * from returnqueryf();
!  a  | b  | c  | d  
! ----+----+----+----
!  10 | 20 | 30 | 40
!  50 | 60 | 70 | 80
!  10 | 20 | 30 | 40
!  50 | 60 | 70 | 80
! (4 rows)
! 
! alter table tabwithcols drop column b;
! select * from returnqueryf();
!  a  | c  | d  
! ----+----+----
!  10 | 30 | 40
!  50 | 70 | 80
!  10 | 30 | 40
!  50 | 70 | 80
! (4 rows)
! 
! alter table tabwithcols drop column d;
! select * from returnqueryf();
!  a  | c  
! ----+----
!  10 | 30
!  50 | 70
!  10 | 30
!  50 | 70
! (4 rows)
! 
! alter table tabwithcols add column d int;
! select * from returnqueryf();
!  a  | c  | d 
! ----+----+---
!  10 | 30 |  
!  50 | 70 |  
!  10 | 30 |  
!  50 | 70 |  
! (4 rows)
! 
! drop function returnqueryf();
! drop table tabwithcols;
! --
! -- Tests for composite-type results
! --
! create type compostype as (x int, y varchar);
! -- test: use of variable of composite type in return statement
! create or replace function compos() returns compostype as $$
! declare
!   v compostype;
! begin
!   v := (1, 'hello');
!   return v;
! end;
! $$ language plpgsql;
! select compos();
!   compos   
! -----------
!  (1,hello)
! (1 row)
! 
! -- test: use of variable of record type in return statement
! create or replace function compos() returns compostype as $$
! declare
!   v record;
! begin
!   v := (1, 'hello'::varchar);
!   return v;
! end;
! $$ language plpgsql;
! select compos();
!   compos   
! -----------
!  (1,hello)
! (1 row)
! 
! -- test: use of row expr in return statement
! create or replace function compos() returns compostype as $$
! begin
!   return (1, 'hello'::varchar);
! end;
! $$ language plpgsql;
! select compos();
!   compos   
! -----------
!  (1,hello)
! (1 row)
! 
! -- this does not work currently (no implicit casting)
! create or replace function compos() returns compostype as $$
! begin
!   return (1, 'hello');
! end;
! $$ language plpgsql;
! select compos();
! ERROR:  returned record type does not match expected record type
! DETAIL:  Returned type unknown does not match expected type character varying in column 2.
! CONTEXT:  PL/pgSQL function compos() while casting return value to function's return type
! -- ... but this does
! create or replace function compos() returns compostype as $$
! begin
!   return (1, 'hello')::compostype;
! end;
! $$ language plpgsql;
! select compos();
!   compos   
! -----------
!  (1,hello)
! (1 row)
! 
! drop function compos();
! -- test: return a row expr as record.
! create or replace function composrec() returns record as $$
! declare
!   v record;
! begin
!   v := (1, 'hello');
!   return v;
! end;
! $$ language plpgsql;
! select composrec();
!  composrec 
! -----------
!  (1,hello)
! (1 row)
! 
! -- test: return row expr in return statement.
! create or replace function composrec() returns record as $$
! begin
!   return (1, 'hello');
! end;
! $$ language plpgsql;
! select composrec();
!  composrec 
! -----------
!  (1,hello)
! (1 row)
! 
! drop function composrec();
! -- test: row expr in RETURN NEXT statement.
! create or replace function compos() returns setof compostype as $$
! begin
!   for i in 1..3
!   loop
!     return next (1, 'hello'::varchar);
!   end loop;
!   return next null::compostype;
!   return next (2, 'goodbye')::compostype;
! end;
! $$ language plpgsql;
! select * from compos();
!  x |    y    
! ---+---------
!  1 | hello
!  1 | hello
!  1 | hello
!    | 
!  2 | goodbye
! (5 rows)
! 
! drop function compos();
! -- test: use invalid expr in return statement.
! create or replace function compos() returns compostype as $$
! begin
!   return 1 + 1;
! end;
! $$ language plpgsql;
! select compos();
! ERROR:  cannot return non-composite value from function returning composite type
! CONTEXT:  PL/pgSQL function compos() line 3 at RETURN
! -- RETURN variable is a different code path ...
! create or replace function compos() returns compostype as $$
! declare x int := 42;
! begin
!   return x;
! end;
! $$ language plpgsql;
! select * from compos();
! ERROR:  cannot return non-composite value from function returning composite type
! CONTEXT:  PL/pgSQL function compos() line 4 at RETURN
! drop function compos();
! -- test: invalid use of composite variable in scalar-returning function
! create or replace function compos() returns int as $$
! declare
!   v compostype;
! begin
!   v := (1, 'hello');
!   return v;
! end;
! $$ language plpgsql;
! select compos();
! ERROR:  invalid input syntax for integer: "(1,hello)"
! CONTEXT:  PL/pgSQL function compos() while casting return value to function's return type
! -- test: invalid use of composite expression in scalar-returning function
! create or replace function compos() returns int as $$
! begin
!   return (1, 'hello')::compostype;
! end;
! $$ language plpgsql;
! select compos();
! ERROR:  invalid input syntax for integer: "(1,hello)"
! CONTEXT:  PL/pgSQL function compos() while casting return value to function's return type
! drop function compos();
! drop type compostype;
! --
! -- Tests for 8.4's new RAISE features
! --
! create or replace function raise_test() returns void as $$
! begin
!   raise notice '% % %', 1, 2, 3
!      using errcode = '55001', detail = 'some detail info', hint = 'some hint';
!   raise '% % %', 1, 2, 3
!      using errcode = 'division_by_zero', detail = 'some detail info';
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  1 2 3
! DETAIL:  some detail info
! HINT:  some hint
! ERROR:  1 2 3
! DETAIL:  some detail info
! CONTEXT:  PL/pgSQL function raise_test() line 5 at RAISE
! -- Since we can't actually see the thrown SQLSTATE in default psql output,
! -- test it like this; this also tests re-RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise 'check me'
!      using errcode = 'division_by_zero', detail = 'some detail info';
!   exception
!     when others then
!       raise notice 'SQLSTATE: % SQLERRM: %', sqlstate, sqlerrm;
!       raise;
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  SQLSTATE: 22012 SQLERRM: check me
! ERROR:  check me
! DETAIL:  some detail info
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise 'check me'
!      using errcode = '1234F', detail = 'some detail info';
!   exception
!     when others then
!       raise notice 'SQLSTATE: % SQLERRM: %', sqlstate, sqlerrm;
!       raise;
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  SQLSTATE: 1234F SQLERRM: check me
! ERROR:  check me
! DETAIL:  some detail info
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! -- SQLSTATE specification in WHEN
! create or replace function raise_test() returns void as $$
! begin
!   raise 'check me'
!      using errcode = '1234F', detail = 'some detail info';
!   exception
!     when sqlstate '1234F' then
!       raise notice 'SQLSTATE: % SQLERRM: %', sqlstate, sqlerrm;
!       raise;
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  SQLSTATE: 1234F SQLERRM: check me
! ERROR:  check me
! DETAIL:  some detail info
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise division_by_zero using detail = 'some detail info';
!   exception
!     when others then
!       raise notice 'SQLSTATE: % SQLERRM: %', sqlstate, sqlerrm;
!       raise;
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  SQLSTATE: 22012 SQLERRM: division_by_zero
! ERROR:  division_by_zero
! DETAIL:  some detail info
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise division_by_zero;
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  division_by_zero
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise sqlstate '1234F';
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  1234F
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise division_by_zero using message = 'custom' || ' message';
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  custom message
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise using message = 'custom' || ' message', errcode = '22012';
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  custom message
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! -- conflict on message
! create or replace function raise_test() returns void as $$
! begin
!   raise notice 'some message' using message = 'custom' || ' message', errcode = '22012';
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  RAISE option already specified: MESSAGE
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! -- conflict on errcode
! create or replace function raise_test() returns void as $$
! begin
!   raise division_by_zero using message = 'custom' || ' message', errcode = '22012';
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  RAISE option already specified: ERRCODE
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! -- nothing to re-RAISE
! create or replace function raise_test() returns void as $$
! begin
!   raise;
! end;
! $$ language plpgsql;
! select raise_test();
! ERROR:  RAISE without parameters cannot be used outside an exception handler
! CONTEXT:  PL/pgSQL function raise_test() line 3 at RAISE
! -- test access to exception data
! create function zero_divide() returns int as $$
! declare v int := 0;
! begin
!   return 10 / v;
! end;
! $$ language plpgsql;
! create or replace function raise_test() returns void as $$
! begin
!   raise exception 'custom exception'
!      using detail = 'some detail of custom exception',
!            hint = 'some hint related to custom exception';
! end;
! $$ language plpgsql;
! create function stacked_diagnostics_test() returns void as $$
! declare _sqlstate text;
!         _message text;
!         _context text;
! begin
!   perform zero_divide();
! exception when others then
!   get stacked diagnostics
!         _sqlstate = returned_sqlstate,
!         _message = message_text,
!         _context = pg_exception_context;
!   raise notice 'sqlstate: %, message: %, context: [%]',
!     _sqlstate, _message, replace(_context, E'\n', ' <- ');
! end;
! $$ language plpgsql;
! select stacked_diagnostics_test();
! NOTICE:  sqlstate: 22012, message: division by zero, context: [PL/pgSQL function zero_divide() line 4 at RETURN <- SQL statement "SELECT zero_divide()" <- PL/pgSQL function stacked_diagnostics_test() line 6 at PERFORM]
!  stacked_diagnostics_test 
! --------------------------
!  
! (1 row)
! 
! create or replace function stacked_diagnostics_test() returns void as $$
! declare _detail text;
!         _hint text;
!         _message text;
! begin
!   perform raise_test();
! exception when others then
!   get stacked diagnostics
!         _message = message_text,
!         _detail = pg_exception_detail,
!         _hint = pg_exception_hint;
!   raise notice 'message: %, detail: %, hint: %', _message, _detail, _hint;
! end;
! $$ language plpgsql;
! select stacked_diagnostics_test();
! NOTICE:  message: custom exception, detail: some detail of custom exception, hint: some hint related to custom exception
!  stacked_diagnostics_test 
! --------------------------
!  
! (1 row)
! 
! -- fail, cannot use stacked diagnostics statement outside handler
! create or replace function stacked_diagnostics_test() returns void as $$
! declare _detail text;
!         _hint text;
!         _message text;
! begin
!   get stacked diagnostics
!         _message = message_text,
!         _detail = pg_exception_detail,
!         _hint = pg_exception_hint;
!   raise notice 'message: %, detail: %, hint: %', _message, _detail, _hint;
! end;
! $$ language plpgsql;
! select stacked_diagnostics_test();
! ERROR:  GET STACKED DIAGNOSTICS cannot be used outside an exception handler
! CONTEXT:  PL/pgSQL function stacked_diagnostics_test() line 6 at GET STACKED DIAGNOSTICS
! drop function zero_divide();
! drop function stacked_diagnostics_test();
! -- check cases where implicit SQLSTATE variable could be confused with
! -- SQLSTATE as a keyword, cf bug #5524
! create or replace function raise_test() returns void as $$
! begin
!   perform 1/0;
! exception
!   when sqlstate '22012' then
!     raise notice using message = sqlstate;
!     raise sqlstate '22012' using message = 'substitute message';
! end;
! $$ language plpgsql;
! select raise_test();
! NOTICE:  22012
! ERROR:  substitute message
! CONTEXT:  PL/pgSQL function raise_test() line 7 at RAISE
! drop function raise_test();
! -- test passing column_name, constraint_name, datatype_name, table_name
! -- and schema_name error fields
! create or replace function stacked_diagnostics_test() returns void as $$
! declare _column_name text;
!         _constraint_name text;
!         _datatype_name text;
!         _table_name text;
!         _schema_name text;
! begin
!   raise exception using
!     column = '>>some column name<<',
!     constraint = '>>some constraint name<<',
!     datatype = '>>some datatype name<<',
!     table = '>>some table name<<',
!     schema = '>>some schema name<<';
! exception when others then
!   get stacked diagnostics
!         _column_name = column_name,
!         _constraint_name = constraint_name,
!         _datatype_name = pg_datatype_name,
!         _table_name = table_name,
!         _schema_name = schema_name;
!   raise notice 'column %, constraint %, type %, table %, schema %',
!     _column_name, _constraint_name, _datatype_name, _table_name, _schema_name;
! end;
! $$ language plpgsql;
! select stacked_diagnostics_test();
! NOTICE:  column >>some column name<<, constraint >>some constraint name<<, type >>some datatype name<<, table >>some table name<<, schema >>some schema name<<
!  stacked_diagnostics_test 
! --------------------------
!  
! (1 row)
! 
! drop function stacked_diagnostics_test();
! -- test CASE statement
! create or replace function case_test(bigint) returns text as $$
! declare a int = 10;
!         b int = 1;
! begin
!   case $1
!     when 1 then
!       return 'one';
!     when 2 then
!       return 'two';
!     when 3,4,3+5 then
!       return 'three, four or eight';
!     when a then
!       return 'ten';
!     when a+b, a+b+1 then
!       return 'eleven, twelve';
!   end case;
! end;
! $$ language plpgsql immutable;
! select case_test(1);
!  case_test 
! -----------
!  one
! (1 row)
! 
! select case_test(2);
!  case_test 
! -----------
!  two
! (1 row)
! 
! select case_test(3);
!       case_test       
! ----------------------
!  three, four or eight
! (1 row)
! 
! select case_test(4);
!       case_test       
! ----------------------
!  three, four or eight
! (1 row)
! 
! select case_test(5); -- fails
! ERROR:  case not found
! HINT:  CASE statement is missing ELSE part.
! CONTEXT:  PL/pgSQL function case_test(bigint) line 5 at CASE
! select case_test(8);
!       case_test       
! ----------------------
!  three, four or eight
! (1 row)
! 
! select case_test(10);
!  case_test 
! -----------
!  ten
! (1 row)
! 
! select case_test(11);
!    case_test    
! ----------------
!  eleven, twelve
! (1 row)
! 
! select case_test(12);
!    case_test    
! ----------------
!  eleven, twelve
! (1 row)
! 
! select case_test(13); -- fails
! ERROR:  case not found
! HINT:  CASE statement is missing ELSE part.
! CONTEXT:  PL/pgSQL function case_test(bigint) line 5 at CASE
! create or replace function catch() returns void as $$
! begin
!   raise notice '%', case_test(6);
! exception
!   when case_not_found then
!     raise notice 'caught case_not_found % %', SQLSTATE, SQLERRM;
! end
! $$ language plpgsql;
! select catch();
! NOTICE:  caught case_not_found 20000 case not found
!  catch 
! -------
!  
! (1 row)
! 
! -- test the searched variant too, as well as ELSE
! create or replace function case_test(bigint) returns text as $$
! declare a int = 10;
! begin
!   case
!     when $1 = 1 then
!       return 'one';
!     when $1 = a + 2 then
!       return 'twelve';
!     else
!       return 'other';
!   end case;
! end;
! $$ language plpgsql immutable;
! select case_test(1);
!  case_test 
! -----------
!  one
! (1 row)
! 
! select case_test(2);
!  case_test 
! -----------
!  other
! (1 row)
! 
! select case_test(12);
!  case_test 
! -----------
!  twelve
! (1 row)
! 
! select case_test(13);
!  case_test 
! -----------
!  other
! (1 row)
! 
! drop function catch();
! drop function case_test(bigint);
! -- test variadic functions
! create or replace function vari(variadic int[])
! returns void as $$
! begin
!   for i in array_lower($1,1)..array_upper($1,1) loop
!     raise notice '%', $1[i];
!   end loop; end;
! $$ language plpgsql;
! select vari(1,2,3,4,5);
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
! NOTICE:  5
!  vari 
! ------
!  
! (1 row)
! 
! select vari(3,4,5);
! NOTICE:  3
! NOTICE:  4
! NOTICE:  5
!  vari 
! ------
!  
! (1 row)
! 
! select vari(variadic array[5,6,7]);
! NOTICE:  5
! NOTICE:  6
! NOTICE:  7
!  vari 
! ------
!  
! (1 row)
! 
! drop function vari(int[]);
! -- coercion test
! create or replace function pleast(variadic numeric[])
! returns numeric as $$
! declare aux numeric = $1[array_lower($1,1)];
! begin
!   for i in array_lower($1,1)+1..array_upper($1,1) loop
!     if $1[i] < aux then aux := $1[i]; end if;
!   end loop;
!   return aux;
! end;
! $$ language plpgsql immutable strict;
! select pleast(10,1,2,3,-16);
!  pleast 
! --------
!     -16
! (1 row)
! 
! select pleast(10.2,2.2,-1.1);
!  pleast 
! --------
!    -1.1
! (1 row)
! 
! select pleast(10.2,10, -20);
!  pleast 
! --------
!     -20
! (1 row)
! 
! select pleast(10,20, -1.0);
!  pleast 
! --------
!    -1.0
! (1 row)
! 
! -- in case of conflict, non-variadic version is preferred
! create or replace function pleast(numeric)
! returns numeric as $$
! begin
!   raise notice 'non-variadic function called';
!   return $1;
! end;
! $$ language plpgsql immutable strict;
! select pleast(10);
! NOTICE:  non-variadic function called
!  pleast 
! --------
!      10
! (1 row)
! 
! drop function pleast(numeric[]);
! drop function pleast(numeric);
! -- test table functions
! create function tftest(int) returns table(a int, b int) as $$
! begin
!   return query select $1, $1+i from generate_series(1,5) g(i);
! end;
! $$ language plpgsql immutable strict;
! select * from tftest(10);
!  a  | b  
! ----+----
!  10 | 11
!  10 | 12
!  10 | 13
!  10 | 14
!  10 | 15
! (5 rows)
! 
! create or replace function tftest(a1 int) returns table(a int, b int) as $$
! begin
!   a := a1; b := a1 + 1;
!   return next;
!   a := a1 * 10; b := a1 * 10 + 1;
!   return next;
! end;
! $$ language plpgsql immutable strict;
! select * from tftest(10);
!   a  |  b  
! -----+-----
!   10 |  11
!  100 | 101
! (2 rows)
! 
! drop function tftest(int);
! create or replace function rttest()
! returns setof int as $$
! declare rc int;
! begin
!   return query values(10),(20);
!   get diagnostics rc = row_count;
!   raise notice '% %', found, rc;
!   return query select * from (values(10),(20)) f(a) where false;
!   get diagnostics rc = row_count;
!   raise notice '% %', found, rc;
!   return query execute 'values(10),(20)';
!   get diagnostics rc = row_count;
!   raise notice '% %', found, rc;
!   return query execute 'select * from (values(10),(20)) f(a) where false';
!   get diagnostics rc = row_count;
!   raise notice '% %', found, rc;
! end;
! $$ language plpgsql;
! select * from rttest();
! NOTICE:  t 2
! NOTICE:  f 0
! NOTICE:  t 2
! NOTICE:  f 0
!  rttest 
! --------
!      10
!      20
!      10
!      20
! (4 rows)
! 
! drop function rttest();
! -- Test for proper cleanup at subtransaction exit.  This example
! -- exposed a bug in PG 8.2.
! CREATE FUNCTION leaker_1(fail BOOL) RETURNS INTEGER AS $$
! DECLARE
!   v_var INTEGER;
! BEGIN
!   BEGIN
!     v_var := (leaker_2(fail)).error_code;
!   EXCEPTION
!     WHEN others THEN RETURN 0;
!   END;
!   RETURN 1;
! END;
! $$ LANGUAGE plpgsql;
! CREATE FUNCTION leaker_2(fail BOOL, OUT error_code INTEGER, OUT new_id INTEGER)
!   RETURNS RECORD AS $$
! BEGIN
!   IF fail THEN
!     RAISE EXCEPTION 'fail ...';
!   END IF;
!   error_code := 1;
!   new_id := 1;
!   RETURN;
! END;
! $$ LANGUAGE plpgsql;
! SELECT * FROM leaker_1(false);
!  leaker_1 
! ----------
!         1
! (1 row)
! 
! SELECT * FROM leaker_1(true);
!  leaker_1 
! ----------
!         0
! (1 row)
! 
! DROP FUNCTION leaker_1(bool);
! DROP FUNCTION leaker_2(bool);
! -- Test for appropriate cleanup of non-simple expression evaluations
! -- (bug in all versions prior to August 2010)
! CREATE FUNCTION nonsimple_expr_test() RETURNS text[] AS $$
! DECLARE
!   arr text[];
!   lr text;
!   i integer;
! BEGIN
!   arr := array[array['foo','bar'], array['baz', 'quux']];
!   lr := 'fool';
!   i := 1;
!   -- use sub-SELECTs to make expressions non-simple
!   arr[(SELECT i)][(SELECT i+1)] := (SELECT lr);
!   RETURN arr;
! END;
! $$ LANGUAGE plpgsql;
! SELECT nonsimple_expr_test();
!    nonsimple_expr_test   
! -------------------------
!  {{foo,fool},{baz,quux}}
! (1 row)
! 
! DROP FUNCTION nonsimple_expr_test();
! CREATE FUNCTION nonsimple_expr_test() RETURNS integer AS $$
! declare
!    i integer NOT NULL := 0;
! begin
!   begin
!     i := (SELECT NULL::integer);  -- should throw error
!   exception
!     WHEN OTHERS THEN
!       i := (SELECT 1::integer);
!   end;
!   return i;
! end;
! $$ LANGUAGE plpgsql;
! SELECT nonsimple_expr_test();
!  nonsimple_expr_test 
! ---------------------
!                    1
! (1 row)
! 
! DROP FUNCTION nonsimple_expr_test();
! --
! -- Test cases involving recursion and error recovery in simple expressions
! -- (bugs in all versions before October 2010).  The problems are most
! -- easily exposed by mutual recursion between plpgsql and sql functions.
! --
! create function recurse(float8) returns float8 as
! $$
! begin
!   if ($1 > 0) then
!     return sql_recurse($1 - 1);
!   else
!     return $1;
!   end if;
! end;
! $$ language plpgsql;
! -- "limit" is to prevent this from being inlined
! create function sql_recurse(float8) returns float8 as
! $$ select recurse($1) limit 1; $$ language sql;
! select recurse(10);
!  recurse 
! ---------
!        0
! (1 row)
! 
! create function error1(text) returns text language sql as
! $$ SELECT relname::text FROM pg_class c WHERE c.oid = $1::regclass $$;
! create function error2(p_name_table text) returns text language plpgsql as $$
! begin
!   return error1(p_name_table);
! end$$;
! BEGIN;
! create table public.stuffs (stuff text);
! SAVEPOINT a;
! select error2('nonexistent.stuffs');
! ERROR:  schema "nonexistent" does not exist
! CONTEXT:  SQL function "error1" statement 1
! PL/pgSQL function error2(text) line 3 at RETURN
! ROLLBACK TO a;
! select error2('public.stuffs');
!  error2 
! --------
!  stuffs
! (1 row)
! 
! rollback;
! drop function error2(p_name_table text);
! drop function error1(text);
! -- Test for proper handling of cast-expression caching
! create function sql_to_date(integer) returns date as $$
! select $1::text::date
! $$ language sql immutable strict;
! create cast (integer as date) with function sql_to_date(integer) as assignment;
! create function cast_invoker(integer) returns date as $$
! begin
!   return $1;
! end$$ language plpgsql;
! select cast_invoker(20150717);
!  cast_invoker 
! --------------
!  07-17-2015
! (1 row)
! 
! select cast_invoker(20150718);  -- second call crashed in pre-release 9.5
!  cast_invoker 
! --------------
!  07-18-2015
! (1 row)
! 
! begin;
! select cast_invoker(20150717);
!  cast_invoker 
! --------------
!  07-17-2015
! (1 row)
! 
! select cast_invoker(20150718);
!  cast_invoker 
! --------------
!  07-18-2015
! (1 row)
! 
! savepoint s1;
! select cast_invoker(20150718);
!  cast_invoker 
! --------------
!  07-18-2015
! (1 row)
! 
! select cast_invoker(-1); -- fails
! ERROR:  invalid input syntax for type date: "-1"
! CONTEXT:  SQL function "sql_to_date" statement 1
! PL/pgSQL function cast_invoker(integer) while casting return value to function's return type
! rollback to savepoint s1;
! select cast_invoker(20150719);
!  cast_invoker 
! --------------
!  07-19-2015
! (1 row)
! 
! select cast_invoker(20150720);
!  cast_invoker 
! --------------
!  07-20-2015
! (1 row)
! 
! commit;
! drop function cast_invoker(integer);
! drop function sql_to_date(integer) cascade;
! NOTICE:  drop cascades to cast from integer to date
! -- Test handling of cast cache inside DO blocks
! -- (to check the original crash case, this must be a cast not previously
! -- used in this session)
! begin;
! do $$ declare x text[]; begin x := '{1.23, 4.56}'::numeric[]; end $$;
! do $$ declare x text[]; begin x := '{1.23, 4.56}'::numeric[]; end $$;
! end;
! -- Test for consistent reporting of error context
! create function fail() returns int language plpgsql as $$
! begin
!   return 1/0;
! end
! $$;
! select fail();
! ERROR:  division by zero
! CONTEXT:  SQL statement "SELECT 1/0"
! PL/pgSQL function fail() line 3 at RETURN
! select fail();
! ERROR:  division by zero
! CONTEXT:  SQL statement "SELECT 1/0"
! PL/pgSQL function fail() line 3 at RETURN
! drop function fail();
! -- Test handling of string literals.
! set standard_conforming_strings = off;
! create or replace function strtest() returns text as $$
! begin
!   raise notice 'foo\\bar\041baz';
!   return 'foo\\bar\041baz';
! end
! $$ language plpgsql;
! WARNING:  nonstandard use of \\ in a string literal
! LINE 3:   raise notice 'foo\\bar\041baz';
!                        ^
! HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
! WARNING:  nonstandard use of \\ in a string literal
! LINE 4:   return 'foo\\bar\041baz';
!                  ^
! HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
! WARNING:  nonstandard use of \\ in a string literal
! LINE 4:   return 'foo\\bar\041baz';
!                  ^
! HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
! select strtest();
! NOTICE:  foo\bar!baz
! WARNING:  nonstandard use of \\ in a string literal
! LINE 1: SELECT 'foo\\bar\041baz'
!                ^
! HINT:  Use the escape string syntax for backslashes, e.g., E'\\'.
! QUERY:  SELECT 'foo\\bar\041baz'
!    strtest   
! -------------
!  foo\bar!baz
! (1 row)
! 
! create or replace function strtest() returns text as $$
! begin
!   raise notice E'foo\\bar\041baz';
!   return E'foo\\bar\041baz';
! end
! $$ language plpgsql;
! select strtest();
! NOTICE:  foo\bar!baz
!    strtest   
! -------------
!  foo\bar!baz
! (1 row)
! 
! set standard_conforming_strings = on;
! create or replace function strtest() returns text as $$
! begin
!   raise notice 'foo\\bar\041baz\';
!   return 'foo\\bar\041baz\';
! end
! $$ language plpgsql;
! select strtest();
! NOTICE:  foo\\bar\041baz\
!      strtest      
! ------------------
!  foo\\bar\041baz\
! (1 row)
! 
! create or replace function strtest() returns text as $$
! begin
!   raise notice E'foo\\bar\041baz';
!   return E'foo\\bar\041baz';
! end
! $$ language plpgsql;
! select strtest();
! NOTICE:  foo\bar!baz
!    strtest   
! -------------
!  foo\bar!baz
! (1 row)
! 
! drop function strtest();
! -- Test anonymous code blocks.
! DO $$
! DECLARE r record;
! BEGIN
!     FOR r IN SELECT rtrim(roomno) AS roomno, comment FROM Room ORDER BY roomno
!     LOOP
!         RAISE NOTICE '%, %', r.roomno, r.comment;
!     END LOOP;
! END$$;
! NOTICE:  001, Entrance
! NOTICE:  002, Office
! NOTICE:  003, Office
! NOTICE:  004, Technical
! NOTICE:  101, Office
! NOTICE:  102, Conference
! NOTICE:  103, Restroom
! NOTICE:  104, Technical
! NOTICE:  105, Office
! NOTICE:  106, Office
! -- these are to check syntax error reporting
! DO LANGUAGE plpgsql $$begin return 1; end$$;
! ERROR:  RETURN cannot have a parameter in function returning void
! LINE 1: DO LANGUAGE plpgsql $$begin return 1; end$$;
!                                            ^
! DO $$
! DECLARE r record;
! BEGIN
!     FOR r IN SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
!     LOOP
!         RAISE NOTICE '%, %', r.roomno, r.comment;
!     END LOOP;
! END$$;
! ERROR:  column "foo" does not exist
! LINE 1: SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomn...
!                                         ^
! QUERY:  SELECT rtrim(roomno) AS roomno, foo FROM Room ORDER BY roomno
! CONTEXT:  PL/pgSQL function inline_code_block line 4 at FOR over SELECT rows
! -- Check handling of errors thrown from/into anonymous code blocks.
! do $outer$
! begin
!   for i in 1..10 loop
!    begin
!     execute $ex$
!       do $$
!       declare x int = 0;
!       begin
!         x := 1 / x;
!       end;
!       $$;
!     $ex$;
!   exception when division_by_zero then
!     raise notice 'caught division by zero';
!   end;
!   end loop;
! end;
! $outer$;
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! NOTICE:  caught division by zero
! -- Check variable scoping -- a var is not available in its own or prior
! -- default expressions.
! create function scope_test() returns int as $$
! declare x int := 42;
! begin
!   declare y int := x + 1;
!           x int := x + 2;
!   begin
!     return x * 100 + y;
!   end;
! end;
! $$ language plpgsql;
! select scope_test();
!  scope_test 
! ------------
!        4443
! (1 row)
! 
! drop function scope_test();
! -- Check handling of conflicts between plpgsql vars and table columns.
! set plpgsql.variable_conflict = error;
! create function conflict_test() returns setof int8_tbl as $$
! declare r record;
!   q1 bigint := 42;
! begin
!   for r in select q1,q2 from int8_tbl loop
!     return next r;
!   end loop;
! end;
! $$ language plpgsql;
! select * from conflict_test();
! ERROR:  column reference "q1" is ambiguous
! LINE 1: select q1,q2 from int8_tbl
!                ^
! DETAIL:  It could refer to either a PL/pgSQL variable or a table column.
! QUERY:  select q1,q2 from int8_tbl
! CONTEXT:  PL/pgSQL function conflict_test() line 5 at FOR over SELECT rows
! create or replace function conflict_test() returns setof int8_tbl as $$
! #variable_conflict use_variable
! declare r record;
!   q1 bigint := 42;
! begin
!   for r in select q1,q2 from int8_tbl loop
!     return next r;
!   end loop;
! end;
! $$ language plpgsql;
! select * from conflict_test();
!  q1 |        q2         
! ----+-------------------
!  42 |               456
!  42 |  4567890123456789
!  42 |               123
!  42 |  4567890123456789
!  42 | -4567890123456789
! (5 rows)
! 
! create or replace function conflict_test() returns setof int8_tbl as $$
! #variable_conflict use_column
! declare r record;
!   q1 bigint := 42;
! begin
!   for r in select q1,q2 from int8_tbl loop
!     return next r;
!   end loop;
! end;
! $$ language plpgsql;
! select * from conflict_test();
!         q1        |        q2         
! ------------------+-------------------
!               123 |               456
!               123 |  4567890123456789
!  4567890123456789 |               123
!  4567890123456789 |  4567890123456789
!  4567890123456789 | -4567890123456789
! (5 rows)
! 
! drop function conflict_test();
! -- Check that an unreserved keyword can be used as a variable name
! create function unreserved_test() returns int as $$
! declare
!   forward int := 21;
! begin
!   forward := forward * 2;
!   return forward;
! end
! $$ language plpgsql;
! select unreserved_test();
!  unreserved_test 
! -----------------
!               42
! (1 row)
! 
! create or replace function unreserved_test() returns int as $$
! declare
!   return int := 42;
! begin
!   return := return + 1;
!   return return;
! end
! $$ language plpgsql;
! select unreserved_test();
!  unreserved_test 
! -----------------
!               43
! (1 row)
! 
! drop function unreserved_test();
! --
! -- Test FOREACH over arrays
! --
! create function foreach_test(anyarray)
! returns void as $$
! declare x int;
! begin
!   foreach x in array $1
!   loop
!     raise notice '%', x;
!   end loop;
!   end;
! $$ language plpgsql;
! select foreach_test(ARRAY[1,2,3,4]);
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[1,2],[3,4]]);
! NOTICE:  1
! NOTICE:  2
! NOTICE:  3
! NOTICE:  4
!  foreach_test 
! --------------
!  
! (1 row)
! 
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare x int;
! begin
!   foreach x slice 1 in array $1
!   loop
!     raise notice '%', x;
!   end loop;
!   end;
! $$ language plpgsql;
! -- should fail
! select foreach_test(ARRAY[1,2,3,4]);
! ERROR:  FOREACH ... SLICE loop variable must be of an array type
! CONTEXT:  PL/pgSQL function foreach_test(anyarray) line 4 at FOREACH over array
! select foreach_test(ARRAY[[1,2],[3,4]]);
! ERROR:  FOREACH ... SLICE loop variable must be of an array type
! CONTEXT:  PL/pgSQL function foreach_test(anyarray) line 4 at FOREACH over array
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare x int[];
! begin
!   foreach x slice 1 in array $1
!   loop
!     raise notice '%', x;
!   end loop;
!   end;
! $$ language plpgsql;
! select foreach_test(ARRAY[1,2,3,4]);
! NOTICE:  {1,2,3,4}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[1,2],[3,4]]);
! NOTICE:  {1,2}
! NOTICE:  {3,4}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! -- higher level of slicing
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare x int[];
! begin
!   foreach x slice 2 in array $1
!   loop
!     raise notice '%', x;
!   end loop;
!   end;
! $$ language plpgsql;
! -- should fail
! select foreach_test(ARRAY[1,2,3,4]);
! ERROR:  slice dimension (2) is out of the valid range 0..1
! CONTEXT:  PL/pgSQL function foreach_test(anyarray) line 4 at FOREACH over array
! -- ok
! select foreach_test(ARRAY[[1,2],[3,4]]);
! NOTICE:  {{1,2},{3,4}}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[[1,2]],[[3,4]]]);
! NOTICE:  {{1,2}}
! NOTICE:  {{3,4}}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! create type xy_tuple AS (x int, y int);
! -- iteration over array of records
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare r record;
! begin
!   foreach r in array $1
!   loop
!     raise notice '%', r;
!   end loop;
!   end;
! $$ language plpgsql;
! select foreach_test(ARRAY[(10,20),(40,69),(35,78)]::xy_tuple[]);
! NOTICE:  (10,20)
! NOTICE:  (40,69)
! NOTICE:  (35,78)
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[(10,20),(40,69)],[(35,78),(88,76)]]::xy_tuple[]);
! NOTICE:  (10,20)
! NOTICE:  (40,69)
! NOTICE:  (35,78)
! NOTICE:  (88,76)
!  foreach_test 
! --------------
!  
! (1 row)
! 
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare x int; y int;
! begin
!   foreach x, y in array $1
!   loop
!     raise notice 'x = %, y = %', x, y;
!   end loop;
!   end;
! $$ language plpgsql;
! select foreach_test(ARRAY[(10,20),(40,69),(35,78)]::xy_tuple[]);
! NOTICE:  x = 10, y = 20
! NOTICE:  x = 40, y = 69
! NOTICE:  x = 35, y = 78
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[(10,20),(40,69)],[(35,78),(88,76)]]::xy_tuple[]);
! NOTICE:  x = 10, y = 20
! NOTICE:  x = 40, y = 69
! NOTICE:  x = 35, y = 78
! NOTICE:  x = 88, y = 76
!  foreach_test 
! --------------
!  
! (1 row)
! 
! -- slicing over array of composite types
! create or replace function foreach_test(anyarray)
! returns void as $$
! declare x xy_tuple[];
! begin
!   foreach x slice 1 in array $1
!   loop
!     raise notice '%', x;
!   end loop;
!   end;
! $$ language plpgsql;
! select foreach_test(ARRAY[(10,20),(40,69),(35,78)]::xy_tuple[]);
! NOTICE:  {"(10,20)","(40,69)","(35,78)"}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! select foreach_test(ARRAY[[(10,20),(40,69)],[(35,78),(88,76)]]::xy_tuple[]);
! NOTICE:  {"(10,20)","(40,69)"}
! NOTICE:  {"(35,78)","(88,76)"}
!  foreach_test 
! --------------
!  
! (1 row)
! 
! drop function foreach_test(anyarray);
! drop type xy_tuple;
! --
! -- Assorted tests for array subscript assignment
! --
! create temp table rtype (id int, ar text[]);
! create function arrayassign1() returns text[] language plpgsql as $$
! declare
!  r record;
! begin
!   r := row(12, '{foo,bar,baz}')::rtype;
!   r.ar[2] := 'replace';
!   return r.ar;
! end$$;
! select arrayassign1();
!    arrayassign1    
! -------------------
!  {foo,replace,baz}
! (1 row)
! 
! select arrayassign1(); -- try again to exercise internal caching
!    arrayassign1    
! -------------------
!  {foo,replace,baz}
! (1 row)
! 
! create domain orderedarray as int[2]
!   constraint sorted check (value[1] < value[2]);
! select '{1,2}'::orderedarray;
!  orderedarray 
! --------------
!  {1,2}
! (1 row)
! 
! select '{2,1}'::orderedarray;  -- fail
! ERROR:  value for domain orderedarray violates check constraint "sorted"
! create function testoa(x1 int, x2 int, x3 int) returns orderedarray
! language plpgsql as $$
! declare res orderedarray;
! begin
!   res := array[x1, x2];
!   res[2] := x3;
!   return res;
! end$$;
! select testoa(1,2,3);
!  testoa 
! --------
!  {1,3}
! (1 row)
! 
! select testoa(1,2,3); -- try again to exercise internal caching
!  testoa 
! --------
!  {1,3}
! (1 row)
! 
! select testoa(2,1,3); -- fail at initial assign
! ERROR:  value for domain orderedarray violates check constraint "sorted"
! CONTEXT:  PL/pgSQL function testoa(integer,integer,integer) line 4 at assignment
! select testoa(1,2,1); -- fail at update
! ERROR:  value for domain orderedarray violates check constraint "sorted"
! CONTEXT:  PL/pgSQL function testoa(integer,integer,integer) line 5 at assignment
! drop function arrayassign1();
! drop function testoa(x1 int, x2 int, x3 int);
! --
! -- Test handling of expanded arrays
! --
! create function returns_rw_array(int) returns int[]
! language plpgsql as $$
!   declare r int[];
!   begin r := array[$1, $1]; return r; end;
! $$ stable;
! create function consumes_rw_array(int[]) returns int
! language plpgsql as $$
!   begin return $1[1]; end;
! $$ stable;
! -- bug #14174
! explain (verbose, costs off)
! select i, a from
!   (select returns_rw_array(1) as a offset 0) ss,
!   lateral consumes_rw_array(a) i;
!                            QUERY PLAN                            
! -----------------------------------------------------------------
!  Nested Loop
!    Output: i.i, (returns_rw_array(1))
!    ->  Result
!          Output: returns_rw_array(1)
!    ->  Function Scan on public.consumes_rw_array i
!          Output: i.i
!          Function Call: consumes_rw_array((returns_rw_array(1)))
! (7 rows)
! 
! select i, a from
!   (select returns_rw_array(1) as a offset 0) ss,
!   lateral consumes_rw_array(a) i;
!  i |   a   
! ---+-------
!  1 | {1,1}
! (1 row)
! 
! explain (verbose, costs off)
! select consumes_rw_array(a), a from returns_rw_array(1) a;
!                  QUERY PLAN                 
! --------------------------------------------
!  Function Scan on public.returns_rw_array a
!    Output: consumes_rw_array(a), a
!    Function Call: returns_rw_array(1)
! (3 rows)
! 
! select consumes_rw_array(a), a from returns_rw_array(1) a;
!  consumes_rw_array |   a   
! -------------------+-------
!                  1 | {1,1}
! (1 row)
! 
! explain (verbose, costs off)
! select consumes_rw_array(a), a from
!   (values (returns_rw_array(1)), (returns_rw_array(2))) v(a);
!                              QUERY PLAN                              
! ---------------------------------------------------------------------
!  Values Scan on "*VALUES*"
!    Output: consumes_rw_array("*VALUES*".column1), "*VALUES*".column1
! (2 rows)
! 
! select consumes_rw_array(a), a from
!   (values (returns_rw_array(1)), (returns_rw_array(2))) v(a);
!  consumes_rw_array |   a   
! -------------------+-------
!                  1 | {1,1}
!                  2 | {2,2}
! (2 rows)
! 
! --
! -- Test access to call stack
! --
! create function inner_func(int)
! returns int as $$
! declare _context text;
! begin
!   get diagnostics _context = pg_context;
!   raise notice '***%***', _context;
!   -- lets do it again, just for fun..
!   get diagnostics _context = pg_context;
!   raise notice '***%***', _context;
!   raise notice 'lets make sure we didnt break anything';
!   return 2 * $1;
! end;
! $$ language plpgsql;
! create or replace function outer_func(int)
! returns int as $$
! declare
!   myresult int;
! begin
!   raise notice 'calling down into inner_func()';
!   myresult := inner_func($1);
!   raise notice 'inner_func() done';
!   return myresult;
! end;
! $$ language plpgsql;
! create or replace function outer_outer_func(int)
! returns int as $$
! declare
!   myresult int;
! begin
!   raise notice 'calling down into outer_func()';
!   myresult := outer_func($1);
!   raise notice 'outer_func() done';
!   return myresult;
! end;
! $$ language plpgsql;
! select outer_outer_func(10);
! NOTICE:  calling down into outer_func()
! NOTICE:  calling down into inner_func()
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 7 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  lets make sure we didnt break anything
! NOTICE:  inner_func() done
! NOTICE:  outer_func() done
!  outer_outer_func 
! ------------------
!                20
! (1 row)
! 
! -- repeated call should to work
! select outer_outer_func(20);
! NOTICE:  calling down into outer_func()
! NOTICE:  calling down into inner_func()
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 7 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  lets make sure we didnt break anything
! NOTICE:  inner_func() done
! NOTICE:  outer_func() done
!  outer_outer_func 
! ------------------
!                40
! (1 row)
! 
! drop function outer_outer_func(int);
! drop function outer_func(int);
! drop function inner_func(int);
! -- access to call stack from exception
! create function inner_func(int)
! returns int as $$
! declare
!   _context text;
!   sx int := 5;
! begin
!   begin
!     perform sx / 0;
!   exception
!     when division_by_zero then
!       get diagnostics _context = pg_context;
!       raise notice '***%***', _context;
!   end;
! 
!   -- lets do it again, just for fun..
!   get diagnostics _context = pg_context;
!   raise notice '***%***', _context;
!   raise notice 'lets make sure we didnt break anything';
!   return 2 * $1;
! end;
! $$ language plpgsql;
! create or replace function outer_func(int)
! returns int as $$
! declare
!   myresult int;
! begin
!   raise notice 'calling down into inner_func()';
!   myresult := inner_func($1);
!   raise notice 'inner_func() done';
!   return myresult;
! end;
! $$ language plpgsql;
! create or replace function outer_outer_func(int)
! returns int as $$
! declare
!   myresult int;
! begin
!   raise notice 'calling down into outer_func()';
!   myresult := outer_func($1);
!   raise notice 'outer_func() done';
!   return myresult;
! end;
! $$ language plpgsql;
! select outer_outer_func(10);
! NOTICE:  calling down into outer_func()
! NOTICE:  calling down into inner_func()
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 10 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 15 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  lets make sure we didnt break anything
! NOTICE:  inner_func() done
! NOTICE:  outer_func() done
!  outer_outer_func 
! ------------------
!                20
! (1 row)
! 
! -- repeated call should to work
! select outer_outer_func(20);
! NOTICE:  calling down into outer_func()
! NOTICE:  calling down into inner_func()
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 10 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  ***PL/pgSQL function inner_func(integer) line 15 at GET DIAGNOSTICS
! PL/pgSQL function outer_func(integer) line 6 at assignment
! PL/pgSQL function outer_outer_func(integer) line 6 at assignment***
! NOTICE:  lets make sure we didnt break anything
! NOTICE:  inner_func() done
! NOTICE:  outer_func() done
!  outer_outer_func 
! ------------------
!                40
! (1 row)
! 
! drop function outer_outer_func(int);
! drop function outer_func(int);
! drop function inner_func(int);
! --
! -- Test ASSERT
! --
! do $$
! begin
!   assert 1=1;  -- should succeed
! end;
! $$;
! do $$
! begin
!   assert 1=0;  -- should fail
! end;
! $$;
! ERROR:  assertion failed
! CONTEXT:  PL/pgSQL function inline_code_block line 3 at ASSERT
! do $$
! begin
!   assert NULL;  -- should fail
! end;
! $$;
! ERROR:  assertion failed
! CONTEXT:  PL/pgSQL function inline_code_block line 3 at ASSERT
! -- check controlling GUC
! set plpgsql.check_asserts = off;
! do $$
! begin
!   assert 1=0;  -- won't be tested
! end;
! $$;
! reset plpgsql.check_asserts;
! -- test custom message
! do $$
! declare var text := 'some value';
! begin
!   assert 1=0, format('assertion failed, var = "%s"', var);
! end;
! $$;
! ERROR:  assertion failed, var = "some value"
! CONTEXT:  PL/pgSQL function inline_code_block line 4 at ASSERT
! -- ensure assertions are not trapped by 'others'
! do $$
! begin
!   assert 1=0, 'unhandled assertion';
! exception when others then
!   null; -- do nothing
! end;
! $$;
! ERROR:  unhandled assertion
! CONTEXT:  PL/pgSQL function inline_code_block line 3 at ASSERT
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/copy2.out	2016-09-05 20:45:48.604032169 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/copy2.out	2016-09-12 12:14:51.883413916 -0300
***************
*** 1,468 ****
! CREATE TEMP TABLE x (
! 	a serial,
! 	b int,
! 	c text not null default 'stuff',
! 	d text,
! 	e text
! ) WITH OIDS;
! CREATE FUNCTION fn_x_before () RETURNS TRIGGER AS '
!   BEGIN
! 		NEW.e := ''before trigger fired''::text;
! 		return NEW;
! 	END;
! ' LANGUAGE plpgsql;
! CREATE FUNCTION fn_x_after () RETURNS TRIGGER AS '
!   BEGIN
! 		UPDATE x set e=''after trigger fired'' where c=''stuff'';
! 		return NULL;
! 	END;
! ' LANGUAGE plpgsql;
! CREATE TRIGGER trg_x_after AFTER INSERT ON x
! FOR EACH ROW EXECUTE PROCEDURE fn_x_after();
! CREATE TRIGGER trg_x_before BEFORE INSERT ON x
! FOR EACH ROW EXECUTE PROCEDURE fn_x_before();
! COPY x (a, b, c, d, e) from stdin;
! COPY x (b, d) from stdin;
! COPY x (b, d) from stdin;
! COPY x (a, b, c, d, e) from stdin;
! -- non-existent column in column list: should fail
! COPY x (xyz) from stdin;
! ERROR:  column "xyz" of relation "x" does not exist
! -- too many columns in column list: should fail
! COPY x (a, b, c, d, e, d, c) from stdin;
! ERROR:  column "d" specified more than once
! -- missing data: should fail
! COPY x from stdin;
! ERROR:  invalid input syntax for integer: ""
! CONTEXT:  COPY x, line 1, column a: ""
! COPY x from stdin;
! ERROR:  missing data for column "e"
! CONTEXT:  COPY x, line 1: "2000	230	23	23"
! COPY x from stdin;
! ERROR:  missing data for column "e"
! CONTEXT:  COPY x, line 1: "2001	231	\N	\N"
! -- extra data: should fail
! COPY x from stdin;
! ERROR:  extra data after last expected column
! CONTEXT:  COPY x, line 1: "2002	232	40	50	60	70	80"
! -- various COPY options: delimiters, oids, NULL string, encoding
! COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
! COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
! COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING 'sql_ascii';
! -- check results of copy in
! SELECT * FROM x;
!    a   | b  |     c      |   d    |          e           
! -------+----+------------+--------+----------------------
!   9999 |    | \N         | NN     | before trigger fired
!  10000 | 21 | 31         | 41     | before trigger fired
!  10001 | 22 | 32         | 42     | before trigger fired
!  10002 | 23 | 33         | 43     | before trigger fired
!  10003 | 24 | 34         | 44     | before trigger fired
!  10004 | 25 | 35         | 45     | before trigger fired
!  10005 | 26 | 36         | 46     | before trigger fired
!      6 |    | 45         | 80     | before trigger fired
!      7 |    | x          | \x     | before trigger fired
!      8 |    | ,          | \,     | before trigger fired
!   3000 |    | c          |        | before trigger fired
!   4000 |    | C          |        | before trigger fired
!   4001 |  1 | empty      |        | before trigger fired
!   4002 |  2 | null       |        | before trigger fired
!   4003 |  3 | Backslash  | \      | before trigger fired
!   4004 |  4 | BackslashX | \X     | before trigger fired
!   4005 |  5 | N          | N      | before trigger fired
!   4006 |  6 | BackslashN | \N     | before trigger fired
!   4007 |  7 | XX         | XX     | before trigger fired
!   4008 |  8 | Delimiter  | :      | before trigger fired
!      1 |  1 | stuff      | test_1 | after trigger fired
!      2 |  2 | stuff      | test_2 | after trigger fired
!      3 |  3 | stuff      | test_3 | after trigger fired
!      4 |  4 | stuff      | test_4 | after trigger fired
!      5 |  5 | stuff      | test_5 | after trigger fired
! (25 rows)
! 
! -- COPY w/ oids on a table w/o oids should fail
! CREATE TABLE no_oids (
! 	a	int,
! 	b	int
! ) WITHOUT OIDS;
! INSERT INTO no_oids (a, b) VALUES (5, 10);
! INSERT INTO no_oids (a, b) VALUES (20, 30);
! -- should fail
! COPY no_oids FROM stdin WITH OIDS;
! ERROR:  table "no_oids" does not have OIDs
! COPY no_oids TO stdout WITH OIDS;
! ERROR:  table "no_oids" does not have OIDs
! -- check copy out
! COPY x TO stdout;
! 9999	\N	\\N	NN	before trigger fired
! 10000	21	31	41	before trigger fired
! 10001	22	32	42	before trigger fired
! 10002	23	33	43	before trigger fired
! 10003	24	34	44	before trigger fired
! 10004	25	35	45	before trigger fired
! 10005	26	36	46	before trigger fired
! 6	\N	45	80	before trigger fired
! 7	\N	x	\\x	before trigger fired
! 8	\N	,	\\,	before trigger fired
! 3000	\N	c	\N	before trigger fired
! 4000	\N	C	\N	before trigger fired
! 4001	1	empty		before trigger fired
! 4002	2	null	\N	before trigger fired
! 4003	3	Backslash	\\	before trigger fired
! 4004	4	BackslashX	\\X	before trigger fired
! 4005	5	N	N	before trigger fired
! 4006	6	BackslashN	\\N	before trigger fired
! 4007	7	XX	XX	before trigger fired
! 4008	8	Delimiter	:	before trigger fired
! 1	1	stuff	test_1	after trigger fired
! 2	2	stuff	test_2	after trigger fired
! 3	3	stuff	test_3	after trigger fired
! 4	4	stuff	test_4	after trigger fired
! 5	5	stuff	test_5	after trigger fired
! COPY x (c, e) TO stdout;
! \\N	before trigger fired
! 31	before trigger fired
! 32	before trigger fired
! 33	before trigger fired
! 34	before trigger fired
! 35	before trigger fired
! 36	before trigger fired
! 45	before trigger fired
! x	before trigger fired
! ,	before trigger fired
! c	before trigger fired
! C	before trigger fired
! empty	before trigger fired
! null	before trigger fired
! Backslash	before trigger fired
! BackslashX	before trigger fired
! N	before trigger fired
! BackslashN	before trigger fired
! XX	before trigger fired
! Delimiter	before trigger fired
! stuff	after trigger fired
! stuff	after trigger fired
! stuff	after trigger fired
! stuff	after trigger fired
! stuff	after trigger fired
! COPY x (b, e) TO stdout WITH NULL 'I''m null';
! I'm null	before trigger fired
! 21	before trigger fired
! 22	before trigger fired
! 23	before trigger fired
! 24	before trigger fired
! 25	before trigger fired
! 26	before trigger fired
! I'm null	before trigger fired
! I'm null	before trigger fired
! I'm null	before trigger fired
! I'm null	before trigger fired
! I'm null	before trigger fired
! 1	before trigger fired
! 2	before trigger fired
! 3	before trigger fired
! 4	before trigger fired
! 5	before trigger fired
! 6	before trigger fired
! 7	before trigger fired
! 8	before trigger fired
! 1	after trigger fired
! 2	after trigger fired
! 3	after trigger fired
! 4	after trigger fired
! 5	after trigger fired
! CREATE TEMP TABLE y (
! 	col1 text,
! 	col2 text
! );
! INSERT INTO y VALUES ('Jackson, Sam', E'\\h');
! INSERT INTO y VALUES ('It is "perfect".',E'\t');
! INSERT INTO y VALUES ('', NULL);
! COPY y TO stdout WITH CSV;
! "Jackson, Sam",\h
! "It is ""perfect"".",	
! "",
! COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
! Jackson, Sam|\h
! It is "perfect".|	
! ''|
! COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING 'sql_ascii';
! "Jackson, Sam","\\h"
! "It is \"perfect\".","	"
! "",
! COPY y TO stdout WITH CSV FORCE QUOTE *;
! "Jackson, Sam","\h"
! "It is ""perfect"".","	"
! "",
! -- Repeat above tests with new 9.0 option syntax
! COPY y TO stdout (FORMAT CSV);
! "Jackson, Sam",\h
! "It is ""perfect"".",	
! "",
! COPY y TO stdout (FORMAT CSV, QUOTE '''', DELIMITER '|');
! Jackson, Sam|\h
! It is "perfect".|	
! ''|
! COPY y TO stdout (FORMAT CSV, FORCE_QUOTE (col2), ESCAPE E'\\');
! "Jackson, Sam","\\h"
! "It is \"perfect\".","	"
! "",
! COPY y TO stdout (FORMAT CSV, FORCE_QUOTE *);
! "Jackson, Sam","\h"
! "It is ""perfect"".","	"
! "",
! \copy y TO stdout (FORMAT CSV)
! "Jackson, Sam",\h
! "It is ""perfect"".",	
! "",
! \copy y TO stdout (FORMAT CSV, QUOTE '''', DELIMITER '|')
! Jackson, Sam|\h
! It is "perfect".|	
! ''|
! \copy y TO stdout (FORMAT CSV, FORCE_QUOTE (col2), ESCAPE E'\\')
! "Jackson, Sam","\\h"
! "It is \"perfect\".","	"
! "",
! \copy y TO stdout (FORMAT CSV, FORCE_QUOTE *)
! "Jackson, Sam","\h"
! "It is ""perfect"".","	"
! "",
! --test that we read consecutive LFs properly
! CREATE TEMP TABLE testnl (a int, b text, c int);
! COPY testnl FROM stdin CSV;
! -- test end of copy marker
! CREATE TEMP TABLE testeoc (a text);
! COPY testeoc FROM stdin CSV;
! COPY testeoc TO stdout CSV;
! a\.
! \.b
! c\.d
! "\."
! -- test handling of nonstandard null marker that violates escaping rules
! CREATE TEMP TABLE testnull(a int, b text);
! INSERT INTO testnull VALUES (1, E'\\0'), (NULL, NULL);
! COPY testnull TO stdout WITH NULL AS E'\\0';
! 1	\\0
! \0	\0
! COPY testnull FROM stdin WITH NULL AS E'\\0';
! SELECT * FROM testnull;
!  a  | b  
! ----+----
!   1 | \0
!     | 
!  42 | \0
!     | 
! (4 rows)
! 
! BEGIN;
! CREATE TABLE vistest (LIKE testeoc);
! COPY vistest FROM stdin CSV;
! COMMIT;
! SELECT * FROM vistest;
!  a  
! ----
!  a0
!  b
! (2 rows)
! 
! BEGIN;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV;
! SELECT * FROM vistest;
!  a  
! ----
!  a1
!  b
! (2 rows)
! 
! SAVEPOINT s1;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV;
! SELECT * FROM vistest;
!  a  
! ----
!  d1
!  e
! (2 rows)
! 
! COMMIT;
! SELECT * FROM vistest;
!  a  
! ----
!  d1
!  e
! (2 rows)
! 
! BEGIN;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV FREEZE;
! SELECT * FROM vistest;
!  a  
! ----
!  a2
!  b
! (2 rows)
! 
! SAVEPOINT s1;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV FREEZE;
! SELECT * FROM vistest;
!  a  
! ----
!  d2
!  e
! (2 rows)
! 
! COMMIT;
! SELECT * FROM vistest;
!  a  
! ----
!  d2
!  e
! (2 rows)
! 
! BEGIN;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV FREEZE;
! SELECT * FROM vistest;
!  a 
! ---
!  x
!  y
! (2 rows)
! 
! COMMIT;
! TRUNCATE vistest;
! COPY vistest FROM stdin CSV FREEZE;
! ERROR:  cannot perform FREEZE because the table was not created or truncated in the current subtransaction
! BEGIN;
! TRUNCATE vistest;
! SAVEPOINT s1;
! COPY vistest FROM stdin CSV FREEZE;
! ERROR:  cannot perform FREEZE because the table was not created or truncated in the current subtransaction
! COMMIT;
! BEGIN;
! INSERT INTO vistest VALUES ('z');
! SAVEPOINT s1;
! TRUNCATE vistest;
! ROLLBACK TO SAVEPOINT s1;
! COPY vistest FROM stdin CSV FREEZE;
! ERROR:  cannot perform FREEZE because the table was not created or truncated in the current subtransaction
! COMMIT;
! CREATE FUNCTION truncate_in_subxact() RETURNS VOID AS
! $$
! BEGIN
! 	TRUNCATE vistest;
! EXCEPTION
!   WHEN OTHERS THEN
! 	INSERT INTO vistest VALUES ('subxact failure');
! END;
! $$ language plpgsql;
! BEGIN;
! INSERT INTO vistest VALUES ('z');
! SELECT truncate_in_subxact();
!  truncate_in_subxact 
! ---------------------
!  
! (1 row)
! 
! COPY vistest FROM stdin CSV FREEZE;
! SELECT * FROM vistest;
!  a  
! ----
!  d4
!  e
! (2 rows)
! 
! COMMIT;
! SELECT * FROM vistest;
!  a  
! ----
!  d4
!  e
! (2 rows)
! 
! -- Test FORCE_NOT_NULL and FORCE_NULL options
! CREATE TEMP TABLE forcetest (
!     a INT NOT NULL,
!     b TEXT NOT NULL,
!     c TEXT,
!     d TEXT,
!     e TEXT
! );
! \pset null NULL
! -- should succeed with no effect ("b" remains an empty string, "c" remains NULL)
! BEGIN;
! COPY forcetest (a, b, c) FROM STDIN WITH (FORMAT csv, FORCE_NOT_NULL(b), FORCE_NULL(c));
! COMMIT;
! SELECT b, c FROM forcetest WHERE a = 1;
!  b |  c   
! ---+------
!    | NULL
! (1 row)
! 
! -- should succeed, FORCE_NULL and FORCE_NOT_NULL can be both specified
! BEGIN;
! COPY forcetest (a, b, c, d) FROM STDIN WITH (FORMAT csv, FORCE_NOT_NULL(c,d), FORCE_NULL(c,d));
! COMMIT;
! SELECT c, d FROM forcetest WHERE a = 2;
!  c |  d   
! ---+------
!    | NULL
! (1 row)
! 
! -- should fail with not-null constraint violation
! BEGIN;
! COPY forcetest (a, b, c) FROM STDIN WITH (FORMAT csv, FORCE_NULL(b), FORCE_NOT_NULL(c));
! ERROR:  null value in column "b" violates not-null constraint
! DETAIL:  Failing row contains (3, null, , null, null).
! CONTEXT:  COPY forcetest, line 1: "3,,"""
! ROLLBACK;
! -- should fail with "not referenced by COPY" error
! BEGIN;
! COPY forcetest (d, e) FROM STDIN WITH (FORMAT csv, FORCE_NOT_NULL(b));
! ERROR:  FORCE_NOT_NULL column "b" not referenced by COPY
! ROLLBACK;
! -- should fail with "not referenced by COPY" error
! BEGIN;
! COPY forcetest (d, e) FROM STDIN WITH (FORMAT csv, FORCE_NULL(b));
! ERROR:  FORCE_NULL column "b" not referenced by COPY
! ROLLBACK;
! \pset null ''
! -- test case with whole-row Var in a check constraint
! create table check_con_tbl (f1 int);
! create function check_con_function(check_con_tbl) returns bool as $$
! begin
!   raise notice 'input = %', row_to_json($1);
!   return $1.f1 > 0;
! end $$ language plpgsql immutable;
! alter table check_con_tbl add check (check_con_function(check_con_tbl.*));
! \d+ check_con_tbl
!                     Table "public.check_con_tbl"
!  Column |  Type   | Modifiers | Storage | Stats target | Description 
! --------+---------+-----------+---------+--------------+-------------
!  f1     | integer |           | plain   |              | 
! Check constraints:
!     "check_con_tbl_check" CHECK (check_con_function(check_con_tbl.*))
! 
! copy check_con_tbl from stdin;
! NOTICE:  input = {"f1":1}
! NOTICE:  input = {"f1":null}
! copy check_con_tbl from stdin;
! NOTICE:  input = {"f1":0}
! ERROR:  new row for relation "check_con_tbl" violates check constraint "check_con_tbl_check"
! DETAIL:  Failing row contains (0).
! CONTEXT:  COPY check_con_tbl, line 1: "0"
! select * from check_con_tbl;
!  f1 
! ----
!   1
!    
! (2 rows)
! 
! DROP TABLE forcetest;
! DROP TABLE vistest;
! DROP FUNCTION truncate_in_subxact();
! DROP TABLE x, y;
! DROP FUNCTION fn_x_before();
! DROP FUNCTION fn_x_after();
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/temp.out	2016-09-05 20:45:49.072033605 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/temp.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,201 ****
! --
! -- TEMP
! -- Test temp relations and indexes
! --
! -- test temp table/index masking
! CREATE TABLE temptest(col int);
! CREATE INDEX i_temptest ON temptest(col);
! CREATE TEMP TABLE temptest(tcol int);
! CREATE INDEX i_temptest ON temptest(tcol);
! SELECT * FROM temptest;
!  tcol 
! ------
! (0 rows)
! 
! DROP INDEX i_temptest;
! DROP TABLE temptest;
! SELECT * FROM temptest;
!  col 
! -----
! (0 rows)
! 
! DROP INDEX i_temptest;
! DROP TABLE temptest;
! -- test temp table selects
! CREATE TABLE temptest(col int);
! INSERT INTO temptest VALUES (1);
! CREATE TEMP TABLE temptest(tcol float);
! INSERT INTO temptest VALUES (2.1);
! SELECT * FROM temptest;
!  tcol 
! ------
!   2.1
! (1 row)
! 
! DROP TABLE temptest;
! SELECT * FROM temptest;
!  col 
! -----
!    1
! (1 row)
! 
! DROP TABLE temptest;
! -- test temp table deletion
! CREATE TEMP TABLE temptest(col int);
! \c
! SELECT * FROM temptest;
! ERROR:  relation "temptest" does not exist
! LINE 1: SELECT * FROM temptest;
!                       ^
! -- Test ON COMMIT DELETE ROWS
! CREATE TEMP TABLE temptest(col int) ON COMMIT DELETE ROWS;
! BEGIN;
! INSERT INTO temptest VALUES (1);
! INSERT INTO temptest VALUES (2);
! SELECT * FROM temptest;
!  col 
! -----
!    1
!    2
! (2 rows)
! 
! COMMIT;
! SELECT * FROM temptest;
!  col 
! -----
! (0 rows)
! 
! DROP TABLE temptest;
! BEGIN;
! CREATE TEMP TABLE temptest(col) ON COMMIT DELETE ROWS AS SELECT 1;
! SELECT * FROM temptest;
!  col 
! -----
!    1
! (1 row)
! 
! COMMIT;
! SELECT * FROM temptest;
!  col 
! -----
! (0 rows)
! 
! DROP TABLE temptest;
! -- Test ON COMMIT DROP
! BEGIN;
! CREATE TEMP TABLE temptest(col int) ON COMMIT DROP;
! INSERT INTO temptest VALUES (1);
! INSERT INTO temptest VALUES (2);
! SELECT * FROM temptest;
!  col 
! -----
!    1
!    2
! (2 rows)
! 
! COMMIT;
! SELECT * FROM temptest;
! ERROR:  relation "temptest" does not exist
! LINE 1: SELECT * FROM temptest;
!                       ^
! BEGIN;
! CREATE TEMP TABLE temptest(col) ON COMMIT DROP AS SELECT 1;
! SELECT * FROM temptest;
!  col 
! -----
!    1
! (1 row)
! 
! COMMIT;
! SELECT * FROM temptest;
! ERROR:  relation "temptest" does not exist
! LINE 1: SELECT * FROM temptest;
!                       ^
! -- ON COMMIT is only allowed for TEMP
! CREATE TABLE temptest(col int) ON COMMIT DELETE ROWS;
! ERROR:  ON COMMIT can only be used on temporary tables
! CREATE TABLE temptest(col) ON COMMIT DELETE ROWS AS SELECT 1;
! ERROR:  ON COMMIT can only be used on temporary tables
! -- Test foreign keys
! BEGIN;
! CREATE TEMP TABLE temptest1(col int PRIMARY KEY);
! CREATE TEMP TABLE temptest2(col int REFERENCES temptest1)
!   ON COMMIT DELETE ROWS;
! INSERT INTO temptest1 VALUES (1);
! INSERT INTO temptest2 VALUES (1);
! COMMIT;
! SELECT * FROM temptest1;
!  col 
! -----
!    1
! (1 row)
! 
! SELECT * FROM temptest2;
!  col 
! -----
! (0 rows)
! 
! BEGIN;
! CREATE TEMP TABLE temptest3(col int PRIMARY KEY) ON COMMIT DELETE ROWS;
! CREATE TEMP TABLE temptest4(col int REFERENCES temptest3);
! COMMIT;
! ERROR:  unsupported ON COMMIT and foreign key combination
! DETAIL:  Table "temptest4" references "temptest3", but they do not have the same ON COMMIT setting.
! -- Test manipulation of temp schema's placement in search path
! create table public.whereami (f1 text);
! insert into public.whereami values ('public');
! create temp table whereami (f1 text);
! insert into whereami values ('temp');
! create function public.whoami() returns text
!   as $$select 'public'::text$$ language sql;
! create function pg_temp.whoami() returns text
!   as $$select 'temp'::text$$ language sql;
! -- default should have pg_temp implicitly first, but only for tables
! select * from whereami;
!   f1  
! ------
!  temp
! (1 row)
! 
! select whoami();
!  whoami 
! --------
!  public
! (1 row)
! 
! -- can list temp first explicitly, but it still doesn't affect functions
! set search_path = pg_temp, public;
! select * from whereami;
!   f1  
! ------
!  temp
! (1 row)
! 
! select whoami();
!  whoami 
! --------
!  public
! (1 row)
! 
! -- or put it last for security
! set search_path = public, pg_temp;
! select * from whereami;
!    f1   
! --------
!  public
! (1 row)
! 
! select whoami();
!  whoami 
! --------
!  public
! (1 row)
! 
! -- you can invoke a temp function explicitly, though
! select pg_temp.whoami();
!  whoami 
! --------
!  temp
! (1 row)
! 
! drop table public.whereami;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/domain.out	2016-09-05 20:45:48.636032268 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/domain.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,722 ****
! --
! -- Test domains.
! --
! -- Test Comment / Drop
! create domain domaindroptest int4;
! comment on domain domaindroptest is 'About to drop this..';
! create domain dependenttypetest domaindroptest;
! -- fail because of dependent type
! drop domain domaindroptest;
! ERROR:  cannot drop type domaindroptest because other objects depend on it
! DETAIL:  type dependenttypetest depends on type domaindroptest
! HINT:  Use DROP ... CASCADE to drop the dependent objects too.
! drop domain domaindroptest cascade;
! NOTICE:  drop cascades to type dependenttypetest
! -- this should fail because already gone
! drop domain domaindroptest cascade;
! ERROR:  type "domaindroptest" does not exist
! -- Test domain input.
! -- Note: the point of checking both INSERT and COPY FROM is that INSERT
! -- exercises CoerceToDomain while COPY exercises domain_in.
! create domain domainvarchar varchar(5);
! create domain domainnumeric numeric(8,2);
! create domain domainint4 int4;
! create domain domaintext text;
! -- Test explicit coercions --- these should succeed (and truncate)
! SELECT cast('123456' as domainvarchar);
!  domainvarchar 
! ---------------
!  12345
! (1 row)
! 
! SELECT cast('12345' as domainvarchar);
!  domainvarchar 
! ---------------
!  12345
! (1 row)
! 
! -- Test tables using domains
! create table basictest
!            ( testint4 domainint4
!            , testtext domaintext
!            , testvarchar domainvarchar
!            , testnumeric domainnumeric
!            );
! INSERT INTO basictest values ('88', 'haha', 'short', '123.12');      -- Good
! INSERT INTO basictest values ('88', 'haha', 'short text', '123.12'); -- Bad varchar
! ERROR:  value too long for type character varying(5)
! INSERT INTO basictest values ('88', 'haha', 'short', '123.1212');    -- Truncate numeric
! -- Test copy
! COPY basictest (testvarchar) FROM stdin; -- fail
! ERROR:  value too long for type character varying(5)
! CONTEXT:  COPY basictest, line 1, column testvarchar: "notsoshorttext"
! COPY basictest (testvarchar) FROM stdin;
! select * from basictest;
!  testint4 | testtext | testvarchar | testnumeric 
! ----------+----------+-------------+-------------
!        88 | haha     | short       |      123.12
!        88 | haha     | short       |      123.12
!           |          | short       |            
! (3 rows)
! 
! -- check that domains inherit operations from base types
! select testtext || testvarchar as concat, testnumeric + 42 as sum
! from basictest;
!   concat   |  sum   
! -----------+--------
!  hahashort | 165.12
!  hahashort | 165.12
!            |       
! (3 rows)
! 
! -- check that union/case/coalesce type resolution handles domains properly
! select coalesce(4::domainint4, 7) is of (int4) as t;
!  t 
! ---
!  t
! (1 row)
! 
! select coalesce(4::domainint4, 7) is of (domainint4) as f;
!  f 
! ---
!  f
! (1 row)
! 
! select coalesce(4::domainint4, 7::domainint4) is of (domainint4) as t;
!  t 
! ---
!  t
! (1 row)
! 
! drop table basictest;
! drop domain domainvarchar restrict;
! drop domain domainnumeric restrict;
! drop domain domainint4 restrict;
! drop domain domaintext;
! -- Test domains over array types
! create domain domainint4arr int4[1];
! create domain domainchar4arr varchar(4)[2][3];
! create table domarrtest
!            ( testint4arr domainint4arr
!            , testchar4arr domainchar4arr
!             );
! INSERT INTO domarrtest values ('{2,2}', '{{"a","b"},{"c","d"}}');
! INSERT INTO domarrtest values ('{{2,2},{2,2}}', '{{"a","b"}}');
! INSERT INTO domarrtest values ('{2,2}', '{{"a","b"},{"c","d"},{"e","f"}}');
! INSERT INTO domarrtest values ('{2,2}', '{{"a"},{"c"}}');
! INSERT INTO domarrtest values (NULL, '{{"a","b","c"},{"d","e","f"}}');
! INSERT INTO domarrtest values (NULL, '{{"toolong","b","c"},{"d","e","f"}}');
! ERROR:  value too long for type character varying(4)
! select * from domarrtest;
!   testint4arr  |    testchar4arr     
! ---------------+---------------------
!  {2,2}         | {{a,b},{c,d}}
!  {{2,2},{2,2}} | {{a,b}}
!  {2,2}         | {{a,b},{c,d},{e,f}}
!  {2,2}         | {{a},{c}}
!                | {{a,b,c},{d,e,f}}
! (5 rows)
! 
! select testint4arr[1], testchar4arr[2:2] from domarrtest;
!  testint4arr | testchar4arr 
! -------------+--------------
!            2 | {{c,d}}
!              | {}
!            2 | {{c,d}}
!            2 | {{c}}
!              | {{d,e,f}}
! (5 rows)
! 
! select array_dims(testint4arr), array_dims(testchar4arr) from domarrtest;
!  array_dims | array_dims 
! ------------+------------
!  [1:2]      | [1:2][1:2]
!  [1:2][1:2] | [1:1][1:2]
!  [1:2]      | [1:3][1:2]
!  [1:2]      | [1:2][1:1]
!             | [1:2][1:3]
! (5 rows)
! 
! COPY domarrtest FROM stdin;
! COPY domarrtest FROM stdin;	-- fail
! ERROR:  value too long for type character varying(4)
! CONTEXT:  COPY domarrtest, line 1, column testchar4arr: "{qwerty,w,e}"
! select * from domarrtest;
!   testint4arr  |    testchar4arr     
! ---------------+---------------------
!  {2,2}         | {{a,b},{c,d}}
!  {{2,2},{2,2}} | {{a,b}}
!  {2,2}         | {{a,b},{c,d},{e,f}}
!  {2,2}         | {{a},{c}}
!                | {{a,b,c},{d,e,f}}
!  {3,4}         | {q,w,e}
!                | 
! (7 rows)
! 
! drop table domarrtest;
! drop domain domainint4arr restrict;
! drop domain domainchar4arr restrict;
! create domain dia as int[];
! select '{1,2,3}'::dia;
!    dia   
! ---------
!  {1,2,3}
! (1 row)
! 
! select array_dims('{1,2,3}'::dia);
!  array_dims 
! ------------
!  [1:3]
! (1 row)
! 
! select pg_typeof('{1,2,3}'::dia);
!  pg_typeof 
! -----------
!  dia
! (1 row)
! 
! select pg_typeof('{1,2,3}'::dia || 42); -- should be int[] not dia
!  pg_typeof 
! -----------
!  integer[]
! (1 row)
! 
! drop domain dia;
! create domain dnotnull varchar(15) NOT NULL;
! create domain dnull    varchar(15);
! create domain dcheck   varchar(15) NOT NULL CHECK (VALUE = 'a' OR VALUE = 'c' OR VALUE = 'd');
! create table nulltest
!            ( col1 dnotnull
!            , col2 dnotnull NULL  -- NOT NULL in the domain cannot be overridden
!            , col3 dnull    NOT NULL
!            , col4 dnull
!            , col5 dcheck CHECK (col5 IN ('c', 'd'))
!            );
! INSERT INTO nulltest DEFAULT VALUES;
! ERROR:  domain dnotnull does not allow null values
! INSERT INTO nulltest values ('a', 'b', 'c', 'd', 'c');  -- Good
! insert into nulltest values ('a', 'b', 'c', 'd', NULL);
! ERROR:  domain dcheck does not allow null values
! insert into nulltest values ('a', 'b', 'c', 'd', 'a');
! ERROR:  new row for relation "nulltest" violates check constraint "nulltest_col5_check"
! DETAIL:  Failing row contains (a, b, c, d, a).
! INSERT INTO nulltest values (NULL, 'b', 'c', 'd', 'd');
! ERROR:  domain dnotnull does not allow null values
! INSERT INTO nulltest values ('a', NULL, 'c', 'd', 'c');
! ERROR:  domain dnotnull does not allow null values
! INSERT INTO nulltest values ('a', 'b', NULL, 'd', 'c');
! ERROR:  null value in column "col3" violates not-null constraint
! DETAIL:  Failing row contains (a, b, null, d, c).
! INSERT INTO nulltest values ('a', 'b', 'c', NULL, 'd'); -- Good
! -- Test copy
! COPY nulltest FROM stdin; --fail
! ERROR:  null value in column "col3" violates not-null constraint
! DETAIL:  Failing row contains (a, b, null, d, d).
! CONTEXT:  COPY nulltest, line 1: "a	b	\N	d	d"
! COPY nulltest FROM stdin; --fail
! ERROR:  domain dcheck does not allow null values
! CONTEXT:  COPY nulltest, line 1, column col5: null input
! -- Last row is bad
! COPY nulltest FROM stdin;
! ERROR:  new row for relation "nulltest" violates check constraint "nulltest_col5_check"
! DETAIL:  Failing row contains (a, b, c, null, a).
! CONTEXT:  COPY nulltest, line 3: "a	b	c	\N	a"
! select * from nulltest;
!  col1 | col2 | col3 | col4 | col5 
! ------+------+------+------+------
!  a    | b    | c    | d    | c
!  a    | b    | c    |      | d
! (2 rows)
! 
! -- Test out coerced (casted) constraints
! SELECT cast('1' as dnotnull);
!  dnotnull 
! ----------
!  1
! (1 row)
! 
! SELECT cast(NULL as dnotnull); -- fail
! ERROR:  domain dnotnull does not allow null values
! SELECT cast(cast(NULL as dnull) as dnotnull); -- fail
! ERROR:  domain dnotnull does not allow null values
! SELECT cast(col4 as dnotnull) from nulltest; -- fail
! ERROR:  domain dnotnull does not allow null values
! -- cleanup
! drop table nulltest;
! drop domain dnotnull restrict;
! drop domain dnull restrict;
! drop domain dcheck restrict;
! create domain ddef1 int4 DEFAULT 3;
! create domain ddef2 oid DEFAULT '12';
! -- Type mixing, function returns int8
! create domain ddef3 text DEFAULT 5;
! create sequence ddef4_seq;
! create domain ddef4 int4 DEFAULT nextval('ddef4_seq');
! create domain ddef5 numeric(8,2) NOT NULL DEFAULT '12.12';
! create table defaulttest
!             ( col1 ddef1
!             , col2 ddef2
!             , col3 ddef3
!             , col4 ddef4 PRIMARY KEY
!             , col5 ddef1 NOT NULL DEFAULT NULL
!             , col6 ddef2 DEFAULT '88'
!             , col7 ddef4 DEFAULT 8000
!             , col8 ddef5
!             );
! insert into defaulttest(col4) values(0); -- fails, col5 defaults to null
! ERROR:  null value in column "col5" violates not-null constraint
! DETAIL:  Failing row contains (3, 12, 5, 0, null, 88, 8000, 12.12).
! alter table defaulttest alter column col5 drop default;
! insert into defaulttest default values; -- succeeds, inserts domain default
! -- We used to treat SET DEFAULT NULL as equivalent to DROP DEFAULT; wrong
! alter table defaulttest alter column col5 set default null;
! insert into defaulttest(col4) values(0); -- fails
! ERROR:  null value in column "col5" violates not-null constraint
! DETAIL:  Failing row contains (3, 12, 5, 0, null, 88, 8000, 12.12).
! alter table defaulttest alter column col5 drop default;
! insert into defaulttest default values;
! insert into defaulttest default values;
! -- Test defaults with copy
! COPY defaulttest(col5) FROM stdin;
! select * from defaulttest;
!  col1 | col2 | col3 | col4 | col5 | col6 | col7 | col8  
! ------+------+------+------+------+------+------+-------
!     3 |   12 | 5    |    1 |    3 |   88 | 8000 | 12.12
!     3 |   12 | 5    |    2 |    3 |   88 | 8000 | 12.12
!     3 |   12 | 5    |    3 |    3 |   88 | 8000 | 12.12
!     3 |   12 | 5    |    4 |   42 |   88 | 8000 | 12.12
! (4 rows)
! 
! drop table defaulttest cascade;
! -- Test ALTER DOMAIN .. NOT NULL
! create domain dnotnulltest integer;
! create table domnotnull
! ( col1 dnotnulltest
! , col2 dnotnulltest
! );
! insert into domnotnull default values;
! alter domain dnotnulltest set not null; -- fails
! ERROR:  column "col1" of table "domnotnull" contains null values
! update domnotnull set col1 = 5;
! alter domain dnotnulltest set not null; -- fails
! ERROR:  column "col2" of table "domnotnull" contains null values
! update domnotnull set col2 = 6;
! alter domain dnotnulltest set not null;
! update domnotnull set col1 = null; -- fails
! ERROR:  domain dnotnulltest does not allow null values
! alter domain dnotnulltest drop not null;
! update domnotnull set col1 = null;
! drop domain dnotnulltest cascade;
! NOTICE:  drop cascades to 2 other objects
! DETAIL:  drop cascades to table domnotnull column col1
! drop cascades to table domnotnull column col2
! -- Test ALTER DOMAIN .. DEFAULT ..
! create table domdeftest (col1 ddef1);
! insert into domdeftest default values;
! select * from domdeftest;
!  col1 
! ------
!     3
! (1 row)
! 
! alter domain ddef1 set default '42';
! insert into domdeftest default values;
! select * from domdeftest;
!  col1 
! ------
!     3
!    42
! (2 rows)
! 
! alter domain ddef1 drop default;
! insert into domdeftest default values;
! select * from domdeftest;
!  col1 
! ------
!     3
!    42
!      
! (3 rows)
! 
! drop table domdeftest;
! -- Test ALTER DOMAIN .. CONSTRAINT ..
! create domain con as integer;
! create table domcontest (col1 con);
! insert into domcontest values (1);
! insert into domcontest values (2);
! alter domain con add constraint t check (VALUE < 1); -- fails
! ERROR:  column "col1" of table "domcontest" contains values that violate the new constraint
! alter domain con add constraint t check (VALUE < 34);
! alter domain con add check (VALUE > 0);
! insert into domcontest values (-5); -- fails
! ERROR:  value for domain con violates check constraint "con_check"
! insert into domcontest values (42); -- fails
! ERROR:  value for domain con violates check constraint "t"
! insert into domcontest values (5);
! alter domain con drop constraint t;
! insert into domcontest values (-5); --fails
! ERROR:  value for domain con violates check constraint "con_check"
! insert into domcontest values (42);
! alter domain con drop constraint nonexistent;
! ERROR:  constraint "nonexistent" of domain "con" does not exist
! alter domain con drop constraint if exists nonexistent;
! NOTICE:  constraint "nonexistent" of domain "con" does not exist, skipping
! -- Test ALTER DOMAIN .. CONSTRAINT .. NOT VALID
! create domain things AS INT;
! CREATE TABLE thethings (stuff things);
! INSERT INTO thethings (stuff) VALUES (55);
! ALTER DOMAIN things ADD CONSTRAINT meow CHECK (VALUE < 11);
! ERROR:  column "stuff" of table "thethings" contains values that violate the new constraint
! ALTER DOMAIN things ADD CONSTRAINT meow CHECK (VALUE < 11) NOT VALID;
! ALTER DOMAIN things VALIDATE CONSTRAINT meow;
! ERROR:  column "stuff" of table "thethings" contains values that violate the new constraint
! UPDATE thethings SET stuff = 10;
! ALTER DOMAIN things VALIDATE CONSTRAINT meow;
! -- Confirm ALTER DOMAIN with RULES.
! create table domtab (col1 integer);
! create domain dom as integer;
! create view domview as select cast(col1 as dom) from domtab;
! insert into domtab (col1) values (null);
! insert into domtab (col1) values (5);
! select * from domview;
!  col1 
! ------
!      
!     5
! (2 rows)
! 
! alter domain dom set not null;
! select * from domview; -- fail
! ERROR:  domain dom does not allow null values
! alter domain dom drop not null;
! select * from domview;
!  col1 
! ------
!      
!     5
! (2 rows)
! 
! alter domain dom add constraint domchkgt6 check(value > 6);
! select * from domview; --fail
! ERROR:  value for domain dom violates check constraint "domchkgt6"
! alter domain dom drop constraint domchkgt6 restrict;
! select * from domview;
!  col1 
! ------
!      
!     5
! (2 rows)
! 
! -- cleanup
! drop domain ddef1 restrict;
! drop domain ddef2 restrict;
! drop domain ddef3 restrict;
! drop domain ddef4 restrict;
! drop domain ddef5 restrict;
! drop sequence ddef4_seq;
! -- Test domains over domains
! create domain vchar4 varchar(4);
! create domain dinter vchar4 check (substring(VALUE, 1, 1) = 'x');
! create domain dtop dinter check (substring(VALUE, 2, 1) = '1');
! select 'x123'::dtop;
!  dtop 
! ------
!  x123
! (1 row)
! 
! select 'x1234'::dtop; -- explicit coercion should truncate
!  dtop 
! ------
!  x123
! (1 row)
! 
! select 'y1234'::dtop; -- fail
! ERROR:  value for domain dtop violates check constraint "dinter_check"
! select 'y123'::dtop; -- fail
! ERROR:  value for domain dtop violates check constraint "dinter_check"
! select 'yz23'::dtop; -- fail
! ERROR:  value for domain dtop violates check constraint "dinter_check"
! select 'xz23'::dtop; -- fail
! ERROR:  value for domain dtop violates check constraint "dtop_check"
! create temp table dtest(f1 dtop);
! insert into dtest values('x123');
! insert into dtest values('x1234'); -- fail, implicit coercion
! ERROR:  value too long for type character varying(4)
! insert into dtest values('y1234'); -- fail, implicit coercion
! ERROR:  value too long for type character varying(4)
! insert into dtest values('y123'); -- fail
! ERROR:  value for domain dtop violates check constraint "dinter_check"
! insert into dtest values('yz23'); -- fail
! ERROR:  value for domain dtop violates check constraint "dinter_check"
! insert into dtest values('xz23'); -- fail
! ERROR:  value for domain dtop violates check constraint "dtop_check"
! drop table dtest;
! drop domain vchar4 cascade;
! NOTICE:  drop cascades to 2 other objects
! DETAIL:  drop cascades to type dinter
! drop cascades to type dtop
! -- Make sure that constraints of newly-added domain columns are
! -- enforced correctly, even if there's no default value for the new
! -- column. Per bug #1433
! create domain str_domain as text not null;
! create table domain_test (a int, b int);
! insert into domain_test values (1, 2);
! insert into domain_test values (1, 2);
! -- should fail
! alter table domain_test add column c str_domain;
! ERROR:  domain str_domain does not allow null values
! create domain str_domain2 as text check (value <> 'foo') default 'foo';
! -- should fail
! alter table domain_test add column d str_domain2;
! ERROR:  value for domain str_domain2 violates check constraint "str_domain2_check"
! -- Check that domain constraints on prepared statement parameters of
! -- unknown type are enforced correctly.
! create domain pos_int as int4 check (value > 0) not null;
! prepare s1 as select $1::pos_int = 10 as "is_ten";
! execute s1(10);
!  is_ten 
! --------
!  t
! (1 row)
! 
! execute s1(0); -- should fail
! ERROR:  value for domain pos_int violates check constraint "pos_int_check"
! execute s1(NULL); -- should fail
! ERROR:  domain pos_int does not allow null values
! -- Check that domain constraints on plpgsql function parameters, results,
! -- and local variables are enforced correctly.
! create function doubledecrement(p1 pos_int) returns pos_int as $$
! declare v pos_int;
! begin
!     return p1;
! end$$ language plpgsql;
! select doubledecrement(3); -- fail because of implicit null assignment
! ERROR:  domain pos_int does not allow null values
! CONTEXT:  PL/pgSQL function doubledecrement(pos_int) line 3 during statement block local variable initialization
! create or replace function doubledecrement(p1 pos_int) returns pos_int as $$
! declare v pos_int := 0;
! begin
!     return p1;
! end$$ language plpgsql;
! select doubledecrement(3); -- fail at initialization assignment
! ERROR:  value for domain pos_int violates check constraint "pos_int_check"
! CONTEXT:  PL/pgSQL function doubledecrement(pos_int) line 3 during statement block local variable initialization
! create or replace function doubledecrement(p1 pos_int) returns pos_int as $$
! declare v pos_int := 1;
! begin
!     v := p1 - 1;
!     return v - 1;
! end$$ language plpgsql;
! select doubledecrement(null); -- fail before call
! ERROR:  domain pos_int does not allow null values
! select doubledecrement(0); -- fail before call
! ERROR:  value for domain pos_int violates check constraint "pos_int_check"
! select doubledecrement(1); -- fail at assignment to v
! ERROR:  value for domain pos_int violates check constraint "pos_int_check"
! CONTEXT:  PL/pgSQL function doubledecrement(pos_int) line 4 at assignment
! select doubledecrement(2); -- fail at return
! ERROR:  value for domain pos_int violates check constraint "pos_int_check"
! CONTEXT:  PL/pgSQL function doubledecrement(pos_int) while casting return value to function's return type
! select doubledecrement(3); -- good
!  doubledecrement 
! -----------------
!                1
! (1 row)
! 
! -- Check that ALTER DOMAIN tests columns of derived types
! create domain posint as int4;
! -- Currently, this doesn't work for composite types, but verify it complains
! create type ddtest1 as (f1 posint);
! create table ddtest2(f1 ddtest1);
! insert into ddtest2 values(row(-1));
! alter domain posint add constraint c1 check(value >= 0);
! ERROR:  cannot alter type "posint" because column "ddtest2.f1" uses it
! drop table ddtest2;
! create table ddtest2(f1 ddtest1[]);
! insert into ddtest2 values('{(-1)}');
! alter domain posint add constraint c1 check(value >= 0);
! ERROR:  cannot alter type "posint" because column "ddtest2.f1" uses it
! drop table ddtest2;
! alter domain posint add constraint c1 check(value >= 0);
! create domain posint2 as posint check (value % 2 = 0);
! create table ddtest2(f1 posint2);
! insert into ddtest2 values(11); -- fail
! ERROR:  value for domain posint2 violates check constraint "posint2_check"
! insert into ddtest2 values(-2); -- fail
! ERROR:  value for domain posint2 violates check constraint "c1"
! insert into ddtest2 values(2);
! alter domain posint add constraint c2 check(value >= 10); -- fail
! ERROR:  column "f1" of table "ddtest2" contains values that violate the new constraint
! alter domain posint add constraint c2 check(value > 0); -- OK
! drop table ddtest2;
! drop type ddtest1;
! drop domain posint cascade;
! NOTICE:  drop cascades to type posint2
! --
! -- Check enforcement of domain-related typmod in plpgsql (bug #5717)
! --
! create or replace function array_elem_check(numeric) returns numeric as $$
! declare
!   x numeric(4,2)[1];
! begin
!   x[1] := $1;
!   return x[1];
! end$$ language plpgsql;
! select array_elem_check(121.00);
! ERROR:  numeric field overflow
! DETAIL:  A field with precision 4, scale 2 must round to an absolute value less than 10^2.
! CONTEXT:  PL/pgSQL function array_elem_check(numeric) line 5 at assignment
! select array_elem_check(1.23456);
!  array_elem_check 
! ------------------
!              1.23
! (1 row)
! 
! create domain mynums as numeric(4,2)[1];
! create or replace function array_elem_check(numeric) returns numeric as $$
! declare
!   x mynums;
! begin
!   x[1] := $1;
!   return x[1];
! end$$ language plpgsql;
! select array_elem_check(121.00);
! ERROR:  numeric field overflow
! DETAIL:  A field with precision 4, scale 2 must round to an absolute value less than 10^2.
! CONTEXT:  PL/pgSQL function array_elem_check(numeric) line 5 at assignment
! select array_elem_check(1.23456);
!  array_elem_check 
! ------------------
!              1.23
! (1 row)
! 
! create domain mynums2 as mynums;
! create or replace function array_elem_check(numeric) returns numeric as $$
! declare
!   x mynums2;
! begin
!   x[1] := $1;
!   return x[1];
! end$$ language plpgsql;
! select array_elem_check(121.00);
! ERROR:  numeric field overflow
! DETAIL:  A field with precision 4, scale 2 must round to an absolute value less than 10^2.
! CONTEXT:  PL/pgSQL function array_elem_check(numeric) line 5 at assignment
! select array_elem_check(1.23456);
!  array_elem_check 
! ------------------
!              1.23
! (1 row)
! 
! drop function array_elem_check(numeric);
! --
! -- Check enforcement of array-level domain constraints
! --
! create domain orderedpair as int[2] check (value[1] < value[2]);
! select array[1,2]::orderedpair;
!  array 
! -------
!  {1,2}
! (1 row)
! 
! select array[2,1]::orderedpair;  -- fail
! ERROR:  value for domain orderedpair violates check constraint "orderedpair_check"
! create temp table op (f1 orderedpair);
! insert into op values (array[1,2]);
! insert into op values (array[2,1]);  -- fail
! ERROR:  value for domain orderedpair violates check constraint "orderedpair_check"
! update op set f1[2] = 3;
! update op set f1[2] = 0;  -- fail
! ERROR:  value for domain orderedpair violates check constraint "orderedpair_check"
! select * from op;
!   f1   
! -------
!  {1,3}
! (1 row)
! 
! create or replace function array_elem_check(int) returns int as $$
! declare
!   x orderedpair := '{1,2}';
! begin
!   x[2] := $1;
!   return x[2];
! end$$ language plpgsql;
! select array_elem_check(3);
!  array_elem_check 
! ------------------
!                 3
! (1 row)
! 
! select array_elem_check(-1);
! ERROR:  value for domain orderedpair violates check constraint "orderedpair_check"
! CONTEXT:  PL/pgSQL function array_elem_check(integer) line 5 at assignment
! drop function array_elem_check(int);
! --
! -- Check enforcement of changing constraints in plpgsql
! --
! create domain di as int;
! create function dom_check(int) returns di as $$
! declare d di;
! begin
!   d := $1;
!   return d;
! end
! $$ language plpgsql immutable;
! select dom_check(0);
!  dom_check 
! -----------
!          0
! (1 row)
! 
! alter domain di add constraint pos check (value > 0);
! select dom_check(0); -- fail
! ERROR:  value for domain di violates check constraint "pos"
! CONTEXT:  PL/pgSQL function dom_check(integer) line 4 at assignment
! alter domain di drop constraint pos;
! select dom_check(0);
!  dom_check 
! -----------
!          0
! (1 row)
! 
! drop function dom_check(int);
! drop domain di;
! --
! -- Check use of a (non-inline-able) SQL function in a domain constraint;
! -- this has caused issues in the past
! --
! create function sql_is_distinct_from(anyelement, anyelement)
! returns boolean language sql
! as 'select $1 is distinct from $2 limit 1';
! create domain inotnull int
!   check (sql_is_distinct_from(value, null));
! select 1::inotnull;
!  inotnull 
! ----------
!         1
! (1 row)
! 
! select null::inotnull;
! ERROR:  value for domain inotnull violates check constraint "inotnull_check"
! create table dom_table (x inotnull);
! insert into dom_table values ('1');
! insert into dom_table values (1);
! insert into dom_table values (null);
! ERROR:  value for domain inotnull violates check constraint "inotnull_check"
! drop table dom_table;
! drop domain inotnull;
! drop function sql_is_distinct_from(anyelement, anyelement);
! --
! -- Renaming
! --
! create domain testdomain1 as int;
! alter domain testdomain1 rename to testdomain2;
! alter type testdomain2 rename to testdomain3;  -- alter type also works
! drop domain testdomain3;
! --
! -- Renaming domain constraints
! --
! create domain testdomain1 as int constraint unsigned check (value > 0);
! alter domain testdomain1 rename constraint unsigned to unsigned_foo;
! alter domain testdomain1 drop constraint unsigned_foo;
! drop domain testdomain1;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/rangefuncs.out	2016-09-05 20:45:48.932033175 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/rangefuncs.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,2108 ****
! SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (11 rows)
! 
! CREATE TABLE foo2(fooid int, f2 int);
! INSERT INTO foo2 VALUES(1, 11);
! INSERT INTO foo2 VALUES(2, 22);
! INSERT INTO foo2 VALUES(1, 111);
! CREATE FUNCTION foot(int) returns setof foo2 as 'SELECT * FROM foo2 WHERE fooid = $1 ORDER BY f2;' LANGUAGE SQL;
! -- function with ORDINALITY
! select * from foot(1) with ordinality as z(a,b,ord);
!  a |  b  | ord 
! ---+-----+-----
!  1 |  11 |   1
!  1 | 111 |   2
! (2 rows)
! 
! select * from foot(1) with ordinality as z(a,b,ord) where b > 100;   -- ordinal 2, not 1
!  a |  b  | ord 
! ---+-----+-----
!  1 | 111 |   2
! (1 row)
! 
! -- ordinality vs. column names and types
! select a,b,ord from foot(1) with ordinality as z(a,b,ord);
!  a |  b  | ord 
! ---+-----+-----
!  1 |  11 |   1
!  1 | 111 |   2
! (2 rows)
! 
! select a,ord from unnest(array['a','b']) with ordinality as z(a,ord);
!  a | ord 
! ---+-----
!  a |   1
!  b |   2
! (2 rows)
! 
! select * from unnest(array['a','b']) with ordinality as z(a,ord);
!  a | ord 
! ---+-----
!  a |   1
!  b |   2
! (2 rows)
! 
! select a,ord from unnest(array[1.0::float8]) with ordinality as z(a,ord);
!  a | ord 
! ---+-----
!  1 |   1
! (1 row)
! 
! select * from unnest(array[1.0::float8]) with ordinality as z(a,ord);
!  a | ord 
! ---+-----
!  1 |   1
! (1 row)
! 
! select row_to_json(s.*) from generate_series(11,14) with ordinality s;
!        row_to_json       
! -------------------------
!  {"s":11,"ordinality":1}
!  {"s":12,"ordinality":2}
!  {"s":13,"ordinality":3}
!  {"s":14,"ordinality":4}
! (4 rows)
! 
! -- ordinality vs. views
! create temporary view vw_ord as select * from (values (1)) v(n) join foot(1) with ordinality as z(a,b,ord) on (n=ord);
! select * from vw_ord;
!  n | a | b  | ord 
! ---+---+----+-----
!  1 | 1 | 11 |   1
! (1 row)
! 
! select definition from pg_views where viewname='vw_ord';
!                              definition                              
! ---------------------------------------------------------------------
!   SELECT v.n,                                                       +
!      z.a,                                                           +
!      z.b,                                                           +
!      z.ord                                                          +
!     FROM (( VALUES (1)) v(n)                                        +
!       JOIN foot(1) WITH ORDINALITY z(a, b, ord) ON ((v.n = z.ord)));
! (1 row)
! 
! drop view vw_ord;
! -- multiple functions
! select * from rows from(foot(1),foot(2)) with ordinality as z(a,b,c,d,ord);
!  a |  b  | c | d  | ord 
! ---+-----+---+----+-----
!  1 |  11 | 2 | 22 |   1
!  1 | 111 |   |    |   2
! (2 rows)
! 
! create temporary view vw_ord as select * from (values (1)) v(n) join rows from(foot(1),foot(2)) with ordinality as z(a,b,c,d,ord) on (n=ord);
! select * from vw_ord;
!  n | a | b  | c | d  | ord 
! ---+---+----+---+----+-----
!  1 | 1 | 11 | 2 | 22 |   1
! (1 row)
! 
! select definition from pg_views where viewname='vw_ord';
!                                           definition                                           
! -----------------------------------------------------------------------------------------------
!   SELECT v.n,                                                                                 +
!      z.a,                                                                                     +
!      z.b,                                                                                     +
!      z.c,                                                                                     +
!      z.d,                                                                                     +
!      z.ord                                                                                    +
!     FROM (( VALUES (1)) v(n)                                                                  +
!       JOIN ROWS FROM(foot(1), foot(2)) WITH ORDINALITY z(a, b, c, d, ord) ON ((v.n = z.ord)));
! (1 row)
! 
! drop view vw_ord;
! -- expansions of unnest()
! select * from unnest(array[10,20],array['foo','bar'],array[1.0]);
!  unnest | unnest | unnest 
! --------+--------+--------
!      10 | foo    |    1.0
!      20 | bar    |       
! (2 rows)
! 
! select * from unnest(array[10,20],array['foo','bar'],array[1.0]) with ordinality as z(a,b,c,ord);
!  a  |  b  |  c  | ord 
! ----+-----+-----+-----
!  10 | foo | 1.0 |   1
!  20 | bar |     |   2
! (2 rows)
! 
! select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) with ordinality as z(a,b,c,ord);
!  a  |  b  |  c  | ord 
! ----+-----+-----+-----
!  10 | foo | 1.0 |   1
!  20 | bar |     |   2
! (2 rows)
! 
! select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(101,102)) with ordinality as z(a,b,c,ord);
!  a  |  b  |  c  | ord 
! ----+-----+-----+-----
!  10 | foo | 101 |   1
!  20 | bar | 102 |   2
! (2 rows)
! 
! create temporary view vw_ord as select * from unnest(array[10,20],array['foo','bar'],array[1.0]) as z(a,b,c);
! select * from vw_ord;
!  a  |  b  |  c  
! ----+-----+-----
!  10 | foo | 1.0
!  20 | bar |    
! (2 rows)
! 
! select definition from pg_views where viewname='vw_ord';
!                                        definition                                       
! ----------------------------------------------------------------------------------------
!   SELECT z.a,                                                                          +
!      z.b,                                                                              +
!      z.c                                                                               +
!     FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
! (1 row)
! 
! drop view vw_ord;
! create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar'],array[1.0])) as z(a,b,c);
! select * from vw_ord;
!  a  |  b  |  c  
! ----+-----+-----
!  10 | foo | 1.0
!  20 | bar |    
! (2 rows)
! 
! select definition from pg_views where viewname='vw_ord';
!                                        definition                                       
! ----------------------------------------------------------------------------------------
!   SELECT z.a,                                                                          +
!      z.b,                                                                              +
!      z.c                                                                               +
!     FROM UNNEST(ARRAY[10, 20], ARRAY['foo'::text, 'bar'::text], ARRAY[1.0]) z(a, b, c);
! (1 row)
! 
! drop view vw_ord;
! create temporary view vw_ord as select * from rows from(unnest(array[10,20],array['foo','bar']), generate_series(1,2)) as z(a,b,c);
! select * from vw_ord;
!  a  |  b  | c 
! ----+-----+---
!  10 | foo | 1
!  20 | bar | 2
! (2 rows)
! 
! select definition from pg_views where viewname='vw_ord';
!                                                       definition                                                      
! ----------------------------------------------------------------------------------------------------------------------
!   SELECT z.a,                                                                                                        +
!      z.b,                                                                                                            +
!      z.c                                                                                                             +
!     FROM ROWS FROM(unnest(ARRAY[10, 20]), unnest(ARRAY['foo'::text, 'bar'::text]), generate_series(1, 2)) z(a, b, c);
! (1 row)
! 
! drop view vw_ord;
! -- ordinality and multiple functions vs. rewind and reverse scan
! begin;
! declare foo scroll cursor for select * from rows from(generate_series(1,5),generate_series(1,2)) with ordinality as g(i,j,o);
! fetch all from foo;
!  i | j | o 
! ---+---+---
!  1 | 1 | 1
!  2 | 2 | 2
!  3 |   | 3
!  4 |   | 4
!  5 |   | 5
! (5 rows)
! 
! fetch backward all from foo;
!  i | j | o 
! ---+---+---
!  5 |   | 5
!  4 |   | 4
!  3 |   | 3
!  2 | 2 | 2
!  1 | 1 | 1
! (5 rows)
! 
! fetch all from foo;
!  i | j | o 
! ---+---+---
!  1 | 1 | 1
!  2 | 2 | 2
!  3 |   | 3
!  4 |   | 4
!  5 |   | 5
! (5 rows)
! 
! fetch next from foo;
!  i | j | o 
! ---+---+---
! (0 rows)
! 
! fetch next from foo;
!  i | j | o 
! ---+---+---
! (0 rows)
! 
! fetch prior from foo;
!  i | j | o 
! ---+---+---
!  5 |   | 5
! (1 row)
! 
! fetch absolute 1 from foo;
!  i | j | o 
! ---+---+---
!  1 | 1 | 1
! (1 row)
! 
! fetch next from foo;
!  i | j | o 
! ---+---+---
!  2 | 2 | 2
! (1 row)
! 
! fetch next from foo;
!  i | j | o 
! ---+---+---
!  3 |   | 3
! (1 row)
! 
! fetch next from foo;
!  i | j | o 
! ---+---+---
!  4 |   | 4
! (1 row)
! 
! fetch prior from foo;
!  i | j | o 
! ---+---+---
!  3 |   | 3
! (1 row)
! 
! fetch prior from foo;
!  i | j | o 
! ---+---+---
!  2 | 2 | 2
! (1 row)
! 
! fetch prior from foo;
!  i | j | o 
! ---+---+---
!  1 | 1 | 1
! (1 row)
! 
! commit;
! -- function with implicit LATERAL
! select * from foo2, foot(foo2.fooid) z where foo2.f2 = z.f2;
!  fooid | f2  | fooid | f2  
! -------+-----+-------+-----
!      1 |  11 |     1 |  11
!      2 |  22 |     2 |  22
!      1 | 111 |     1 | 111
! (3 rows)
! 
! -- function with implicit LATERAL and explicit ORDINALITY
! select * from foo2, foot(foo2.fooid) with ordinality as z(fooid,f2,ord) where foo2.f2 = z.f2;
!  fooid | f2  | fooid | f2  | ord 
! -------+-----+-------+-----+-----
!      1 |  11 |     1 |  11 |   1
!      2 |  22 |     2 |  22 |   1
!      1 | 111 |     1 | 111 |   2
! (3 rows)
! 
! -- function in subselect
! select * from foo2 where f2 in (select f2 from foot(foo2.fooid) z where z.fooid = foo2.fooid) ORDER BY 1,2;
!  fooid | f2  
! -------+-----
!      1 |  11
!      1 | 111
!      2 |  22
! (3 rows)
! 
! -- function in subselect
! select * from foo2 where f2 in (select f2 from foot(1) z where z.fooid = foo2.fooid) ORDER BY 1,2;
!  fooid | f2  
! -------+-----
!      1 |  11
!      1 | 111
! (2 rows)
! 
! -- function in subselect
! select * from foo2 where f2 in (select f2 from foot(foo2.fooid) z where z.fooid = 1) ORDER BY 1,2;
!  fooid | f2  
! -------+-----
!      1 |  11
!      1 | 111
! (2 rows)
! 
! -- nested functions
! select foot.fooid, foot.f2 from foot(sin(pi()/2)::int) ORDER BY 1,2;
!  fooid | f2  
! -------+-----
!      1 |  11
!      1 | 111
! (2 rows)
! 
! CREATE TABLE foo (fooid int, foosubid int, fooname text, primary key(fooid,foosubid));
! INSERT INTO foo VALUES(1,1,'Joe');
! INSERT INTO foo VALUES(1,2,'Ed');
! INSERT INTO foo VALUES(2,1,'Mary');
! -- sql, proretset = f, prorettype = b
! CREATE FUNCTION getfoo1(int) RETURNS int AS 'SELECT $1;' LANGUAGE SQL;
! SELECT * FROM getfoo1(1) AS t1;
!  t1 
! ----
!   1
! (1 row)
! 
! SELECT * FROM getfoo1(1) WITH ORDINALITY AS t1(v,o);
!  v | o 
! ---+---
!  1 | 1
! (1 row)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo1(1);
! SELECT * FROM vw_getfoo;
!  getfoo1 
! ---------
!        1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo1(1) WITH ORDINALITY as t1(v,o);
! SELECT * FROM vw_getfoo;
!  v | o 
! ---+---
!  1 | 1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = t, prorettype = b
! CREATE FUNCTION getfoo2(int) RETURNS setof int AS 'SELECT fooid FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo2(1) AS t1;
!  t1 
! ----
!   1
!   1
! (2 rows)
! 
! SELECT * FROM getfoo2(1) WITH ORDINALITY AS t1(v,o);
!  v | o 
! ---+---
!  1 | 1
!  1 | 2
! (2 rows)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo2(1);
! SELECT * FROM vw_getfoo;
!  getfoo2 
! ---------
!        1
!        1
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo2(1) WITH ORDINALITY AS t1(v,o);
! SELECT * FROM vw_getfoo;
!  v | o 
! ---+---
!  1 | 1
!  1 | 2
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = t, prorettype = b
! CREATE FUNCTION getfoo3(int) RETURNS setof text AS 'SELECT fooname FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo3(1) AS t1;
!  t1  
! -----
!  Joe
!  Ed
! (2 rows)
! 
! SELECT * FROM getfoo3(1) WITH ORDINALITY AS t1(v,o);
!   v  | o 
! -----+---
!  Joe | 1
!  Ed  | 2
! (2 rows)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo3(1);
! SELECT * FROM vw_getfoo;
!  getfoo3 
! ---------
!  Joe
!  Ed
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo3(1) WITH ORDINALITY AS t1(v,o);
! SELECT * FROM vw_getfoo;
!   v  | o 
! -----+---
!  Joe | 1
!  Ed  | 2
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = f, prorettype = c
! CREATE FUNCTION getfoo4(int) RETURNS foo AS 'SELECT * FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo4(1) AS t1;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! SELECT * FROM getfoo4(1) WITH ORDINALITY AS t1(a,b,c,o);
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
! (1 row)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo4(1);
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo4(1) WITH ORDINALITY AS t1(a,b,c,o);
! SELECT * FROM vw_getfoo;
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = t, prorettype = c
! CREATE FUNCTION getfoo5(int) RETURNS setof foo AS 'SELECT * FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo5(1) AS t1;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
!      1 |        2 | Ed
! (2 rows)
! 
! SELECT * FROM getfoo5(1) WITH ORDINALITY AS t1(a,b,c,o);
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
!  1 | 2 | Ed  | 2
! (2 rows)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo5(1);
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
!      1 |        2 | Ed
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo5(1) WITH ORDINALITY AS t1(a,b,c,o);
! SELECT * FROM vw_getfoo;
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
!  1 | 2 | Ed  | 2
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = f, prorettype = record
! CREATE FUNCTION getfoo6(int) RETURNS RECORD AS 'SELECT * FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo6(1) AS t1(fooid int, foosubid int, fooname text);
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! SELECT * FROM ROWS FROM( getfoo6(1) AS (fooid int, foosubid int, fooname text) ) WITH ORDINALITY;
!  fooid | foosubid | fooname | ordinality 
! -------+----------+---------+------------
!      1 |        1 | Joe     |          1
! (1 row)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo6(1) AS
! (fooid int, foosubid int, fooname text);
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS
!   SELECT * FROM ROWS FROM( getfoo6(1) AS (fooid int, foosubid int, fooname text) )
!                 WITH ORDINALITY;
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname | ordinality 
! -------+----------+---------+------------
!      1 |        1 | Joe     |          1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! -- sql, proretset = t, prorettype = record
! CREATE FUNCTION getfoo7(int) RETURNS setof record AS 'SELECT * FROM foo WHERE fooid = $1;' LANGUAGE SQL;
! SELECT * FROM getfoo7(1) AS t1(fooid int, foosubid int, fooname text);
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
!      1 |        2 | Ed
! (2 rows)
! 
! SELECT * FROM ROWS FROM( getfoo7(1) AS (fooid int, foosubid int, fooname text) ) WITH ORDINALITY;
!  fooid | foosubid | fooname | ordinality 
! -------+----------+---------+------------
!      1 |        1 | Joe     |          1
!      1 |        2 | Ed      |          2
! (2 rows)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo7(1) AS
! (fooid int, foosubid int, fooname text);
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
!      1 |        2 | Ed
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS
!   SELECT * FROM ROWS FROM( getfoo7(1) AS (fooid int, foosubid int, fooname text) )
!                 WITH ORDINALITY;
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname | ordinality 
! -------+----------+---------+------------
!      1 |        1 | Joe     |          1
!      1 |        2 | Ed      |          2
! (2 rows)
! 
! DROP VIEW vw_getfoo;
! -- plpgsql, proretset = f, prorettype = b
! CREATE FUNCTION getfoo8(int) RETURNS int AS 'DECLARE fooint int; BEGIN SELECT fooid into fooint FROM foo WHERE fooid = $1; RETURN fooint; END;' LANGUAGE plpgsql;
! SELECT * FROM getfoo8(1) AS t1;
!  t1 
! ----
!   1
! (1 row)
! 
! SELECT * FROM getfoo8(1) WITH ORDINALITY AS t1(v,o);
!  v | o 
! ---+---
!  1 | 1
! (1 row)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo8(1);
! SELECT * FROM vw_getfoo;
!  getfoo8 
! ---------
!        1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo8(1) WITH ORDINALITY AS t1(v,o);
! SELECT * FROM vw_getfoo;
!  v | o 
! ---+---
!  1 | 1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! -- plpgsql, proretset = f, prorettype = c
! CREATE FUNCTION getfoo9(int) RETURNS foo AS 'DECLARE footup foo%ROWTYPE; BEGIN SELECT * into footup FROM foo WHERE fooid = $1; RETURN footup; END;' LANGUAGE plpgsql;
! SELECT * FROM getfoo9(1) AS t1;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! SELECT * FROM getfoo9(1) WITH ORDINALITY AS t1(a,b,c,o);
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
! (1 row)
! 
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo9(1);
! SELECT * FROM vw_getfoo;
!  fooid | foosubid | fooname 
! -------+----------+---------
!      1 |        1 | Joe
! (1 row)
! 
! DROP VIEW vw_getfoo;
! CREATE VIEW vw_getfoo AS SELECT * FROM getfoo9(1) WITH ORDINALITY AS t1(a,b,c,o);
! SELECT * FROM vw_getfoo;
!  a | b |  c  | o 
! ---+---+-----+---
!  1 | 1 | Joe | 1
! (1 row)
! 
! DROP VIEW vw_getfoo;
! -- mix 'n match kinds, to exercise expandRTE and related logic
! select * from rows from(getfoo1(1),getfoo2(1),getfoo3(1),getfoo4(1),getfoo5(1),
!                     getfoo6(1) AS (fooid int, foosubid int, fooname text),
!                     getfoo7(1) AS (fooid int, foosubid int, fooname text),
!                     getfoo8(1),getfoo9(1))
!               with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
!  a | b |  c  | d | e |  f  | g | h |  i  | j | k |  l  | m | o |  p  | q | r | s |  t  | u 
! ---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+-----+---+---+---+-----+---
!  1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | 1 | Joe | 1
!    | 1 | Ed  |   |   |     | 1 | 2 | Ed  |   |   |     | 1 | 2 | Ed  |   |   |   |     | 2
! (2 rows)
! 
! select * from rows from(getfoo9(1),getfoo8(1),
!                     getfoo7(1) AS (fooid int, foosubid int, fooname text),
!                     getfoo6(1) AS (fooid int, foosubid int, fooname text),
!                     getfoo5(1),getfoo4(1),getfoo3(1),getfoo2(1),getfoo1(1))
!               with ordinality as t1(a,b,c,d,e,f,g,h,i,j,k,l,m,o,p,q,r,s,t,u);
!  a | b |  c  | d | e | f |  g  | h | i |  j  | k | l |  m  | o | p |  q  |  r  | s | t | u 
! ---+---+-----+---+---+---+-----+---+---+-----+---+---+-----+---+---+-----+-----+---+---+---
!  1 | 1 | Joe | 1 | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | 1 | 1 | Joe | Joe | 1 | 1 | 1
!    |   |     |   | 1 | 2 | Ed  |   |   |     | 1 | 2 | Ed  |   |   |     | Ed  | 1 |   | 2
! (2 rows)
! 
! create temporary view vw_foo as
!   select * from rows from(getfoo9(1),
!                       getfoo7(1) AS (fooid int, foosubid int, fooname text),
!                       getfoo1(1))
!                 with ordinality as t1(a,b,c,d,e,f,g,n);
! select * from vw_foo;
!  a | b |  c  | d | e |  f  | g | n 
! ---+---+-----+---+---+-----+---+---
!  1 | 1 | Joe | 1 | 1 | Joe | 1 | 1
!    |   |     | 1 | 2 | Ed  |   | 2
! (2 rows)
! 
! select pg_get_viewdef('vw_foo');
!                                                                     pg_get_viewdef                                                                    
! ------------------------------------------------------------------------------------------------------------------------------------------------------
!   SELECT t1.a,                                                                                                                                       +
!      t1.b,                                                                                                                                           +
!      t1.c,                                                                                                                                           +
!      t1.d,                                                                                                                                           +
!      t1.e,                                                                                                                                           +
!      t1.f,                                                                                                                                           +
!      t1.g,                                                                                                                                           +
!      t1.n                                                                                                                                            +
!     FROM ROWS FROM(getfoo9(1), getfoo7(1) AS (fooid integer, foosubid integer, fooname text), getfoo1(1)) WITH ORDINALITY t1(a, b, c, d, e, f, g, n);
! (1 row)
! 
! drop view vw_foo;
! DROP FUNCTION getfoo1(int);
! DROP FUNCTION getfoo2(int);
! DROP FUNCTION getfoo3(int);
! DROP FUNCTION getfoo4(int);
! DROP FUNCTION getfoo5(int);
! DROP FUNCTION getfoo6(int);
! DROP FUNCTION getfoo7(int);
! DROP FUNCTION getfoo8(int);
! DROP FUNCTION getfoo9(int);
! DROP FUNCTION foot(int);
! DROP TABLE foo2;
! DROP TABLE foo;
! -- Rescan tests --
! CREATE TEMPORARY SEQUENCE foo_rescan_seq1;
! CREATE TEMPORARY SEQUENCE foo_rescan_seq2;
! CREATE TYPE foo_rescan_t AS (i integer, s bigint);
! CREATE FUNCTION foo_sql(int,int) RETURNS setof foo_rescan_t AS 'SELECT i, nextval(''foo_rescan_seq1'') FROM generate_series($1,$2) i;' LANGUAGE SQL;
! -- plpgsql functions use materialize mode
! CREATE FUNCTION foo_mat(int,int) RETURNS setof foo_rescan_t AS 'begin for i in $1..$2 loop return next (i, nextval(''foo_rescan_seq2'')); end loop; end;' LANGUAGE plpgsql;
! --invokes ExecReScanFunctionScan - all these cases should materialize the function only once
! -- LEFT JOIN on a condition that the planner can't prove to be true is used to ensure the function
! -- is on the inner path of a nestloop join
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN foo_sql(11,13) ON (r+i)<100;
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  2 | 11 | 1
!  2 | 12 | 2
!  2 | 13 | 3
!  3 | 11 | 1
!  3 | 12 | 2
!  3 | 13 | 3
! (9 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN foo_sql(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  1 | 12 | 2 | 2
!  1 | 13 | 3 | 3
!  2 | 11 | 1 | 1
!  2 | 12 | 2 | 2
!  2 | 13 | 3 | 3
!  3 | 11 | 1 | 1
!  3 | 12 | 2 | 2
!  3 | 13 | 3 | 3
! (9 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN foo_mat(11,13) ON (r+i)<100;
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  2 | 11 | 1
!  2 | 12 | 2
!  2 | 13 | 3
!  3 | 11 | 1
!  3 | 12 | 2
!  3 | 13 | 3
! (9 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN foo_mat(11,13) WITH ORDINALITY AS f(i,s,o) ON (r+i)<100;
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  1 | 12 | 2 | 2
!  1 | 13 | 3 | 3
!  2 | 11 | 1 | 1
!  2 | 12 | 2 | 2
!  2 | 13 | 3 | 3
!  3 | 11 | 1 | 1
!  3 | 12 | 2 | 2
!  3 | 13 | 3 | 3
! (9 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN ROWS FROM( foo_sql(11,13), foo_mat(11,13) ) WITH ORDINALITY AS f(i1,s1,i2,s2,o) ON (r+i1+i2)<100;
!  r | i1 | s1 | i2 | s2 | o 
! ---+----+----+----+----+---
!  1 | 11 |  1 | 11 |  1 | 1
!  1 | 12 |  2 | 12 |  2 | 2
!  1 | 13 |  3 | 13 |  3 | 3
!  2 | 11 |  1 | 11 |  1 | 1
!  2 | 12 |  2 | 12 |  2 | 2
!  2 | 13 |  3 | 13 |  3 | 3
!  3 | 11 |  1 | 11 |  1 | 1
!  3 | 12 |  2 | 12 |  2 | 2
!  3 | 13 |  3 | 13 |  3 | 3
! (9 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) f(i) ON (r+i)<100;
!  r | i  
! ---+----
!  1 | 11
!  1 | 12
!  1 | 13
!  2 | 11
!  2 | 12
!  2 | 13
!  3 | 11
!  3 | 12
!  3 | 13
! (9 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN generate_series(11,13) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
!  r | i  | o 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  2 | 11 | 1
!  2 | 12 | 2
!  2 | 13 | 3
!  3 | 11 | 1
!  3 | 12 | 2
!  3 | 13 | 3
! (9 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) f(i) ON (r+i)<100;
!  r | i  
! ---+----
!  1 | 10
!  1 | 20
!  1 | 30
!  2 | 10
!  2 | 20
!  2 | 30
!  3 | 10
!  3 | 20
!  3 | 30
! (9 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r) LEFT JOIN unnest(array[10,20,30]) WITH ORDINALITY AS f(i,o) ON (r+i)<100;
!  r | i  | o 
! ---+----+---
!  1 | 10 | 1
!  1 | 20 | 2
!  1 | 30 | 3
!  2 | 10 | 1
!  2 | 20 | 2
!  2 | 30 | 3
!  3 | 10 | 1
!  3 | 20 | 2
!  3 | 30 | 3
! (9 rows)
! 
! --invokes ExecReScanFunctionScan with chgParam != NULL (using implied LATERAL)
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_sql(10+r,13);
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  2 | 12 | 4
!  2 | 13 | 5
!  3 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_sql(10+r,13) WITH ORDINALITY AS f(i,s,o);
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  1 | 12 | 2 | 2
!  1 | 13 | 3 | 3
!  2 | 12 | 4 | 1
!  2 | 13 | 5 | 2
!  3 | 13 | 6 | 1
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_sql(11,10+r);
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  2 | 11 | 2
!  2 | 12 | 3
!  3 | 11 | 4
!  3 | 12 | 5
!  3 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_sql(11,10+r) WITH ORDINALITY AS f(i,s,o);
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  2 | 11 | 2 | 1
!  2 | 12 | 3 | 2
!  3 | 11 | 4 | 1
!  3 | 12 | 5 | 2
!  3 | 13 | 6 | 3
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), foo_sql(r1,r2);
!  r1 | r2 | i  | s  
! ----+----+----+----
!  11 | 12 | 11 |  1
!  11 | 12 | 12 |  2
!  13 | 15 | 13 |  3
!  13 | 15 | 14 |  4
!  13 | 15 | 15 |  5
!  16 | 20 | 16 |  6
!  16 | 20 | 17 |  7
!  16 | 20 | 18 |  8
!  16 | 20 | 19 |  9
!  16 | 20 | 20 | 10
! (10 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), foo_sql(r1,r2) WITH ORDINALITY AS f(i,s,o);
!  r1 | r2 | i  | s  | o 
! ----+----+----+----+---
!  11 | 12 | 11 |  1 | 1
!  11 | 12 | 12 |  2 | 2
!  13 | 15 | 13 |  3 | 1
!  13 | 15 | 14 |  4 | 2
!  13 | 15 | 15 |  5 | 3
!  16 | 20 | 16 |  6 | 1
!  16 | 20 | 17 |  7 | 2
!  16 | 20 | 18 |  8 | 3
!  16 | 20 | 19 |  9 | 4
!  16 | 20 | 20 | 10 | 5
! (10 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_mat(10+r,13);
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  2 | 12 | 4
!  2 | 13 | 5
!  3 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_mat(10+r,13) WITH ORDINALITY AS f(i,s,o);
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  1 | 12 | 2 | 2
!  1 | 13 | 3 | 3
!  2 | 12 | 4 | 1
!  2 | 13 | 5 | 2
!  3 | 13 | 6 | 1
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_mat(11,10+r);
!  r | i  | s 
! ---+----+---
!  1 | 11 | 1
!  2 | 11 | 2
!  2 | 12 | 3
!  3 | 11 | 4
!  3 | 12 | 5
!  3 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), foo_mat(11,10+r) WITH ORDINALITY AS f(i,s,o);
!  r | i  | s | o 
! ---+----+---+---
!  1 | 11 | 1 | 1
!  2 | 11 | 2 | 1
!  2 | 12 | 3 | 2
!  3 | 11 | 4 | 1
!  3 | 12 | 5 | 2
!  3 | 13 | 6 | 3
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), foo_mat(r1,r2);
!  r1 | r2 | i  | s  
! ----+----+----+----
!  11 | 12 | 11 |  1
!  11 | 12 | 12 |  2
!  13 | 15 | 13 |  3
!  13 | 15 | 14 |  4
!  13 | 15 | 15 |  5
!  16 | 20 | 16 |  6
!  16 | 20 | 17 |  7
!  16 | 20 | 18 |  8
!  16 | 20 | 19 |  9
!  16 | 20 | 20 | 10
! (10 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (11,12),(13,15),(16,20)) v(r1,r2), foo_mat(r1,r2) WITH ORDINALITY AS f(i,s,o);
!  r1 | r2 | i  | s  | o 
! ----+----+----+----+---
!  11 | 12 | 11 |  1 | 1
!  11 | 12 | 12 |  2 | 2
!  13 | 15 | 13 |  3 | 1
!  13 | 15 | 14 |  4 | 2
!  13 | 15 | 15 |  5 | 3
!  16 | 20 | 16 |  6 | 1
!  16 | 20 | 17 |  7 | 2
!  16 | 20 | 18 |  8 | 3
!  16 | 20 | 19 |  9 | 4
!  16 | 20 | 20 | 10 | 5
! (10 rows)
! 
! -- selective rescan of multiple functions:
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( foo_sql(11,11), foo_mat(10+r,13) );
!  r | i  | s | i  | s 
! ---+----+---+----+---
!  1 | 11 | 1 | 11 | 1
!  1 |    |   | 12 | 2
!  1 |    |   | 13 | 3
!  2 | 11 | 1 | 12 | 4
!  2 |    |   | 13 | 5
!  3 | 11 | 1 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( foo_sql(10+r,13), foo_mat(11,11) );
!  r | i  | s | i  | s 
! ---+----+---+----+---
!  1 | 11 | 1 | 11 | 1
!  1 | 12 | 2 |    |  
!  1 | 13 | 3 |    |  
!  2 | 12 | 4 | 11 | 1
!  2 | 13 | 5 |    |  
!  3 | 13 | 6 | 11 | 1
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), ROWS FROM( foo_sql(10+r,13), foo_mat(10+r,13) );
!  r | i  | s | i  | s 
! ---+----+---+----+---
!  1 | 11 | 1 | 11 | 1
!  1 | 12 | 2 | 12 | 2
!  1 | 13 | 3 | 13 | 3
!  2 | 12 | 4 | 12 | 4
!  2 | 13 | 5 | 13 | 5
!  3 | 13 | 6 | 13 | 6
! (6 rows)
! 
! SELECT setval('foo_rescan_seq1',1,false),setval('foo_rescan_seq2',1,false);
!  setval | setval 
! --------+--------
!       1 |      1
! (1 row)
! 
! SELECT * FROM generate_series(1,2) r1, generate_series(r1,3) r2, ROWS FROM( foo_sql(10+r1,13), foo_mat(10+r2,13) );
!  r1 | r2 | i  | s  | i  | s 
! ----+----+----+----+----+---
!   1 |  1 | 11 |  1 | 11 | 1
!   1 |  1 | 12 |  2 | 12 | 2
!   1 |  1 | 13 |  3 | 13 | 3
!   1 |  2 | 11 |  4 | 12 | 4
!   1 |  2 | 12 |  5 | 13 | 5
!   1 |  2 | 13 |  6 |    |  
!   1 |  3 | 11 |  7 | 13 | 6
!   1 |  3 | 12 |  8 |    |  
!   1 |  3 | 13 |  9 |    |  
!   2 |  2 | 12 | 10 | 12 | 7
!   2 |  2 | 13 | 11 | 13 | 8
!   2 |  3 | 12 | 12 | 13 | 9
!   2 |  3 | 13 | 13 |    |  
! (13 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) f(i);
!  r | i  
! ---+----
!  1 | 11
!  1 | 12
!  1 | 13
!  1 | 14
!  1 | 15
!  1 | 16
!  1 | 17
!  1 | 18
!  1 | 19
!  2 | 12
!  2 | 13
!  2 | 14
!  2 | 15
!  2 | 16
!  2 | 17
!  2 | 18
!  3 | 13
!  3 | 14
!  3 | 15
!  3 | 16
!  3 | 17
! (21 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), generate_series(10+r,20-r) WITH ORDINALITY AS f(i,o);
!  r | i  | o 
! ---+----+---
!  1 | 11 | 1
!  1 | 12 | 2
!  1 | 13 | 3
!  1 | 14 | 4
!  1 | 15 | 5
!  1 | 16 | 6
!  1 | 17 | 7
!  1 | 18 | 8
!  1 | 19 | 9
!  2 | 12 | 1
!  2 | 13 | 2
!  2 | 14 | 3
!  2 | 15 | 4
!  2 | 16 | 5
!  2 | 17 | 6
!  2 | 18 | 7
!  3 | 13 | 1
!  3 | 14 | 2
!  3 | 15 | 3
!  3 | 16 | 4
!  3 | 17 | 5
! (21 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) f(i);
!  r | i  
! ---+----
!  1 | 10
!  1 | 20
!  1 | 30
!  2 | 20
!  2 | 40
!  2 | 60
!  3 | 30
!  3 | 60
!  3 | 90
! (9 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v(r), unnest(array[r*10,r*20,r*30]) WITH ORDINALITY AS f(i,o);
!  r | i  | o 
! ---+----+---
!  1 | 10 | 1
!  1 | 20 | 2
!  1 | 30 | 3
!  2 | 20 | 1
!  2 | 40 | 2
!  2 | 60 | 3
!  3 | 30 | 1
!  3 | 60 | 2
!  3 | 90 | 3
! (9 rows)
! 
! -- deep nesting
! SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
!               LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
!                                          LEFT JOIN generate_series(21,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
!  r1 | r1 | r2 | i  
! ----+----+----+----
!   1 |  1 | 10 | 21
!   1 |  1 | 10 | 22
!   1 |  1 | 10 | 23
!   1 |  1 | 20 | 21
!   1 |  1 | 20 | 22
!   1 |  1 | 20 | 23
!   1 |  1 | 30 | 21
!   1 |  1 | 30 | 22
!   1 |  1 | 30 | 23
!   2 |  2 | 10 | 21
!   2 |  2 | 10 | 22
!   2 |  2 | 10 | 23
!   2 |  2 | 20 | 21
!   2 |  2 | 20 | 22
!   2 |  2 | 20 | 23
!   2 |  2 | 30 | 21
!   2 |  2 | 30 | 22
!   2 |  2 | 30 | 23
!   3 |  3 | 10 | 21
!   3 |  3 | 10 | 22
!   3 |  3 | 10 | 23
!   3 |  3 | 20 | 21
!   3 |  3 | 20 | 22
!   3 |  3 | 20 | 23
!   3 |  3 | 30 | 21
!   3 |  3 | 30 | 22
!   3 |  3 | 30 | 23
! (27 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
!               LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
!                                          LEFT JOIN generate_series(20+r1,23) f(i) ON ((r2+i)<100) OFFSET 0) s1;
!  r1 | r1 | r2 | i  
! ----+----+----+----
!   1 |  1 | 10 | 21
!   1 |  1 | 10 | 22
!   1 |  1 | 10 | 23
!   1 |  1 | 20 | 21
!   1 |  1 | 20 | 22
!   1 |  1 | 20 | 23
!   1 |  1 | 30 | 21
!   1 |  1 | 30 | 22
!   1 |  1 | 30 | 23
!   2 |  2 | 10 | 22
!   2 |  2 | 10 | 23
!   2 |  2 | 20 | 22
!   2 |  2 | 20 | 23
!   2 |  2 | 30 | 22
!   2 |  2 | 30 | 23
!   3 |  3 | 10 | 23
!   3 |  3 | 20 | 23
!   3 |  3 | 30 | 23
! (18 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
!               LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
!                                          LEFT JOIN generate_series(r2,r2+3) f(i) ON ((r2+i)<100) OFFSET 0) s1;
!  r1 | r1 | r2 | i  
! ----+----+----+----
!   1 |  1 | 10 | 10
!   1 |  1 | 10 | 11
!   1 |  1 | 10 | 12
!   1 |  1 | 10 | 13
!   1 |  1 | 20 | 20
!   1 |  1 | 20 | 21
!   1 |  1 | 20 | 22
!   1 |  1 | 20 | 23
!   1 |  1 | 30 | 30
!   1 |  1 | 30 | 31
!   1 |  1 | 30 | 32
!   1 |  1 | 30 | 33
!   2 |  2 | 10 | 10
!   2 |  2 | 10 | 11
!   2 |  2 | 10 | 12
!   2 |  2 | 10 | 13
!   2 |  2 | 20 | 20
!   2 |  2 | 20 | 21
!   2 |  2 | 20 | 22
!   2 |  2 | 20 | 23
!   2 |  2 | 30 | 30
!   2 |  2 | 30 | 31
!   2 |  2 | 30 | 32
!   2 |  2 | 30 | 33
!   3 |  3 | 10 | 10
!   3 |  3 | 10 | 11
!   3 |  3 | 10 | 12
!   3 |  3 | 10 | 13
!   3 |  3 | 20 | 20
!   3 |  3 | 20 | 21
!   3 |  3 | 20 | 22
!   3 |  3 | 20 | 23
!   3 |  3 | 30 | 30
!   3 |  3 | 30 | 31
!   3 |  3 | 30 | 32
!   3 |  3 | 30 | 33
! (36 rows)
! 
! SELECT * FROM (VALUES (1),(2),(3)) v1(r1),
!               LATERAL (SELECT r1, * FROM (VALUES (10),(20),(30)) v2(r2)
!                                          LEFT JOIN generate_series(r1,2+r2/5) f(i) ON ((r2+i)<100) OFFSET 0) s1;
!  r1 | r1 | r2 | i 
! ----+----+----+---
!   1 |  1 | 10 | 1
!   1 |  1 | 10 | 2
!   1 |  1 | 10 | 3
!   1 |  1 | 10 | 4
!   1 |  1 | 20 | 1
!   1 |  1 | 20 | 2
!   1 |  1 | 20 | 3
!   1 |  1 | 20 | 4
!   1 |  1 | 20 | 5
!   1 |  1 | 20 | 6
!   1 |  1 | 30 | 1
!   1 |  1 | 30 | 2
!   1 |  1 | 30 | 3
!   1 |  1 | 30 | 4
!   1 |  1 | 30 | 5
!   1 |  1 | 30 | 6
!   1 |  1 | 30 | 7
!   1 |  1 | 30 | 8
!   2 |  2 | 10 | 2
!   2 |  2 | 10 | 3
!   2 |  2 | 10 | 4
!   2 |  2 | 20 | 2
!   2 |  2 | 20 | 3
!   2 |  2 | 20 | 4
!   2 |  2 | 20 | 5
!   2 |  2 | 20 | 6
!   2 |  2 | 30 | 2
!   2 |  2 | 30 | 3
!   2 |  2 | 30 | 4
!   2 |  2 | 30 | 5
!   2 |  2 | 30 | 6
!   2 |  2 | 30 | 7
!   2 |  2 | 30 | 8
!   3 |  3 | 10 | 3
!   3 |  3 | 10 | 4
!   3 |  3 | 20 | 3
!   3 |  3 | 20 | 4
!   3 |  3 | 20 | 5
!   3 |  3 | 20 | 6
!   3 |  3 | 30 | 3
!   3 |  3 | 30 | 4
!   3 |  3 | 30 | 5
!   3 |  3 | 30 | 6
!   3 |  3 | 30 | 7
!   3 |  3 | 30 | 8
! (45 rows)
! 
! DROP FUNCTION foo_sql(int,int);
! DROP FUNCTION foo_mat(int,int);
! DROP SEQUENCE foo_rescan_seq1;
! DROP SEQUENCE foo_rescan_seq2;
! --
! -- Test cases involving OUT parameters
! --
! CREATE FUNCTION foo(in f1 int, out f2 int)
! AS 'select $1+1' LANGUAGE sql;
! SELECT foo(42);
!  foo 
! -----
!   43
! (1 row)
! 
! SELECT * FROM foo(42);
!  f2 
! ----
!  43
! (1 row)
! 
! SELECT * FROM foo(42) AS p(x);
!  x  
! ----
!  43
! (1 row)
! 
! -- explicit spec of return type is OK
! CREATE OR REPLACE FUNCTION foo(in f1 int, out f2 int) RETURNS int
! AS 'select $1+1' LANGUAGE sql;
! -- error, wrong result type
! CREATE OR REPLACE FUNCTION foo(in f1 int, out f2 int) RETURNS float
! AS 'select $1+1' LANGUAGE sql;
! ERROR:  function result type must be integer because of OUT parameters
! -- with multiple OUT params you must get a RECORD result
! CREATE OR REPLACE FUNCTION foo(in f1 int, out f2 int, out f3 text) RETURNS int
! AS 'select $1+1' LANGUAGE sql;
! ERROR:  function result type must be record because of OUT parameters
! CREATE OR REPLACE FUNCTION foo(in f1 int, out f2 int, out f3 text)
! RETURNS record
! AS 'select $1+1' LANGUAGE sql;
! ERROR:  cannot change return type of existing function
! HINT:  Use DROP FUNCTION foo(integer) first.
! CREATE OR REPLACE FUNCTION foor(in f1 int, out f2 int, out text)
! AS $$select $1-1, $1::text || 'z'$$ LANGUAGE sql;
! SELECT f1, foor(f1) FROM int4_tbl;
!      f1      |            foor            
! -------------+----------------------------
!            0 | (-1,0z)
!       123456 | (123455,123456z)
!      -123456 | (-123457,-123456z)
!   2147483647 | (2147483646,2147483647z)
!  -2147483647 | (-2147483648,-2147483647z)
! (5 rows)
! 
! SELECT * FROM foor(42);
!  f2 | column2 
! ----+---------
!  41 | 42z
! (1 row)
! 
! SELECT * FROM foor(42) AS p(a,b);
!  a  |  b  
! ----+-----
!  41 | 42z
! (1 row)
! 
! CREATE OR REPLACE FUNCTION foob(in f1 int, inout f2 int, out text)
! AS $$select $2-1, $1::text || 'z'$$ LANGUAGE sql;
! SELECT f1, foob(f1, f1/2) FROM int4_tbl;
!      f1      |            foob            
! -------------+----------------------------
!            0 | (-1,0z)
!       123456 | (61727,123456z)
!      -123456 | (-61729,-123456z)
!   2147483647 | (1073741822,2147483647z)
!  -2147483647 | (-1073741824,-2147483647z)
! (5 rows)
! 
! SELECT * FROM foob(42, 99);
!  f2 | column2 
! ----+---------
!  98 | 42z
! (1 row)
! 
! SELECT * FROM foob(42, 99) AS p(a,b);
!  a  |  b  
! ----+-----
!  98 | 42z
! (1 row)
! 
! -- Can reference function with or without OUT params for DROP, etc
! DROP FUNCTION foo(int);
! DROP FUNCTION foor(in f2 int, out f1 int, out text);
! DROP FUNCTION foob(in f1 int, inout f2 int);
! --
! -- For my next trick, polymorphic OUT parameters
! --
! CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
! AS 'select $1, array[$1,$1]' LANGUAGE sql;
! SELECT dup(22);
!       dup       
! ----------------
!  (22,"{22,22}")
! (1 row)
! 
! SELECT dup('xyz');	-- fails
! ERROR:  could not determine polymorphic type because input has type "unknown"
! SELECT dup('xyz'::text);
!         dup        
! -------------------
!  (xyz,"{xyz,xyz}")
! (1 row)
! 
! SELECT * FROM dup('xyz'::text);
!  f2  |    f3     
! -----+-----------
!  xyz | {xyz,xyz}
! (1 row)
! 
! -- fails, as we are attempting to rename first argument
! CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
! AS 'select $1, array[$1,$1]' LANGUAGE sql;
! ERROR:  cannot change name of input parameter "f1"
! HINT:  Use DROP FUNCTION dup(anyelement) first.
! DROP FUNCTION dup(anyelement);
! -- equivalent behavior, though different name exposed for input arg
! CREATE OR REPLACE FUNCTION dup (inout f2 anyelement, out f3 anyarray)
! AS 'select $1, array[$1,$1]' LANGUAGE sql;
! SELECT dup(22);
!       dup       
! ----------------
!  (22,"{22,22}")
! (1 row)
! 
! DROP FUNCTION dup(anyelement);
! -- fails, no way to deduce outputs
! CREATE FUNCTION bad (f1 int, out f2 anyelement, out f3 anyarray)
! AS 'select $1, array[$1,$1]' LANGUAGE sql;
! ERROR:  cannot determine result data type
! DETAIL:  A function returning a polymorphic type must have at least one polymorphic argument.
! --
! -- table functions
! --
! CREATE OR REPLACE FUNCTION foo()
! RETURNS TABLE(a int)
! AS $$ SELECT a FROM generate_series(1,5) a(a) $$ LANGUAGE sql;
! SELECT * FROM foo();
!  a 
! ---
!  1
!  2
!  3
!  4
!  5
! (5 rows)
! 
! DROP FUNCTION foo();
! CREATE OR REPLACE FUNCTION foo(int)
! RETURNS TABLE(a int, b int)
! AS $$ SELECT a, b
!          FROM generate_series(1,$1) a(a),
!               generate_series(1,$1) b(b) $$ LANGUAGE sql;
! SELECT * FROM foo(3);
!  a | b 
! ---+---
!  1 | 1
!  1 | 2
!  1 | 3
!  2 | 1
!  2 | 2
!  2 | 3
!  3 | 1
!  3 | 2
!  3 | 3
! (9 rows)
! 
! DROP FUNCTION foo(int);
! -- case that causes change of typmod knowledge during inlining
! CREATE OR REPLACE FUNCTION foo()
! RETURNS TABLE(a varchar(5))
! AS $$ SELECT 'hello'::varchar(5) $$ LANGUAGE sql STABLE;
! SELECT * FROM foo() GROUP BY 1;
!    a   
! -------
!  hello
! (1 row)
! 
! DROP FUNCTION foo();
! --
! -- some tests on SQL functions with RETURNING
! --
! create temp table tt(f1 serial, data text);
! create function insert_tt(text) returns int as
! $$ insert into tt(data) values($1) returning f1 $$
! language sql;
! select insert_tt('foo');
!  insert_tt 
! -----------
!          1
! (1 row)
! 
! select insert_tt('bar');
!  insert_tt 
! -----------
!          2
! (1 row)
! 
! select * from tt;
!  f1 | data 
! ----+------
!   1 | foo
!   2 | bar
! (2 rows)
! 
! -- insert will execute to completion even if function needs just 1 row
! create or replace function insert_tt(text) returns int as
! $$ insert into tt(data) values($1),($1||$1) returning f1 $$
! language sql;
! select insert_tt('fool');
!  insert_tt 
! -----------
!          3
! (1 row)
! 
! select * from tt;
!  f1 |   data   
! ----+----------
!   1 | foo
!   2 | bar
!   3 | fool
!   4 | foolfool
! (4 rows)
! 
! -- setof does what's expected
! create or replace function insert_tt2(text,text) returns setof int as
! $$ insert into tt(data) values($1),($2) returning f1 $$
! language sql;
! select insert_tt2('foolish','barrish');
!  insert_tt2 
! ------------
!           5
!           6
! (2 rows)
! 
! select * from insert_tt2('baz','quux');
!  insert_tt2 
! ------------
!           7
!           8
! (2 rows)
! 
! select * from tt;
!  f1 |   data   
! ----+----------
!   1 | foo
!   2 | bar
!   3 | fool
!   4 | foolfool
!   5 | foolish
!   6 | barrish
!   7 | baz
!   8 | quux
! (8 rows)
! 
! -- limit doesn't prevent execution to completion
! select insert_tt2('foolish','barrish') limit 1;
!  insert_tt2 
! ------------
!           9
! (1 row)
! 
! select * from tt;
!  f1 |   data   
! ----+----------
!   1 | foo
!   2 | bar
!   3 | fool
!   4 | foolfool
!   5 | foolish
!   6 | barrish
!   7 | baz
!   8 | quux
!   9 | foolish
!  10 | barrish
! (10 rows)
! 
! -- triggers will fire, too
! create function noticetrigger() returns trigger as $$
! begin
!   raise notice 'noticetrigger % %', new.f1, new.data;
!   return null;
! end $$ language plpgsql;
! create trigger tnoticetrigger after insert on tt for each row
! execute procedure noticetrigger();
! select insert_tt2('foolme','barme') limit 1;
! NOTICE:  noticetrigger 11 foolme
! NOTICE:  noticetrigger 12 barme
!  insert_tt2 
! ------------
!          11
! (1 row)
! 
! select * from tt;
!  f1 |   data   
! ----+----------
!   1 | foo
!   2 | bar
!   3 | fool
!   4 | foolfool
!   5 | foolish
!   6 | barrish
!   7 | baz
!   8 | quux
!   9 | foolish
!  10 | barrish
!  11 | foolme
!  12 | barme
! (12 rows)
! 
! -- and rules work
! create temp table tt_log(f1 int, data text);
! create rule insert_tt_rule as on insert to tt do also
!   insert into tt_log values(new.*);
! select insert_tt2('foollog','barlog') limit 1;
! NOTICE:  noticetrigger 13 foollog
! NOTICE:  noticetrigger 14 barlog
!  insert_tt2 
! ------------
!          13
! (1 row)
! 
! select * from tt;
!  f1 |   data   
! ----+----------
!   1 | foo
!   2 | bar
!   3 | fool
!   4 | foolfool
!   5 | foolish
!   6 | barrish
!   7 | baz
!   8 | quux
!   9 | foolish
!  10 | barrish
!  11 | foolme
!  12 | barme
!  13 | foollog
!  14 | barlog
! (14 rows)
! 
! -- note that nextval() gets executed a second time in the rule expansion,
! -- which is expected.
! select * from tt_log;
!  f1 |  data   
! ----+---------
!  15 | foollog
!  16 | barlog
! (2 rows)
! 
! -- test case for a whole-row-variable bug
! create function foo1(n integer, out a text, out b text)
!   returns setof record
!   language sql
!   as $$ select 'foo ' || i, 'bar ' || i from generate_series(1,$1) i $$;
! set work_mem='64kB';
! select t.a, t, t.a from foo1(10000) t limit 1;
!    a   |         t         |   a   
! -------+-------------------+-------
!  foo 1 | ("foo 1","bar 1") | foo 1
! (1 row)
! 
! reset work_mem;
! select t.a, t, t.a from foo1(10000) t limit 1;
!    a   |         t         |   a   
! -------+-------------------+-------
!  foo 1 | ("foo 1","bar 1") | foo 1
! (1 row)
! 
! drop function foo1(n integer);
! -- test use of SQL functions returning record
! -- this is supported in some cases where the query doesn't specify
! -- the actual record type ...
! create function array_to_set(anyarray) returns setof record as $$
!   select i AS "index", $1[i] AS "value" from generate_subscripts($1, 1) i
! $$ language sql strict immutable;
! select array_to_set(array['one', 'two']);
!  array_to_set 
! --------------
!  (1,one)
!  (2,two)
! (2 rows)
! 
! select * from array_to_set(array['one', 'two']) as t(f1 int,f2 text);
!  f1 | f2  
! ----+-----
!   1 | one
!   2 | two
! (2 rows)
! 
! select * from array_to_set(array['one', 'two']); -- fail
! ERROR:  a column definition list is required for functions returning "record"
! LINE 1: select * from array_to_set(array['one', 'two']);
!                       ^
! create temp table foo(f1 int8, f2 int8);
! create function testfoo() returns record as $$
!   insert into foo values (1,2) returning *;
! $$ language sql;
! select testfoo();
!  testfoo 
! ---------
!  (1,2)
! (1 row)
! 
! select * from testfoo() as t(f1 int8,f2 int8);
!  f1 | f2 
! ----+----
!   1 |  2
! (1 row)
! 
! select * from testfoo(); -- fail
! ERROR:  a column definition list is required for functions returning "record"
! LINE 1: select * from testfoo();
!                       ^
! drop function testfoo();
! create function testfoo() returns setof record as $$
!   insert into foo values (1,2), (3,4) returning *;
! $$ language sql;
! select testfoo();
!  testfoo 
! ---------
!  (1,2)
!  (3,4)
! (2 rows)
! 
! select * from testfoo() as t(f1 int8,f2 int8);
!  f1 | f2 
! ----+----
!   1 |  2
!   3 |  4
! (2 rows)
! 
! select * from testfoo(); -- fail
! ERROR:  a column definition list is required for functions returning "record"
! LINE 1: select * from testfoo();
!                       ^
! drop function testfoo();
! --
! -- Check some cases involving added/dropped columns in a rowtype result
! --
! create temp table users (userid text, seq int, email text, todrop bool, moredrop int, enabled bool);
! insert into users values ('id',1,'email',true,11,true);
! insert into users values ('id2',2,'email2',true,12,true);
! alter table users drop column todrop;
! create or replace function get_first_user() returns users as
! $$ SELECT * FROM users ORDER BY userid LIMIT 1; $$
! language sql stable;
! SELECT get_first_user();
!   get_first_user   
! -------------------
!  (id,1,email,11,t)
! (1 row)
! 
! SELECT * FROM get_first_user();
!  userid | seq | email | moredrop | enabled 
! --------+-----+-------+----------+---------
!  id     |   1 | email |       11 | t
! (1 row)
! 
! create or replace function get_users() returns setof users as
! $$ SELECT * FROM users ORDER BY userid; $$
! language sql stable;
! SELECT get_users();
!       get_users      
! ---------------------
!  (id,1,email,11,t)
!  (id2,2,email2,12,t)
! (2 rows)
! 
! SELECT * FROM get_users();
!  userid | seq | email  | moredrop | enabled 
! --------+-----+--------+----------+---------
!  id     |   1 | email  |       11 | t
!  id2    |   2 | email2 |       12 | t
! (2 rows)
! 
! SELECT * FROM get_users() WITH ORDINALITY;   -- make sure ordinality copes
!  userid | seq | email  | moredrop | enabled | ordinality 
! --------+-----+--------+----------+---------+------------
!  id     |   1 | email  |       11 | t       |          1
!  id2    |   2 | email2 |       12 | t       |          2
! (2 rows)
! 
! -- multiple functions vs. dropped columns
! SELECT * FROM ROWS FROM(generate_series(10,11), get_users()) WITH ORDINALITY;
!  generate_series | userid | seq | email  | moredrop | enabled | ordinality 
! -----------------+--------+-----+--------+----------+---------+------------
!               10 | id     |   1 | email  |       11 | t       |          1
!               11 | id2    |   2 | email2 |       12 | t       |          2
! (2 rows)
! 
! SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
!  userid | seq | email  | moredrop | enabled | generate_series | ordinality 
! --------+-----+--------+----------+---------+-----------------+------------
!  id     |   1 | email  |       11 | t       |              10 |          1
!  id2    |   2 | email2 |       12 | t       |              11 |          2
! (2 rows)
! 
! -- check that we can cope with post-parsing changes in rowtypes
! create temp view usersview as
! SELECT * FROM ROWS FROM(get_users(), generate_series(10,11)) WITH ORDINALITY;
! select * from usersview;
!  userid | seq | email  | moredrop | enabled | generate_series | ordinality 
! --------+-----+--------+----------+---------+-----------------+------------
!  id     |   1 | email  |       11 | t       |              10 |          1
!  id2    |   2 | email2 |       12 | t       |              11 |          2
! (2 rows)
! 
! alter table users drop column moredrop;
! select * from usersview;
!  userid | seq | email  | moredrop | enabled | generate_series | ordinality 
! --------+-----+--------+----------+---------+-----------------+------------
!  id     |   1 | email  |          | t       |              10 |          1
!  id2    |   2 | email2 |          | t       |              11 |          2
! (2 rows)
! 
! alter table users add column junk text;
! select * from usersview;
!  userid | seq | email  | moredrop | enabled | generate_series | ordinality 
! --------+-----+--------+----------+---------+-----------------+------------
!  id     |   1 | email  |          | t       |              10 |          1
!  id2    |   2 | email2 |          | t       |              11 |          2
! (2 rows)
! 
! alter table users alter column seq type numeric;
! select * from usersview;  -- expect clean failure
! ERROR:  attribute 2 has wrong type
! DETAIL:  Table has type numeric, but query expects integer.
! drop view usersview;
! drop function get_first_user();
! drop function get_users();
! drop table users;
! -- this won't get inlined because of type coercion, but it shouldn't fail
! create or replace function foobar() returns setof text as
! $$ select 'foo'::varchar union all select 'bar'::varchar ; $$
! language sql stable;
! select foobar();
!  foobar 
! --------
!  foo
!  bar
! (2 rows)
! 
! select * from foobar();
!  foobar 
! --------
!  foo
!  bar
! (2 rows)
! 
! drop function foobar();
! -- check handling of a SQL function with multiple OUT params (bug #5777)
! create or replace function foobar(out integer, out numeric) as
! $$ select (1, 2.1) $$ language sql;
! select * from foobar();
!  column1 | column2 
! ---------+---------
!        1 |     2.1
! (1 row)
! 
! create or replace function foobar(out integer, out numeric) as
! $$ select (1, 2) $$ language sql;
! select * from foobar();  -- fail
! ERROR:  function return row and query-specified return row do not match
! DETAIL:  Returned type integer at ordinal position 2, but query expects numeric.
! create or replace function foobar(out integer, out numeric) as
! $$ select (1, 2.1, 3) $$ language sql;
! select * from foobar();  -- fail
! ERROR:  function return row and query-specified return row do not match
! DETAIL:  Returned row contains 3 attributes, but query expects 2.
! drop function foobar();
! -- check behavior when a function's input sometimes returns a set (bug #8228)
! SELECT *,
!   lower(CASE WHEN id = 2 THEN (regexp_matches(str, '^0*([1-9]\d+)$'))[1]
!         ELSE str
!         END)
! FROM
!   (VALUES (1,''), (2,'0000000049404'), (3,'FROM 10000000876')) v(id, str);
!  id |       str        |      lower       
! ----+------------------+------------------
!   1 |                  | 
!   2 | 0000000049404    | 49404
!   3 | FROM 10000000876 | from 10000000876
! (3 rows)
! 
! -- check whole-row-Var handling in nested lateral functions (bug #11703)
! create function extractq2(t int8_tbl) returns int8 as $$
!   select t.q2
! $$ language sql immutable;
! explain (verbose, costs off)
! select x from int8_tbl, extractq2(int8_tbl) f(x);
!                 QUERY PLAN                
! ------------------------------------------
!  Nested Loop
!    Output: f.x
!    ->  Seq Scan on public.int8_tbl
!          Output: int8_tbl.q1, int8_tbl.q2
!    ->  Function Scan on f
!          Output: f.x
!          Function Call: int8_tbl.q2
! (7 rows)
! 
! select x from int8_tbl, extractq2(int8_tbl) f(x);
!          x         
! -------------------
!                456
!   4567890123456789
!                123
!   4567890123456789
!  -4567890123456789
! (5 rows)
! 
! create function extractq2_2(t int8_tbl) returns table(ret1 int8) as $$
!   select extractq2(t) offset 0
! $$ language sql immutable;
! explain (verbose, costs off)
! select x from int8_tbl, extractq2_2(int8_tbl) f(x);
!             QUERY PLAN             
! -----------------------------------
!  Nested Loop
!    Output: ((int8_tbl.*).q2)
!    ->  Seq Scan on public.int8_tbl
!          Output: int8_tbl.*
!    ->  Result
!          Output: (int8_tbl.*).q2
! (6 rows)
! 
! select x from int8_tbl, extractq2_2(int8_tbl) f(x);
!          x         
! -------------------
!                456
!   4567890123456789
!                123
!   4567890123456789
!  -4567890123456789
! (5 rows)
! 
! -- without the "offset 0", this function gets optimized quite differently
! create function extractq2_2_opt(t int8_tbl) returns table(ret1 int8) as $$
!   select extractq2(t)
! $$ language sql immutable;
! explain (verbose, costs off)
! select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
!          QUERY PLAN          
! -----------------------------
!  Seq Scan on public.int8_tbl
!    Output: int8_tbl.q2
! (2 rows)
! 
! select x from int8_tbl, extractq2_2_opt(int8_tbl) f(x);
!          x         
! -------------------
!                456
!   4567890123456789
!                123
!   4567890123456789
!  -4567890123456789
! (5 rows)
! 
! -- check handling of nulls in SRF results (bug #7808)
! create type foo2 as (a integer, b text);
! select *, row_to_json(u) from unnest(array[(1,'foo')::foo2, null::foo2]) u;
!  a |  b  |     row_to_json     
! ---+-----+---------------------
!  1 | foo | {"a":1,"b":"foo"}
!    |     | {"a":null,"b":null}
! (2 rows)
! 
! select *, row_to_json(u) from unnest(array[null::foo2, null::foo2]) u;
!  a | b |     row_to_json     
! ---+---+---------------------
!    |   | {"a":null,"b":null}
!    |   | {"a":null,"b":null}
! (2 rows)
! 
! select *, row_to_json(u) from unnest(array[null::foo2, (1,'foo')::foo2, null::foo2]) u;
!  a |  b  |     row_to_json     
! ---+-----+---------------------
!    |     | {"a":null,"b":null}
!  1 | foo | {"a":1,"b":"foo"}
!    |     | {"a":null,"b":null}
! (3 rows)
! 
! select *, row_to_json(u) from unnest(array[]::foo2[]) u;
!  a | b | row_to_json 
! ---+---+-------------
! (0 rows)
! 
! drop type foo2;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/prepare.out	2016-09-05 20:45:48.912033114 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/prepare.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,180 ****
! -- Regression tests for prepareable statements. We query the content
! -- of the pg_prepared_statements view as prepared statements are
! -- created and removed.
! SELECT name, statement, parameter_types FROM pg_prepared_statements;
!  name | statement | parameter_types 
! ------+-----------+-----------------
! (0 rows)
! 
! PREPARE q1 AS SELECT 1 AS a;
! EXECUTE q1;
!  a 
! ---
!  1
! (1 row)
! 
! SELECT name, statement, parameter_types FROM pg_prepared_statements;
!  name |          statement           | parameter_types 
! ------+------------------------------+-----------------
!  q1   | PREPARE q1 AS SELECT 1 AS a; | {}
! (1 row)
! 
! -- should fail
! PREPARE q1 AS SELECT 2;
! ERROR:  prepared statement "q1" already exists
! -- should succeed
! DEALLOCATE q1;
! PREPARE q1 AS SELECT 2;
! EXECUTE q1;
!  ?column? 
! ----------
!         2
! (1 row)
! 
! PREPARE q2 AS SELECT 2 AS b;
! SELECT name, statement, parameter_types FROM pg_prepared_statements;
!  name |          statement           | parameter_types 
! ------+------------------------------+-----------------
!  q1   | PREPARE q1 AS SELECT 2;      | {}
!  q2   | PREPARE q2 AS SELECT 2 AS b; | {}
! (2 rows)
! 
! -- sql92 syntax
! DEALLOCATE PREPARE q1;
! SELECT name, statement, parameter_types FROM pg_prepared_statements;
!  name |          statement           | parameter_types 
! ------+------------------------------+-----------------
!  q2   | PREPARE q2 AS SELECT 2 AS b; | {}
! (1 row)
! 
! DEALLOCATE PREPARE q2;
! -- the view should return the empty set again
! SELECT name, statement, parameter_types FROM pg_prepared_statements;
!  name | statement | parameter_types 
! ------+-----------+-----------------
! (0 rows)
! 
! -- parameterized queries
! PREPARE q2(text) AS
! 	SELECT datname, datistemplate, datallowconn
! 	FROM pg_database WHERE datname = $1;
! EXECUTE q2('postgres');
!  datname  | datistemplate | datallowconn 
! ----------+---------------+--------------
!  postgres | f             | t
! (1 row)
! 
! PREPARE q3(text, int, float, boolean, oid, smallint) AS
! 	SELECT * FROM tenk1 WHERE string4 = $1 AND (four = $2 OR
! 	ten = $3::bigint OR true = $4 OR oid = $5 OR odd = $6::int)
! 	ORDER BY unique1;
! EXECUTE q3('AAAAxx', 5::smallint, 10.5::float, false, 500::oid, 4::bigint);
!  unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
! ---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
!        2 |    2716 |   0 |    2 |   2 |      2 |       2 |        2 |           2 |         2 |        2 |   4 |    5 | CAAAAA   | MAEAAA   | AAAAxx
!      102 |     612 |   0 |    2 |   2 |      2 |       2 |      102 |         102 |       102 |      102 |   4 |    5 | YDAAAA   | OXAAAA   | AAAAxx
!      802 |    2908 |   0 |    2 |   2 |      2 |       2 |      802 |         802 |       802 |      802 |   4 |    5 | WEAAAA   | WHEAAA   | AAAAxx
!      902 |    1104 |   0 |    2 |   2 |      2 |       2 |      902 |         902 |       902 |      902 |   4 |    5 | SIAAAA   | MQBAAA   | AAAAxx
!     1002 |    2580 |   0 |    2 |   2 |      2 |       2 |        2 |        1002 |      1002 |     1002 |   4 |    5 | OMAAAA   | GVDAAA   | AAAAxx
!     1602 |    8148 |   0 |    2 |   2 |      2 |       2 |      602 |        1602 |      1602 |     1602 |   4 |    5 | QJAAAA   | KBMAAA   | AAAAxx
!     1702 |    7940 |   0 |    2 |   2 |      2 |       2 |      702 |        1702 |      1702 |     1702 |   4 |    5 | MNAAAA   | KTLAAA   | AAAAxx
!     2102 |    6184 |   0 |    2 |   2 |      2 |       2 |      102 |         102 |      2102 |     2102 |   4 |    5 | WCAAAA   | WDJAAA   | AAAAxx
!     2202 |    8028 |   0 |    2 |   2 |      2 |       2 |      202 |         202 |      2202 |     2202 |   4 |    5 | SGAAAA   | UWLAAA   | AAAAxx
!     2302 |    7112 |   0 |    2 |   2 |      2 |       2 |      302 |         302 |      2302 |     2302 |   4 |    5 | OKAAAA   | ONKAAA   | AAAAxx
!     2902 |    6816 |   0 |    2 |   2 |      2 |       2 |      902 |         902 |      2902 |     2902 |   4 |    5 | QHAAAA   | ECKAAA   | AAAAxx
!     3202 |    7128 |   0 |    2 |   2 |      2 |       2 |      202 |        1202 |      3202 |     3202 |   4 |    5 | ETAAAA   | EOKAAA   | AAAAxx
!     3902 |    9224 |   0 |    2 |   2 |      2 |       2 |      902 |        1902 |      3902 |     3902 |   4 |    5 | CUAAAA   | UQNAAA   | AAAAxx
!     4102 |    7676 |   0 |    2 |   2 |      2 |       2 |      102 |         102 |      4102 |     4102 |   4 |    5 | UBAAAA   | GJLAAA   | AAAAxx
!     4202 |    6628 |   0 |    2 |   2 |      2 |       2 |      202 |         202 |      4202 |     4202 |   4 |    5 | QFAAAA   | YUJAAA   | AAAAxx
!     4502 |     412 |   0 |    2 |   2 |      2 |       2 |      502 |         502 |      4502 |     4502 |   4 |    5 | ERAAAA   | WPAAAA   | AAAAxx
!     4702 |    2520 |   0 |    2 |   2 |      2 |       2 |      702 |         702 |      4702 |     4702 |   4 |    5 | WYAAAA   | YSDAAA   | AAAAxx
!     4902 |    1600 |   0 |    2 |   2 |      2 |       2 |      902 |         902 |      4902 |     4902 |   4 |    5 | OGAAAA   | OJCAAA   | AAAAxx
!     5602 |    8796 |   0 |    2 |   2 |      2 |       2 |      602 |        1602 |       602 |     5602 |   4 |    5 | MHAAAA   | IANAAA   | AAAAxx
!     6002 |    8932 |   0 |    2 |   2 |      2 |       2 |        2 |           2 |      1002 |     6002 |   4 |    5 | WWAAAA   | OFNAAA   | AAAAxx
!     6402 |    3808 |   0 |    2 |   2 |      2 |       2 |      402 |         402 |      1402 |     6402 |   4 |    5 | GMAAAA   | MQFAAA   | AAAAxx
!     7602 |    1040 |   0 |    2 |   2 |      2 |       2 |      602 |        1602 |      2602 |     7602 |   4 |    5 | KGAAAA   | AOBAAA   | AAAAxx
!     7802 |    7508 |   0 |    2 |   2 |      2 |       2 |      802 |        1802 |      2802 |     7802 |   4 |    5 | COAAAA   | UCLAAA   | AAAAxx
!     8002 |    9980 |   0 |    2 |   2 |      2 |       2 |        2 |           2 |      3002 |     8002 |   4 |    5 | UVAAAA   | WTOAAA   | AAAAxx
!     8302 |    7800 |   0 |    2 |   2 |      2 |       2 |      302 |         302 |      3302 |     8302 |   4 |    5 | IHAAAA   | AOLAAA   | AAAAxx
!     8402 |    5708 |   0 |    2 |   2 |      2 |       2 |      402 |         402 |      3402 |     8402 |   4 |    5 | ELAAAA   | OLIAAA   | AAAAxx
!     8602 |    5440 |   0 |    2 |   2 |      2 |       2 |      602 |         602 |      3602 |     8602 |   4 |    5 | WSAAAA   | GBIAAA   | AAAAxx
!     9502 |    1812 |   0 |    2 |   2 |      2 |       2 |      502 |        1502 |      4502 |     9502 |   4 |    5 | MBAAAA   | SRCAAA   | AAAAxx
!     9602 |    9972 |   0 |    2 |   2 |      2 |       2 |      602 |        1602 |      4602 |     9602 |   4 |    5 | IFAAAA   | OTOAAA   | AAAAxx
! (29 rows)
! 
! -- too few params
! EXECUTE q3('bool');
! ERROR:  wrong number of parameters for prepared statement "q3"
! DETAIL:  Expected 6 parameters but got 1.
! -- too many params
! EXECUTE q3('bytea', 5::smallint, 10.5::float, false, 500::oid, 4::bigint, true);
! ERROR:  wrong number of parameters for prepared statement "q3"
! DETAIL:  Expected 6 parameters but got 7.
! -- wrong param types
! EXECUTE q3(5::smallint, 10.5::float, false, 500::oid, 4::bigint, 'bytea');
! ERROR:  parameter $3 of type boolean cannot be coerced to the expected type double precision
! HINT:  You will need to rewrite or cast the expression.
! -- invalid type
! PREPARE q4(nonexistenttype) AS SELECT $1;
! ERROR:  type "nonexistenttype" does not exist
! LINE 1: PREPARE q4(nonexistenttype) AS SELECT $1;
!                    ^
! -- create table as execute
! PREPARE q5(int, text) AS
! 	SELECT * FROM tenk1 WHERE unique1 = $1 OR stringu1 = $2
! 	ORDER BY unique1;
! CREATE TEMPORARY TABLE q5_prep_results AS EXECUTE q5(200, 'DTAAAA');
! SELECT * FROM q5_prep_results;
!  unique1 | unique2 | two | four | ten | twenty | hundred | thousand | twothousand | fivethous | tenthous | odd | even | stringu1 | stringu2 | string4 
! ---------+---------+-----+------+-----+--------+---------+----------+-------------+-----------+----------+-----+------+----------+----------+---------
!      200 |    9441 |   0 |    0 |   0 |      0 |       0 |      200 |         200 |       200 |      200 |   0 |    1 | SHAAAA   | DZNAAA   | HHHHxx
!      497 |    9092 |   1 |    1 |   7 |     17 |      97 |      497 |         497 |       497 |      497 | 194 |  195 | DTAAAA   | SLNAAA   | AAAAxx
!     1173 |    6699 |   1 |    1 |   3 |     13 |      73 |      173 |        1173 |      1173 |     1173 | 146 |  147 | DTAAAA   | RXJAAA   | VVVVxx
!     1849 |    8143 |   1 |    1 |   9 |      9 |      49 |      849 |        1849 |      1849 |     1849 |  98 |   99 | DTAAAA   | FBMAAA   | VVVVxx
!     2525 |      64 |   1 |    1 |   5 |      5 |      25 |      525 |         525 |      2525 |     2525 |  50 |   51 | DTAAAA   | MCAAAA   | AAAAxx
!     3201 |    7309 |   1 |    1 |   1 |      1 |       1 |      201 |        1201 |      3201 |     3201 |   2 |    3 | DTAAAA   | DVKAAA   | HHHHxx
!     3877 |    4060 |   1 |    1 |   7 |     17 |      77 |      877 |        1877 |      3877 |     3877 | 154 |  155 | DTAAAA   | EAGAAA   | AAAAxx
!     4553 |    4113 |   1 |    1 |   3 |     13 |      53 |      553 |         553 |      4553 |     4553 | 106 |  107 | DTAAAA   | FCGAAA   | HHHHxx
!     5229 |    6407 |   1 |    1 |   9 |      9 |      29 |      229 |        1229 |       229 |     5229 |  58 |   59 | DTAAAA   | LMJAAA   | VVVVxx
!     5905 |    9537 |   1 |    1 |   5 |      5 |       5 |      905 |        1905 |       905 |     5905 |  10 |   11 | DTAAAA   | VCOAAA   | HHHHxx
!     6581 |    4686 |   1 |    1 |   1 |      1 |      81 |      581 |         581 |      1581 |     6581 | 162 |  163 | DTAAAA   | GYGAAA   | OOOOxx
!     7257 |    1895 |   1 |    1 |   7 |     17 |      57 |      257 |        1257 |      2257 |     7257 | 114 |  115 | DTAAAA   | XUCAAA   | VVVVxx
!     7933 |    4514 |   1 |    1 |   3 |     13 |      33 |      933 |        1933 |      2933 |     7933 |  66 |   67 | DTAAAA   | QRGAAA   | OOOOxx
!     8609 |    5918 |   1 |    1 |   9 |      9 |       9 |      609 |         609 |      3609 |     8609 |  18 |   19 | DTAAAA   | QTIAAA   | OOOOxx
!     9285 |    8469 |   1 |    1 |   5 |      5 |      85 |      285 |        1285 |      4285 |     9285 | 170 |  171 | DTAAAA   | TNMAAA   | HHHHxx
!     9961 |    2058 |   1 |    1 |   1 |      1 |      61 |      961 |        1961 |      4961 |     9961 | 122 |  123 | DTAAAA   | EBDAAA   | OOOOxx
! (16 rows)
! 
! -- unknown or unspecified parameter types: should succeed
! PREPARE q6 AS
!     SELECT * FROM tenk1 WHERE unique1 = $1 AND stringu1 = $2;
! PREPARE q7(unknown) AS
!     SELECT * FROM road WHERE thepath = $1;
! SELECT name, statement, parameter_types FROM pg_prepared_statements
!     ORDER BY name;
!  name |                              statement                              |                    parameter_types                     
! ------+---------------------------------------------------------------------+--------------------------------------------------------
!  q2   | PREPARE q2(text) AS                                                +| {text}
!       |         SELECT datname, datistemplate, datallowconn                +| 
!       |         FROM pg_database WHERE datname = $1;                        | 
!  q3   | PREPARE q3(text, int, float, boolean, oid, smallint) AS            +| {text,integer,"double precision",boolean,oid,smallint}
!       |         SELECT * FROM tenk1 WHERE string4 = $1 AND (four = $2 OR   +| 
!       |         ten = $3::bigint OR true = $4 OR oid = $5 OR odd = $6::int)+| 
!       |         ORDER BY unique1;                                           | 
!  q5   | PREPARE q5(int, text) AS                                           +| {integer,text}
!       |         SELECT * FROM tenk1 WHERE unique1 = $1 OR stringu1 = $2    +| 
!       |         ORDER BY unique1;                                           | 
!  q6   | PREPARE q6 AS                                                      +| {integer,name}
!       |     SELECT * FROM tenk1 WHERE unique1 = $1 AND stringu1 = $2;       | 
!  q7   | PREPARE q7(unknown) AS                                             +| {path}
!       |     SELECT * FROM road WHERE thepath = $1;                          | 
! (5 rows)
! 
! -- test DEALLOCATE ALL;
! DEALLOCATE ALL;
! SELECT name, statement, parameter_types FROM pg_prepared_statements
!     ORDER BY name;
!  name | statement | parameter_types 
! ------+-----------+-----------------
! (0 rows)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/without_oid.out	2016-09-05 20:45:49.140033814 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/without_oid.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,103 ****
! --
! -- WITHOUT OID
! --
! --
! -- This test tries to verify that WITHOUT OIDS actually saves space.
! -- On machines where MAXALIGN is 8, WITHOUT OIDS may or may not save any
! -- space, depending on the size of the tuple header + null bitmap.
! -- As of 8.3 we need a null bitmap of 8 or less bits for the difference
! -- to appear.
! --
! CREATE TABLE wi (i INT,
!                  n1 int, n2 int, n3 int, n4 int,
!                  n5 int, n6 int, n7 int) WITH OIDS;
! CREATE TABLE wo (i INT,
!                  n1 int, n2 int, n3 int, n4 int,
!                  n5 int, n6 int, n7 int) WITHOUT OIDS;
! INSERT INTO wi VALUES (1);  -- 1
! INSERT INTO wo SELECT i FROM wi;  -- 1
! INSERT INTO wo SELECT i+1 FROM wi;  -- 1+1=2
! INSERT INTO wi SELECT i+1 FROM wo;  -- 1+2=3
! INSERT INTO wi SELECT i+3 FROM wi;  -- 3+3=6
! INSERT INTO wo SELECT i+2 FROM wi;  -- 2+6=8
! INSERT INTO wo SELECT i+8 FROM wo;  -- 8+8=16
! INSERT INTO wi SELECT i+6 FROM wo;  -- 6+16=22
! INSERT INTO wi SELECT i+22 FROM wi;  -- 22+22=44
! INSERT INTO wo SELECT i+16 FROM wi;  -- 16+44=60
! INSERT INTO wo SELECT i+60 FROM wo;  -- 60+60=120
! INSERT INTO wi SELECT i+44 FROM wo;  -- 44+120=164
! INSERT INTO wi SELECT i+164 FROM wi;  -- 164+164=328
! INSERT INTO wo SELECT i+120 FROM wi;  -- 120+328=448
! INSERT INTO wo SELECT i+448 FROM wo;  -- 448+448=896
! INSERT INTO wi SELECT i+328 FROM wo;  -- 328+896=1224
! INSERT INTO wi SELECT i+1224 FROM wi;  -- 1224+1224=2448
! INSERT INTO wo SELECT i+896 FROM wi;  -- 896+2448=3344
! INSERT INTO wo SELECT i+3344 FROM wo;  -- 3344+3344=6688
! INSERT INTO wi SELECT i+2448 FROM wo;  -- 2448+6688=9136
! INSERT INTO wo SELECT i+6688 FROM wi WHERE i<=2448;  -- 6688+2448=9136
! SELECT count(oid) FROM wi;
!  count 
! -------
!   9136
! (1 row)
! 
! -- should fail
! SELECT count(oid) FROM wo;
! ERROR:  column "oid" does not exist
! LINE 1: SELECT count(oid) FROM wo;
!                      ^
! VACUUM ANALYZE wi;
! VACUUM ANALYZE wo;
! SELECT min(relpages) < max(relpages), min(reltuples) - max(reltuples)
!   FROM pg_class
!  WHERE relname IN ('wi', 'wo');
!  ?column? | ?column? 
! ----------+----------
!  t        |        0
! (1 row)
! 
! DROP TABLE wi;
! DROP TABLE wo;
! --
! -- WITH / WITHOUT OIDS in CREATE TABLE AS
! --
! CREATE TABLE create_table_test (
!     a int,
!     b int
! );
! COPY create_table_test FROM stdin;
! CREATE TABLE create_table_test2 WITH OIDS AS
!     SELECT a + b AS c1, a - b AS c2 FROM create_table_test;
! CREATE TABLE create_table_test3 WITHOUT OIDS AS
!     SELECT a + b AS c1, a - b AS c2 FROM create_table_test;
! SELECT count(oid) FROM create_table_test2;
!  count 
! -------
!      2
! (1 row)
! 
! -- should fail
! SELECT count(oid) FROM create_table_test3;
! ERROR:  column "oid" does not exist
! LINE 1: SELECT count(oid) FROM create_table_test3;
!                      ^
! PREPARE table_source(int) AS
!     SELECT a + b AS c1, a - b AS c2, $1 AS c3 FROM create_table_test;
! CREATE TABLE execute_with WITH OIDS AS EXECUTE table_source(1);
! CREATE TABLE execute_without WITHOUT OIDS AS EXECUTE table_source(2);
! SELECT count(oid) FROM execute_with;
!  count 
! -------
!      2
! (1 row)
! 
! -- should fail
! SELECT count(oid) FROM execute_without;
! ERROR:  column "oid" does not exist
! LINE 1: SELECT count(oid) FROM execute_without;
!                      ^
! DROP TABLE create_table_test;
! DROP TABLE create_table_test2;
! DROP TABLE create_table_test3;
! DROP TABLE execute_with;
! DROP TABLE execute_without;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/conversion.out	2016-09-05 20:45:48.604032169 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/conversion.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,39 ****
! --
! -- create user defined conversion
! --
! CREATE USER regress_conversion_user WITH NOCREATEDB NOCREATEROLE;
! SET SESSION AUTHORIZATION regress_conversion_user;
! CREATE CONVERSION myconv FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
! --
! -- cannot make same name conversion in same schema
! --
! CREATE CONVERSION myconv FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
! ERROR:  conversion "myconv" already exists
! --
! -- create default conversion with qualified name
! --
! CREATE DEFAULT CONVERSION public.mydef FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
! --
! -- cannot make default conversion with same schema/for_encoding/to_encoding
! --
! CREATE DEFAULT CONVERSION public.mydef2 FOR 'LATIN1' TO 'UTF8' FROM iso8859_1_to_utf8;
! ERROR:  default conversion for LATIN1 to UTF8 already exists
! -- test comments
! COMMENT ON CONVERSION myconv_bad IS 'foo';
! ERROR:  conversion "myconv_bad" does not exist
! COMMENT ON CONVERSION myconv IS 'bar';
! COMMENT ON CONVERSION myconv IS NULL;
! --
! -- drop user defined conversion
! --
! DROP CONVERSION myconv;
! DROP CONVERSION mydef;
! --
! -- Note: the built-in conversions are exercised in opr_sanity.sql,
! -- so there's no need to do that here.
! --
! --
! -- return to the super user
! --
! RESET SESSION AUTHORIZATION;
! DROP USER regress_conversion_user;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/truncate.out	2016-09-05 20:45:49.108033715 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/truncate.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,422 ****
! -- Test basic TRUNCATE functionality.
! CREATE TABLE truncate_a (col1 integer primary key);
! INSERT INTO truncate_a VALUES (1);
! INSERT INTO truncate_a VALUES (2);
! SELECT * FROM truncate_a;
!  col1 
! ------
!     1
!     2
! (2 rows)
! 
! -- Roll truncate back
! BEGIN;
! TRUNCATE truncate_a;
! ROLLBACK;
! SELECT * FROM truncate_a;
!  col1 
! ------
!     1
!     2
! (2 rows)
! 
! -- Commit the truncate this time
! BEGIN;
! TRUNCATE truncate_a;
! COMMIT;
! SELECT * FROM truncate_a;
!  col1 
! ------
! (0 rows)
! 
! -- Test foreign-key checks
! CREATE TABLE trunc_b (a int REFERENCES truncate_a);
! CREATE TABLE trunc_c (a serial PRIMARY KEY);
! CREATE TABLE trunc_d (a int REFERENCES trunc_c);
! CREATE TABLE trunc_e (a int REFERENCES truncate_a, b int REFERENCES trunc_c);
! TRUNCATE TABLE truncate_a;		-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_b" references "truncate_a".
! HINT:  Truncate table "trunc_b" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE truncate_a,trunc_b;		-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_e" references "truncate_a".
! HINT:  Truncate table "trunc_e" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE truncate_a,trunc_b,trunc_e;	-- ok
! TRUNCATE TABLE truncate_a,trunc_e;		-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_b" references "truncate_a".
! HINT:  Truncate table "trunc_b" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c;		-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_d" references "trunc_c".
! HINT:  Truncate table "trunc_d" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,trunc_d;		-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_e" references "trunc_c".
! HINT:  Truncate table "trunc_e" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,trunc_d,trunc_e;	-- ok
! TRUNCATE TABLE trunc_c,trunc_d,trunc_e,truncate_a;	-- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_b" references "truncate_a".
! HINT:  Truncate table "trunc_b" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,trunc_d,trunc_e,truncate_a,trunc_b;	-- ok
! TRUNCATE TABLE truncate_a RESTRICT; -- fail
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_b" references "truncate_a".
! HINT:  Truncate table "trunc_b" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE truncate_a CASCADE;  -- ok
! NOTICE:  truncate cascades to table "trunc_b"
! NOTICE:  truncate cascades to table "trunc_e"
! -- circular references
! ALTER TABLE truncate_a ADD FOREIGN KEY (col1) REFERENCES trunc_c;
! -- Add some data to verify that truncating actually works ...
! INSERT INTO trunc_c VALUES (1);
! INSERT INTO truncate_a VALUES (1);
! INSERT INTO trunc_b VALUES (1);
! INSERT INTO trunc_d VALUES (1);
! INSERT INTO trunc_e VALUES (1,1);
! TRUNCATE TABLE trunc_c;
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "truncate_a" references "trunc_c".
! HINT:  Truncate table "truncate_a" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,truncate_a;
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_d" references "trunc_c".
! HINT:  Truncate table "trunc_d" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,truncate_a,trunc_d;
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_e" references "trunc_c".
! HINT:  Truncate table "trunc_e" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,truncate_a,trunc_d,trunc_e;
! ERROR:  cannot truncate a table referenced in a foreign key constraint
! DETAIL:  Table "trunc_b" references "truncate_a".
! HINT:  Truncate table "trunc_b" at the same time, or use TRUNCATE ... CASCADE.
! TRUNCATE TABLE trunc_c,truncate_a,trunc_d,trunc_e,trunc_b;
! -- Verify that truncating did actually work
! SELECT * FROM truncate_a
!    UNION ALL
!  SELECT * FROM trunc_c
!    UNION ALL
!  SELECT * FROM trunc_b
!    UNION ALL
!  SELECT * FROM trunc_d;
!  col1 
! ------
! (0 rows)
! 
! SELECT * FROM trunc_e;
!  a | b 
! ---+---
! (0 rows)
! 
! -- Add data again to test TRUNCATE ... CASCADE
! INSERT INTO trunc_c VALUES (1);
! INSERT INTO truncate_a VALUES (1);
! INSERT INTO trunc_b VALUES (1);
! INSERT INTO trunc_d VALUES (1);
! INSERT INTO trunc_e VALUES (1,1);
! TRUNCATE TABLE trunc_c CASCADE;  -- ok
! NOTICE:  truncate cascades to table "truncate_a"
! NOTICE:  truncate cascades to table "trunc_d"
! NOTICE:  truncate cascades to table "trunc_e"
! NOTICE:  truncate cascades to table "trunc_b"
! SELECT * FROM truncate_a
!    UNION ALL
!  SELECT * FROM trunc_c
!    UNION ALL
!  SELECT * FROM trunc_b
!    UNION ALL
!  SELECT * FROM trunc_d;
!  col1 
! ------
! (0 rows)
! 
! SELECT * FROM trunc_e;
!  a | b 
! ---+---
! (0 rows)
! 
! DROP TABLE truncate_a,trunc_c,trunc_b,trunc_d,trunc_e CASCADE;
! -- Test TRUNCATE with inheritance
! CREATE TABLE trunc_f (col1 integer primary key);
! INSERT INTO trunc_f VALUES (1);
! INSERT INTO trunc_f VALUES (2);
! CREATE TABLE trunc_fa (col2a text) INHERITS (trunc_f);
! INSERT INTO trunc_fa VALUES (3, 'three');
! CREATE TABLE trunc_fb (col2b int) INHERITS (trunc_f);
! INSERT INTO trunc_fb VALUES (4, 444);
! CREATE TABLE trunc_faa (col3 text) INHERITS (trunc_fa);
! INSERT INTO trunc_faa VALUES (5, 'five', 'FIVE');
! BEGIN;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
!     3
!     4
!     5
! (5 rows)
! 
! TRUNCATE trunc_f;
! SELECT * FROM trunc_f;
!  col1 
! ------
! (0 rows)
! 
! ROLLBACK;
! BEGIN;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
!     3
!     4
!     5
! (5 rows)
! 
! TRUNCATE ONLY trunc_f;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     3
!     4
!     5
! (3 rows)
! 
! ROLLBACK;
! BEGIN;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
!     3
!     4
!     5
! (5 rows)
! 
! SELECT * FROM trunc_fa;
!  col1 | col2a 
! ------+-------
!     3 | three
!     5 | five
! (2 rows)
! 
! SELECT * FROM trunc_faa;
!  col1 | col2a | col3 
! ------+-------+------
!     5 | five  | FIVE
! (1 row)
! 
! TRUNCATE ONLY trunc_fb, ONLY trunc_fa;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
!     5
! (3 rows)
! 
! SELECT * FROM trunc_fa;
!  col1 | col2a 
! ------+-------
!     5 | five
! (1 row)
! 
! SELECT * FROM trunc_faa;
!  col1 | col2a | col3 
! ------+-------+------
!     5 | five  | FIVE
! (1 row)
! 
! ROLLBACK;
! BEGIN;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
!     3
!     4
!     5
! (5 rows)
! 
! SELECT * FROM trunc_fa;
!  col1 | col2a 
! ------+-------
!     3 | three
!     5 | five
! (2 rows)
! 
! SELECT * FROM trunc_faa;
!  col1 | col2a | col3 
! ------+-------+------
!     5 | five  | FIVE
! (1 row)
! 
! TRUNCATE ONLY trunc_fb, trunc_fa;
! SELECT * FROM trunc_f;
!  col1 
! ------
!     1
!     2
! (2 rows)
! 
! SELECT * FROM trunc_fa;
!  col1 | col2a 
! ------+-------
! (0 rows)
! 
! SELECT * FROM trunc_faa;
!  col1 | col2a | col3 
! ------+-------+------
! (0 rows)
! 
! ROLLBACK;
! DROP TABLE trunc_f CASCADE;
! NOTICE:  drop cascades to 3 other objects
! DETAIL:  drop cascades to table trunc_fa
! drop cascades to table trunc_faa
! drop cascades to table trunc_fb
! -- Test ON TRUNCATE triggers
! CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
! CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
!         tgargv text, tgtable name, rowcount bigint);
! CREATE FUNCTION trunctrigger() RETURNS trigger as $$
! declare c bigint;
! begin
!     execute 'select count(*) from ' || quote_ident(tg_table_name) into c;
!     insert into trunc_trigger_log values
!       (TG_OP, TG_LEVEL, TG_WHEN, TG_ARGV[0], tg_table_name, c);
!     return null;
! end;
! $$ LANGUAGE plpgsql;
! -- basic before trigger
! INSERT INTO trunc_trigger_test VALUES(1, 'foo', 'bar'), (2, 'baz', 'quux');
! CREATE TRIGGER t
! BEFORE TRUNCATE ON trunc_trigger_test
! FOR EACH STATEMENT
! EXECUTE PROCEDURE trunctrigger('before trigger truncate');
! SELECT count(*) as "Row count in test table" FROM trunc_trigger_test;
!  Row count in test table 
! -------------------------
!                        2
! (1 row)
! 
! SELECT * FROM trunc_trigger_log;
!  tgop | tglevel | tgwhen | tgargv | tgtable | rowcount 
! ------+---------+--------+--------+---------+----------
! (0 rows)
! 
! TRUNCATE trunc_trigger_test;
! SELECT count(*) as "Row count in test table" FROM trunc_trigger_test;
!  Row count in test table 
! -------------------------
!                        0
! (1 row)
! 
! SELECT * FROM trunc_trigger_log;
!    tgop   |  tglevel  | tgwhen |         tgargv          |      tgtable       | rowcount 
! ----------+-----------+--------+-------------------------+--------------------+----------
!  TRUNCATE | STATEMENT | BEFORE | before trigger truncate | trunc_trigger_test |        2
! (1 row)
! 
! DROP TRIGGER t ON trunc_trigger_test;
! truncate trunc_trigger_log;
! -- same test with an after trigger
! INSERT INTO trunc_trigger_test VALUES(1, 'foo', 'bar'), (2, 'baz', 'quux');
! CREATE TRIGGER tt
! AFTER TRUNCATE ON trunc_trigger_test
! FOR EACH STATEMENT
! EXECUTE PROCEDURE trunctrigger('after trigger truncate');
! SELECT count(*) as "Row count in test table" FROM trunc_trigger_test;
!  Row count in test table 
! -------------------------
!                        2
! (1 row)
! 
! SELECT * FROM trunc_trigger_log;
!  tgop | tglevel | tgwhen | tgargv | tgtable | rowcount 
! ------+---------+--------+--------+---------+----------
! (0 rows)
! 
! TRUNCATE trunc_trigger_test;
! SELECT count(*) as "Row count in test table" FROM trunc_trigger_test;
!  Row count in test table 
! -------------------------
!                        0
! (1 row)
! 
! SELECT * FROM trunc_trigger_log;
!    tgop   |  tglevel  | tgwhen |         tgargv         |      tgtable       | rowcount 
! ----------+-----------+--------+------------------------+--------------------+----------
!  TRUNCATE | STATEMENT | AFTER  | after trigger truncate | trunc_trigger_test |        0
! (1 row)
! 
! DROP TABLE trunc_trigger_test;
! DROP TABLE trunc_trigger_log;
! DROP FUNCTION trunctrigger();
! -- test TRUNCATE ... RESTART IDENTITY
! CREATE SEQUENCE truncate_a_id1 START WITH 33;
! CREATE TABLE truncate_a (id serial,
!                          id1 integer default nextval('truncate_a_id1'));
! ALTER SEQUENCE truncate_a_id1 OWNED BY truncate_a.id1;
! INSERT INTO truncate_a DEFAULT VALUES;
! INSERT INTO truncate_a DEFAULT VALUES;
! SELECT * FROM truncate_a;
!  id | id1 
! ----+-----
!   1 |  33
!   2 |  34
! (2 rows)
! 
! TRUNCATE truncate_a;
! INSERT INTO truncate_a DEFAULT VALUES;
! INSERT INTO truncate_a DEFAULT VALUES;
! SELECT * FROM truncate_a;
!  id | id1 
! ----+-----
!   3 |  35
!   4 |  36
! (2 rows)
! 
! TRUNCATE truncate_a RESTART IDENTITY;
! INSERT INTO truncate_a DEFAULT VALUES;
! INSERT INTO truncate_a DEFAULT VALUES;
! SELECT * FROM truncate_a;
!  id | id1 
! ----+-----
!   1 |  33
!   2 |  34
! (2 rows)
! 
! -- check rollback of a RESTART IDENTITY operation
! BEGIN;
! TRUNCATE truncate_a RESTART IDENTITY;
! INSERT INTO truncate_a DEFAULT VALUES;
! SELECT * FROM truncate_a;
!  id | id1 
! ----+-----
!   1 |  33
! (1 row)
! 
! ROLLBACK;
! INSERT INTO truncate_a DEFAULT VALUES;
! INSERT INTO truncate_a DEFAULT VALUES;
! SELECT * FROM truncate_a;
!  id | id1 
! ----+-----
!   1 |  33
!   2 |  34
!   3 |  35
!   4 |  36
! (4 rows)
! 
! DROP TABLE truncate_a;
! SELECT nextval('truncate_a_id1'); -- fail, seq should have been dropped
! ERROR:  relation "truncate_a_id1" does not exist
! LINE 1: SELECT nextval('truncate_a_id1');
!                        ^
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/alter_table.out	2016-09-05 20:45:48.568032058 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/alter_table.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,2916 ****
! --
! -- ALTER_TABLE
! -- add attribute
! --
! CREATE TABLE tmp (initial int4);
! COMMENT ON TABLE tmp_wrong IS 'table comment';
! ERROR:  relation "tmp_wrong" does not exist
! COMMENT ON TABLE tmp IS 'table comment';
! COMMENT ON TABLE tmp IS NULL;
! ALTER TABLE tmp ADD COLUMN xmin integer; -- fails
! ERROR:  column name "xmin" conflicts with a system column name
! ALTER TABLE tmp ADD COLUMN a int4 default 3;
! ALTER TABLE tmp ADD COLUMN b name;
! ALTER TABLE tmp ADD COLUMN c text;
! ALTER TABLE tmp ADD COLUMN d float8;
! ALTER TABLE tmp ADD COLUMN e float4;
! ALTER TABLE tmp ADD COLUMN f int2;
! ALTER TABLE tmp ADD COLUMN g polygon;
! ALTER TABLE tmp ADD COLUMN h abstime;
! ALTER TABLE tmp ADD COLUMN i char;
! ALTER TABLE tmp ADD COLUMN j abstime[];
! ALTER TABLE tmp ADD COLUMN k int4;
! ALTER TABLE tmp ADD COLUMN l tid;
! ALTER TABLE tmp ADD COLUMN m xid;
! ALTER TABLE tmp ADD COLUMN n oidvector;
! --ALTER TABLE tmp ADD COLUMN o lock;
! ALTER TABLE tmp ADD COLUMN p smgr;
! ALTER TABLE tmp ADD COLUMN q point;
! ALTER TABLE tmp ADD COLUMN r lseg;
! ALTER TABLE tmp ADD COLUMN s path;
! ALTER TABLE tmp ADD COLUMN t box;
! ALTER TABLE tmp ADD COLUMN u tinterval;
! ALTER TABLE tmp ADD COLUMN v timestamp;
! ALTER TABLE tmp ADD COLUMN w interval;
! ALTER TABLE tmp ADD COLUMN x float8[];
! ALTER TABLE tmp ADD COLUMN y float4[];
! ALTER TABLE tmp ADD COLUMN z int2[];
! INSERT INTO tmp (a, b, c, d, e, f, g, h, i, j, k, l, m, n, p, q, r, s, t, u,
! 	v, w, x, y, z)
!    VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
!         'Mon May  1 00:30:30 1995', 'c', '{Mon May  1 00:30:30 1995, Monday Aug 24 14:43:07 1992, epoch}',
! 	314159, '(1,1)', '512',
! 	'1 2 3 4 5 6 7 8', 'magnetic disk', '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
! 	'(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)', '["epoch" "infinity"]',
! 	'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
! SELECT * FROM tmp;
!  initial | a |  b   |  c   |  d  |  e  | f |           g           |              h               | i |                                               j                                                |   k    |   l   |  m  |        n        |       p       |     q     |           r           |              s              |          t          |                      u                      |            v             |        w         |     x     |     y     |     z     
! ---------+---+------+------+-----+-----+---+-----------------------+------------------------------+---+------------------------------------------------------------------------------------------------+--------+-------+-----+-----------------+---------------+-----------+-----------------------+-----------------------------+---------------------+---------------------------------------------+--------------------------+------------------+-----------+-----------+-----------
!          | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | Mon May 01 00:30:30 1995 PDT | c | {"Mon May 01 00:30:30 1995 PDT","Mon Aug 24 14:43:07 1992 PDT","Wed Dec 31 16:00:00 1969 PST"} | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | magnetic disk | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | ["Wed Dec 31 16:00:00 1969 PST" "infinity"] | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
! (1 row)
! 
! DROP TABLE tmp;
! -- the wolf bug - schema mods caused inconsistent row descriptors
! CREATE TABLE tmp (
! 	initial 	int4
! );
! ALTER TABLE tmp ADD COLUMN a int4;
! ALTER TABLE tmp ADD COLUMN b name;
! ALTER TABLE tmp ADD COLUMN c text;
! ALTER TABLE tmp ADD COLUMN d float8;
! ALTER TABLE tmp ADD COLUMN e float4;
! ALTER TABLE tmp ADD COLUMN f int2;
! ALTER TABLE tmp ADD COLUMN g polygon;
! ALTER TABLE tmp ADD COLUMN h abstime;
! ALTER TABLE tmp ADD COLUMN i char;
! ALTER TABLE tmp ADD COLUMN j abstime[];
! ALTER TABLE tmp ADD COLUMN k int4;
! ALTER TABLE tmp ADD COLUMN l tid;
! ALTER TABLE tmp ADD COLUMN m xid;
! ALTER TABLE tmp ADD COLUMN n oidvector;
! --ALTER TABLE tmp ADD COLUMN o lock;
! ALTER TABLE tmp ADD COLUMN p smgr;
! ALTER TABLE tmp ADD COLUMN q point;
! ALTER TABLE tmp ADD COLUMN r lseg;
! ALTER TABLE tmp ADD COLUMN s path;
! ALTER TABLE tmp ADD COLUMN t box;
! ALTER TABLE tmp ADD COLUMN u tinterval;
! ALTER TABLE tmp ADD COLUMN v timestamp;
! ALTER TABLE tmp ADD COLUMN w interval;
! ALTER TABLE tmp ADD COLUMN x float8[];
! ALTER TABLE tmp ADD COLUMN y float4[];
! ALTER TABLE tmp ADD COLUMN z int2[];
! INSERT INTO tmp (a, b, c, d, e, f, g, h, i, j, k, l, m, n, p, q, r, s, t, u,
! 	v, w, x, y, z)
!    VALUES (4, 'name', 'text', 4.1, 4.1, 2, '(4.1,4.1,3.1,3.1)',
!         'Mon May  1 00:30:30 1995', 'c', '{Mon May  1 00:30:30 1995, Monday Aug 24 14:43:07 1992, epoch}',
! 	314159, '(1,1)', '512',
! 	'1 2 3 4 5 6 7 8', 'magnetic disk', '(1.1,1.1)', '(4.1,4.1,3.1,3.1)',
! 	'(0,2,4.1,4.1,3.1,3.1)', '(4.1,4.1,3.1,3.1)', '["epoch" "infinity"]',
! 	'epoch', '01:00:10', '{1.0,2.0,3.0,4.0}', '{1.0,2.0,3.0,4.0}', '{1,2,3,4}');
! SELECT * FROM tmp;
!  initial | a |  b   |  c   |  d  |  e  | f |           g           |              h               | i |                                               j                                                |   k    |   l   |  m  |        n        |       p       |     q     |           r           |              s              |          t          |                      u                      |            v             |        w         |     x     |     y     |     z     
! ---------+---+------+------+-----+-----+---+-----------------------+------------------------------+---+------------------------------------------------------------------------------------------------+--------+-------+-----+-----------------+---------------+-----------+-----------------------+-----------------------------+---------------------+---------------------------------------------+--------------------------+------------------+-----------+-----------+-----------
!          | 4 | name | text | 4.1 | 4.1 | 2 | ((4.1,4.1),(3.1,3.1)) | Mon May 01 00:30:30 1995 PDT | c | {"Mon May 01 00:30:30 1995 PDT","Mon Aug 24 14:43:07 1992 PDT","Wed Dec 31 16:00:00 1969 PST"} | 314159 | (1,1) | 512 | 1 2 3 4 5 6 7 8 | magnetic disk | (1.1,1.1) | [(4.1,4.1),(3.1,3.1)] | ((0,2),(4.1,4.1),(3.1,3.1)) | (4.1,4.1),(3.1,3.1) | ["Wed Dec 31 16:00:00 1969 PST" "infinity"] | Thu Jan 01 00:00:00 1970 | @ 1 hour 10 secs | {1,2,3,4} | {1,2,3,4} | {1,2,3,4}
! (1 row)
! 
! DROP TABLE tmp;
! --
! -- rename - check on both non-temp and temp tables
! --
! CREATE TABLE tmp (regtable int);
! CREATE TEMP TABLE tmp (tmptable int);
! ALTER TABLE tmp RENAME TO tmp_new;
! SELECT * FROM tmp;
!  regtable 
! ----------
! (0 rows)
! 
! SELECT * FROM tmp_new;
!  tmptable 
! ----------
! (0 rows)
! 
! ALTER TABLE tmp RENAME TO tmp_new2;
! SELECT * FROM tmp;		-- should fail
! ERROR:  relation "tmp" does not exist
! LINE 1: SELECT * FROM tmp;
!                       ^
! SELECT * FROM tmp_new;
!  tmptable 
! ----------
! (0 rows)
! 
! SELECT * FROM tmp_new2;
!  regtable 
! ----------
! (0 rows)
! 
! DROP TABLE tmp_new;
! DROP TABLE tmp_new2;
! -- ALTER TABLE ... RENAME on non-table relations
! -- renaming indexes (FIXME: this should probably test the index's functionality)
! ALTER INDEX IF EXISTS __onek_unique1 RENAME TO tmp_onek_unique1;
! NOTICE:  relation "__onek_unique1" does not exist, skipping
! ALTER INDEX IF EXISTS __tmp_onek_unique1 RENAME TO onek_unique1;
! NOTICE:  relation "__tmp_onek_unique1" does not exist, skipping
! ALTER INDEX onek_unique1 RENAME TO tmp_onek_unique1;
! ALTER INDEX tmp_onek_unique1 RENAME TO onek_unique1;
! -- renaming views
! CREATE VIEW tmp_view (unique1) AS SELECT unique1 FROM tenk1;
! ALTER TABLE tmp_view RENAME TO tmp_view_new;
! -- hack to ensure we get an indexscan here
! set enable_seqscan to off;
! set enable_bitmapscan to off;
! -- 5 values, sorted
! SELECT unique1 FROM tenk1 WHERE unique1 < 5;
!  unique1 
! ---------
!        0
!        1
!        2
!        3
!        4
! (5 rows)
! 
! reset enable_seqscan;
! reset enable_bitmapscan;
! DROP VIEW tmp_view_new;
! -- toast-like relation name
! alter table stud_emp rename to pg_toast_stud_emp;
! alter table pg_toast_stud_emp rename to stud_emp;
! -- renaming index should rename constraint as well
! ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
! ALTER INDEX onek_unique1_constraint RENAME TO onek_unique1_constraint_foo;
! ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
! -- renaming constraint
! ALTER TABLE onek ADD CONSTRAINT onek_check_constraint CHECK (unique1 >= 0);
! ALTER TABLE onek RENAME CONSTRAINT onek_check_constraint TO onek_check_constraint_foo;
! ALTER TABLE onek DROP CONSTRAINT onek_check_constraint_foo;
! -- renaming constraint should rename index as well
! ALTER TABLE onek ADD CONSTRAINT onek_unique1_constraint UNIQUE (unique1);
! DROP INDEX onek_unique1_constraint;  -- to see whether it's there
! ERROR:  cannot drop index onek_unique1_constraint because constraint onek_unique1_constraint on table onek requires it
! HINT:  You can drop constraint onek_unique1_constraint on table onek instead.
! ALTER TABLE onek RENAME CONSTRAINT onek_unique1_constraint TO onek_unique1_constraint_foo;
! DROP INDEX onek_unique1_constraint_foo;  -- to see whether it's there
! ERROR:  cannot drop index onek_unique1_constraint_foo because constraint onek_unique1_constraint_foo on table onek requires it
! HINT:  You can drop constraint onek_unique1_constraint_foo on table onek instead.
! ALTER TABLE onek DROP CONSTRAINT onek_unique1_constraint_foo;
! -- renaming constraints vs. inheritance
! CREATE TABLE constraint_rename_test (a int CONSTRAINT con1 CHECK (a > 0), b int, c int);
! \d constraint_rename_test
! Table "public.constraint_rename_test"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
! Check constraints:
!     "con1" CHECK (a > 0)
! 
! CREATE TABLE constraint_rename_test2 (a int CONSTRAINT con1 CHECK (a > 0), d int) INHERITS (constraint_rename_test);
! NOTICE:  merging column "a" with inherited definition
! NOTICE:  merging constraint "con1" with inherited definition
! \d constraint_rename_test2
! Table "public.constraint_rename_test2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
!  d      | integer | 
! Check constraints:
!     "con1" CHECK (a > 0)
! Inherits: constraint_rename_test
! 
! ALTER TABLE constraint_rename_test2 RENAME CONSTRAINT con1 TO con1foo; -- fail
! ERROR:  cannot rename inherited constraint "con1"
! ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- fail
! ERROR:  inherited constraint "con1" must be renamed in child tables too
! ALTER TABLE constraint_rename_test RENAME CONSTRAINT con1 TO con1foo; -- ok
! \d constraint_rename_test
! Table "public.constraint_rename_test"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
! Check constraints:
!     "con1foo" CHECK (a > 0)
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d constraint_rename_test2
! Table "public.constraint_rename_test2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
!  d      | integer | 
! Check constraints:
!     "con1foo" CHECK (a > 0)
! Inherits: constraint_rename_test
! 
! ALTER TABLE constraint_rename_test ADD CONSTRAINT con2 CHECK (b > 0) NO INHERIT;
! ALTER TABLE ONLY constraint_rename_test RENAME CONSTRAINT con2 TO con2foo; -- ok
! ALTER TABLE constraint_rename_test RENAME CONSTRAINT con2foo TO con2bar; -- ok
! \d constraint_rename_test
! Table "public.constraint_rename_test"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
! Check constraints:
!     "con1foo" CHECK (a > 0)
!     "con2bar" CHECK (b > 0) NO INHERIT
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d constraint_rename_test2
! Table "public.constraint_rename_test2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
!  d      | integer | 
! Check constraints:
!     "con1foo" CHECK (a > 0)
! Inherits: constraint_rename_test
! 
! ALTER TABLE constraint_rename_test ADD CONSTRAINT con3 PRIMARY KEY (a);
! ALTER TABLE constraint_rename_test RENAME CONSTRAINT con3 TO con3foo; -- ok
! \d constraint_rename_test
! Table "public.constraint_rename_test"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | not null
!  b      | integer | 
!  c      | integer | 
! Indexes:
!     "con3foo" PRIMARY KEY, btree (a)
! Check constraints:
!     "con1foo" CHECK (a > 0)
!     "con2bar" CHECK (b > 0) NO INHERIT
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d constraint_rename_test2
! Table "public.constraint_rename_test2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
!  c      | integer | 
!  d      | integer | 
! Check constraints:
!     "con1foo" CHECK (a > 0)
! Inherits: constraint_rename_test
! 
! DROP TABLE constraint_rename_test2;
! DROP TABLE constraint_rename_test;
! ALTER TABLE IF EXISTS constraint_not_exist RENAME CONSTRAINT con3 TO con3foo; -- ok
! NOTICE:  relation "constraint_not_exist" does not exist, skipping
! ALTER TABLE IF EXISTS constraint_rename_test ADD CONSTRAINT con4 UNIQUE (a);
! NOTICE:  relation "constraint_rename_test" does not exist, skipping
! -- FOREIGN KEY CONSTRAINT adding TEST
! CREATE TABLE tmp2 (a int primary key);
! CREATE TABLE tmp3 (a int, b int);
! CREATE TABLE tmp4 (a int, b int, unique(a,b));
! CREATE TABLE tmp5 (a int, b int);
! -- Insert rows into tmp2 (pktable)
! INSERT INTO tmp2 values (1);
! INSERT INTO tmp2 values (2);
! INSERT INTO tmp2 values (3);
! INSERT INTO tmp2 values (4);
! -- Insert rows into tmp3
! INSERT INTO tmp3 values (1,10);
! INSERT INTO tmp3 values (1,20);
! INSERT INTO tmp3 values (5,50);
! -- Try (and fail) to add constraint due to invalid source columns
! ALTER TABLE tmp3 add constraint tmpconstr foreign key(c) references tmp2 match full;
! ERROR:  column "c" referenced in foreign key constraint does not exist
! -- Try (and fail) to add constraint due to invalide destination columns explicitly given
! ALTER TABLE tmp3 add constraint tmpconstr foreign key(a) references tmp2(b) match full;
! ERROR:  column "b" referenced in foreign key constraint does not exist
! -- Try (and fail) to add constraint due to invalid data
! ALTER TABLE tmp3 add constraint tmpconstr foreign key (a) references tmp2 match full;
! ERROR:  insert or update on table "tmp3" violates foreign key constraint "tmpconstr"
! DETAIL:  Key (a)=(5) is not present in table "tmp2".
! -- Delete failing row
! DELETE FROM tmp3 where a=5;
! -- Try (and succeed)
! ALTER TABLE tmp3 add constraint tmpconstr foreign key (a) references tmp2 match full;
! ALTER TABLE tmp3 drop constraint tmpconstr;
! INSERT INTO tmp3 values (5,50);
! -- Try NOT VALID and then VALIDATE CONSTRAINT, but fails. Delete failure then re-validate
! ALTER TABLE tmp3 add constraint tmpconstr foreign key (a) references tmp2 match full NOT VALID;
! ALTER TABLE tmp3 validate constraint tmpconstr;
! ERROR:  insert or update on table "tmp3" violates foreign key constraint "tmpconstr"
! DETAIL:  Key (a)=(5) is not present in table "tmp2".
! -- Delete failing row
! DELETE FROM tmp3 where a=5;
! -- Try (and succeed) and repeat to show it works on already valid constraint
! ALTER TABLE tmp3 validate constraint tmpconstr;
! ALTER TABLE tmp3 validate constraint tmpconstr;
! -- Try a non-verified CHECK constraint
! ALTER TABLE tmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10); -- fail
! ERROR:  check constraint "b_greater_than_ten" is violated by some row
! ALTER TABLE tmp3 ADD CONSTRAINT b_greater_than_ten CHECK (b > 10) NOT VALID; -- succeeds
! ALTER TABLE tmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- fails
! ERROR:  check constraint "b_greater_than_ten" is violated by some row
! DELETE FROM tmp3 WHERE NOT b > 10;
! ALTER TABLE tmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
! ALTER TABLE tmp3 VALIDATE CONSTRAINT b_greater_than_ten; -- succeeds
! -- Test inherited NOT VALID CHECK constraints
! select * from tmp3;
!  a | b  
! ---+----
!  1 | 20
! (1 row)
! 
! CREATE TABLE tmp6 () INHERITS (tmp3);
! CREATE TABLE tmp7 () INHERITS (tmp3);
! INSERT INTO tmp6 VALUES (6, 30), (7, 16);
! ALTER TABLE tmp3 ADD CONSTRAINT b_le_20 CHECK (b <= 20) NOT VALID;
! ALTER TABLE tmp3 VALIDATE CONSTRAINT b_le_20;	-- fails
! ERROR:  check constraint "b_le_20" is violated by some row
! DELETE FROM tmp6 WHERE b > 20;
! ALTER TABLE tmp3 VALIDATE CONSTRAINT b_le_20;	-- succeeds
! -- An already validated constraint must not be revalidated
! CREATE FUNCTION boo(int) RETURNS int IMMUTABLE STRICT LANGUAGE plpgsql AS $$ BEGIN RAISE NOTICE 'boo: %', $1; RETURN $1; END; $$;
! INSERT INTO tmp7 VALUES (8, 18);
! ALTER TABLE tmp7 ADD CONSTRAINT identity CHECK (b = boo(b));
! NOTICE:  boo: 18
! ALTER TABLE tmp3 ADD CONSTRAINT IDENTITY check (b = boo(b)) NOT VALID;
! NOTICE:  merging constraint "identity" with inherited definition
! ALTER TABLE tmp3 VALIDATE CONSTRAINT identity;
! NOTICE:  boo: 16
! NOTICE:  boo: 20
! -- Try (and fail) to create constraint from tmp5(a) to tmp4(a) - unique constraint on
! -- tmp4 is a,b
! ALTER TABLE tmp5 add constraint tmpconstr foreign key(a) references tmp4(a) match full;
! ERROR:  there is no unique constraint matching given keys for referenced table "tmp4"
! DROP TABLE tmp7;
! DROP TABLE tmp6;
! DROP TABLE tmp5;
! DROP TABLE tmp4;
! DROP TABLE tmp3;
! DROP TABLE tmp2;
! -- NOT VALID with plan invalidation -- ensure we don't use a constraint for
! -- exclusion until validated
! set constraint_exclusion TO 'partition';
! create table nv_parent (d date, check (false) no inherit not valid);
! -- not valid constraint added at creation time should automatically become valid
! \d nv_parent
!  Table "public.nv_parent"
!  Column | Type | Modifiers 
! --------+------+-----------
!  d      | date | 
! Check constraints:
!     "nv_parent_check" CHECK (false) NO INHERIT
! 
! create table nv_child_2010 () inherits (nv_parent);
! create table nv_child_2011 () inherits (nv_parent);
! alter table nv_child_2010 add check (d between '2010-01-01'::date and '2010-12-31'::date) not valid;
! alter table nv_child_2011 add check (d between '2011-01-01'::date and '2011-12-31'::date) not valid;
! explain (costs off) select * from nv_parent where d between '2011-08-01' and '2011-08-31';
!                                 QUERY PLAN                                 
! ---------------------------------------------------------------------------
!  Append
!    ->  Seq Scan on nv_parent
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
!    ->  Seq Scan on nv_child_2010
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
!    ->  Seq Scan on nv_child_2011
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
! (7 rows)
! 
! create table nv_child_2009 (check (d between '2009-01-01'::date and '2009-12-31'::date)) inherits (nv_parent);
! explain (costs off) select * from nv_parent where d between '2011-08-01'::date and '2011-08-31'::date;
!                                 QUERY PLAN                                 
! ---------------------------------------------------------------------------
!  Append
!    ->  Seq Scan on nv_parent
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
!    ->  Seq Scan on nv_child_2010
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
!    ->  Seq Scan on nv_child_2011
!          Filter: ((d >= '08-01-2011'::date) AND (d <= '08-31-2011'::date))
! (7 rows)
! 
! explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
!                                 QUERY PLAN                                 
! ---------------------------------------------------------------------------
!  Append
!    ->  Seq Scan on nv_parent
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
!    ->  Seq Scan on nv_child_2010
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
!    ->  Seq Scan on nv_child_2011
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
!    ->  Seq Scan on nv_child_2009
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
! (9 rows)
! 
! -- after validation, the constraint should be used
! alter table nv_child_2011 VALIDATE CONSTRAINT nv_child_2011_d_check;
! explain (costs off) select * from nv_parent where d between '2009-08-01'::date and '2009-08-31'::date;
!                                 QUERY PLAN                                 
! ---------------------------------------------------------------------------
!  Append
!    ->  Seq Scan on nv_parent
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
!    ->  Seq Scan on nv_child_2010
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
!    ->  Seq Scan on nv_child_2009
!          Filter: ((d >= '08-01-2009'::date) AND (d <= '08-31-2009'::date))
! (7 rows)
! 
! -- add an inherited NOT VALID constraint
! alter table nv_parent add check (d between '2001-01-01'::date and '2099-12-31'::date) not valid;
! \d nv_child_2009
! Table "public.nv_child_2009"
!  Column | Type | Modifiers 
! --------+------+-----------
!  d      | date | 
! Check constraints:
!     "nv_child_2009_d_check" CHECK (d >= '01-01-2009'::date AND d <= '12-31-2009'::date)
!     "nv_parent_d_check" CHECK (d >= '01-01-2001'::date AND d <= '12-31-2099'::date) NOT VALID
! Inherits: nv_parent
! 
! -- we leave nv_parent and children around to help test pg_dump logic
! -- Foreign key adding test with mixed types
! -- Note: these tables are TEMP to avoid name conflicts when this test
! -- is run in parallel with foreign_key.sql.
! CREATE TEMP TABLE PKTABLE (ptest1 int PRIMARY KEY);
! INSERT INTO PKTABLE VALUES(42);
! CREATE TEMP TABLE FKTABLE (ftest1 inet);
! -- This next should fail, because int=inet does not exist
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
! -- This should also fail for the same reason, but here we
! -- give the column name
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable(ptest1);
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest1" are of incompatible types: inet and integer.
! DROP TABLE FKTABLE;
! -- This should succeed, even though they are different types,
! -- because int=int8 exists and is a member of the integer opfamily
! CREATE TEMP TABLE FKTABLE (ftest1 int8);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
! -- Check it actually works
! INSERT INTO FKTABLE VALUES(42);		-- should succeed
! INSERT INTO FKTABLE VALUES(43);		-- should fail
! ERROR:  insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
! DETAIL:  Key (ftest1)=(43) is not present in table "pktable".
! DROP TABLE FKTABLE;
! -- This should fail, because we'd have to cast numeric to int which is
! -- not an implicit coercion (or use numeric=numeric, but that's not part
! -- of the integer opfamily)
! CREATE TEMP TABLE FKTABLE (ftest1 numeric);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest1" are of incompatible types: numeric and integer.
! DROP TABLE FKTABLE;
! DROP TABLE PKTABLE;
! -- On the other hand, this should work because int implicitly promotes to
! -- numeric, and we allow promotion on the FK side
! CREATE TEMP TABLE PKTABLE (ptest1 numeric PRIMARY KEY);
! INSERT INTO PKTABLE VALUES(42);
! CREATE TEMP TABLE FKTABLE (ftest1 int);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1) references pktable;
! -- Check it actually works
! INSERT INTO FKTABLE VALUES(42);		-- should succeed
! INSERT INTO FKTABLE VALUES(43);		-- should fail
! ERROR:  insert or update on table "fktable" violates foreign key constraint "fktable_ftest1_fkey"
! DETAIL:  Key (ftest1)=(43) is not present in table "pktable".
! DROP TABLE FKTABLE;
! DROP TABLE PKTABLE;
! CREATE TEMP TABLE PKTABLE (ptest1 int, ptest2 inet,
!                            PRIMARY KEY(ptest1, ptest2));
! -- This should fail, because we just chose really odd types
! CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2) references pktable;
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
! DROP TABLE FKTABLE;
! -- Again, so should this...
! CREATE TEMP TABLE FKTABLE (ftest1 cidr, ftest2 timestamp);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
!      references pktable(ptest1, ptest2);
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest1" are of incompatible types: cidr and integer.
! DROP TABLE FKTABLE;
! -- This fails because we mixed up the column ordering
! CREATE TEMP TABLE FKTABLE (ftest1 int, ftest2 inet);
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest1, ftest2)
!      references pktable(ptest2, ptest1);
! ERROR:  foreign key constraint "fktable_ftest1_fkey" cannot be implemented
! DETAIL:  Key columns "ftest1" and "ptest2" are of incompatible types: integer and inet.
! -- As does this...
! ALTER TABLE FKTABLE ADD FOREIGN KEY(ftest2, ftest1)
!      references pktable(ptest1, ptest2);
! ERROR:  foreign key constraint "fktable_ftest2_fkey" cannot be implemented
! DETAIL:  Key columns "ftest2" and "ptest1" are of incompatible types: inet and integer.
! -- temp tables should go away by themselves, need not drop them.
! -- test check constraint adding
! create table atacc1 ( test int );
! -- add a check constraint
! alter table atacc1 add constraint atacc_test1 check (test>3);
! -- should fail
! insert into atacc1 (test) values (2);
! ERROR:  new row for relation "atacc1" violates check constraint "atacc_test1"
! DETAIL:  Failing row contains (2).
! -- should succeed
! insert into atacc1 (test) values (4);
! drop table atacc1;
! -- let's do one where the check fails when added
! create table atacc1 ( test int );
! -- insert a soon to be failing row
! insert into atacc1 (test) values (2);
! -- add a check constraint (fails)
! alter table atacc1 add constraint atacc_test1 check (test>3);
! ERROR:  check constraint "atacc_test1" is violated by some row
! insert into atacc1 (test) values (4);
! drop table atacc1;
! -- let's do one where the check fails because the column doesn't exist
! create table atacc1 ( test int );
! -- add a check constraint (fails)
! alter table atacc1 add constraint atacc_test1 check (test1>3);
! ERROR:  column "test1" does not exist
! HINT:  Perhaps you meant to reference the column "atacc1.test".
! drop table atacc1;
! -- something a little more complicated
! create table atacc1 ( test int, test2 int, test3 int);
! -- add a check constraint (fails)
! alter table atacc1 add constraint atacc_test1 check (test+test2<test3*4);
! -- should fail
! insert into atacc1 (test,test2,test3) values (4,4,2);
! ERROR:  new row for relation "atacc1" violates check constraint "atacc_test1"
! DETAIL:  Failing row contains (4, 4, 2).
! -- should succeed
! insert into atacc1 (test,test2,test3) values (4,4,5);
! drop table atacc1;
! -- lets do some naming tests
! create table atacc1 (test int check (test>3), test2 int);
! alter table atacc1 add check (test2>test);
! -- should fail for $2
! insert into atacc1 (test2, test) values (3, 4);
! ERROR:  new row for relation "atacc1" violates check constraint "atacc1_check"
! DETAIL:  Failing row contains (4, 3).
! drop table atacc1;
! -- inheritance related tests
! create table atacc1 (test int);
! create table atacc2 (test2 int);
! create table atacc3 (test3 int) inherits (atacc1, atacc2);
! alter table atacc2 add constraint foo check (test2>0);
! -- fail and then succeed on atacc2
! insert into atacc2 (test2) values (-3);
! ERROR:  new row for relation "atacc2" violates check constraint "foo"
! DETAIL:  Failing row contains (-3).
! insert into atacc2 (test2) values (3);
! -- fail and then succeed on atacc3
! insert into atacc3 (test2) values (-3);
! ERROR:  new row for relation "atacc3" violates check constraint "foo"
! DETAIL:  Failing row contains (null, -3, null).
! insert into atacc3 (test2) values (3);
! drop table atacc3;
! drop table atacc2;
! drop table atacc1;
! -- same things with one created with INHERIT
! create table atacc1 (test int);
! create table atacc2 (test2 int);
! create table atacc3 (test3 int) inherits (atacc1, atacc2);
! alter table atacc3 no inherit atacc2;
! -- fail
! alter table atacc3 no inherit atacc2;
! ERROR:  relation "atacc2" is not a parent of relation "atacc3"
! -- make sure it really isn't a child
! insert into atacc3 (test2) values (3);
! select test2 from atacc2;
!  test2 
! -------
! (0 rows)
! 
! -- fail due to missing constraint
! alter table atacc2 add constraint foo check (test2>0);
! alter table atacc3 inherit atacc2;
! ERROR:  child table is missing constraint "foo"
! -- fail due to missing column
! alter table atacc3 rename test2 to testx;
! alter table atacc3 inherit atacc2;
! ERROR:  child table is missing column "test2"
! -- fail due to mismatched data type
! alter table atacc3 add test2 bool;
! alter table atacc3 inherit atacc2;
! ERROR:  child table "atacc3" has different type for column "test2"
! alter table atacc3 drop test2;
! -- succeed
! alter table atacc3 add test2 int;
! update atacc3 set test2 = 4 where test2 is null;
! alter table atacc3 add constraint foo check (test2>0);
! alter table atacc3 inherit atacc2;
! -- fail due to duplicates and circular inheritance
! alter table atacc3 inherit atacc2;
! ERROR:  relation "atacc2" would be inherited from more than once
! alter table atacc2 inherit atacc3;
! ERROR:  circular inheritance not allowed
! DETAIL:  "atacc3" is already a child of "atacc2".
! alter table atacc2 inherit atacc2;
! ERROR:  circular inheritance not allowed
! DETAIL:  "atacc2" is already a child of "atacc2".
! -- test that we really are a child now (should see 4 not 3 and cascade should go through)
! select test2 from atacc2;
!  test2 
! -------
!      4
! (1 row)
! 
! drop table atacc2 cascade;
! NOTICE:  drop cascades to table atacc3
! drop table atacc1;
! -- adding only to a parent is allowed as of 9.2
! create table atacc1 (test int);
! create table atacc2 (test2 int) inherits (atacc1);
! -- ok:
! alter table atacc1 add constraint foo check (test>0) no inherit;
! -- check constraint is not there on child
! insert into atacc2 (test) values (-3);
! -- check constraint is there on parent
! insert into atacc1 (test) values (-3);
! ERROR:  new row for relation "atacc1" violates check constraint "foo"
! DETAIL:  Failing row contains (-3).
! insert into atacc1 (test) values (3);
! -- fail, violating row:
! alter table atacc2 add constraint foo check (test>0) no inherit;
! ERROR:  check constraint "foo" is violated by some row
! drop table atacc2;
! drop table atacc1;
! -- test unique constraint adding
! create table atacc1 ( test int ) with oids;
! -- add a unique constraint
! alter table atacc1 add constraint atacc_test1 unique (test);
! -- insert first value
! insert into atacc1 (test) values (2);
! -- should fail
! insert into atacc1 (test) values (2);
! ERROR:  duplicate key value violates unique constraint "atacc_test1"
! DETAIL:  Key (test)=(2) already exists.
! -- should succeed
! insert into atacc1 (test) values (4);
! -- try adding a unique oid constraint
! alter table atacc1 add constraint atacc_oid1 unique(oid);
! -- try to create duplicates via alter table using - should fail
! alter table atacc1 alter column test type integer using 0;
! ERROR:  could not create unique index "atacc_test1"
! DETAIL:  Key (test)=(0) is duplicated.
! drop table atacc1;
! -- let's do one where the unique constraint fails when added
! create table atacc1 ( test int );
! -- insert soon to be failing rows
! insert into atacc1 (test) values (2);
! insert into atacc1 (test) values (2);
! -- add a unique constraint (fails)
! alter table atacc1 add constraint atacc_test1 unique (test);
! ERROR:  could not create unique index "atacc_test1"
! DETAIL:  Key (test)=(2) is duplicated.
! insert into atacc1 (test) values (3);
! drop table atacc1;
! -- let's do one where the unique constraint fails
! -- because the column doesn't exist
! create table atacc1 ( test int );
! -- add a unique constraint (fails)
! alter table atacc1 add constraint atacc_test1 unique (test1);
! ERROR:  column "test1" named in key does not exist
! drop table atacc1;
! -- something a little more complicated
! create table atacc1 ( test int, test2 int);
! -- add a unique constraint
! alter table atacc1 add constraint atacc_test1 unique (test, test2);
! -- insert initial value
! insert into atacc1 (test,test2) values (4,4);
! -- should fail
! insert into atacc1 (test,test2) values (4,4);
! ERROR:  duplicate key value violates unique constraint "atacc_test1"
! DETAIL:  Key (test, test2)=(4, 4) already exists.
! -- should all succeed
! insert into atacc1 (test,test2) values (4,5);
! insert into atacc1 (test,test2) values (5,4);
! insert into atacc1 (test,test2) values (5,5);
! drop table atacc1;
! -- lets do some naming tests
! create table atacc1 (test int, test2 int, unique(test));
! alter table atacc1 add unique (test2);
! -- should fail for @@ second one @@
! insert into atacc1 (test2, test) values (3, 3);
! insert into atacc1 (test2, test) values (2, 3);
! ERROR:  duplicate key value violates unique constraint "atacc1_test_key"
! DETAIL:  Key (test)=(3) already exists.
! drop table atacc1;
! -- test primary key constraint adding
! create table atacc1 ( test int ) with oids;
! -- add a primary key constraint
! alter table atacc1 add constraint atacc_test1 primary key (test);
! -- insert first value
! insert into atacc1 (test) values (2);
! -- should fail
! insert into atacc1 (test) values (2);
! ERROR:  duplicate key value violates unique constraint "atacc_test1"
! DETAIL:  Key (test)=(2) already exists.
! -- should succeed
! insert into atacc1 (test) values (4);
! -- inserting NULL should fail
! insert into atacc1 (test) values(NULL);
! ERROR:  null value in column "test" violates not-null constraint
! DETAIL:  Failing row contains (null).
! -- try adding a second primary key (should fail)
! alter table atacc1 add constraint atacc_oid1 primary key(oid);
! ERROR:  multiple primary keys for table "atacc1" are not allowed
! -- drop first primary key constraint
! alter table atacc1 drop constraint atacc_test1 restrict;
! -- try adding a primary key on oid (should succeed)
! alter table atacc1 add constraint atacc_oid1 primary key(oid);
! drop table atacc1;
! -- let's do one where the primary key constraint fails when added
! create table atacc1 ( test int );
! -- insert soon to be failing rows
! insert into atacc1 (test) values (2);
! insert into atacc1 (test) values (2);
! -- add a primary key (fails)
! alter table atacc1 add constraint atacc_test1 primary key (test);
! ERROR:  could not create unique index "atacc_test1"
! DETAIL:  Key (test)=(2) is duplicated.
! insert into atacc1 (test) values (3);
! drop table atacc1;
! -- let's do another one where the primary key constraint fails when added
! create table atacc1 ( test int );
! -- insert soon to be failing row
! insert into atacc1 (test) values (NULL);
! -- add a primary key (fails)
! alter table atacc1 add constraint atacc_test1 primary key (test);
! ERROR:  column "test" contains null values
! insert into atacc1 (test) values (3);
! drop table atacc1;
! -- let's do one where the primary key constraint fails
! -- because the column doesn't exist
! create table atacc1 ( test int );
! -- add a primary key constraint (fails)
! alter table atacc1 add constraint atacc_test1 primary key (test1);
! ERROR:  column "test1" named in key does not exist
! drop table atacc1;
! -- adding a new column as primary key to a non-empty table.
! -- should fail unless the column has a non-null default value.
! create table atacc1 ( test int );
! insert into atacc1 (test) values (0);
! -- add a primary key column without a default (fails).
! alter table atacc1 add column test2 int primary key;
! ERROR:  column "test2" contains null values
! -- now add a primary key column with a default (succeeds).
! alter table atacc1 add column test2 int default 0 primary key;
! drop table atacc1;
! -- something a little more complicated
! create table atacc1 ( test int, test2 int);
! -- add a primary key constraint
! alter table atacc1 add constraint atacc_test1 primary key (test, test2);
! -- try adding a second primary key - should fail
! alter table atacc1 add constraint atacc_test2 primary key (test);
! ERROR:  multiple primary keys for table "atacc1" are not allowed
! -- insert initial value
! insert into atacc1 (test,test2) values (4,4);
! -- should fail
! insert into atacc1 (test,test2) values (4,4);
! ERROR:  duplicate key value violates unique constraint "atacc_test1"
! DETAIL:  Key (test, test2)=(4, 4) already exists.
! insert into atacc1 (test,test2) values (NULL,3);
! ERROR:  null value in column "test" violates not-null constraint
! DETAIL:  Failing row contains (null, 3).
! insert into atacc1 (test,test2) values (3, NULL);
! ERROR:  null value in column "test2" violates not-null constraint
! DETAIL:  Failing row contains (3, null).
! insert into atacc1 (test,test2) values (NULL,NULL);
! ERROR:  null value in column "test" violates not-null constraint
! DETAIL:  Failing row contains (null, null).
! -- should all succeed
! insert into atacc1 (test,test2) values (4,5);
! insert into atacc1 (test,test2) values (5,4);
! insert into atacc1 (test,test2) values (5,5);
! drop table atacc1;
! -- lets do some naming tests
! create table atacc1 (test int, test2 int, primary key(test));
! -- only first should succeed
! insert into atacc1 (test2, test) values (3, 3);
! insert into atacc1 (test2, test) values (2, 3);
! ERROR:  duplicate key value violates unique constraint "atacc1_pkey"
! DETAIL:  Key (test)=(3) already exists.
! insert into atacc1 (test2, test) values (1, NULL);
! ERROR:  null value in column "test" violates not-null constraint
! DETAIL:  Failing row contains (null, 1).
! drop table atacc1;
! -- alter table / alter column [set/drop] not null tests
! -- try altering system catalogs, should fail
! alter table pg_class alter column relname drop not null;
! ERROR:  permission denied: "pg_class" is a system catalog
! alter table pg_class alter relname set not null;
! ERROR:  permission denied: "pg_class" is a system catalog
! -- try altering non-existent table, should fail
! alter table non_existent alter column bar set not null;
! ERROR:  relation "non_existent" does not exist
! alter table non_existent alter column bar drop not null;
! ERROR:  relation "non_existent" does not exist
! -- test setting columns to null and not null and vice versa
! -- test checking for null values and primary key
! create table atacc1 (test int not null) with oids;
! alter table atacc1 add constraint "atacc1_pkey" primary key (test);
! alter table atacc1 alter column test drop not null;
! ERROR:  column "test" is in a primary key
! alter table atacc1 drop constraint "atacc1_pkey";
! alter table atacc1 alter column test drop not null;
! insert into atacc1 values (null);
! alter table atacc1 alter test set not null;
! ERROR:  column "test" contains null values
! delete from atacc1;
! alter table atacc1 alter test set not null;
! -- try altering a non-existent column, should fail
! alter table atacc1 alter bar set not null;
! ERROR:  column "bar" of relation "atacc1" does not exist
! alter table atacc1 alter bar drop not null;
! ERROR:  column "bar" of relation "atacc1" does not exist
! -- try altering the oid column, should fail
! alter table atacc1 alter oid set not null;
! ERROR:  cannot alter system column "oid"
! alter table atacc1 alter oid drop not null;
! ERROR:  cannot alter system column "oid"
! -- try creating a view and altering that, should fail
! create view myview as select * from atacc1;
! alter table myview alter column test drop not null;
! ERROR:  "myview" is not a table or foreign table
! alter table myview alter column test set not null;
! ERROR:  "myview" is not a table or foreign table
! drop view myview;
! drop table atacc1;
! -- test inheritance
! create table parent (a int);
! create table child (b varchar(255)) inherits (parent);
! alter table parent alter a set not null;
! insert into parent values (NULL);
! ERROR:  null value in column "a" violates not-null constraint
! DETAIL:  Failing row contains (null).
! insert into child (a, b) values (NULL, 'foo');
! ERROR:  null value in column "a" violates not-null constraint
! DETAIL:  Failing row contains (null, foo).
! alter table parent alter a drop not null;
! insert into parent values (NULL);
! insert into child (a, b) values (NULL, 'foo');
! alter table only parent alter a set not null;
! ERROR:  column "a" contains null values
! alter table child alter a set not null;
! ERROR:  column "a" contains null values
! delete from parent;
! alter table only parent alter a set not null;
! insert into parent values (NULL);
! ERROR:  null value in column "a" violates not-null constraint
! DETAIL:  Failing row contains (null).
! alter table child alter a set not null;
! insert into child (a, b) values (NULL, 'foo');
! ERROR:  null value in column "a" violates not-null constraint
! DETAIL:  Failing row contains (null, foo).
! delete from child;
! alter table child alter a set not null;
! insert into child (a, b) values (NULL, 'foo');
! ERROR:  null value in column "a" violates not-null constraint
! DETAIL:  Failing row contains (null, foo).
! drop table child;
! drop table parent;
! -- test setting and removing default values
! create table def_test (
! 	c1	int4 default 5,
! 	c2	text default 'initial_default'
! );
! insert into def_test default values;
! alter table def_test alter column c1 drop default;
! insert into def_test default values;
! alter table def_test alter column c2 drop default;
! insert into def_test default values;
! alter table def_test alter column c1 set default 10;
! alter table def_test alter column c2 set default 'new_default';
! insert into def_test default values;
! select * from def_test;
!  c1 |       c2        
! ----+-----------------
!   5 | initial_default
!     | initial_default
!     | 
!  10 | new_default
! (4 rows)
! 
! -- set defaults to an incorrect type: this should fail
! alter table def_test alter column c1 set default 'wrong_datatype';
! ERROR:  invalid input syntax for integer: "wrong_datatype"
! alter table def_test alter column c2 set default 20;
! -- set defaults on a non-existent column: this should fail
! alter table def_test alter column c3 set default 30;
! ERROR:  column "c3" of relation "def_test" does not exist
! -- set defaults on views: we need to create a view, add a rule
! -- to allow insertions into it, and then alter the view to add
! -- a default
! create view def_view_test as select * from def_test;
! create rule def_view_test_ins as
! 	on insert to def_view_test
! 	do instead insert into def_test select new.*;
! insert into def_view_test default values;
! alter table def_view_test alter column c1 set default 45;
! insert into def_view_test default values;
! alter table def_view_test alter column c2 set default 'view_default';
! insert into def_view_test default values;
! select * from def_view_test;
!  c1 |       c2        
! ----+-----------------
!   5 | initial_default
!     | initial_default
!     | 
!  10 | new_default
!     | 
!  45 | 
!  45 | view_default
! (7 rows)
! 
! drop rule def_view_test_ins on def_view_test;
! drop view def_view_test;
! drop table def_test;
! -- alter table / drop column tests
! -- try altering system catalogs, should fail
! alter table pg_class drop column relname;
! ERROR:  permission denied: "pg_class" is a system catalog
! -- try altering non-existent table, should fail
! alter table nosuchtable drop column bar;
! ERROR:  relation "nosuchtable" does not exist
! -- test dropping columns
! create table atacc1 (a int4 not null, b int4, c int4 not null, d int4) with oids;
! insert into atacc1 values (1, 2, 3, 4);
! alter table atacc1 drop a;
! alter table atacc1 drop a;
! ERROR:  column "a" of relation "atacc1" does not exist
! -- SELECTs
! select * from atacc1;
!  b | c | d 
! ---+---+---
!  2 | 3 | 4
! (1 row)
! 
! select * from atacc1 order by a;
! ERROR:  column "a" does not exist
! LINE 1: select * from atacc1 order by a;
!                                       ^
! select * from atacc1 order by "........pg.dropped.1........";
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: select * from atacc1 order by "........pg.dropped.1........"...
!                                       ^
! select * from atacc1 group by a;
! ERROR:  column "a" does not exist
! LINE 1: select * from atacc1 group by a;
!                                       ^
! select * from atacc1 group by "........pg.dropped.1........";
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: select * from atacc1 group by "........pg.dropped.1........"...
!                                       ^
! select atacc1.* from atacc1;
!  b | c | d 
! ---+---+---
!  2 | 3 | 4
! (1 row)
! 
! select a from atacc1;
! ERROR:  column "a" does not exist
! LINE 1: select a from atacc1;
!                ^
! select atacc1.a from atacc1;
! ERROR:  column atacc1.a does not exist
! LINE 1: select atacc1.a from atacc1;
!                ^
! select b,c,d from atacc1;
!  b | c | d 
! ---+---+---
!  2 | 3 | 4
! (1 row)
! 
! select a,b,c,d from atacc1;
! ERROR:  column "a" does not exist
! LINE 1: select a,b,c,d from atacc1;
!                ^
! select * from atacc1 where a = 1;
! ERROR:  column "a" does not exist
! LINE 1: select * from atacc1 where a = 1;
!                                    ^
! select "........pg.dropped.1........" from atacc1;
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: select "........pg.dropped.1........" from atacc1;
!                ^
! select atacc1."........pg.dropped.1........" from atacc1;
! ERROR:  column atacc1.........pg.dropped.1........ does not exist
! LINE 1: select atacc1."........pg.dropped.1........" from atacc1;
!                ^
! select "........pg.dropped.1........",b,c,d from atacc1;
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: select "........pg.dropped.1........",b,c,d from atacc1;
!                ^
! select * from atacc1 where "........pg.dropped.1........" = 1;
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: select * from atacc1 where "........pg.dropped.1........" = ...
!                                    ^
! -- UPDATEs
! update atacc1 set a = 3;
! ERROR:  column "a" of relation "atacc1" does not exist
! LINE 1: update atacc1 set a = 3;
!                           ^
! update atacc1 set b = 2 where a = 3;
! ERROR:  column "a" does not exist
! LINE 1: update atacc1 set b = 2 where a = 3;
!                                       ^
! update atacc1 set "........pg.dropped.1........" = 3;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! LINE 1: update atacc1 set "........pg.dropped.1........" = 3;
!                           ^
! update atacc1 set b = 2 where "........pg.dropped.1........" = 3;
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: update atacc1 set b = 2 where "........pg.dropped.1........"...
!                                       ^
! -- INSERTs
! insert into atacc1 values (10, 11, 12, 13);
! ERROR:  INSERT has more expressions than target columns
! LINE 1: insert into atacc1 values (10, 11, 12, 13);
!                                                ^
! insert into atacc1 values (default, 11, 12, 13);
! ERROR:  INSERT has more expressions than target columns
! LINE 1: insert into atacc1 values (default, 11, 12, 13);
!                                                     ^
! insert into atacc1 values (11, 12, 13);
! insert into atacc1 (a) values (10);
! ERROR:  column "a" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 (a) values (10);
!                             ^
! insert into atacc1 (a) values (default);
! ERROR:  column "a" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 (a) values (default);
!                             ^
! insert into atacc1 (a,b,c,d) values (10,11,12,13);
! ERROR:  column "a" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 (a,b,c,d) values (10,11,12,13);
!                             ^
! insert into atacc1 (a,b,c,d) values (default,11,12,13);
! ERROR:  column "a" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 (a,b,c,d) values (default,11,12,13);
!                             ^
! insert into atacc1 (b,c,d) values (11,12,13);
! insert into atacc1 ("........pg.dropped.1........") values (10);
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
!                             ^
! insert into atacc1 ("........pg.dropped.1........") values (default);
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 ("........pg.dropped.1........") values (...
!                             ^
! insert into atacc1 ("........pg.dropped.1........",b,c,d) values (10,11,12,13);
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
!                             ^
! insert into atacc1 ("........pg.dropped.1........",b,c,d) values (default,11,12,13);
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! LINE 1: insert into atacc1 ("........pg.dropped.1........",b,c,d) va...
!                             ^
! -- DELETEs
! delete from atacc1 where a = 3;
! ERROR:  column "a" does not exist
! LINE 1: delete from atacc1 where a = 3;
!                                  ^
! delete from atacc1 where "........pg.dropped.1........" = 3;
! ERROR:  column "........pg.dropped.1........" does not exist
! LINE 1: delete from atacc1 where "........pg.dropped.1........" = 3;
!                                  ^
! delete from atacc1;
! -- try dropping a non-existent column, should fail
! alter table atacc1 drop bar;
! ERROR:  column "bar" of relation "atacc1" does not exist
! -- try dropping the oid column, should succeed
! alter table atacc1 drop oid;
! -- try dropping the xmin column, should fail
! alter table atacc1 drop xmin;
! ERROR:  cannot drop system column "xmin"
! -- try creating a view and altering that, should fail
! create view myview as select * from atacc1;
! select * from myview;
!  b | c | d 
! ---+---+---
! (0 rows)
! 
! alter table myview drop d;
! ERROR:  "myview" is not a table, composite type, or foreign table
! drop view myview;
! -- test some commands to make sure they fail on the dropped column
! analyze atacc1(a);
! ERROR:  column "a" of relation "atacc1" does not exist
! analyze atacc1("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! vacuum analyze atacc1(a);
! ERROR:  column "a" of relation "atacc1" does not exist
! vacuum analyze atacc1("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! comment on column atacc1.a is 'testing';
! ERROR:  column "a" of relation "atacc1" does not exist
! comment on column atacc1."........pg.dropped.1........" is 'testing';
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a set storage plain;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" set storage plain;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a set statistics 0;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" set statistics 0;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a set default 3;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" set default 3;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a drop default;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" drop default;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a set not null;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" set not null;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 alter a drop not null;
! ERROR:  column "a" of relation "atacc1" does not exist
! alter table atacc1 alter "........pg.dropped.1........" drop not null;
! ERROR:  column "........pg.dropped.1........" of relation "atacc1" does not exist
! alter table atacc1 rename a to x;
! ERROR:  column "a" does not exist
! alter table atacc1 rename "........pg.dropped.1........" to x;
! ERROR:  column "........pg.dropped.1........" does not exist
! alter table atacc1 add primary key(a);
! ERROR:  column "a" named in key does not exist
! alter table atacc1 add primary key("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" named in key does not exist
! alter table atacc1 add unique(a);
! ERROR:  column "a" named in key does not exist
! alter table atacc1 add unique("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" named in key does not exist
! alter table atacc1 add check (a > 3);
! ERROR:  column "a" does not exist
! alter table atacc1 add check ("........pg.dropped.1........" > 3);
! ERROR:  column "........pg.dropped.1........" does not exist
! create table atacc2 (id int4 unique);
! alter table atacc1 add foreign key (a) references atacc2(id);
! ERROR:  column "a" referenced in foreign key constraint does not exist
! alter table atacc1 add foreign key ("........pg.dropped.1........") references atacc2(id);
! ERROR:  column "........pg.dropped.1........" referenced in foreign key constraint does not exist
! alter table atacc2 add foreign key (id) references atacc1(a);
! ERROR:  column "a" referenced in foreign key constraint does not exist
! alter table atacc2 add foreign key (id) references atacc1("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" referenced in foreign key constraint does not exist
! drop table atacc2;
! create index "testing_idx" on atacc1(a);
! ERROR:  column "a" does not exist
! create index "testing_idx" on atacc1("........pg.dropped.1........");
! ERROR:  column "........pg.dropped.1........" does not exist
! -- test create as and select into
! insert into atacc1 values (21, 22, 23);
! create table test1 as select * from atacc1;
! select * from test1;
!  b  | c  | d  
! ----+----+----
!  21 | 22 | 23
! (1 row)
! 
! drop table test1;
! select * into test2 from atacc1;
! select * from test2;
!  b  | c  | d  
! ----+----+----
!  21 | 22 | 23
! (1 row)
! 
! drop table test2;
! -- try dropping all columns
! alter table atacc1 drop c;
! alter table atacc1 drop d;
! alter table atacc1 drop b;
! select * from atacc1;
! --
! (1 row)
! 
! drop table atacc1;
! -- test constraint error reporting in presence of dropped columns
! create table atacc1 (id serial primary key, value int check (value < 10));
! insert into atacc1(value) values (100);
! ERROR:  new row for relation "atacc1" violates check constraint "atacc1_value_check"
! DETAIL:  Failing row contains (1, 100).
! alter table atacc1 drop column value;
! alter table atacc1 add column value int check (value < 10);
! insert into atacc1(value) values (100);
! ERROR:  new row for relation "atacc1" violates check constraint "atacc1_value_check"
! DETAIL:  Failing row contains (2, 100).
! insert into atacc1(id, value) values (null, 0);
! ERROR:  null value in column "id" violates not-null constraint
! DETAIL:  Failing row contains (null, 0).
! drop table atacc1;
! -- test inheritance
! create table parent (a int, b int, c int);
! insert into parent values (1, 2, 3);
! alter table parent drop a;
! create table child (d varchar(255)) inherits (parent);
! insert into child values (12, 13, 'testing');
! select * from parent;
!  b  | c  
! ----+----
!   2 |  3
!  12 | 13
! (2 rows)
! 
! select * from child;
!  b  | c  |    d    
! ----+----+---------
!  12 | 13 | testing
! (1 row)
! 
! alter table parent drop c;
! select * from parent;
!  b  
! ----
!   2
!  12
! (2 rows)
! 
! select * from child;
!  b  |    d    
! ----+---------
!  12 | testing
! (1 row)
! 
! drop table child;
! drop table parent;
! -- check error cases for inheritance column merging
! create table parent (a float8, b numeric(10,4), c text collate "C");
! create table child (a float4) inherits (parent); -- fail
! NOTICE:  merging column "a" with inherited definition
! ERROR:  column "a" has a type conflict
! DETAIL:  double precision versus real
! create table child (b decimal(10,7)) inherits (parent); -- fail
! NOTICE:  moving and merging column "b" with inherited definition
! DETAIL:  User-specified column moved to the position of the inherited column.
! ERROR:  column "b" has a type conflict
! DETAIL:  numeric(10,4) versus numeric(10,7)
! create table child (c text collate "POSIX") inherits (parent); -- fail
! NOTICE:  moving and merging column "c" with inherited definition
! DETAIL:  User-specified column moved to the position of the inherited column.
! ERROR:  column "c" has a collation conflict
! DETAIL:  "C" versus "POSIX"
! create table child (a double precision, b decimal(10,4)) inherits (parent);
! NOTICE:  merging column "a" with inherited definition
! NOTICE:  merging column "b" with inherited definition
! drop table child;
! drop table parent;
! -- test copy in/out
! create table test (a int4, b int4, c int4);
! insert into test values (1,2,3);
! alter table test drop a;
! copy test to stdout;
! 2	3
! copy test(a) to stdout;
! ERROR:  column "a" of relation "test" does not exist
! copy test("........pg.dropped.1........") to stdout;
! ERROR:  column "........pg.dropped.1........" of relation "test" does not exist
! copy test from stdin;
! ERROR:  extra data after last expected column
! CONTEXT:  COPY test, line 1: "10	11	12"
! select * from test;
!  b | c 
! ---+---
!  2 | 3
! (1 row)
! 
! copy test from stdin;
! select * from test;
!  b  | c  
! ----+----
!   2 |  3
!  21 | 22
! (2 rows)
! 
! copy test(a) from stdin;
! ERROR:  column "a" of relation "test" does not exist
! copy test("........pg.dropped.1........") from stdin;
! ERROR:  column "........pg.dropped.1........" of relation "test" does not exist
! copy test(b,c) from stdin;
! select * from test;
!  b  | c  
! ----+----
!   2 |  3
!  21 | 22
!  31 | 32
! (3 rows)
! 
! drop table test;
! -- test inheritance
! create table dropColumn (a int, b int, e int);
! create table dropColumnChild (c int) inherits (dropColumn);
! create table dropColumnAnother (d int) inherits (dropColumnChild);
! -- these two should fail
! alter table dropColumnchild drop column a;
! ERROR:  cannot drop inherited column "a"
! alter table only dropColumnChild drop column b;
! ERROR:  cannot drop inherited column "b"
! -- these three should work
! alter table only dropColumn drop column e;
! alter table dropColumnChild drop column c;
! alter table dropColumn drop column a;
! create table renameColumn (a int);
! create table renameColumnChild (b int) inherits (renameColumn);
! create table renameColumnAnother (c int) inherits (renameColumnChild);
! -- these three should fail
! alter table renameColumnChild rename column a to d;
! ERROR:  cannot rename inherited column "a"
! alter table only renameColumnChild rename column a to d;
! ERROR:  inherited column "a" must be renamed in child tables too
! alter table only renameColumn rename column a to d;
! ERROR:  inherited column "a" must be renamed in child tables too
! -- these should work
! alter table renameColumn rename column a to d;
! alter table renameColumnChild rename column b to a;
! -- these should work
! alter table if exists doesnt_exist_tab rename column a to d;
! NOTICE:  relation "doesnt_exist_tab" does not exist, skipping
! alter table if exists doesnt_exist_tab rename column b to a;
! NOTICE:  relation "doesnt_exist_tab" does not exist, skipping
! -- this should work
! alter table renameColumn add column w int;
! -- this should fail
! alter table only renameColumn add column x int;
! ERROR:  column must be added to child tables too
! -- Test corner cases in dropping of inherited columns
! create table p1 (f1 int, f2 int);
! create table c1 (f1 int not null) inherits(p1);
! NOTICE:  merging column "f1" with inherited definition
! -- should be rejected since c1.f1 is inherited
! alter table c1 drop column f1;
! ERROR:  cannot drop inherited column "f1"
! -- should work
! alter table p1 drop column f1;
! -- c1.f1 is still there, but no longer inherited
! select f1 from c1;
!  f1 
! ----
! (0 rows)
! 
! alter table c1 drop column f1;
! select f1 from c1;
! ERROR:  column "f1" does not exist
! LINE 1: select f1 from c1;
!                ^
! HINT:  Perhaps you meant to reference the column "c1.f2".
! drop table p1 cascade;
! NOTICE:  drop cascades to table c1
! create table p1 (f1 int, f2 int);
! create table c1 () inherits(p1);
! -- should be rejected since c1.f1 is inherited
! alter table c1 drop column f1;
! ERROR:  cannot drop inherited column "f1"
! alter table p1 drop column f1;
! -- c1.f1 is dropped now, since there is no local definition for it
! select f1 from c1;
! ERROR:  column "f1" does not exist
! LINE 1: select f1 from c1;
!                ^
! HINT:  Perhaps you meant to reference the column "c1.f2".
! drop table p1 cascade;
! NOTICE:  drop cascades to table c1
! create table p1 (f1 int, f2 int);
! create table c1 () inherits(p1);
! -- should be rejected since c1.f1 is inherited
! alter table c1 drop column f1;
! ERROR:  cannot drop inherited column "f1"
! alter table only p1 drop column f1;
! -- c1.f1 is NOT dropped, but must now be considered non-inherited
! alter table c1 drop column f1;
! drop table p1 cascade;
! NOTICE:  drop cascades to table c1
! create table p1 (f1 int, f2 int);
! create table c1 (f1 int not null) inherits(p1);
! NOTICE:  merging column "f1" with inherited definition
! -- should be rejected since c1.f1 is inherited
! alter table c1 drop column f1;
! ERROR:  cannot drop inherited column "f1"
! alter table only p1 drop column f1;
! -- c1.f1 is still there, but no longer inherited
! alter table c1 drop column f1;
! drop table p1 cascade;
! NOTICE:  drop cascades to table c1
! create table p1(id int, name text);
! create table p2(id2 int, name text, height int);
! create table c1(age int) inherits(p1,p2);
! NOTICE:  merging multiple inherited definitions of column "name"
! create table gc1() inherits (c1);
! select relname, attname, attinhcount, attislocal
! from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
! where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
! order by relname, attnum;
!  relname | attname | attinhcount | attislocal 
! ---------+---------+-------------+------------
!  c1      | id      |           1 | f
!  c1      | name    |           2 | f
!  c1      | id2     |           1 | f
!  c1      | height  |           1 | f
!  c1      | age     |           0 | t
!  gc1     | id      |           1 | f
!  gc1     | name    |           1 | f
!  gc1     | id2     |           1 | f
!  gc1     | height  |           1 | f
!  gc1     | age     |           1 | f
!  p1      | id      |           0 | t
!  p1      | name    |           0 | t
!  p2      | id2     |           0 | t
!  p2      | name    |           0 | t
!  p2      | height  |           0 | t
! (15 rows)
! 
! -- should work
! alter table only p1 drop column name;
! -- should work. Now c1.name is local and inhcount is 0.
! alter table p2 drop column name;
! -- should be rejected since its inherited
! alter table gc1 drop column name;
! ERROR:  cannot drop inherited column "name"
! -- should work, and drop gc1.name along
! alter table c1 drop column name;
! -- should fail: column does not exist
! alter table gc1 drop column name;
! ERROR:  column "name" of relation "gc1" does not exist
! -- should work and drop the attribute in all tables
! alter table p2 drop column height;
! -- IF EXISTS test
! create table dropColumnExists ();
! alter table dropColumnExists drop column non_existing; --fail
! ERROR:  column "non_existing" of relation "dropcolumnexists" does not exist
! alter table dropColumnExists drop column if exists non_existing; --succeed
! NOTICE:  column "non_existing" of relation "dropcolumnexists" does not exist, skipping
! select relname, attname, attinhcount, attislocal
! from pg_class join pg_attribute on (pg_class.oid = pg_attribute.attrelid)
! where relname in ('p1','p2','c1','gc1') and attnum > 0 and not attisdropped
! order by relname, attnum;
!  relname | attname | attinhcount | attislocal 
! ---------+---------+-------------+------------
!  c1      | id      |           1 | f
!  c1      | id2     |           1 | f
!  c1      | age     |           0 | t
!  gc1     | id      |           1 | f
!  gc1     | id2     |           1 | f
!  gc1     | age     |           1 | f
!  p1      | id      |           0 | t
!  p2      | id2     |           0 | t
! (8 rows)
! 
! drop table p1, p2 cascade;
! NOTICE:  drop cascades to 2 other objects
! DETAIL:  drop cascades to table c1
! drop cascades to table gc1
! -- test attinhcount tracking with merged columns
! create table depth0();
! create table depth1(c text) inherits (depth0);
! create table depth2() inherits (depth1);
! alter table depth0 add c text;
! NOTICE:  merging definition of column "c" for child "depth1"
! select attrelid::regclass, attname, attinhcount, attislocal
! from pg_attribute
! where attnum > 0 and attrelid::regclass in ('depth0', 'depth1', 'depth2')
! order by attrelid::regclass::text, attnum;
!  attrelid | attname | attinhcount | attislocal 
! ----------+---------+-------------+------------
!  depth0   | c       |           0 | t
!  depth1   | c       |           1 | t
!  depth2   | c       |           1 | f
! (3 rows)
! 
! --
! -- Test the ALTER TABLE SET WITH/WITHOUT OIDS command
! --
! create table altstartwith (col integer) with oids;
! insert into altstartwith values (1);
! select oid > 0, * from altstartwith;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! alter table altstartwith set without oids;
! select oid > 0, * from altstartwith; -- fails
! ERROR:  column "oid" does not exist
! LINE 1: select oid > 0, * from altstartwith;
!                ^
! select * from altstartwith;
!  col 
! -----
!    1
! (1 row)
! 
! alter table altstartwith set with oids;
! select oid > 0, * from altstartwith;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! drop table altstartwith;
! -- Check inheritance cases
! create table altwithoid (col integer) with oids;
! -- Inherits parents oid column anyway
! create table altinhoid () inherits (altwithoid) without oids;
! insert into altinhoid values (1);
! select oid > 0, * from altwithoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! select oid > 0, * from altinhoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! alter table altwithoid set without oids;
! select oid > 0, * from altwithoid; -- fails
! ERROR:  column "oid" does not exist
! LINE 1: select oid > 0, * from altwithoid;
!                ^
! select oid > 0, * from altinhoid; -- fails
! ERROR:  column "oid" does not exist
! LINE 1: select oid > 0, * from altinhoid;
!                ^
! select * from altwithoid;
!  col 
! -----
!    1
! (1 row)
! 
! select * from altinhoid;
!  col 
! -----
!    1
! (1 row)
! 
! alter table altwithoid set with oids;
! select oid > 0, * from altwithoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! select oid > 0, * from altinhoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! drop table altwithoid cascade;
! NOTICE:  drop cascades to table altinhoid
! create table altwithoid (col integer) without oids;
! -- child can have local oid column
! create table altinhoid () inherits (altwithoid) with oids;
! insert into altinhoid values (1);
! select oid > 0, * from altwithoid; -- fails
! ERROR:  column "oid" does not exist
! LINE 1: select oid > 0, * from altwithoid;
!                ^
! select oid > 0, * from altinhoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! alter table altwithoid set with oids;
! NOTICE:  merging definition of column "oid" for child "altinhoid"
! select oid > 0, * from altwithoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! select oid > 0, * from altinhoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! -- the child's local definition should remain
! alter table altwithoid set without oids;
! select oid > 0, * from altwithoid; -- fails
! ERROR:  column "oid" does not exist
! LINE 1: select oid > 0, * from altwithoid;
!                ^
! select oid > 0, * from altinhoid;
!  ?column? | col 
! ----------+-----
!  t        |   1
! (1 row)
! 
! drop table altwithoid cascade;
! NOTICE:  drop cascades to table altinhoid
! -- test renumbering of child-table columns in inherited operations
! create table p1 (f1 int);
! create table c1 (f2 text, f3 int) inherits (p1);
! alter table p1 add column a1 int check (a1 > 0);
! alter table p1 add column f2 text;
! NOTICE:  merging definition of column "f2" for child "c1"
! insert into p1 values (1,2,'abc');
! insert into c1 values(11,'xyz',33,0); -- should fail
! ERROR:  new row for relation "c1" violates check constraint "p1_a1_check"
! DETAIL:  Failing row contains (11, xyz, 33, 0).
! insert into c1 values(11,'xyz',33,22);
! select * from p1;
!  f1 | a1 | f2  
! ----+----+-----
!   1 |  2 | abc
!  11 | 22 | xyz
! (2 rows)
! 
! update p1 set a1 = a1 + 1, f2 = upper(f2);
! select * from p1;
!  f1 | a1 | f2  
! ----+----+-----
!   1 |  3 | ABC
!  11 | 23 | XYZ
! (2 rows)
! 
! drop table p1 cascade;
! NOTICE:  drop cascades to table c1
! -- test that operations with a dropped column do not try to reference
! -- its datatype
! create domain mytype as text;
! create temp table foo (f1 text, f2 mytype, f3 text);
! insert into foo values('bb','cc','dd');
! select * from foo;
!  f1 | f2 | f3 
! ----+----+----
!  bb | cc | dd
! (1 row)
! 
! drop domain mytype cascade;
! NOTICE:  drop cascades to table foo column f2
! select * from foo;
!  f1 | f3 
! ----+----
!  bb | dd
! (1 row)
! 
! insert into foo values('qq','rr');
! select * from foo;
!  f1 | f3 
! ----+----
!  bb | dd
!  qq | rr
! (2 rows)
! 
! update foo set f3 = 'zz';
! select * from foo;
!  f1 | f3 
! ----+----
!  bb | zz
!  qq | zz
! (2 rows)
! 
! select f3,max(f1) from foo group by f3;
!  f3 | max 
! ----+-----
!  zz | qq
! (1 row)
! 
! -- Simple tests for alter table column type
! alter table foo alter f1 TYPE integer; -- fails
! ERROR:  column "f1" cannot be cast automatically to type integer
! HINT:  You might need to specify "USING f1::integer".
! alter table foo alter f1 TYPE varchar(10);
! create table anothertab (atcol1 serial8, atcol2 boolean,
! 	constraint anothertab_chk check (atcol1 <= 3));
! insert into anothertab (atcol1, atcol2) values (default, true);
! insert into anothertab (atcol1, atcol2) values (default, false);
! select * from anothertab;
!  atcol1 | atcol2 
! --------+--------
!       1 | t
!       2 | f
! (2 rows)
! 
! alter table anothertab alter column atcol1 type boolean; -- fails
! ERROR:  column "atcol1" cannot be cast automatically to type boolean
! HINT:  You might need to specify "USING atcol1::boolean".
! alter table anothertab alter column atcol1 type boolean using atcol1::int; -- fails
! ERROR:  result of USING clause for column "atcol1" cannot be cast automatically to type boolean
! HINT:  You might need to add an explicit cast.
! alter table anothertab alter column atcol1 type integer;
! select * from anothertab;
!  atcol1 | atcol2 
! --------+--------
!       1 | t
!       2 | f
! (2 rows)
! 
! insert into anothertab (atcol1, atcol2) values (45, null); -- fails
! ERROR:  new row for relation "anothertab" violates check constraint "anothertab_chk"
! DETAIL:  Failing row contains (45, null).
! insert into anothertab (atcol1, atcol2) values (default, null);
! select * from anothertab;
!  atcol1 | atcol2 
! --------+--------
!       1 | t
!       2 | f
!       3 | 
! (3 rows)
! 
! alter table anothertab alter column atcol2 type text
!       using case when atcol2 is true then 'IT WAS TRUE'
!                  when atcol2 is false then 'IT WAS FALSE'
!                  else 'IT WAS NULL!' end;
! select * from anothertab;
!  atcol1 |    atcol2    
! --------+--------------
!       1 | IT WAS TRUE
!       2 | IT WAS FALSE
!       3 | IT WAS NULL!
! (3 rows)
! 
! alter table anothertab alter column atcol1 type boolean
!         using case when atcol1 % 2 = 0 then true else false end; -- fails
! ERROR:  default for column "atcol1" cannot be cast automatically to type boolean
! alter table anothertab alter column atcol1 drop default;
! alter table anothertab alter column atcol1 type boolean
!         using case when atcol1 % 2 = 0 then true else false end; -- fails
! ERROR:  operator does not exist: boolean <= integer
! HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
! alter table anothertab drop constraint anothertab_chk;
! alter table anothertab drop constraint anothertab_chk; -- fails
! ERROR:  constraint "anothertab_chk" of relation "anothertab" does not exist
! alter table anothertab drop constraint IF EXISTS anothertab_chk; -- succeeds
! NOTICE:  constraint "anothertab_chk" of relation "anothertab" does not exist, skipping
! alter table anothertab alter column atcol1 type boolean
!         using case when atcol1 % 2 = 0 then true else false end;
! select * from anothertab;
!  atcol1 |    atcol2    
! --------+--------------
!  f      | IT WAS TRUE
!  t      | IT WAS FALSE
!  f      | IT WAS NULL!
! (3 rows)
! 
! drop table anothertab;
! create table another (f1 int, f2 text);
! insert into another values(1, 'one');
! insert into another values(2, 'two');
! insert into another values(3, 'three');
! select * from another;
!  f1 |  f2   
! ----+-------
!   1 | one
!   2 | two
!   3 | three
! (3 rows)
! 
! alter table another
!   alter f1 type text using f2 || ' more',
!   alter f2 type bigint using f1 * 10;
! select * from another;
!      f1     | f2 
! ------------+----
!  one more   | 10
!  two more   | 20
!  three more | 30
! (3 rows)
! 
! drop table another;
! -- table's row type
! create table tab1 (a int, b text);
! create table tab2 (x int, y tab1);
! alter table tab1 alter column b type varchar; -- fails
! ERROR:  cannot alter table "tab1" because column "tab2.y" uses its row type
! -- disallow recursive containment of row types
! create temp table recur1 (f1 int);
! alter table recur1 add column f2 recur1; -- fails
! ERROR:  composite type recur1 cannot be made a member of itself
! alter table recur1 add column f2 recur1[]; -- fails
! ERROR:  composite type recur1 cannot be made a member of itself
! create domain array_of_recur1 as recur1[];
! alter table recur1 add column f2 array_of_recur1; -- fails
! ERROR:  composite type recur1 cannot be made a member of itself
! create temp table recur2 (f1 int, f2 recur1);
! alter table recur1 add column f2 recur2; -- fails
! ERROR:  composite type recur1 cannot be made a member of itself
! alter table recur1 add column f2 int;
! alter table recur1 alter column f2 type recur2; -- fails
! ERROR:  composite type recur1 cannot be made a member of itself
! -- SET STORAGE may need to add a TOAST table
! create table test_storage (a text);
! alter table test_storage alter a set storage plain;
! alter table test_storage add b int default 0; -- rewrite table to remove its TOAST table
! alter table test_storage alter a set storage extended; -- re-add TOAST table
! select reltoastrelid <> 0 as has_toast_table
! from pg_class
! where oid = 'test_storage'::regclass;
!  has_toast_table 
! -----------------
!  t
! (1 row)
! 
! -- ALTER COLUMN TYPE with a check constraint and a child table (bug #13779)
! CREATE TABLE test_inh_check (a float check (a > 10.2), b float);
! CREATE TABLE test_inh_check_child() INHERITS(test_inh_check);
! \d test_inh_check
!      Table "public.test_inh_check"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | double precision | 
!  b      | double precision | 
! Check constraints:
!     "test_inh_check_a_check" CHECK (a > 10.2::double precision)
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d test_inh_check_child
!   Table "public.test_inh_check_child"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | double precision | 
!  b      | double precision | 
! Check constraints:
!     "test_inh_check_a_check" CHECK (a > 10.2::double precision)
! Inherits: test_inh_check
! 
! select relname, conname, coninhcount, conislocal, connoinherit
!   from pg_constraint c, pg_class r
!   where relname like 'test_inh_check%' and c.conrelid = r.oid
!   order by 1, 2;
!        relname        |        conname         | coninhcount | conislocal | connoinherit 
! ----------------------+------------------------+-------------+------------+--------------
!  test_inh_check       | test_inh_check_a_check |           0 | t          | f
!  test_inh_check_child | test_inh_check_a_check |           1 | f          | f
! (2 rows)
! 
! ALTER TABLE test_inh_check ALTER COLUMN a TYPE numeric;
! \d test_inh_check
!      Table "public.test_inh_check"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | numeric          | 
!  b      | double precision | 
! Check constraints:
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d test_inh_check_child
!   Table "public.test_inh_check_child"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | numeric          | 
!  b      | double precision | 
! Check constraints:
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Inherits: test_inh_check
! 
! select relname, conname, coninhcount, conislocal, connoinherit
!   from pg_constraint c, pg_class r
!   where relname like 'test_inh_check%' and c.conrelid = r.oid
!   order by 1, 2;
!        relname        |        conname         | coninhcount | conislocal | connoinherit 
! ----------------------+------------------------+-------------+------------+--------------
!  test_inh_check       | test_inh_check_a_check |           0 | t          | f
!  test_inh_check_child | test_inh_check_a_check |           1 | f          | f
! (2 rows)
! 
! -- also try noinherit, local, and local+inherited cases
! ALTER TABLE test_inh_check ADD CONSTRAINT bnoinherit CHECK (b > 100) NO INHERIT;
! ALTER TABLE test_inh_check_child ADD CONSTRAINT blocal CHECK (b < 1000);
! ALTER TABLE test_inh_check_child ADD CONSTRAINT bmerged CHECK (b > 1);
! ALTER TABLE test_inh_check ADD CONSTRAINT bmerged CHECK (b > 1);
! NOTICE:  merging constraint "bmerged" with inherited definition
! \d test_inh_check
!      Table "public.test_inh_check"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | numeric          | 
!  b      | double precision | 
! Check constraints:
!     "bmerged" CHECK (b > 1::double precision)
!     "bnoinherit" CHECK (b > 100::double precision) NO INHERIT
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d test_inh_check_child
!   Table "public.test_inh_check_child"
!  Column |       Type       | Modifiers 
! --------+------------------+-----------
!  a      | numeric          | 
!  b      | double precision | 
! Check constraints:
!     "blocal" CHECK (b < 1000::double precision)
!     "bmerged" CHECK (b > 1::double precision)
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Inherits: test_inh_check
! 
! select relname, conname, coninhcount, conislocal, connoinherit
!   from pg_constraint c, pg_class r
!   where relname like 'test_inh_check%' and c.conrelid = r.oid
!   order by 1, 2;
!        relname        |        conname         | coninhcount | conislocal | connoinherit 
! ----------------------+------------------------+-------------+------------+--------------
!  test_inh_check       | bmerged                |           0 | t          | f
!  test_inh_check       | bnoinherit             |           0 | t          | t
!  test_inh_check       | test_inh_check_a_check |           0 | t          | f
!  test_inh_check_child | blocal                 |           0 | t          | f
!  test_inh_check_child | bmerged                |           1 | t          | f
!  test_inh_check_child | test_inh_check_a_check |           1 | f          | f
! (6 rows)
! 
! ALTER TABLE test_inh_check ALTER COLUMN b TYPE numeric;
! NOTICE:  merging constraint "bmerged" with inherited definition
! \d test_inh_check
! Table "public.test_inh_check"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | numeric | 
!  b      | numeric | 
! Check constraints:
!     "bmerged" CHECK (b::double precision > 1::double precision)
!     "bnoinherit" CHECK (b::double precision > 100::double precision) NO INHERIT
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Number of child tables: 1 (Use \d+ to list them.)
! 
! \d test_inh_check_child
! Table "public.test_inh_check_child"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | numeric | 
!  b      | numeric | 
! Check constraints:
!     "blocal" CHECK (b::double precision < 1000::double precision)
!     "bmerged" CHECK (b::double precision > 1::double precision)
!     "test_inh_check_a_check" CHECK (a::double precision > 10.2::double precision)
! Inherits: test_inh_check
! 
! select relname, conname, coninhcount, conislocal, connoinherit
!   from pg_constraint c, pg_class r
!   where relname like 'test_inh_check%' and c.conrelid = r.oid
!   order by 1, 2;
!        relname        |        conname         | coninhcount | conislocal | connoinherit 
! ----------------------+------------------------+-------------+------------+--------------
!  test_inh_check       | bmerged                |           0 | t          | f
!  test_inh_check       | bnoinherit             |           0 | t          | t
!  test_inh_check       | test_inh_check_a_check |           0 | t          | f
!  test_inh_check_child | blocal                 |           0 | t          | f
!  test_inh_check_child | bmerged                |           1 | t          | f
!  test_inh_check_child | test_inh_check_a_check |           1 | f          | f
! (6 rows)
! 
! -- check for rollback of ANALYZE corrupting table property flags (bug #11638)
! CREATE TABLE check_fk_presence_1 (id int PRIMARY KEY, t text);
! CREATE TABLE check_fk_presence_2 (id int REFERENCES check_fk_presence_1, t text);
! BEGIN;
! ALTER TABLE check_fk_presence_2 DROP CONSTRAINT check_fk_presence_2_id_fkey;
! ANALYZE check_fk_presence_2;
! ROLLBACK;
! \d check_fk_presence_2
! Table "public.check_fk_presence_2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  id     | integer | 
!  t      | text    | 
! Foreign-key constraints:
!     "check_fk_presence_2_id_fkey" FOREIGN KEY (id) REFERENCES check_fk_presence_1(id)
! 
! DROP TABLE check_fk_presence_1, check_fk_presence_2;
! --
! -- lock levels
! --
! drop type lockmodes;
! ERROR:  type "lockmodes" does not exist
! create type lockmodes as enum (
!  'SIReadLock'
! ,'AccessShareLock'
! ,'RowShareLock'
! ,'RowExclusiveLock'
! ,'ShareUpdateExclusiveLock'
! ,'ShareLock'
! ,'ShareRowExclusiveLock'
! ,'ExclusiveLock'
! ,'AccessExclusiveLock'
! );
! drop view my_locks;
! ERROR:  view "my_locks" does not exist
! create or replace view my_locks as
! select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
! from pg_locks l join pg_class c on l.relation = c.oid
! where virtualtransaction = (
!         select virtualtransaction
!         from pg_locks
!         where transactionid = txid_current()::integer)
! and locktype = 'relation'
! and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
! and c.relname != 'my_locks'
! group by c.relname;
! create table alterlock (f1 int primary key, f2 text);
! insert into alterlock values (1, 'foo');
! create table alterlock2 (f3 int primary key, f1 int);
! insert into alterlock2 values (1, 1);
! begin; alter table alterlock alter column f2 set statistics 150;
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
! (1 row)
! 
! rollback;
! begin; alter table alterlock cluster on alterlock_pkey;
! select * from my_locks order by 1;
!     relname     |       max_lockmode       
! ----------------+--------------------------
!  alterlock      | ShareUpdateExclusiveLock
!  alterlock_pkey | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock set without cluster;
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
! (1 row)
! 
! commit;
! begin; alter table alterlock set (fillfactor = 100);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
!  pg_toast  | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock reset (fillfactor);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
!  pg_toast  | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock set (toast.autovacuum_enabled = off);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
!  pg_toast  | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock set (autovacuum_enabled = off);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
!  pg_toast  | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock alter column f2 set (n_distinct = 1);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
! (1 row)
! 
! rollback;
! -- test that mixing options with different lock levels works as expected
! begin; alter table alterlock set (autovacuum_enabled = off, fillfactor = 80);
! select * from my_locks order by 1;
!   relname  |       max_lockmode       
! -----------+--------------------------
!  alterlock | ShareUpdateExclusiveLock
!  pg_toast  | ShareUpdateExclusiveLock
! (2 rows)
! 
! commit;
! begin; alter table alterlock alter column f2 set storage extended;
! select * from my_locks order by 1;
!   relname  |    max_lockmode     
! -----------+---------------------
!  alterlock | AccessExclusiveLock
! (1 row)
! 
! rollback;
! begin; alter table alterlock alter column f2 set default 'x';
! select * from my_locks order by 1;
!   relname  |    max_lockmode     
! -----------+---------------------
!  alterlock | AccessExclusiveLock
! (1 row)
! 
! rollback;
! begin;
! create trigger ttdummy
! 	before delete or update on alterlock
! 	for each row
! 	execute procedure
! 	ttdummy (1, 1);
! select * from my_locks order by 1;
!   relname  |     max_lockmode      
! -----------+-----------------------
!  alterlock | ShareRowExclusiveLock
! (1 row)
! 
! rollback;
! begin;
! select * from my_locks order by 1;
!  relname | max_lockmode 
! ---------+--------------
! (0 rows)
! 
! alter table alterlock2 add foreign key (f1) references alterlock (f1);
! select * from my_locks order by 1;
!      relname     |     max_lockmode      
! -----------------+-----------------------
!  alterlock       | ShareRowExclusiveLock
!  alterlock2      | ShareRowExclusiveLock
!  alterlock2_pkey | AccessShareLock
!  alterlock_pkey  | AccessShareLock
! (4 rows)
! 
! rollback;
! begin;
! alter table alterlock2
! add constraint alterlock2nv foreign key (f1) references alterlock (f1) NOT VALID;
! select * from my_locks order by 1;
!   relname   |     max_lockmode      
! ------------+-----------------------
!  alterlock  | ShareRowExclusiveLock
!  alterlock2 | ShareRowExclusiveLock
! (2 rows)
! 
! commit;
! begin;
! alter table alterlock2 validate constraint alterlock2nv;
! select * from my_locks order by 1;
!      relname     |       max_lockmode       
! -----------------+--------------------------
!  alterlock       | RowShareLock
!  alterlock2      | ShareUpdateExclusiveLock
!  alterlock2_pkey | AccessShareLock
!  alterlock_pkey  | AccessShareLock
! (4 rows)
! 
! rollback;
! create or replace view my_locks as
! select case when c.relname like 'pg_toast%' then 'pg_toast' else c.relname end, max(mode::lockmodes) as max_lockmode
! from pg_locks l join pg_class c on l.relation = c.oid
! where virtualtransaction = (
!         select virtualtransaction
!         from pg_locks
!         where transactionid = txid_current()::integer)
! and locktype = 'relation'
! and relnamespace != (select oid from pg_namespace where nspname = 'pg_catalog')
! and c.relname = 'my_locks'
! group by c.relname;
! -- raise exception
! alter table my_locks set (autovacuum_enabled = false);
! ERROR:  unrecognized parameter "autovacuum_enabled"
! alter view my_locks set (autovacuum_enabled = false);
! ERROR:  unrecognized parameter "autovacuum_enabled"
! alter table my_locks reset (autovacuum_enabled);
! alter view my_locks reset (autovacuum_enabled);
! begin;
! alter view my_locks set (security_barrier=off);
! select * from my_locks order by 1;
!  relname  |    max_lockmode     
! ----------+---------------------
!  my_locks | AccessExclusiveLock
! (1 row)
! 
! alter view my_locks reset (security_barrier);
! rollback;
! -- this test intentionally applies the ALTER TABLE command against a view, but
! -- uses a view option so we expect this to succeed. This form of SQL is
! -- accepted for historical reasons, as shown in the docs for ALTER VIEW
! begin;
! alter table my_locks set (security_barrier=off);
! select * from my_locks order by 1;
!  relname  |    max_lockmode     
! ----------+---------------------
!  my_locks | AccessExclusiveLock
! (1 row)
! 
! alter table my_locks reset (security_barrier);
! rollback;
! -- cleanup
! drop table alterlock2;
! drop table alterlock;
! drop view my_locks;
! drop type lockmodes;
! --
! -- alter function
! --
! create function test_strict(text) returns text as
!     'select coalesce($1, ''got passed a null'');'
!     language sql returns null on null input;
! select test_strict(NULL);
!  test_strict 
! -------------
!  
! (1 row)
! 
! alter function test_strict(text) called on null input;
! select test_strict(NULL);
!     test_strict    
! -------------------
!  got passed a null
! (1 row)
! 
! create function non_strict(text) returns text as
!     'select coalesce($1, ''got passed a null'');'
!     language sql called on null input;
! select non_strict(NULL);
!     non_strict     
! -------------------
!  got passed a null
! (1 row)
! 
! alter function non_strict(text) returns null on null input;
! select non_strict(NULL);
!  non_strict 
! ------------
!  
! (1 row)
! 
! --
! -- alter object set schema
! --
! create schema alter1;
! create schema alter2;
! create table alter1.t1(f1 serial primary key, f2 int check (f2 > 0));
! create view alter1.v1 as select * from alter1.t1;
! create function alter1.plus1(int) returns int as 'select $1+1' language sql;
! create domain alter1.posint integer check (value > 0);
! create type alter1.ctype as (f1 int, f2 text);
! create function alter1.same(alter1.ctype, alter1.ctype) returns boolean language sql
! as 'select $1.f1 is not distinct from $2.f1 and $1.f2 is not distinct from $2.f2';
! create operator alter1.=(procedure = alter1.same, leftarg  = alter1.ctype, rightarg = alter1.ctype);
! create operator class alter1.ctype_hash_ops default for type alter1.ctype using hash as
!   operator 1 alter1.=(alter1.ctype, alter1.ctype);
! create conversion alter1.ascii_to_utf8 for 'sql_ascii' to 'utf8' from ascii_to_utf8;
! create text search parser alter1.prs(start = prsd_start, gettoken = prsd_nexttoken, end = prsd_end, lextypes = prsd_lextype);
! create text search configuration alter1.cfg(parser = alter1.prs);
! create text search template alter1.tmpl(init = dsimple_init, lexize = dsimple_lexize);
! create text search dictionary alter1.dict(template = alter1.tmpl);
! insert into alter1.t1(f2) values(11);
! insert into alter1.t1(f2) values(12);
! alter table alter1.t1 set schema alter1; -- no-op, same schema
! alter table alter1.t1 set schema alter2;
! alter table alter1.v1 set schema alter2;
! alter function alter1.plus1(int) set schema alter2;
! alter domain alter1.posint set schema alter2;
! alter operator class alter1.ctype_hash_ops using hash set schema alter2;
! alter operator family alter1.ctype_hash_ops using hash set schema alter2;
! alter operator alter1.=(alter1.ctype, alter1.ctype) set schema alter2;
! alter function alter1.same(alter1.ctype, alter1.ctype) set schema alter2;
! alter type alter1.ctype set schema alter1; -- no-op, same schema
! alter type alter1.ctype set schema alter2;
! alter conversion alter1.ascii_to_utf8 set schema alter2;
! alter text search parser alter1.prs set schema alter2;
! alter text search configuration alter1.cfg set schema alter2;
! alter text search template alter1.tmpl set schema alter2;
! alter text search dictionary alter1.dict set schema alter2;
! -- this should succeed because nothing is left in alter1
! drop schema alter1;
! insert into alter2.t1(f2) values(13);
! insert into alter2.t1(f2) values(14);
! select * from alter2.t1;
!  f1 | f2 
! ----+----
!   1 | 11
!   2 | 12
!   3 | 13
!   4 | 14
! (4 rows)
! 
! select * from alter2.v1;
!  f1 | f2 
! ----+----
!   1 | 11
!   2 | 12
!   3 | 13
!   4 | 14
! (4 rows)
! 
! select alter2.plus1(41);
!  plus1 
! -------
!     42
! (1 row)
! 
! -- clean up
! drop schema alter2 cascade;
! NOTICE:  drop cascades to 13 other objects
! DETAIL:  drop cascades to table alter2.t1
! drop cascades to view alter2.v1
! drop cascades to function alter2.plus1(integer)
! drop cascades to type alter2.posint
! drop cascades to operator family alter2.ctype_hash_ops for access method hash
! drop cascades to type alter2.ctype
! drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
! drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
! drop cascades to conversion ascii_to_utf8
! drop cascades to text search parser prs
! drop cascades to text search configuration cfg
! drop cascades to text search template tmpl
! drop cascades to text search dictionary dict
! --
! -- composite types
! --
! CREATE TYPE test_type AS (a int);
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
! 
! ALTER TYPE nosuchtype ADD ATTRIBUTE b text; -- fails
! ERROR:  relation "nosuchtype" does not exist
! ALTER TYPE test_type ADD ATTRIBUTE b text;
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | text    | 
! 
! ALTER TYPE test_type ADD ATTRIBUTE b text; -- fails
! ERROR:  column "b" of relation "test_type" already exists
! ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE varchar;
! \d test_type
!    Composite type "public.test_type"
!  Column |       Type        | Modifiers 
! --------+-------------------+-----------
!  a      | integer           | 
!  b      | character varying | 
! 
! ALTER TYPE test_type ALTER ATTRIBUTE b SET DATA TYPE integer;
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | integer | 
! 
! ALTER TYPE test_type DROP ATTRIBUTE b;
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
! 
! ALTER TYPE test_type DROP ATTRIBUTE c; -- fails
! ERROR:  column "c" of relation "test_type" does not exist
! ALTER TYPE test_type DROP ATTRIBUTE IF EXISTS c;
! NOTICE:  column "c" of relation "test_type" does not exist, skipping
! ALTER TYPE test_type DROP ATTRIBUTE a, ADD ATTRIBUTE d boolean;
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  d      | boolean | 
! 
! ALTER TYPE test_type RENAME ATTRIBUTE a TO aa;
! ERROR:  column "a" does not exist
! ALTER TYPE test_type RENAME ATTRIBUTE d TO dd;
! \d test_type
! Composite type "public.test_type"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  dd     | boolean | 
! 
! DROP TYPE test_type;
! CREATE TYPE test_type1 AS (a int, b text);
! CREATE TABLE test_tbl1 (x int, y test_type1);
! ALTER TYPE test_type1 ALTER ATTRIBUTE b TYPE varchar; -- fails
! ERROR:  cannot alter type "test_type1" because column "test_tbl1.y" uses it
! CREATE TYPE test_type2 AS (a int, b text);
! CREATE TABLE test_tbl2 OF test_type2;
! CREATE TABLE test_tbl2_subclass () INHERITS (test_tbl2);
! \d test_type2
! Composite type "public.test_type2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | text    | 
! 
! \d test_tbl2
!    Table "public.test_tbl2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | text    | 
! Number of child tables: 1 (Use \d+ to list them.)
! Typed table of type: test_type2
! 
! ALTER TYPE test_type2 ADD ATTRIBUTE c text; -- fails
! ERROR:  cannot alter type "test_type2" because it is the type of a typed table
! HINT:  Use ALTER ... CASCADE to alter the typed tables too.
! ALTER TYPE test_type2 ADD ATTRIBUTE c text CASCADE;
! \d test_type2
! Composite type "public.test_type2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | text    | 
!  c      | text    | 
! 
! \d test_tbl2
!    Table "public.test_tbl2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  b      | text    | 
!  c      | text    | 
! Number of child tables: 1 (Use \d+ to list them.)
! Typed table of type: test_type2
! 
! ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar; -- fails
! ERROR:  cannot alter type "test_type2" because it is the type of a typed table
! HINT:  Use ALTER ... CASCADE to alter the typed tables too.
! ALTER TYPE test_type2 ALTER ATTRIBUTE b TYPE varchar CASCADE;
! \d test_type2
!    Composite type "public.test_type2"
!  Column |       Type        | Modifiers 
! --------+-------------------+-----------
!  a      | integer           | 
!  b      | character varying | 
!  c      | text              | 
! 
! \d test_tbl2
!         Table "public.test_tbl2"
!  Column |       Type        | Modifiers 
! --------+-------------------+-----------
!  a      | integer           | 
!  b      | character varying | 
!  c      | text              | 
! Number of child tables: 1 (Use \d+ to list them.)
! Typed table of type: test_type2
! 
! ALTER TYPE test_type2 DROP ATTRIBUTE b; -- fails
! ERROR:  cannot alter type "test_type2" because it is the type of a typed table
! HINT:  Use ALTER ... CASCADE to alter the typed tables too.
! ALTER TYPE test_type2 DROP ATTRIBUTE b CASCADE;
! \d test_type2
! Composite type "public.test_type2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  c      | text    | 
! 
! \d test_tbl2
!    Table "public.test_tbl2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  a      | integer | 
!  c      | text    | 
! Number of child tables: 1 (Use \d+ to list them.)
! Typed table of type: test_type2
! 
! ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa; -- fails
! ERROR:  cannot alter type "test_type2" because it is the type of a typed table
! HINT:  Use ALTER ... CASCADE to alter the typed tables too.
! ALTER TYPE test_type2 RENAME ATTRIBUTE a TO aa CASCADE;
! \d test_type2
! Composite type "public.test_type2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  aa     | integer | 
!  c      | text    | 
! 
! \d test_tbl2
!    Table "public.test_tbl2"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  aa     | integer | 
!  c      | text    | 
! Number of child tables: 1 (Use \d+ to list them.)
! Typed table of type: test_type2
! 
! \d test_tbl2_subclass
! Table "public.test_tbl2_subclass"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  aa     | integer | 
!  c      | text    | 
! Inherits: test_tbl2
! 
! DROP TABLE test_tbl2_subclass;
! -- This test isn't that interesting on its own, but the purpose is to leave
! -- behind a table to test pg_upgrade with. The table has a composite type
! -- column in it, and the composite type has a dropped attribute.
! CREATE TYPE test_type3 AS (a int);
! CREATE TABLE test_tbl3 (c) AS SELECT '(1)'::test_type3;
! ALTER TYPE test_type3 DROP ATTRIBUTE a, ADD ATTRIBUTE b int;
! CREATE TYPE test_type_empty AS ();
! DROP TYPE test_type_empty;
! --
! -- typed tables: OF / NOT OF
! --
! CREATE TYPE tt_t0 AS (z inet, x int, y numeric(8,2));
! ALTER TYPE tt_t0 DROP ATTRIBUTE z;
! CREATE TABLE tt0 (x int NOT NULL, y numeric(8,2));	-- OK
! CREATE TABLE tt1 (x int, y bigint);					-- wrong base type
! CREATE TABLE tt2 (x int, y numeric(9,2));			-- wrong typmod
! CREATE TABLE tt3 (y numeric(8,2), x int);			-- wrong column order
! CREATE TABLE tt4 (x int);							-- too few columns
! CREATE TABLE tt5 (x int, y numeric(8,2), z int);	-- too few columns
! CREATE TABLE tt6 () INHERITS (tt0);					-- can't have a parent
! CREATE TABLE tt7 (x int, q text, y numeric(8,2)) WITH OIDS;
! ALTER TABLE tt7 DROP q;								-- OK
! ALTER TABLE tt0 OF tt_t0;
! ALTER TABLE tt1 OF tt_t0;
! ERROR:  table "tt1" has different type for column "y"
! ALTER TABLE tt2 OF tt_t0;
! ERROR:  table "tt2" has different type for column "y"
! ALTER TABLE tt3 OF tt_t0;
! ERROR:  table has column "y" where type requires "x"
! ALTER TABLE tt4 OF tt_t0;
! ERROR:  table is missing column "y"
! ALTER TABLE tt5 OF tt_t0;
! ERROR:  table has extra column "z"
! ALTER TABLE tt6 OF tt_t0;
! ERROR:  typed tables cannot inherit
! ALTER TABLE tt7 OF tt_t0;
! CREATE TYPE tt_t1 AS (x int, y numeric(8,2));
! ALTER TABLE tt7 OF tt_t1;			-- reassign an already-typed table
! ALTER TABLE tt7 NOT OF;
! \d tt7
!         Table "public.tt7"
!  Column |     Type     | Modifiers 
! --------+--------------+-----------
!  x      | integer      | 
!  y      | numeric(8,2) | 
! 
! -- make sure we can drop a constraint on the parent but it remains on the child
! CREATE TABLE test_drop_constr_parent (c text CHECK (c IS NOT NULL));
! CREATE TABLE test_drop_constr_child () INHERITS (test_drop_constr_parent);
! ALTER TABLE ONLY test_drop_constr_parent DROP CONSTRAINT "test_drop_constr_parent_c_check";
! -- should fail
! INSERT INTO test_drop_constr_child (c) VALUES (NULL);
! ERROR:  new row for relation "test_drop_constr_child" violates check constraint "test_drop_constr_parent_c_check"
! DETAIL:  Failing row contains (null).
! DROP TABLE test_drop_constr_parent CASCADE;
! NOTICE:  drop cascades to table test_drop_constr_child
! --
! -- IF EXISTS test
! --
! ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
! NOTICE:  relation "tt8" does not exist, skipping
! ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
! NOTICE:  relation "tt8" does not exist, skipping
! ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
! NOTICE:  relation "tt8" does not exist, skipping
! ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
! NOTICE:  relation "tt8" does not exist, skipping
! ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
! NOTICE:  relation "tt8" does not exist, skipping
! ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
! NOTICE:  relation "tt8" does not exist, skipping
! CREATE TABLE tt8(a int);
! CREATE SCHEMA alter2;
! ALTER TABLE IF EXISTS tt8 ADD COLUMN f int;
! ALTER TABLE IF EXISTS tt8 ADD CONSTRAINT xxx PRIMARY KEY(f);
! ALTER TABLE IF EXISTS tt8 ADD CHECK (f BETWEEN 0 AND 10);
! ALTER TABLE IF EXISTS tt8 ALTER COLUMN f SET DEFAULT 0;
! ALTER TABLE IF EXISTS tt8 RENAME COLUMN f TO f1;
! ALTER TABLE IF EXISTS tt8 SET SCHEMA alter2;
! \d alter2.tt8
!           Table "alter2.tt8"
!  Column |  Type   |     Modifiers      
! --------+---------+--------------------
!  a      | integer | 
!  f1     | integer | not null default 0
! Indexes:
!     "xxx" PRIMARY KEY, btree (f1)
! Check constraints:
!     "tt8_f_check" CHECK (f1 >= 0 AND f1 <= 10)
! 
! DROP TABLE alter2.tt8;
! DROP SCHEMA alter2;
! -- Check that comments on constraints and indexes are not lost at ALTER TABLE.
! CREATE TABLE comment_test (
!   id int,
!   positive_col int CHECK (positive_col > 0),
!   indexed_col int,
!   CONSTRAINT comment_test_pk PRIMARY KEY (id));
! CREATE INDEX comment_test_index ON comment_test(indexed_col);
! COMMENT ON COLUMN comment_test.id IS 'Column ''id'' on comment_test';
! COMMENT ON INDEX comment_test_index IS 'Simple index on comment_test';
! COMMENT ON CONSTRAINT comment_test_positive_col_check ON comment_test IS 'CHECK constraint on comment_test.positive_col';
! COMMENT ON CONSTRAINT comment_test_pk ON comment_test IS 'PRIMARY KEY constraint of comment_test';
! COMMENT ON INDEX comment_test_pk IS 'Index backing the PRIMARY KEY of comment_test';
! SELECT col_description('comment_test'::regclass, 1) as comment;
!            comment           
! -----------------------------
!  Column 'id' on comment_test
! (1 row)
! 
! SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
!        index        |                    comment                    
! --------------------+-----------------------------------------------
!  comment_test_index | Simple index on comment_test
!  comment_test_pk    | Index backing the PRIMARY KEY of comment_test
! (2 rows)
! 
! SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
!            constraint            |                    comment                    
! ---------------------------------+-----------------------------------------------
!  comment_test_pk                 | PRIMARY KEY constraint of comment_test
!  comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
! (2 rows)
! 
! -- Change the datatype of all the columns. ALTER TABLE is optimized to not
! -- rebuild an index if the new data type is binary compatible with the old
! -- one. Check do a dummy ALTER TABLE that doesn't change the datatype
! -- first, to test that no-op codepath, and another one that does.
! ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE int;
! ALTER TABLE comment_test ALTER COLUMN indexed_col SET DATA TYPE text;
! ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE int;
! ALTER TABLE comment_test ALTER COLUMN id SET DATA TYPE text;
! ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE int;
! ALTER TABLE comment_test ALTER COLUMN positive_col SET DATA TYPE bigint;
! -- Check that the comments are intact.
! SELECT col_description('comment_test'::regclass, 1) as comment;
!            comment           
! -----------------------------
!  Column 'id' on comment_test
! (1 row)
! 
! SELECT indexrelid::regclass::text as index, obj_description(indexrelid, 'pg_class') as comment FROM pg_index where indrelid = 'comment_test'::regclass ORDER BY 1, 2;
!        index        |                    comment                    
! --------------------+-----------------------------------------------
!  comment_test_index | Simple index on comment_test
!  comment_test_pk    | Index backing the PRIMARY KEY of comment_test
! (2 rows)
! 
! SELECT conname as constraint, obj_description(oid, 'pg_constraint') as comment FROM pg_constraint where conrelid = 'comment_test'::regclass ORDER BY 1, 2;
!            constraint            |                    comment                    
! ---------------------------------+-----------------------------------------------
!  comment_test_pk                 | PRIMARY KEY constraint of comment_test
!  comment_test_positive_col_check | CHECK constraint on comment_test.positive_col
! (2 rows)
! 
! -- Check that we map relation oids to filenodes and back correctly.  Only
! -- display bad mappings so the test output doesn't change all the time.  A
! -- filenode function call can return NULL for a relation dropped concurrently
! -- with the call's surrounding query, so ignore a NULL mapped_oid for
! -- relations that no longer exist after all calls finish.
! CREATE TEMP TABLE filenode_mapping AS
! SELECT
!     oid, mapped_oid, reltablespace, relfilenode, relname
! FROM pg_class,
!     pg_filenode_relation(reltablespace, pg_relation_filenode(oid)) AS mapped_oid
! WHERE relkind IN ('r', 'i', 'S', 't', 'm') AND mapped_oid IS DISTINCT FROM oid;
! SELECT m.* FROM filenode_mapping m LEFT JOIN pg_class c ON c.oid = m.oid
! WHERE c.oid IS NOT NULL OR m.mapped_oid IS NOT NULL;
!  oid | mapped_oid | reltablespace | relfilenode | relname 
! -----+------------+---------------+-------------+---------
! (0 rows)
! 
! -- Checks on creating and manipulation of user defined relations in
! -- pg_catalog.
! --
! -- XXX: It would be useful to add checks around trying to manipulate
! -- catalog tables, but that might have ugly consequences when run
! -- against an existing server with allow_system_table_mods = on.
! SHOW allow_system_table_mods;
!  allow_system_table_mods 
! -------------------------
!  off
! (1 row)
! 
! -- disallowed because of search_path issues with pg_dump
! CREATE TABLE pg_catalog.new_system_table();
! ERROR:  permission denied to create "pg_catalog.new_system_table"
! DETAIL:  System catalog modifications are currently disallowed.
! -- instead create in public first, move to catalog
! CREATE TABLE new_system_table(id serial primary key, othercol text);
! ALTER TABLE new_system_table SET SCHEMA pg_catalog;
! -- XXX: it's currently impossible to move relations out of pg_catalog
! ALTER TABLE new_system_table SET SCHEMA public;
! ERROR:  cannot remove dependency on schema pg_catalog because it is a system object
! -- move back, will be ignored -- already there
! ALTER TABLE new_system_table SET SCHEMA pg_catalog;
! ALTER TABLE new_system_table RENAME TO old_system_table;
! CREATE INDEX old_system_table__othercol ON old_system_table (othercol);
! INSERT INTO old_system_table(othercol) VALUES ('somedata'), ('otherdata');
! UPDATE old_system_table SET id = -id;
! DELETE FROM old_system_table WHERE othercol = 'somedata';
! TRUNCATE old_system_table;
! ALTER TABLE old_system_table DROP CONSTRAINT new_system_table_pkey;
! ALTER TABLE old_system_table DROP COLUMN othercol;
! DROP TABLE old_system_table;
! -- set logged
! CREATE UNLOGGED TABLE unlogged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
! -- check relpersistence of an unlogged table
! SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
! UNION ALL
! SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
! UNION ALL
! SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
! ORDER BY relname;
!      relname      | relkind | relpersistence 
! ------------------+---------+----------------
!  toast index      | i       | u
!  toast table      | t       | u
!  unlogged1        | r       | u
!  unlogged1_f1_seq | S       | p
!  unlogged1_pkey   | i       | u
! (5 rows)
! 
! CREATE UNLOGGED TABLE unlogged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged1); -- foreign key
! CREATE UNLOGGED TABLE unlogged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES unlogged3); -- self-referencing foreign key
! ALTER TABLE unlogged3 SET LOGGED; -- skip self-referencing foreign key
! ALTER TABLE unlogged2 SET LOGGED; -- fails because a foreign key to an unlogged table exists
! ERROR:  could not change table "unlogged2" to logged because it references unlogged table "unlogged1"
! ALTER TABLE unlogged1 SET LOGGED;
! -- check relpersistence of an unlogged table after changing to permament
! SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^unlogged1'
! UNION ALL
! SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^unlogged1'
! UNION ALL
! SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^unlogged1'
! ORDER BY relname;
!      relname      | relkind | relpersistence 
! ------------------+---------+----------------
!  toast index      | i       | p
!  toast table      | t       | p
!  unlogged1        | r       | p
!  unlogged1_f1_seq | S       | p
!  unlogged1_pkey   | i       | p
! (5 rows)
! 
! ALTER TABLE unlogged1 SET LOGGED; -- silently do nothing
! DROP TABLE unlogged3;
! DROP TABLE unlogged2;
! DROP TABLE unlogged1;
! -- set unlogged
! CREATE TABLE logged1(f1 SERIAL PRIMARY KEY, f2 TEXT);
! -- check relpersistence of a permanent table
! SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
! UNION ALL
! SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
! UNION ALL
! SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
! ORDER BY relname;
!     relname     | relkind | relpersistence 
! ----------------+---------+----------------
!  logged1        | r       | p
!  logged1_f1_seq | S       | p
!  logged1_pkey   | i       | p
!  toast index    | i       | p
!  toast table    | t       | p
! (5 rows)
! 
! CREATE TABLE logged2(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged1); -- foreign key
! CREATE TABLE logged3(f1 SERIAL PRIMARY KEY, f2 INTEGER REFERENCES logged3); -- self-referencing foreign key
! ALTER TABLE logged1 SET UNLOGGED; -- fails because a foreign key from a permanent table exists
! ERROR:  could not change table "logged1" to unlogged because it references logged table "logged2"
! ALTER TABLE logged3 SET UNLOGGED; -- skip self-referencing foreign key
! ALTER TABLE logged2 SET UNLOGGED;
! ALTER TABLE logged1 SET UNLOGGED;
! -- check relpersistence of a permanent table after changing to unlogged
! SELECT relname, relkind, relpersistence FROM pg_class WHERE relname ~ '^logged1'
! UNION ALL
! SELECT 'toast table', t.relkind, t.relpersistence FROM pg_class r JOIN pg_class t ON t.oid = r.reltoastrelid WHERE r.relname ~ '^logged1'
! UNION ALL
! SELECT 'toast index', ri.relkind, ri.relpersistence FROM pg_class r join pg_class t ON t.oid = r.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ri ON ri.oid = i.indexrelid WHERE r.relname ~ '^logged1'
! ORDER BY relname;
!     relname     | relkind | relpersistence 
! ----------------+---------+----------------
!  logged1        | r       | u
!  logged1_f1_seq | S       | p
!  logged1_pkey   | i       | u
!  toast index    | i       | u
!  toast table    | t       | u
! (5 rows)
! 
! ALTER TABLE logged1 SET UNLOGGED; -- silently do nothing
! DROP TABLE logged3;
! DROP TABLE logged2;
! DROP TABLE logged1;
! -- test ADD COLUMN IF NOT EXISTS
! CREATE TABLE test_add_column(c1 integer);
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN c2 integer;
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN c2 integer; -- fail because c2 already exists
! ERROR:  column "c2" of relation "test_add_column" already exists
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN IF NOT EXISTS c2 integer; -- skipping because c2 already exists
! NOTICE:  column "c2" of relation "test_add_column" already exists, skipping
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN c2 integer, -- fail because c2 already exists
! 	ADD COLUMN c3 integer;
! ERROR:  column "c2" of relation "test_add_column" already exists
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
! 	ADD COLUMN c3 integer; -- fail because c3 already exists
! NOTICE:  column "c2" of relation "test_add_column" already exists, skipping
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
!  c3     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
! 	ADD COLUMN IF NOT EXISTS c3 integer; -- skipping because c3 already exists
! NOTICE:  column "c2" of relation "test_add_column" already exists, skipping
! NOTICE:  column "c3" of relation "test_add_column" already exists, skipping
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
!  c3     | integer | 
! 
! ALTER TABLE test_add_column
! 	ADD COLUMN IF NOT EXISTS c2 integer, -- skipping because c2 already exists
! 	ADD COLUMN IF NOT EXISTS c3 integer, -- skipping because c3 already exists
! 	ADD COLUMN c4 integer;
! NOTICE:  column "c2" of relation "test_add_column" already exists, skipping
! NOTICE:  column "c3" of relation "test_add_column" already exists, skipping
! \d test_add_column
! Table "public.test_add_column"
!  Column |  Type   | Modifiers 
! --------+---------+-----------
!  c1     | integer | 
!  c2     | integer | 
!  c3     | integer | 
!  c4     | integer | 
! 
! DROP TABLE test_add_column;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/sequence.out	2016-09-05 20:45:49.072033605 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/sequence.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,519 ****
! ---
! --- test creation of SERIAL column
! ---
! CREATE TABLE serialTest (f1 text, f2 serial);
! INSERT INTO serialTest VALUES ('foo');
! INSERT INTO serialTest VALUES ('bar');
! INSERT INTO serialTest VALUES ('force', 100);
! INSERT INTO serialTest VALUES ('wrong', NULL);
! ERROR:  null value in column "f2" violates not-null constraint
! DETAIL:  Failing row contains (wrong, null).
! SELECT * FROM serialTest;
!   f1   | f2  
! -------+-----
!  foo   |   1
!  bar   |   2
!  force | 100
! (3 rows)
! 
! -- test smallserial / bigserial
! CREATE TABLE serialTest2 (f1 text, f2 serial, f3 smallserial, f4 serial2,
!   f5 bigserial, f6 serial8);
! INSERT INTO serialTest2 (f1)
!   VALUES ('test_defaults');
! INSERT INTO serialTest2 (f1, f2, f3, f4, f5, f6)
!   VALUES ('test_max_vals', 2147483647, 32767, 32767, 9223372036854775807,
!           9223372036854775807),
!          ('test_min_vals', -2147483648, -32768, -32768, -9223372036854775808,
!           -9223372036854775808);
! -- All these INSERTs should fail:
! INSERT INTO serialTest2 (f1, f3)
!   VALUES ('bogus', -32769);
! ERROR:  smallint out of range
! INSERT INTO serialTest2 (f1, f4)
!   VALUES ('bogus', -32769);
! ERROR:  smallint out of range
! INSERT INTO serialTest2 (f1, f3)
!   VALUES ('bogus', 32768);
! ERROR:  smallint out of range
! INSERT INTO serialTest2 (f1, f4)
!   VALUES ('bogus', 32768);
! ERROR:  smallint out of range
! INSERT INTO serialTest2 (f1, f5)
!   VALUES ('bogus', -9223372036854775809);
! ERROR:  bigint out of range
! INSERT INTO serialTest2 (f1, f6)
!   VALUES ('bogus', -9223372036854775809);
! ERROR:  bigint out of range
! INSERT INTO serialTest2 (f1, f5)
!   VALUES ('bogus', 9223372036854775808);
! ERROR:  bigint out of range
! INSERT INTO serialTest2 (f1, f6)
!   VALUES ('bogus', 9223372036854775808);
! ERROR:  bigint out of range
! SELECT * FROM serialTest2 ORDER BY f2 ASC;
!       f1       |     f2      |   f3   |   f4   |          f5          |          f6          
! ---------------+-------------+--------+--------+----------------------+----------------------
!  test_min_vals | -2147483648 | -32768 | -32768 | -9223372036854775808 | -9223372036854775808
!  test_defaults |           1 |      1 |      1 |                    1 |                    1
!  test_max_vals |  2147483647 |  32767 |  32767 |  9223372036854775807 |  9223372036854775807
! (3 rows)
! 
! SELECT nextval('serialTest2_f2_seq');
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT nextval('serialTest2_f3_seq');
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT nextval('serialTest2_f4_seq');
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT nextval('serialTest2_f5_seq');
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT nextval('serialTest2_f6_seq');
!  nextval 
! ---------
!        2
! (1 row)
! 
! -- basic sequence operations using both text and oid references
! CREATE SEQUENCE sequence_test;
! CREATE SEQUENCE IF NOT EXISTS sequence_test;
! NOTICE:  relation "sequence_test" already exists, skipping
! SELECT nextval('sequence_test'::text);
!  nextval 
! ---------
!        1
! (1 row)
! 
! SELECT nextval('sequence_test'::regclass);
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT currval('sequence_test'::text);
!  currval 
! ---------
!        2
! (1 row)
! 
! SELECT currval('sequence_test'::regclass);
!  currval 
! ---------
!        2
! (1 row)
! 
! SELECT setval('sequence_test'::text, 32);
!  setval 
! --------
!      32
! (1 row)
! 
! SELECT nextval('sequence_test'::regclass);
!  nextval 
! ---------
!       33
! (1 row)
! 
! SELECT setval('sequence_test'::text, 99, false);
!  setval 
! --------
!      99
! (1 row)
! 
! SELECT nextval('sequence_test'::regclass);
!  nextval 
! ---------
!       99
! (1 row)
! 
! SELECT setval('sequence_test'::regclass, 32);
!  setval 
! --------
!      32
! (1 row)
! 
! SELECT nextval('sequence_test'::text);
!  nextval 
! ---------
!       33
! (1 row)
! 
! SELECT setval('sequence_test'::regclass, 99, false);
!  setval 
! --------
!      99
! (1 row)
! 
! SELECT nextval('sequence_test'::text);
!  nextval 
! ---------
!       99
! (1 row)
! 
! DISCARD SEQUENCES;
! SELECT currval('sequence_test'::regclass);
! ERROR:  currval of sequence "sequence_test" is not yet defined in this session
! DROP SEQUENCE sequence_test;
! -- renaming sequences
! CREATE SEQUENCE foo_seq;
! ALTER TABLE foo_seq RENAME TO foo_seq_new;
! SELECT * FROM foo_seq_new;
!  sequence_name | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called 
! ---------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
!  foo_seq       |          1 |           1 |            1 | 9223372036854775807 |         1 |           1 |       0 | f         | f
! (1 row)
! 
! SELECT nextval('foo_seq_new');
!  nextval 
! ---------
!        1
! (1 row)
! 
! SELECT nextval('foo_seq_new');
!  nextval 
! ---------
!        2
! (1 row)
! 
! SELECT * FROM foo_seq_new;
!  sequence_name | last_value | start_value | increment_by |      max_value      | min_value | cache_value | log_cnt | is_cycled | is_called 
! ---------------+------------+-------------+--------------+---------------------+-----------+-------------+---------+-----------+-----------
!  foo_seq       |          2 |           1 |            1 | 9223372036854775807 |         1 |           1 |      31 | f         | t
! (1 row)
! 
! DROP SEQUENCE foo_seq_new;
! -- renaming serial sequences
! ALTER TABLE serialtest_f2_seq RENAME TO serialtest_f2_foo;
! INSERT INTO serialTest VALUES ('more');
! SELECT * FROM serialTest;
!   f1   | f2  
! -------+-----
!  foo   |   1
!  bar   |   2
!  force | 100
!  more  |   3
! (4 rows)
! 
! --
! -- Check dependencies of serial and ordinary sequences
! --
! CREATE TEMP SEQUENCE myseq2;
! CREATE TEMP SEQUENCE myseq3;
! CREATE TEMP TABLE t1 (
!   f1 serial,
!   f2 int DEFAULT nextval('myseq2'),
!   f3 int DEFAULT nextval('myseq3'::text)
! );
! -- Both drops should fail, but with different error messages:
! DROP SEQUENCE t1_f1_seq;
! ERROR:  cannot drop sequence t1_f1_seq because other objects depend on it
! DETAIL:  default for table t1 column f1 depends on sequence t1_f1_seq
! HINT:  Use DROP ... CASCADE to drop the dependent objects too.
! DROP SEQUENCE myseq2;
! ERROR:  cannot drop sequence myseq2 because other objects depend on it
! DETAIL:  default for table t1 column f2 depends on sequence myseq2
! HINT:  Use DROP ... CASCADE to drop the dependent objects too.
! -- This however will work:
! DROP SEQUENCE myseq3;
! DROP TABLE t1;
! -- Fails because no longer existent:
! DROP SEQUENCE t1_f1_seq;
! ERROR:  sequence "t1_f1_seq" does not exist
! -- Now OK:
! DROP SEQUENCE myseq2;
! --
! -- Alter sequence
! --
! ALTER SEQUENCE IF EXISTS sequence_test2 RESTART WITH 24
! 	 INCREMENT BY 4 MAXVALUE 36 MINVALUE 5 CYCLE;
! NOTICE:  relation "sequence_test2" does not exist, skipping
! CREATE SEQUENCE sequence_test2 START WITH 32;
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       32
! (1 row)
! 
! ALTER SEQUENCE sequence_test2 RESTART WITH 24
! 	 INCREMENT BY 4 MAXVALUE 36 MINVALUE 5 CYCLE;
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       24
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       28
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       32
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       36
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!        5
! (1 row)
! 
! ALTER SEQUENCE sequence_test2 RESTART;
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       32
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!       36
! (1 row)
! 
! SELECT nextval('sequence_test2');
!  nextval 
! ---------
!        5
! (1 row)
! 
! -- Information schema
! SELECT * FROM information_schema.sequences WHERE sequence_name IN
!   ('sequence_test2', 'serialtest2_f2_seq', 'serialtest2_f3_seq',
!    'serialtest2_f4_seq', 'serialtest2_f5_seq', 'serialtest2_f6_seq')
!   ORDER BY sequence_name ASC;
!  sequence_catalog | sequence_schema |   sequence_name    | data_type | numeric_precision | numeric_precision_radix | numeric_scale | start_value | minimum_value |    maximum_value    | increment | cycle_option 
! ------------------+-----------------+--------------------+-----------+-------------------+-------------------------+---------------+-------------+---------------+---------------------+-----------+--------------
!  regression       | public          | sequence_test2     | bigint    |                64 |                       2 |             0 | 32          | 5             | 36                  | 4         | YES
!  regression       | public          | serialtest2_f2_seq | bigint    |                64 |                       2 |             0 | 1           | 1             | 9223372036854775807 | 1         | NO
!  regression       | public          | serialtest2_f3_seq | bigint    |                64 |                       2 |             0 | 1           | 1             | 9223372036854775807 | 1         | NO
!  regression       | public          | serialtest2_f4_seq | bigint    |                64 |                       2 |             0 | 1           | 1             | 9223372036854775807 | 1         | NO
!  regression       | public          | serialtest2_f5_seq | bigint    |                64 |                       2 |             0 | 1           | 1             | 9223372036854775807 | 1         | NO
!  regression       | public          | serialtest2_f6_seq | bigint    |                64 |                       2 |             0 | 1           | 1             | 9223372036854775807 | 1         | NO
! (6 rows)
! 
! -- Test comments
! COMMENT ON SEQUENCE asdf IS 'won''t work';
! ERROR:  relation "asdf" does not exist
! COMMENT ON SEQUENCE sequence_test2 IS 'will work';
! COMMENT ON SEQUENCE sequence_test2 IS NULL;
! -- Test lastval()
! CREATE SEQUENCE seq;
! SELECT nextval('seq');
!  nextval 
! ---------
!        1
! (1 row)
! 
! SELECT lastval();
!  lastval 
! ---------
!        1
! (1 row)
! 
! SELECT setval('seq', 99);
!  setval 
! --------
!      99
! (1 row)
! 
! SELECT lastval();
!  lastval 
! ---------
!       99
! (1 row)
! 
! DISCARD SEQUENCES;
! SELECT lastval();
! ERROR:  lastval is not yet defined in this session
! CREATE SEQUENCE seq2;
! SELECT nextval('seq2');
!  nextval 
! ---------
!        1
! (1 row)
! 
! SELECT lastval();
!  lastval 
! ---------
!        1
! (1 row)
! 
! DROP SEQUENCE seq2;
! -- should fail
! SELECT lastval();
! ERROR:  lastval is not yet defined in this session
! CREATE USER regress_seq_user;
! -- privileges tests
! -- nextval
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT SELECT ON seq3 TO regress_seq_user;
! SELECT nextval('seq3');
! ERROR:  permission denied for sequence seq3
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT UPDATE ON seq3 TO regress_seq_user;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT USAGE ON seq3 TO regress_seq_user;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! -- currval
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT SELECT ON seq3 TO regress_seq_user;
! SELECT currval('seq3');
!  currval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT UPDATE ON seq3 TO regress_seq_user;
! SELECT currval('seq3');
! ERROR:  permission denied for sequence seq3
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT USAGE ON seq3 TO regress_seq_user;
! SELECT currval('seq3');
!  currval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! -- lastval
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT SELECT ON seq3 TO regress_seq_user;
! SELECT lastval();
!  lastval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT UPDATE ON seq3 TO regress_seq_user;
! SELECT lastval();
! ERROR:  permission denied for sequence seq3
! ROLLBACK;
! BEGIN;
! SET LOCAL SESSION AUTHORIZATION regress_seq_user;
! CREATE SEQUENCE seq3;
! SELECT nextval('seq3');
!  nextval 
! ---------
!        1
! (1 row)
! 
! REVOKE ALL ON seq3 FROM regress_seq_user;
! GRANT USAGE ON seq3 TO regress_seq_user;
! SELECT lastval();
!  lastval 
! ---------
!        1
! (1 row)
! 
! ROLLBACK;
! -- Sequences should get wiped out as well:
! DROP TABLE serialTest, serialTest2;
! -- Make sure sequences are gone:
! SELECT * FROM information_schema.sequences WHERE sequence_name IN
!   ('sequence_test2', 'serialtest2_f2_seq', 'serialtest2_f3_seq',
!    'serialtest2_f4_seq', 'serialtest2_f5_seq', 'serialtest2_f6_seq')
!   ORDER BY sequence_name ASC;
!  sequence_catalog | sequence_schema | sequence_name  | data_type | numeric_precision | numeric_precision_radix | numeric_scale | start_value | minimum_value | maximum_value | increment | cycle_option 
! ------------------+-----------------+----------------+-----------+-------------------+-------------------------+---------------+-------------+---------------+---------------+-----------+--------------
!  regression       | public          | sequence_test2 | bigint    |                64 |                       2 |             0 | 32          | 5             | 36            | 4         | YES
! (1 row)
! 
! DROP USER regress_seq_user;
! DROP SEQUENCE seq;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/polymorphism.out	2016-09-05 20:45:48.892033053 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/polymorphism.out	2016-09-12 12:14:51.891413917 -0300
***************
*** 1,1458 ****
! -- Currently this tests polymorphic aggregates and indirectly does some
! -- testing of polymorphic SQL functions.  It ought to be extended.
! -- Tests for other features related to function-calling have snuck in, too.
! -- Legend:
! -----------
! -- A = type is ANY
! -- P = type is polymorphic
! -- N = type is non-polymorphic
! -- B = aggregate base type
! -- S = aggregate state type
! -- R = aggregate return type
! -- 1 = arg1 of a function
! -- 2 = arg2 of a function
! -- ag = aggregate
! -- tf = trans (state) function
! -- ff = final function
! -- rt = return type of a function
! -- -> = implies
! -- => = allowed
! -- !> = not allowed
! -- E  = exists
! -- NE = not-exists
! --
! -- Possible states:
! -- ----------------
! -- B = (A || P || N)
! --   when (B = A) -> (tf2 = NE)
! -- S = (P || N)
! -- ff = (E || NE)
! -- tf1 = (P || N)
! -- tf2 = (NE || P || N)
! -- R = (P || N)
! -- create functions for use as tf and ff with the needed combinations of
! -- argument polymorphism, but within the constraints of valid aggregate
! -- functions, i.e. tf arg1 and tf return type must match
! -- polymorphic single arg transfn
! CREATE FUNCTION stfp(anyarray) RETURNS anyarray AS
! 'select $1' LANGUAGE SQL;
! -- non-polymorphic single arg transfn
! CREATE FUNCTION stfnp(int[]) RETURNS int[] AS
! 'select $1' LANGUAGE SQL;
! -- dual polymorphic transfn
! CREATE FUNCTION tfp(anyarray,anyelement) RETURNS anyarray AS
! 'select $1 || $2' LANGUAGE SQL;
! -- dual non-polymorphic transfn
! CREATE FUNCTION tfnp(int[],int) RETURNS int[] AS
! 'select $1 || $2' LANGUAGE SQL;
! -- arg1 only polymorphic transfn
! CREATE FUNCTION tf1p(anyarray,int) RETURNS anyarray AS
! 'select $1' LANGUAGE SQL;
! -- arg2 only polymorphic transfn
! CREATE FUNCTION tf2p(int[],anyelement) RETURNS int[] AS
! 'select $1' LANGUAGE SQL;
! -- multi-arg polymorphic
! CREATE FUNCTION sum3(anyelement,anyelement,anyelement) returns anyelement AS
! 'select $1+$2+$3' language sql strict;
! -- finalfn polymorphic
! CREATE FUNCTION ffp(anyarray) RETURNS anyarray AS
! 'select $1' LANGUAGE SQL;
! -- finalfn non-polymorphic
! CREATE FUNCTION ffnp(int[]) returns int[] as
! 'select $1' LANGUAGE SQL;
! -- Try to cover all the possible states:
! --
! -- Note: in Cases 1 & 2, we are trying to return P. Therefore, if the transfn
! -- is stfnp, tfnp, or tf2p, we must use ffp as finalfn, because stfnp, tfnp,
! -- and tf2p do not return P. Conversely, in Cases 3 & 4, we are trying to
! -- return N. Therefore, if the transfn is stfp, tfp, or tf1p, we must use ffnp
! -- as finalfn, because stfp, tfp, and tf1p do not return N.
! --
! --     Case1 (R = P) && (B = A)
! --     ------------------------
! --     S    tf1
! --     -------
! --     N    N
! -- should CREATE
! CREATE AGGREGATE myaggp01a(*) (SFUNC = stfnp, STYPE = int4[],
!   FINALFUNC = ffp, INITCOND = '{}');
! --     P    N
! -- should ERROR: stfnp(anyarray) not matched by stfnp(int[])
! CREATE AGGREGATE myaggp02a(*) (SFUNC = stfnp, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --     N    P
! -- should CREATE
! CREATE AGGREGATE myaggp03a(*) (SFUNC = stfp, STYPE = int4[],
!   FINALFUNC = ffp, INITCOND = '{}');
! CREATE AGGREGATE myaggp03b(*) (SFUNC = stfp, STYPE = int4[],
!   INITCOND = '{}');
! --     P    P
! -- should ERROR: we have no way to resolve S
! CREATE AGGREGATE myaggp04a(*) (SFUNC = stfp, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggp04b(*) (SFUNC = stfp, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    Case2 (R = P) && ((B = P) || (B = N))
! --    -------------------------------------
! --    S    tf1      B    tf2
! --    -----------------------
! --    N    N        N    N
! -- should CREATE
! CREATE AGGREGATE myaggp05a(BASETYPE = int, SFUNC = tfnp, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! --    N    N        N    P
! -- should CREATE
! CREATE AGGREGATE myaggp06a(BASETYPE = int, SFUNC = tf2p, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! --    N    N        P    N
! -- should ERROR: tfnp(int[], anyelement) not matched by tfnp(int[], int)
! CREATE AGGREGATE myaggp07a(BASETYPE = anyelement, SFUNC = tfnp, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tfnp(integer[], anyelement) does not exist
! --    N    N        P    P
! -- should CREATE
! CREATE AGGREGATE myaggp08a(BASETYPE = anyelement, SFUNC = tf2p, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! --    N    P        N    N
! -- should CREATE
! CREATE AGGREGATE myaggp09a(BASETYPE = int, SFUNC = tf1p, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! CREATE AGGREGATE myaggp09b(BASETYPE = int, SFUNC = tf1p, STYPE = int[],
!   INITCOND = '{}');
! --    N    P        N    P
! -- should CREATE
! CREATE AGGREGATE myaggp10a(BASETYPE = int, SFUNC = tfp, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! CREATE AGGREGATE myaggp10b(BASETYPE = int, SFUNC = tfp, STYPE = int[],
!   INITCOND = '{}');
! --    N    P        P    N
! -- should ERROR: tf1p(int[],anyelement) not matched by tf1p(anyarray,int)
! CREATE AGGREGATE myaggp11a(BASETYPE = anyelement, SFUNC = tf1p, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tf1p(integer[], anyelement) does not exist
! CREATE AGGREGATE myaggp11b(BASETYPE = anyelement, SFUNC = tf1p, STYPE = int[],
!   INITCOND = '{}');
! ERROR:  function tf1p(integer[], anyelement) does not exist
! --    N    P        P    P
! -- should ERROR: tfp(int[],anyelement) not matched by tfp(anyarray,anyelement)
! CREATE AGGREGATE myaggp12a(BASETYPE = anyelement, SFUNC = tfp, STYPE = int[],
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tfp(integer[], anyelement) does not exist
! CREATE AGGREGATE myaggp12b(BASETYPE = anyelement, SFUNC = tfp, STYPE = int[],
!   INITCOND = '{}');
! ERROR:  function tfp(integer[], anyelement) does not exist
! --    P    N        N    N
! -- should ERROR: tfnp(anyarray, int) not matched by tfnp(int[],int)
! CREATE AGGREGATE myaggp13a(BASETYPE = int, SFUNC = tfnp, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    N        N    P
! -- should ERROR: tf2p(anyarray, int) not matched by tf2p(int[],anyelement)
! CREATE AGGREGATE myaggp14a(BASETYPE = int, SFUNC = tf2p, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    N        P    N
! -- should ERROR: tfnp(anyarray, anyelement) not matched by tfnp(int[],int)
! CREATE AGGREGATE myaggp15a(BASETYPE = anyelement, SFUNC = tfnp,
!   STYPE = anyarray, FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tfnp(anyarray, anyelement) does not exist
! --    P    N        P    P
! -- should ERROR: tf2p(anyarray, anyelement) not matched by tf2p(int[],anyelement)
! CREATE AGGREGATE myaggp16a(BASETYPE = anyelement, SFUNC = tf2p,
!   STYPE = anyarray, FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tf2p(anyarray, anyelement) does not exist
! --    P    P        N    N
! -- should ERROR: we have no way to resolve S
! CREATE AGGREGATE myaggp17a(BASETYPE = int, SFUNC = tf1p, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggp17b(BASETYPE = int, SFUNC = tf1p, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    P        N    P
! -- should ERROR: tfp(anyarray, int) not matched by tfp(anyarray, anyelement)
! CREATE AGGREGATE myaggp18a(BASETYPE = int, SFUNC = tfp, STYPE = anyarray,
!   FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggp18b(BASETYPE = int, SFUNC = tfp, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    P        P    N
! -- should ERROR: tf1p(anyarray, anyelement) not matched by tf1p(anyarray, int)
! CREATE AGGREGATE myaggp19a(BASETYPE = anyelement, SFUNC = tf1p,
!   STYPE = anyarray, FINALFUNC = ffp, INITCOND = '{}');
! ERROR:  function tf1p(anyarray, anyelement) does not exist
! CREATE AGGREGATE myaggp19b(BASETYPE = anyelement, SFUNC = tf1p,
!   STYPE = anyarray, INITCOND = '{}');
! ERROR:  function tf1p(anyarray, anyelement) does not exist
! --    P    P        P    P
! -- should CREATE
! CREATE AGGREGATE myaggp20a(BASETYPE = anyelement, SFUNC = tfp,
!   STYPE = anyarray, FINALFUNC = ffp, INITCOND = '{}');
! CREATE AGGREGATE myaggp20b(BASETYPE = anyelement, SFUNC = tfp,
!   STYPE = anyarray, INITCOND = '{}');
! --     Case3 (R = N) && (B = A)
! --     ------------------------
! --     S    tf1
! --     -------
! --     N    N
! -- should CREATE
! CREATE AGGREGATE myaggn01a(*) (SFUNC = stfnp, STYPE = int4[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! CREATE AGGREGATE myaggn01b(*) (SFUNC = stfnp, STYPE = int4[],
!   INITCOND = '{}');
! --     P    N
! -- should ERROR: stfnp(anyarray) not matched by stfnp(int[])
! CREATE AGGREGATE myaggn02a(*) (SFUNC = stfnp, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggn02b(*) (SFUNC = stfnp, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --     N    P
! -- should CREATE
! CREATE AGGREGATE myaggn03a(*) (SFUNC = stfp, STYPE = int4[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! --     P    P
! -- should ERROR: ffnp(anyarray) not matched by ffnp(int[])
! CREATE AGGREGATE myaggn04a(*) (SFUNC = stfp, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    Case4 (R = N) && ((B = P) || (B = N))
! --    -------------------------------------
! --    S    tf1      B    tf2
! --    -----------------------
! --    N    N        N    N
! -- should CREATE
! CREATE AGGREGATE myaggn05a(BASETYPE = int, SFUNC = tfnp, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! CREATE AGGREGATE myaggn05b(BASETYPE = int, SFUNC = tfnp, STYPE = int[],
!   INITCOND = '{}');
! --    N    N        N    P
! -- should CREATE
! CREATE AGGREGATE myaggn06a(BASETYPE = int, SFUNC = tf2p, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! CREATE AGGREGATE myaggn06b(BASETYPE = int, SFUNC = tf2p, STYPE = int[],
!   INITCOND = '{}');
! --    N    N        P    N
! -- should ERROR: tfnp(int[], anyelement) not matched by tfnp(int[], int)
! CREATE AGGREGATE myaggn07a(BASETYPE = anyelement, SFUNC = tfnp, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tfnp(integer[], anyelement) does not exist
! CREATE AGGREGATE myaggn07b(BASETYPE = anyelement, SFUNC = tfnp, STYPE = int[],
!   INITCOND = '{}');
! ERROR:  function tfnp(integer[], anyelement) does not exist
! --    N    N        P    P
! -- should CREATE
! CREATE AGGREGATE myaggn08a(BASETYPE = anyelement, SFUNC = tf2p, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! CREATE AGGREGATE myaggn08b(BASETYPE = anyelement, SFUNC = tf2p, STYPE = int[],
!   INITCOND = '{}');
! --    N    P        N    N
! -- should CREATE
! CREATE AGGREGATE myaggn09a(BASETYPE = int, SFUNC = tf1p, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! --    N    P        N    P
! -- should CREATE
! CREATE AGGREGATE myaggn10a(BASETYPE = int, SFUNC = tfp, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! --    N    P        P    N
! -- should ERROR: tf1p(int[],anyelement) not matched by tf1p(anyarray,int)
! CREATE AGGREGATE myaggn11a(BASETYPE = anyelement, SFUNC = tf1p, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tf1p(integer[], anyelement) does not exist
! --    N    P        P    P
! -- should ERROR: tfp(int[],anyelement) not matched by tfp(anyarray,anyelement)
! CREATE AGGREGATE myaggn12a(BASETYPE = anyelement, SFUNC = tfp, STYPE = int[],
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tfp(integer[], anyelement) does not exist
! --    P    N        N    N
! -- should ERROR: tfnp(anyarray, int) not matched by tfnp(int[],int)
! CREATE AGGREGATE myaggn13a(BASETYPE = int, SFUNC = tfnp, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggn13b(BASETYPE = int, SFUNC = tfnp, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    N        N    P
! -- should ERROR: tf2p(anyarray, int) not matched by tf2p(int[],anyelement)
! CREATE AGGREGATE myaggn14a(BASETYPE = int, SFUNC = tf2p, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! CREATE AGGREGATE myaggn14b(BASETYPE = int, SFUNC = tf2p, STYPE = anyarray,
!   INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    N        P    N
! -- should ERROR: tfnp(anyarray, anyelement) not matched by tfnp(int[],int)
! CREATE AGGREGATE myaggn15a(BASETYPE = anyelement, SFUNC = tfnp,
!   STYPE = anyarray, FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tfnp(anyarray, anyelement) does not exist
! CREATE AGGREGATE myaggn15b(BASETYPE = anyelement, SFUNC = tfnp,
!   STYPE = anyarray, INITCOND = '{}');
! ERROR:  function tfnp(anyarray, anyelement) does not exist
! --    P    N        P    P
! -- should ERROR: tf2p(anyarray, anyelement) not matched by tf2p(int[],anyelement)
! CREATE AGGREGATE myaggn16a(BASETYPE = anyelement, SFUNC = tf2p,
!   STYPE = anyarray, FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tf2p(anyarray, anyelement) does not exist
! CREATE AGGREGATE myaggn16b(BASETYPE = anyelement, SFUNC = tf2p,
!   STYPE = anyarray, INITCOND = '{}');
! ERROR:  function tf2p(anyarray, anyelement) does not exist
! --    P    P        N    N
! -- should ERROR: ffnp(anyarray) not matched by ffnp(int[])
! CREATE AGGREGATE myaggn17a(BASETYPE = int, SFUNC = tf1p, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    P        N    P
! -- should ERROR: tfp(anyarray, int) not matched by tfp(anyarray, anyelement)
! CREATE AGGREGATE myaggn18a(BASETYPE = int, SFUNC = tfp, STYPE = anyarray,
!   FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  cannot determine transition data type
! DETAIL:  An aggregate using a polymorphic transition type must have at least one polymorphic argument.
! --    P    P        P    N
! -- should ERROR: tf1p(anyarray, anyelement) not matched by tf1p(anyarray, int)
! CREATE AGGREGATE myaggn19a(BASETYPE = anyelement, SFUNC = tf1p,
!   STYPE = anyarray, FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function tf1p(anyarray, anyelement) does not exist
! --    P    P        P    P
! -- should ERROR: ffnp(anyarray) not matched by ffnp(int[])
! CREATE AGGREGATE myaggn20a(BASETYPE = anyelement, SFUNC = tfp,
!   STYPE = anyarray, FINALFUNC = ffnp, INITCOND = '{}');
! ERROR:  function ffnp(anyarray) does not exist
! -- multi-arg polymorphic
! CREATE AGGREGATE mysum2(anyelement,anyelement) (SFUNC = sum3,
!   STYPE = anyelement, INITCOND = '0');
! -- create test data for polymorphic aggregates
! create temp table t(f1 int, f2 int[], f3 text);
! insert into t values(1,array[1],'a');
! insert into t values(1,array[11],'b');
! insert into t values(1,array[111],'c');
! insert into t values(2,array[2],'a');
! insert into t values(2,array[22],'b');
! insert into t values(2,array[222],'c');
! insert into t values(3,array[3],'a');
! insert into t values(3,array[3],'b');
! -- test the successfully created polymorphic aggregates
! select f3, myaggp01a(*) from t group by f3 order by f3;
!  f3 | myaggp01a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp03a(*) from t group by f3 order by f3;
!  f3 | myaggp03a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp03b(*) from t group by f3 order by f3;
!  f3 | myaggp03b 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp05a(f1) from t group by f3 order by f3;
!  f3 | myaggp05a 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggp06a(f1) from t group by f3 order by f3;
!  f3 | myaggp06a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp08a(f1) from t group by f3 order by f3;
!  f3 | myaggp08a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp09a(f1) from t group by f3 order by f3;
!  f3 | myaggp09a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp09b(f1) from t group by f3 order by f3;
!  f3 | myaggp09b 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggp10a(f1) from t group by f3 order by f3;
!  f3 | myaggp10a 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggp10b(f1) from t group by f3 order by f3;
!  f3 | myaggp10b 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggp20a(f1) from t group by f3 order by f3;
!  f3 | myaggp20a 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggp20b(f1) from t group by f3 order by f3;
!  f3 | myaggp20b 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggn01a(*) from t group by f3 order by f3;
!  f3 | myaggn01a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn01b(*) from t group by f3 order by f3;
!  f3 | myaggn01b 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn03a(*) from t group by f3 order by f3;
!  f3 | myaggn03a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn05a(f1) from t group by f3 order by f3;
!  f3 | myaggn05a 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggn05b(f1) from t group by f3 order by f3;
!  f3 | myaggn05b 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select f3, myaggn06a(f1) from t group by f3 order by f3;
!  f3 | myaggn06a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn06b(f1) from t group by f3 order by f3;
!  f3 | myaggn06b 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn08a(f1) from t group by f3 order by f3;
!  f3 | myaggn08a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn08b(f1) from t group by f3 order by f3;
!  f3 | myaggn08b 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn09a(f1) from t group by f3 order by f3;
!  f3 | myaggn09a 
! ----+-----------
!  a  | {}
!  b  | {}
!  c  | {}
! (3 rows)
! 
! select f3, myaggn10a(f1) from t group by f3 order by f3;
!  f3 | myaggn10a 
! ----+-----------
!  a  | {1,2,3}
!  b  | {1,2,3}
!  c  | {1,2}
! (3 rows)
! 
! select mysum2(f1, f1 + 1) from t;
!  mysum2 
! --------
!      38
! (1 row)
! 
! -- test inlining of polymorphic SQL functions
! create function bleat(int) returns int as $$
! begin
!   raise notice 'bleat %', $1;
!   return $1;
! end$$ language plpgsql;
! create function sql_if(bool, anyelement, anyelement) returns anyelement as $$
! select case when $1 then $2 else $3 end $$ language sql;
! -- Note this would fail with integer overflow, never mind wrong bleat() output,
! -- if the CASE expression were not successfully inlined
! select f1, sql_if(f1 > 0, bleat(f1), bleat(f1 + 1)) from int4_tbl;
! NOTICE:  bleat 1
! NOTICE:  bleat 123456
! NOTICE:  bleat -123455
! NOTICE:  bleat 2147483647
! NOTICE:  bleat -2147483646
!      f1      |   sql_if    
! -------------+-------------
!            0 |           1
!       123456 |      123456
!      -123456 |     -123455
!   2147483647 |  2147483647
!  -2147483647 | -2147483646
! (5 rows)
! 
! select q2, sql_if(q2 > 0, q2, q2 + 1) from int8_tbl;
!         q2         |      sql_if       
! -------------------+-------------------
!                456 |               456
!   4567890123456789 |  4567890123456789
!                123 |               123
!   4567890123456789 |  4567890123456789
!  -4567890123456789 | -4567890123456788
! (5 rows)
! 
! -- another sort of polymorphic aggregate
! CREATE AGGREGATE array_cat_accum (anyarray)
! (
!     sfunc = array_cat,
!     stype = anyarray,
!     initcond = '{}'
! );
! SELECT array_cat_accum(i)
! FROM (VALUES (ARRAY[1,2]), (ARRAY[3,4])) as t(i);
!  array_cat_accum 
! -----------------
!  {1,2,3,4}
! (1 row)
! 
! SELECT array_cat_accum(i)
! FROM (VALUES (ARRAY[row(1,2),row(3,4)]), (ARRAY[row(5,6),row(7,8)])) as t(i);
!           array_cat_accum          
! -----------------------------------
!  {"(1,2)","(3,4)","(5,6)","(7,8)"}
! (1 row)
! 
! -- another kind of polymorphic aggregate
! create function add_group(grp anyarray, ad anyelement, size integer)
!   returns anyarray
!   as $$
! begin
!   if grp is null then
!     return array[ad];
!   end if;
!   if array_upper(grp, 1) < size then
!     return grp || ad;
!   end if;
!   return grp;
! end;
! $$
!   language plpgsql immutable;
! create aggregate build_group(anyelement, integer) (
!   SFUNC = add_group,
!   STYPE = anyarray
! );
! select build_group(q1,3) from int8_tbl;
!         build_group         
! ----------------------------
!  {123,123,4567890123456789}
! (1 row)
! 
! -- this should fail because stype isn't compatible with arg
! create aggregate build_group(int8, integer) (
!   SFUNC = add_group,
!   STYPE = int2[]
! );
! ERROR:  function add_group(smallint[], bigint, integer) does not exist
! -- but we can make a non-poly agg from a poly sfunc if types are OK
! create aggregate build_group(int8, integer) (
!   SFUNC = add_group,
!   STYPE = int8[]
! );
! -- check that we can apply functions taking ANYARRAY to pg_stats
! select distinct array_ndims(histogram_bounds) from pg_stats
! where histogram_bounds is not null;
!  array_ndims 
! -------------
!            1
! (1 row)
! 
! -- such functions must protect themselves if varying element type isn't OK
! -- (WHERE clause here is to avoid possibly getting a collation error instead)
! select max(histogram_bounds) from pg_stats where tablename = 'pg_am';
! ERROR:  cannot compare arrays of different element types
! -- test variadic polymorphic functions
! create function myleast(variadic anyarray) returns anyelement as $$
!   select min($1[i]) from generate_subscripts($1,1) g(i)
! $$ language sql immutable strict;
! select myleast(10, 1, 20, 33);
!  myleast 
! ---------
!        1
! (1 row)
! 
! select myleast(1.1, 0.22, 0.55);
!  myleast 
! ---------
!     0.22
! (1 row)
! 
! select myleast('z'::text);
!  myleast 
! ---------
!  z
! (1 row)
! 
! select myleast(); -- fail
! ERROR:  function myleast() does not exist
! LINE 1: select myleast();
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! -- test with variadic call parameter
! select myleast(variadic array[1,2,3,4,-1]);
!  myleast 
! ---------
!       -1
! (1 row)
! 
! select myleast(variadic array[1.1, -5.5]);
!  myleast 
! ---------
!     -5.5
! (1 row)
! 
! --test with empty variadic call parameter
! select myleast(variadic array[]::int[]);
!  myleast 
! ---------
!         
! (1 row)
! 
! -- an example with some ordinary arguments too
! create function concat(text, variadic anyarray) returns text as $$
!   select array_to_string($2, $1);
! $$ language sql immutable strict;
! select concat('%', 1, 2, 3, 4, 5);
!   concat   
! -----------
!  1%2%3%4%5
! (1 row)
! 
! select concat('|', 'a'::text, 'b', 'c');
!  concat 
! --------
!  a|b|c
! (1 row)
! 
! select concat('|', variadic array[1,2,33]);
!  concat 
! --------
!  1|2|33
! (1 row)
! 
! select concat('|', variadic array[]::int[]);
!  concat 
! --------
!  
! (1 row)
! 
! drop function concat(text, anyarray);
! -- mix variadic with anyelement
! create function formarray(anyelement, variadic anyarray) returns anyarray as $$
!   select array_prepend($1, $2);
! $$ language sql immutable strict;
! select formarray(1,2,3,4,5);
!   formarray  
! -------------
!  {1,2,3,4,5}
! (1 row)
! 
! select formarray(1.1, variadic array[1.2,55.5]);
!    formarray    
! ----------------
!  {1.1,1.2,55.5}
! (1 row)
! 
! select formarray(1.1, array[1.2,55.5]); -- fail without variadic
! ERROR:  function formarray(numeric, numeric[]) does not exist
! LINE 1: select formarray(1.1, array[1.2,55.5]);
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select formarray(1, 'x'::text); -- fail, type mismatch
! ERROR:  function formarray(integer, text) does not exist
! LINE 1: select formarray(1, 'x'::text);
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select formarray(1, variadic array['x'::text]); -- fail, type mismatch
! ERROR:  function formarray(integer, text[]) does not exist
! LINE 1: select formarray(1, variadic array['x'::text]);
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! drop function formarray(anyelement, variadic anyarray);
! -- test pg_typeof() function
! select pg_typeof(null);           -- unknown
!  pg_typeof 
! -----------
!  unknown
! (1 row)
! 
! select pg_typeof(0);              -- integer
!  pg_typeof 
! -----------
!  integer
! (1 row)
! 
! select pg_typeof(0.0);            -- numeric
!  pg_typeof 
! -----------
!  numeric
! (1 row)
! 
! select pg_typeof(1+1 = 2);        -- boolean
!  pg_typeof 
! -----------
!  boolean
! (1 row)
! 
! select pg_typeof('x');            -- unknown
!  pg_typeof 
! -----------
!  unknown
! (1 row)
! 
! select pg_typeof('' || '');       -- text
!  pg_typeof 
! -----------
!  text
! (1 row)
! 
! select pg_typeof(pg_typeof(0));   -- regtype
!  pg_typeof 
! -----------
!  regtype
! (1 row)
! 
! select pg_typeof(array[1.2,55.5]); -- numeric[]
!  pg_typeof 
! -----------
!  numeric[]
! (1 row)
! 
! select pg_typeof(myleast(10, 1, 20, 33));  -- polymorphic input
!  pg_typeof 
! -----------
!  integer
! (1 row)
! 
! -- test functions with default parameters
! -- test basic functionality
! create function dfunc(a int = 1, int = 2) returns int as $$
!   select $1 + $2;
! $$ language sql;
! select dfunc();
!  dfunc 
! -------
!      3
! (1 row)
! 
! select dfunc(10);
!  dfunc 
! -------
!     12
! (1 row)
! 
! select dfunc(10, 20);
!  dfunc 
! -------
!     30
! (1 row)
! 
! select dfunc(10, 20, 30);  -- fail
! ERROR:  function dfunc(integer, integer, integer) does not exist
! LINE 1: select dfunc(10, 20, 30);
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! drop function dfunc();  -- fail
! ERROR:  function dfunc() does not exist
! drop function dfunc(int);  -- fail
! ERROR:  function dfunc(integer) does not exist
! drop function dfunc(int, int);  -- ok
! -- fail: defaults must be at end of argument list
! create function dfunc(a int = 1, b int) returns int as $$
!   select $1 + $2;
! $$ language sql;
! ERROR:  input parameters after one with a default value must also have defaults
! -- however, this should work:
! create function dfunc(a int = 1, out sum int, b int = 2) as $$
!   select $1 + $2;
! $$ language sql;
! select dfunc();
!  dfunc 
! -------
!      3
! (1 row)
! 
! -- verify it lists properly
! \df dfunc
!                                            List of functions
!  Schema | Name  | Result data type |                    Argument data types                    |  Type  
! --------+-------+------------------+-----------------------------------------------------------+--------
!  public | dfunc | integer          | a integer DEFAULT 1, OUT sum integer, b integer DEFAULT 2 | normal
! (1 row)
! 
! drop function dfunc(int, int);
! -- check implicit coercion
! create function dfunc(a int DEFAULT 1.0, int DEFAULT '-1') returns int as $$
!   select $1 + $2;
! $$ language sql;
! select dfunc();
!  dfunc 
! -------
!      0
! (1 row)
! 
! create function dfunc(a text DEFAULT 'Hello', b text DEFAULT 'World') returns text as $$
!   select $1 || ', ' || $2;
! $$ language sql;
! select dfunc();  -- fail: which dfunc should be called? int or text
! ERROR:  function dfunc() is not unique
! LINE 1: select dfunc();
!                ^
! HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
! select dfunc('Hi');  -- ok
!    dfunc   
! -----------
!  Hi, World
! (1 row)
! 
! select dfunc('Hi', 'City');  -- ok
!   dfunc   
! ----------
!  Hi, City
! (1 row)
! 
! select dfunc(0);  -- ok
!  dfunc 
! -------
!     -1
! (1 row)
! 
! select dfunc(10, 20);  -- ok
!  dfunc 
! -------
!     30
! (1 row)
! 
! drop function dfunc(int, int);
! drop function dfunc(text, text);
! create function dfunc(int = 1, int = 2) returns int as $$
!   select 2;
! $$ language sql;
! create function dfunc(int = 1, int = 2, int = 3, int = 4) returns int as $$
!   select 4;
! $$ language sql;
! -- Now, dfunc(nargs = 2) and dfunc(nargs = 4) are ambiguous when called
! -- with 0 to 2 arguments.
! select dfunc();  -- fail
! ERROR:  function dfunc() is not unique
! LINE 1: select dfunc();
!                ^
! HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
! select dfunc(1);  -- fail
! ERROR:  function dfunc(integer) is not unique
! LINE 1: select dfunc(1);
!                ^
! HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
! select dfunc(1, 2);  -- fail
! ERROR:  function dfunc(integer, integer) is not unique
! LINE 1: select dfunc(1, 2);
!                ^
! HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
! select dfunc(1, 2, 3);  -- ok
!  dfunc 
! -------
!      4
! (1 row)
! 
! select dfunc(1, 2, 3, 4);  -- ok
!  dfunc 
! -------
!      4
! (1 row)
! 
! drop function dfunc(int, int);
! drop function dfunc(int, int, int, int);
! -- default values are not allowed for output parameters
! create function dfunc(out int = 20) returns int as $$
!   select 1;
! $$ language sql;
! ERROR:  only input parameters can have default values
! -- polymorphic parameter test
! create function dfunc(anyelement = 'World'::text) returns text as $$
!   select 'Hello, ' || $1::text;
! $$ language sql;
! select dfunc();
!     dfunc     
! --------------
!  Hello, World
! (1 row)
! 
! select dfunc(0);
!   dfunc   
! ----------
!  Hello, 0
! (1 row)
! 
! select dfunc(to_date('20081215','YYYYMMDD'));
!        dfunc       
! -------------------
!  Hello, 12-15-2008
! (1 row)
! 
! select dfunc('City'::text);
!     dfunc    
! -------------
!  Hello, City
! (1 row)
! 
! drop function dfunc(anyelement);
! -- check defaults for variadics
! create function dfunc(a variadic int[]) returns int as
! $$ select array_upper($1, 1) $$ language sql;
! select dfunc();  -- fail
! ERROR:  function dfunc() does not exist
! LINE 1: select dfunc();
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select dfunc(10);
!  dfunc 
! -------
!      1
! (1 row)
! 
! select dfunc(10,20);
!  dfunc 
! -------
!      2
! (1 row)
! 
! create or replace function dfunc(a variadic int[] default array[]::int[]) returns int as
! $$ select array_upper($1, 1) $$ language sql;
! select dfunc();  -- now ok
!  dfunc 
! -------
!       
! (1 row)
! 
! select dfunc(10);
!  dfunc 
! -------
!      1
! (1 row)
! 
! select dfunc(10,20);
!  dfunc 
! -------
!      2
! (1 row)
! 
! -- can't remove the default once it exists
! create or replace function dfunc(a variadic int[]) returns int as
! $$ select array_upper($1, 1) $$ language sql;
! ERROR:  cannot remove parameter defaults from existing function
! HINT:  Use DROP FUNCTION dfunc(integer[]) first.
! \df dfunc
!                                       List of functions
!  Schema | Name  | Result data type |               Argument data types               |  Type  
! --------+-------+------------------+-------------------------------------------------+--------
!  public | dfunc | integer          | VARIADIC a integer[] DEFAULT ARRAY[]::integer[] | normal
! (1 row)
! 
! drop function dfunc(a variadic int[]);
! -- Ambiguity should be reported only if there's not a better match available
! create function dfunc(int = 1, int = 2, int = 3) returns int as $$
!   select 3;
! $$ language sql;
! create function dfunc(int = 1, int = 2) returns int as $$
!   select 2;
! $$ language sql;
! create function dfunc(text) returns text as $$
!   select $1;
! $$ language sql;
! -- dfunc(narg=2) and dfunc(narg=3) are ambiguous
! select dfunc(1);  -- fail
! ERROR:  function dfunc(integer) is not unique
! LINE 1: select dfunc(1);
!                ^
! HINT:  Could not choose a best candidate function. You might need to add explicit type casts.
! -- but this works since the ambiguous functions aren't preferred anyway
! select dfunc('Hi');
!  dfunc 
! -------
!  Hi
! (1 row)
! 
! drop function dfunc(int, int, int);
! drop function dfunc(int, int);
! drop function dfunc(text);
! --
! -- Tests for named- and mixed-notation function calling
! --
! create function dfunc(a int, b int, c int = 0, d int = 0)
!   returns table (a int, b int, c int, d int) as $$
!   select $1, $2, $3, $4;
! $$ language sql;
! select (dfunc(10,20,30)).*;
!  a  | b  | c  | d 
! ----+----+----+---
!  10 | 20 | 30 | 0
! (1 row)
! 
! select (dfunc(a := 10, b := 20, c := 30)).*;
!  a  | b  | c  | d 
! ----+----+----+---
!  10 | 20 | 30 | 0
! (1 row)
! 
! select * from dfunc(a := 10, b := 20);
!  a  | b  | c | d 
! ----+----+---+---
!  10 | 20 | 0 | 0
! (1 row)
! 
! select * from dfunc(b := 10, a := 20);
!  a  | b  | c | d 
! ----+----+---+---
!  20 | 10 | 0 | 0
! (1 row)
! 
! select * from dfunc(0);  -- fail
! ERROR:  function dfunc(integer) does not exist
! LINE 1: select * from dfunc(0);
!                       ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select * from dfunc(1,2);
!  a | b | c | d 
! ---+---+---+---
!  1 | 2 | 0 | 0
! (1 row)
! 
! select * from dfunc(1,2,c := 3);
!  a | b | c | d 
! ---+---+---+---
!  1 | 2 | 3 | 0
! (1 row)
! 
! select * from dfunc(1,2,d := 3);
!  a | b | c | d 
! ---+---+---+---
!  1 | 2 | 0 | 3
! (1 row)
! 
! select * from dfunc(x := 20, b := 10, x := 30);  -- fail, duplicate name
! ERROR:  argument name "x" used more than once
! LINE 1: select * from dfunc(x := 20, b := 10, x := 30);
!                                               ^
! select * from dfunc(10, b := 20, 30);  -- fail, named args must be last
! ERROR:  positional argument cannot follow named argument
! LINE 1: select * from dfunc(10, b := 20, 30);
!                                          ^
! select * from dfunc(x := 10, b := 20, c := 30);  -- fail, unknown param
! ERROR:  function dfunc(x => integer, b => integer, c => integer) does not exist
! LINE 1: select * from dfunc(x := 10, b := 20, c := 30);
!                       ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select * from dfunc(10, 10, a := 20);  -- fail, a overlaps positional parameter
! ERROR:  function dfunc(integer, integer, a => integer) does not exist
! LINE 1: select * from dfunc(10, 10, a := 20);
!                       ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select * from dfunc(1,c := 2,d := 3); -- fail, no value for b
! ERROR:  function dfunc(integer, c => integer, d => integer) does not exist
! LINE 1: select * from dfunc(1,c := 2,d := 3);
!                       ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! drop function dfunc(int, int, int, int);
! -- test with different parameter types
! create function dfunc(a varchar, b numeric, c date = current_date)
!   returns table (a varchar, b numeric, c date) as $$
!   select $1, $2, $3;
! $$ language sql;
! select (dfunc('Hello World', 20, '2009-07-25'::date)).*;
!       a      | b  |     c      
! -------------+----+------------
!  Hello World | 20 | 07-25-2009
! (1 row)
! 
! select * from dfunc('Hello World', 20, '2009-07-25'::date);
!       a      | b  |     c      
! -------------+----+------------
!  Hello World | 20 | 07-25-2009
! (1 row)
! 
! select * from dfunc(c := '2009-07-25'::date, a := 'Hello World', b := 20);
!       a      | b  |     c      
! -------------+----+------------
!  Hello World | 20 | 07-25-2009
! (1 row)
! 
! select * from dfunc('Hello World', b := 20, c := '2009-07-25'::date);
!       a      | b  |     c      
! -------------+----+------------
!  Hello World | 20 | 07-25-2009
! (1 row)
! 
! select * from dfunc('Hello World', c := '2009-07-25'::date, b := 20);
!       a      | b  |     c      
! -------------+----+------------
!  Hello World | 20 | 07-25-2009
! (1 row)
! 
! select * from dfunc('Hello World', c := 20, b := '2009-07-25'::date);  -- fail
! ERROR:  function dfunc(unknown, c => integer, b => date) does not exist
! LINE 1: select * from dfunc('Hello World', c := 20, b := '2009-07-25...
!                       ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! drop function dfunc(varchar, numeric, date);
! -- test out parameters with named params
! create function dfunc(a varchar = 'def a', out _a varchar, c numeric = NULL, out _c numeric)
! returns record as $$
!   select $1, $2;
! $$ language sql;
! select (dfunc()).*;
!   _a   | _c 
! -------+----
!  def a |   
! (1 row)
! 
! select * from dfunc();
!   _a   | _c 
! -------+----
!  def a |   
! (1 row)
! 
! select * from dfunc('Hello', 100);
!   _a   | _c  
! -------+-----
!  Hello | 100
! (1 row)
! 
! select * from dfunc(a := 'Hello', c := 100);
!   _a   | _c  
! -------+-----
!  Hello | 100
! (1 row)
! 
! select * from dfunc(c := 100, a := 'Hello');
!   _a   | _c  
! -------+-----
!  Hello | 100
! (1 row)
! 
! select * from dfunc('Hello');
!   _a   | _c 
! -------+----
!  Hello |   
! (1 row)
! 
! select * from dfunc('Hello', c := 100);
!   _a   | _c  
! -------+-----
!  Hello | 100
! (1 row)
! 
! select * from dfunc(c := 100);
!   _a   | _c  
! -------+-----
!  def a | 100
! (1 row)
! 
! -- fail, can no longer change an input parameter's name
! create or replace function dfunc(a varchar = 'def a', out _a varchar, x numeric = NULL, out _c numeric)
! returns record as $$
!   select $1, $2;
! $$ language sql;
! ERROR:  cannot change name of input parameter "c"
! HINT:  Use DROP FUNCTION dfunc(character varying,numeric) first.
! create or replace function dfunc(a varchar = 'def a', out _a varchar, numeric = NULL, out _c numeric)
! returns record as $$
!   select $1, $2;
! $$ language sql;
! ERROR:  cannot change name of input parameter "c"
! HINT:  Use DROP FUNCTION dfunc(character varying,numeric) first.
! drop function dfunc(varchar, numeric);
! --fail, named parameters are not unique
! create function testfoo(a int, a int) returns int as $$ select 1;$$ language sql;
! ERROR:  parameter name "a" used more than once
! create function testfoo(int, out a int, out a int) returns int as $$ select 1;$$ language sql;
! ERROR:  parameter name "a" used more than once
! create function testfoo(out a int, inout a int) returns int as $$ select 1;$$ language sql;
! ERROR:  parameter name "a" used more than once
! create function testfoo(a int, inout a int) returns int as $$ select 1;$$ language sql;
! ERROR:  parameter name "a" used more than once
! -- valid
! create function testfoo(a int, out a int) returns int as $$ select $1;$$ language sql;
! select testfoo(37);
!  testfoo 
! ---------
!       37
! (1 row)
! 
! drop function testfoo(int);
! create function testfoo(a int) returns table(a int) as $$ select $1;$$ language sql;
! select * from testfoo(37);
!  a  
! ----
!  37
! (1 row)
! 
! drop function testfoo(int);
! -- test polymorphic params and defaults
! create function dfunc(a anyelement, b anyelement = null, flag bool = true)
! returns anyelement as $$
!   select case when $3 then $1 else $2 end;
! $$ language sql;
! select dfunc(1,2);
!  dfunc 
! -------
!      1
! (1 row)
! 
! select dfunc('a'::text, 'b'); -- positional notation with default
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a := 1, b := 2);
!  dfunc 
! -------
!      1
! (1 row)
! 
! select dfunc(a := 'a'::text, b := 'b');
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a := 'a'::text, b := 'b', flag := false); -- named notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc(b := 'b'::text, a := 'a'); -- named notation with default
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a := 'a'::text, flag := true); -- named notation with default
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a := 'a'::text, flag := false); -- named notation with default
!  dfunc 
! -------
!  
! (1 row)
! 
! select dfunc(b := 'b'::text, a := 'a', flag := true); -- named notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc('a'::text, 'b', false); -- full positional notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc('a'::text, 'b', flag := false); -- mixed notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc('a'::text, 'b', true); -- full positional notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc('a'::text, 'b', flag := true); -- mixed notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! -- ansi/sql syntax
! select dfunc(a => 1, b => 2);
!  dfunc 
! -------
!      1
! (1 row)
! 
! select dfunc(a => 'a'::text, b => 'b');
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a => 'a'::text, b => 'b', flag => false); -- named notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc(b => 'b'::text, a => 'a'); -- named notation with default
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a => 'a'::text, flag => true); -- named notation with default
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc(a => 'a'::text, flag => false); -- named notation with default
!  dfunc 
! -------
!  
! (1 row)
! 
! select dfunc(b => 'b'::text, a => 'a', flag => true); -- named notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc('a'::text, 'b', false); -- full positional notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc('a'::text, 'b', flag => false); -- mixed notation
!  dfunc 
! -------
!  b
! (1 row)
! 
! select dfunc('a'::text, 'b', true); -- full positional notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! select dfunc('a'::text, 'b', flag => true); -- mixed notation
!  dfunc 
! -------
!  a
! (1 row)
! 
! -- check reverse-listing of named-arg calls
! CREATE VIEW dfview AS
!    SELECT q1, q2,
!      dfunc(q1,q2, flag := q1>q2) as c3,
!      dfunc(q1, flag := q1<q2, b := q2) as c4
!      FROM int8_tbl;
! select * from dfview;
!         q1        |        q2         |        c3        |        c4         
! ------------------+-------------------+------------------+-------------------
!               123 |               456 |              456 |               123
!               123 |  4567890123456789 | 4567890123456789 |               123
!  4567890123456789 |               123 | 4567890123456789 |               123
!  4567890123456789 |  4567890123456789 | 4567890123456789 |  4567890123456789
!  4567890123456789 | -4567890123456789 | 4567890123456789 | -4567890123456789
! (5 rows)
! 
! \d+ dfview
!                 View "public.dfview"
!  Column |  Type  | Modifiers | Storage | Description 
! --------+--------+-----------+---------+-------------
!  q1     | bigint |           | plain   | 
!  q2     | bigint |           | plain   | 
!  c3     | bigint |           | plain   | 
!  c4     | bigint |           | plain   | 
! View definition:
!  SELECT int8_tbl.q1,
!     int8_tbl.q2,
!     dfunc(int8_tbl.q1, int8_tbl.q2, flag => int8_tbl.q1 > int8_tbl.q2) AS c3,
!     dfunc(int8_tbl.q1, flag => int8_tbl.q1 < int8_tbl.q2, b => int8_tbl.q2) AS c4
!    FROM int8_tbl;
! 
! drop view dfview;
! drop function dfunc(anyelement, anyelement, bool);
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/rowtypes.out	2016-09-05 20:45:48.972033299 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/rowtypes.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,713 ****
! --
! -- ROWTYPES
! --
! -- Make both a standalone composite type and a table rowtype
! create type complex as (r float8, i float8);
! create temp table fullname (first text, last text);
! -- Nested composite
! create type quad as (c1 complex, c2 complex);
! -- Some simple tests of I/O conversions and row construction
! select (1.1,2.2)::complex, row((3.3,4.4),(5.5,null))::quad;
!     row    |          row           
! -----------+------------------------
!  (1.1,2.2) | ("(3.3,4.4)","(5.5,)")
! (1 row)
! 
! select row('Joe', 'Blow')::fullname, '(Joe,Blow)'::fullname;
!     row     |  fullname  
! ------------+------------
!  (Joe,Blow) | (Joe,Blow)
! (1 row)
! 
! select '(Joe,von Blow)'::fullname, '(Joe,d''Blow)'::fullname;
!      fullname     |   fullname   
! ------------------+--------------
!  (Joe,"von Blow") | (Joe,d'Blow)
! (1 row)
! 
! select '(Joe,"von""Blow")'::fullname, E'(Joe,d\\\\Blow)'::fullname;
!      fullname      |    fullname     
! -------------------+-----------------
!  (Joe,"von""Blow") | (Joe,"d\\Blow")
! (1 row)
! 
! select '(Joe,"Blow,Jr")'::fullname;
!     fullname     
! -----------------
!  (Joe,"Blow,Jr")
! (1 row)
! 
! select '(Joe,)'::fullname;	-- ok, null 2nd column
!  fullname 
! ----------
!  (Joe,)
! (1 row)
! 
! select '(Joe)'::fullname;	-- bad
! ERROR:  malformed record literal: "(Joe)"
! LINE 1: select '(Joe)'::fullname;
!                ^
! DETAIL:  Too few columns.
! select '(Joe,,)'::fullname;	-- bad
! ERROR:  malformed record literal: "(Joe,,)"
! LINE 1: select '(Joe,,)'::fullname;
!                ^
! DETAIL:  Too many columns.
! create temp table quadtable(f1 int, q quad);
! insert into quadtable values (1, ((3.3,4.4),(5.5,6.6)));
! insert into quadtable values (2, ((null,4.4),(5.5,6.6)));
! select * from quadtable;
!  f1 |             q             
! ----+---------------------------
!   1 | ("(3.3,4.4)","(5.5,6.6)")
!   2 | ("(,4.4)","(5.5,6.6)")
! (2 rows)
! 
! select f1, q.c1 from quadtable;		-- fails, q is a table reference
! ERROR:  missing FROM-clause entry for table "q"
! LINE 1: select f1, q.c1 from quadtable;
!                    ^
! select f1, (q).c1, (qq.q).c1.i from quadtable qq;
!  f1 |    c1     |  i  
! ----+-----------+-----
!   1 | (3.3,4.4) | 4.4
!   2 | (,4.4)    | 4.4
! (2 rows)
! 
! create temp table people (fn fullname, bd date);
! insert into people values ('(Joe,Blow)', '1984-01-10');
! select * from people;
!      fn     |     bd     
! ------------+------------
!  (Joe,Blow) | 01-10-1984
! (1 row)
! 
! -- at the moment this will not work due to ALTER TABLE inadequacy:
! alter table fullname add column suffix text default '';
! ERROR:  cannot alter table "fullname" because column "people.fn" uses its row type
! -- but this should work:
! alter table fullname add column suffix text default null;
! select * from people;
!      fn      |     bd     
! -------------+------------
!  (Joe,Blow,) | 01-10-1984
! (1 row)
! 
! -- test insertion/updating of subfields
! update people set fn.suffix = 'Jr';
! select * from people;
!       fn       |     bd     
! ---------------+------------
!  (Joe,Blow,Jr) | 01-10-1984
! (1 row)
! 
! insert into quadtable (f1, q.c1.r, q.c2.i) values(44,55,66);
! select * from quadtable;
!  f1 |             q             
! ----+---------------------------
!   1 | ("(3.3,4.4)","(5.5,6.6)")
!   2 | ("(,4.4)","(5.5,6.6)")
!  44 | ("(55,)","(,66)")
! (3 rows)
! 
! -- The object here is to ensure that toasted references inside
! -- composite values don't cause problems.  The large f1 value will
! -- be toasted inside pp, it must still work after being copied to people.
! create temp table pp (f1 text);
! insert into pp values (repeat('abcdefghijkl', 100000));
! insert into people select ('Jim', f1, null)::fullname, current_date from pp;
! select (fn).first, substr((fn).last, 1, 20), length((fn).last) from people;
!  first |        substr        | length  
! -------+----------------------+---------
!  Joe   | Blow                 |       4
!  Jim   | abcdefghijklabcdefgh | 1200000
! (2 rows)
! 
! -- Test row comparison semantics.  Prior to PG 8.2 we did this in a totally
! -- non-spec-compliant way.
! select ROW(1,2) < ROW(1,3) as true;
!  true 
! ------
!  t
! (1 row)
! 
! select ROW(1,2) < ROW(1,1) as false;
!  false 
! -------
!  f
! (1 row)
! 
! select ROW(1,2) < ROW(1,NULL) as null;
!  null 
! ------
!  
! (1 row)
! 
! select ROW(1,2,3) < ROW(1,3,NULL) as true; -- the NULL is not examined
!  true 
! ------
!  t
! (1 row)
! 
! select ROW(11,'ABC') < ROW(11,'DEF') as true;
!  true 
! ------
!  t
! (1 row)
! 
! select ROW(11,'ABC') > ROW(11,'DEF') as false;
!  false 
! -------
!  f
! (1 row)
! 
! select ROW(12,'ABC') > ROW(11,'DEF') as true;
!  true 
! ------
!  t
! (1 row)
! 
! -- = and <> have different NULL-behavior than < etc
! select ROW(1,2,3) < ROW(1,NULL,4) as null;
!  null 
! ------
!  
! (1 row)
! 
! select ROW(1,2,3) = ROW(1,NULL,4) as false;
!  false 
! -------
!  f
! (1 row)
! 
! select ROW(1,2,3) <> ROW(1,NULL,4) as true;
!  true 
! ------
!  t
! (1 row)
! 
! -- We allow operators beyond the six standard ones, if they have btree
! -- operator classes.
! select ROW('ABC','DEF') ~<=~ ROW('DEF','ABC') as true;
!  true 
! ------
!  t
! (1 row)
! 
! select ROW('ABC','DEF') ~>=~ ROW('DEF','ABC') as false;
!  false 
! -------
!  f
! (1 row)
! 
! select ROW('ABC','DEF') ~~ ROW('DEF','ABC') as fail;
! ERROR:  could not determine interpretation of row comparison operator ~~
! LINE 1: select ROW('ABC','DEF') ~~ ROW('DEF','ABC') as fail;
!                                 ^
! HINT:  Row comparison operators must be associated with btree operator families.
! -- Comparisons of ROW() expressions can cope with some type mismatches
! select ROW(1,2) = ROW(1,2::int8);
!  ?column? 
! ----------
!  t
! (1 row)
! 
! select ROW(1,2) in (ROW(3,4), ROW(1,2));
!  ?column? 
! ----------
!  t
! (1 row)
! 
! select ROW(1,2) in (ROW(3,4), ROW(1,2::int8));
!  ?column? 
! ----------
!  t
! (1 row)
! 
! -- Check row comparison with a subselect
! select unique1, unique2 from tenk1
! where (unique1, unique2) < any (select ten, ten from tenk1 where hundred < 3)
!       and unique1 <= 20
! order by 1;
!  unique1 | unique2 
! ---------+---------
!        0 |    9998
!        1 |    2838
! (2 rows)
! 
! -- Also check row comparison with an indexable condition
! explain (costs off)
! select thousand, tenthous from tenk1
! where (thousand, tenthous) >= (997, 5000)
! order by thousand, tenthous;
!                         QUERY PLAN                         
! -----------------------------------------------------------
!  Index Only Scan using tenk1_thous_tenthous on tenk1
!    Index Cond: (ROW(thousand, tenthous) >= ROW(997, 5000))
! (2 rows)
! 
! select thousand, tenthous from tenk1
! where (thousand, tenthous) >= (997, 5000)
! order by thousand, tenthous;
!  thousand | tenthous 
! ----------+----------
!       997 |     5997
!       997 |     6997
!       997 |     7997
!       997 |     8997
!       997 |     9997
!       998 |      998
!       998 |     1998
!       998 |     2998
!       998 |     3998
!       998 |     4998
!       998 |     5998
!       998 |     6998
!       998 |     7998
!       998 |     8998
!       998 |     9998
!       999 |      999
!       999 |     1999
!       999 |     2999
!       999 |     3999
!       999 |     4999
!       999 |     5999
!       999 |     6999
!       999 |     7999
!       999 |     8999
!       999 |     9999
! (25 rows)
! 
! -- Test case for bug #14010: indexed row comparisons fail with nulls
! create temp table test_table (a text, b text);
! insert into test_table values ('a', 'b');
! insert into test_table select 'a', null from generate_series(1,1000);
! insert into test_table values ('b', 'a');
! create index on test_table (a,b);
! set enable_sort = off;
! explain (costs off)
! select a,b from test_table where (a,b) > ('a','a') order by a,b;
!                        QUERY PLAN                       
! --------------------------------------------------------
!  Index Only Scan using test_table_a_b_idx on test_table
!    Index Cond: (ROW(a, b) > ROW('a'::text, 'a'::text))
! (2 rows)
! 
! select a,b from test_table where (a,b) > ('a','a') order by a,b;
!  a | b 
! ---+---
!  a | b
!  b | a
! (2 rows)
! 
! reset enable_sort;
! -- Check row comparisons with IN
! select * from int8_tbl i8 where i8 in (row(123,456));  -- fail, type mismatch
! ERROR:  cannot compare dissimilar column types bigint and integer at record column 1
! explain (costs off)
! select * from int8_tbl i8
! where i8 in (row(123,456)::int8_tbl, '(4567890123456789,123)');
!                                                    QUERY PLAN                                                    
! -----------------------------------------------------------------------------------------------------------------
!  Seq Scan on int8_tbl i8
!    Filter: (i8.* = ANY (ARRAY[ROW('123'::bigint, '456'::bigint)::int8_tbl, '(4567890123456789,123)'::int8_tbl]))
! (2 rows)
! 
! select * from int8_tbl i8
! where i8 in (row(123,456)::int8_tbl, '(4567890123456789,123)');
!         q1        | q2  
! ------------------+-----
!               123 | 456
!  4567890123456789 | 123
! (2 rows)
! 
! -- Check some corner cases involving empty rowtypes
! select ROW();
!  row 
! -----
!  ()
! (1 row)
! 
! select ROW() IS NULL;
!  ?column? 
! ----------
!  t
! (1 row)
! 
! select ROW() = ROW();
! ERROR:  cannot compare rows of zero length
! LINE 1: select ROW() = ROW();
!                      ^
! -- Check ability to create arrays of anonymous rowtypes
! select array[ row(1,2), row(3,4), row(5,6) ];
!            array           
! ---------------------------
!  {"(1,2)","(3,4)","(5,6)"}
! (1 row)
! 
! -- Check ability to compare an anonymous row to elements of an array
! select row(1,1.1) = any (array[ row(7,7.7), row(1,1.1), row(0,0.0) ]);
!  ?column? 
! ----------
!  t
! (1 row)
! 
! select row(1,1.1) = any (array[ row(7,7.7), row(1,1.0), row(0,0.0) ]);
!  ?column? 
! ----------
!  f
! (1 row)
! 
! -- Check behavior with a non-comparable rowtype
! create type cantcompare as (p point, r float8);
! create temp table cc (f1 cantcompare);
! insert into cc values('("(1,2)",3)');
! insert into cc values('("(4,5)",6)');
! select * from cc order by f1; -- fail, but should complain about cantcompare
! ERROR:  could not identify an ordering operator for type cantcompare
! LINE 1: select * from cc order by f1;
!                                   ^
! HINT:  Use an explicit ordering operator or modify the query.
! --
! -- Test case derived from bug #5716: check multiple uses of a rowtype result
! --
! BEGIN;
! CREATE TABLE price (
!     id SERIAL PRIMARY KEY,
!     active BOOLEAN NOT NULL,
!     price NUMERIC
! );
! CREATE TYPE price_input AS (
!     id INTEGER,
!     price NUMERIC
! );
! CREATE TYPE price_key AS (
!     id INTEGER
! );
! CREATE FUNCTION price_key_from_table(price) RETURNS price_key AS $$
!     SELECT $1.id
! $$ LANGUAGE SQL;
! CREATE FUNCTION price_key_from_input(price_input) RETURNS price_key AS $$
!     SELECT $1.id
! $$ LANGUAGE SQL;
! insert into price values (1,false,42), (10,false,100), (11,true,17.99);
! UPDATE price
!     SET active = true, price = input_prices.price
!     FROM unnest(ARRAY[(10, 123.00), (11, 99.99)]::price_input[]) input_prices
!     WHERE price_key_from_table(price.*) = price_key_from_input(input_prices.*);
! select * from price;
!  id | active | price  
! ----+--------+--------
!   1 | f      |     42
!  10 | t      | 123.00
!  11 | t      |  99.99
! (3 rows)
! 
! rollback;
! --
! -- Test case derived from bug #9085: check * qualification of composite
! -- parameters for SQL functions
! --
! create temp table compos (f1 int, f2 text);
! create function fcompos1(v compos) returns void as $$
! insert into compos values (v);  -- fail
! $$ language sql;
! ERROR:  column "f1" is of type integer but expression is of type compos
! LINE 2: insert into compos values (v);  -- fail
!                                    ^
! HINT:  You will need to rewrite or cast the expression.
! create function fcompos1(v compos) returns void as $$
! insert into compos values (v.*);
! $$ language sql;
! create function fcompos2(v compos) returns void as $$
! select fcompos1(v);
! $$ language sql;
! create function fcompos3(v compos) returns void as $$
! select fcompos1(fcompos3.v.*);
! $$ language sql;
! select fcompos1(row(1,'one'));
!  fcompos1 
! ----------
!  
! (1 row)
! 
! select fcompos2(row(2,'two'));
!  fcompos2 
! ----------
!  
! (1 row)
! 
! select fcompos3(row(3,'three'));
!  fcompos3 
! ----------
!  
! (1 row)
! 
! select * from compos;
!  f1 |  f2   
! ----+-------
!   1 | one
!   2 | two
!   3 | three
! (3 rows)
! 
! --
! -- We allow I/O conversion casts from composite types to strings to be
! -- invoked via cast syntax, but not functional syntax.  This is because
! -- the latter is too prone to be invoked unintentionally.
! --
! select cast (fullname as text) from fullname;
!  fullname 
! ----------
! (0 rows)
! 
! select fullname::text from fullname;
!  fullname 
! ----------
! (0 rows)
! 
! select text(fullname) from fullname;  -- error
! ERROR:  function text(fullname) does not exist
! LINE 1: select text(fullname) from fullname;
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select fullname.text from fullname;  -- error
! ERROR:  column fullname.text does not exist
! LINE 1: select fullname.text from fullname;
!                ^
! -- same, but RECORD instead of named composite type:
! select cast (row('Jim', 'Beam') as text);
!     row     
! ------------
!  (Jim,Beam)
! (1 row)
! 
! select (row('Jim', 'Beam'))::text;
!     row     
! ------------
!  (Jim,Beam)
! (1 row)
! 
! select text(row('Jim', 'Beam'));  -- error
! ERROR:  function text(record) does not exist
! LINE 1: select text(row('Jim', 'Beam'));
!                ^
! HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
! select (row('Jim', 'Beam')).text;  -- error
! ERROR:  could not identify column "text" in record data type
! LINE 1: select (row('Jim', 'Beam')).text;
!                 ^
! --
! -- Test that composite values are seen to have the correct column names
! -- (bug #11210 and other reports)
! --
! select row_to_json(i) from int8_tbl i;
!                   row_to_json                   
! ------------------------------------------------
!  {"q1":123,"q2":456}
!  {"q1":123,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":123}
!  {"q1":4567890123456789,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":-4567890123456789}
! (5 rows)
! 
! select row_to_json(i) from int8_tbl i(x,y);
!                  row_to_json                  
! ----------------------------------------------
!  {"x":123,"y":456}
!  {"x":123,"y":4567890123456789}
!  {"x":4567890123456789,"y":123}
!  {"x":4567890123456789,"y":4567890123456789}
!  {"x":4567890123456789,"y":-4567890123456789}
! (5 rows)
! 
! create temp view vv1 as select * from int8_tbl;
! select row_to_json(i) from vv1 i;
!                   row_to_json                   
! ------------------------------------------------
!  {"q1":123,"q2":456}
!  {"q1":123,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":123}
!  {"q1":4567890123456789,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":-4567890123456789}
! (5 rows)
! 
! select row_to_json(i) from vv1 i(x,y);
!                  row_to_json                  
! ----------------------------------------------
!  {"x":123,"y":456}
!  {"x":123,"y":4567890123456789}
!  {"x":4567890123456789,"y":123}
!  {"x":4567890123456789,"y":4567890123456789}
!  {"x":4567890123456789,"y":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1, q2 from int8_tbl) as ss;
!                   row_to_json                   
! ------------------------------------------------
!  {"q1":123,"q2":456}
!  {"q1":123,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":123}
!  {"q1":4567890123456789,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1, q2 from int8_tbl offset 0) as ss;
!                   row_to_json                   
! ------------------------------------------------
!  {"q1":123,"q2":456}
!  {"q1":123,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":123}
!  {"q1":4567890123456789,"q2":4567890123456789}
!  {"q1":4567890123456789,"q2":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1 as a, q2 as b from int8_tbl) as ss;
!                  row_to_json                  
! ----------------------------------------------
!  {"a":123,"b":456}
!  {"a":123,"b":4567890123456789}
!  {"a":4567890123456789,"b":123}
!  {"a":4567890123456789,"b":4567890123456789}
!  {"a":4567890123456789,"b":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1 as a, q2 as b from int8_tbl offset 0) as ss;
!                  row_to_json                  
! ----------------------------------------------
!  {"a":123,"b":456}
!  {"a":123,"b":4567890123456789}
!  {"a":4567890123456789,"b":123}
!  {"a":4567890123456789,"b":4567890123456789}
!  {"a":4567890123456789,"b":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1 as a, q2 as b from int8_tbl) as ss(x,y);
!                  row_to_json                  
! ----------------------------------------------
!  {"x":123,"y":456}
!  {"x":123,"y":4567890123456789}
!  {"x":4567890123456789,"y":123}
!  {"x":4567890123456789,"y":4567890123456789}
!  {"x":4567890123456789,"y":-4567890123456789}
! (5 rows)
! 
! select row_to_json(ss) from
!   (select q1 as a, q2 as b from int8_tbl offset 0) as ss(x,y);
!                  row_to_json                  
! ----------------------------------------------
!  {"x":123,"y":456}
!  {"x":123,"y":4567890123456789}
!  {"x":4567890123456789,"y":123}
!  {"x":4567890123456789,"y":4567890123456789}
!  {"x":4567890123456789,"y":-4567890123456789}
! (5 rows)
! 
! explain (costs off)
! select row_to_json(q) from
!   (select thousand, tenthous from tenk1
!    where thousand = 42 and tenthous < 2000 offset 0) q;
!                          QUERY PLAN                          
! -------------------------------------------------------------
!  Subquery Scan on q
!    ->  Index Only Scan using tenk1_thous_tenthous on tenk1
!          Index Cond: ((thousand = 42) AND (tenthous < 2000))
! (3 rows)
! 
! select row_to_json(q) from
!   (select thousand, tenthous from tenk1
!    where thousand = 42 and tenthous < 2000 offset 0) q;
!            row_to_json           
! ---------------------------------
!  {"thousand":42,"tenthous":42}
!  {"thousand":42,"tenthous":1042}
! (2 rows)
! 
! select row_to_json(q) from
!   (select thousand as x, tenthous as y from tenk1
!    where thousand = 42 and tenthous < 2000 offset 0) q;
!     row_to_json    
! -------------------
!  {"x":42,"y":42}
!  {"x":42,"y":1042}
! (2 rows)
! 
! select row_to_json(q) from
!   (select thousand as x, tenthous as y from tenk1
!    where thousand = 42 and tenthous < 2000 offset 0) q(a,b);
!     row_to_json    
! -------------------
!  {"a":42,"b":42}
!  {"a":42,"b":1042}
! (2 rows)
! 
! create temp table tt1 as select * from int8_tbl limit 2;
! create temp table tt2 () inherits(tt1);
! insert into tt2 values(0,0);
! select row_to_json(r) from (select q2,q1 from tt1 offset 0) r;
!            row_to_json            
! ----------------------------------
!  {"q2":456,"q1":123}
!  {"q2":4567890123456789,"q1":123}
!  {"q2":0,"q1":0}
! (3 rows)
! 
! --
! -- IS [NOT] NULL should not recurse into nested composites (bug #14235)
! --
! explain (verbose, costs off)
! select r, r is null as isnull, r is not null as isnotnull
! from (values (1,row(1,2)), (1,row(null,null)), (1,null),
!              (null,row(1,2)), (null,row(null,null)), (null,null) ) r(a,b);
!                                                                                                          QUERY PLAN                                                                                                          
! -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
!  Values Scan on "*VALUES*"
!    Output: ROW("*VALUES*".column1, "*VALUES*".column2), (("*VALUES*".column1 IS NULL) AND ("*VALUES*".column2 IS NOT DISTINCT FROM NULL)), (("*VALUES*".column1 IS NOT NULL) AND ("*VALUES*".column2 IS DISTINCT FROM NULL))
! (2 rows)
! 
! select r, r is null as isnull, r is not null as isnotnull
! from (values (1,row(1,2)), (1,row(null,null)), (1,null),
!              (null,row(1,2)), (null,row(null,null)), (null,null) ) r(a,b);
!       r      | isnull | isnotnull 
! -------------+--------+-----------
!  (1,"(1,2)") | f      | t
!  (1,"(,)")   | f      | t
!  (1,)        | f      | f
!  (,"(1,2)")  | f      | f
!  (,"(,)")    | f      | f
!  (,)         | t      | f
! (6 rows)
! 
! explain (verbose, costs off)
! with r(a,b) as
!   (values (1,row(1,2)), (1,row(null,null)), (1,null),
!           (null,row(1,2)), (null,row(null,null)), (null,null) )
! select r, r is null as isnull, r is not null as isnotnull from r;
!                         QUERY PLAN                        
! ----------------------------------------------------------
!  CTE Scan on r
!    Output: r.*, (r.* IS NULL), (r.* IS NOT NULL)
!    CTE r
!      ->  Values Scan on "*VALUES*"
!            Output: "*VALUES*".column1, "*VALUES*".column2
! (5 rows)
! 
! with r(a,b) as
!   (values (1,row(1,2)), (1,row(null,null)), (1,null),
!           (null,row(1,2)), (null,row(null,null)), (null,null) )
! select r, r is null as isnull, r is not null as isnotnull from r;
!       r      | isnull | isnotnull 
! -------------+--------+-----------
!  (1,"(1,2)") | f      | t
!  (1,"(,)")   | f      | t
!  (1,)        | f      | f
!  (,"(1,2)")  | f      | f
!  (,"(,)")    | f      | f
!  (,)         | t      | f
! (6 rows)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/returning.out	2016-09-05 20:45:48.952033237 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/returning.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,357 ****
! --
! -- Test INSERT/UPDATE/DELETE RETURNING
! --
! -- Simple cases
! CREATE TEMP TABLE foo (f1 serial, f2 text, f3 int default 42);
! INSERT INTO foo (f2,f3)
!   VALUES ('test', DEFAULT), ('More', 11), (upper('more'), 7+9)
!   RETURNING *, f1+f3 AS sum;
!  f1 |  f2  | f3 | sum 
! ----+------+----+-----
!   1 | test | 42 |  43
!   2 | More | 11 |  13
!   3 | MORE | 16 |  19
! (3 rows)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 
! ----+------+----
!   1 | test | 42
!   2 | More | 11
!   3 | MORE | 16
! (3 rows)
! 
! UPDATE foo SET f2 = lower(f2), f3 = DEFAULT RETURNING foo.*, f1+f3 AS sum13;
!  f1 |  f2  | f3 | sum13 
! ----+------+----+-------
!   1 | test | 42 |    43
!   2 | more | 42 |    44
!   3 | more | 42 |    45
! (3 rows)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 
! ----+------+----
!   1 | test | 42
!   2 | more | 42
!   3 | more | 42
! (3 rows)
! 
! DELETE FROM foo WHERE f1 > 2 RETURNING f3, f2, f1, least(f1,f3);
!  f3 |  f2  | f1 | least 
! ----+------+----+-------
!  42 | more |  3 |     3
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 
! ----+------+----
!   1 | test | 42
!   2 | more | 42
! (2 rows)
! 
! -- Subplans and initplans in the RETURNING list
! INSERT INTO foo SELECT f1+10, f2, f3+99 FROM foo
!   RETURNING *, f1+112 IN (SELECT q1 FROM int8_tbl) AS subplan,
!     EXISTS(SELECT * FROM int4_tbl) AS initplan;
!  f1 |  f2  | f3  | subplan | initplan 
! ----+------+-----+---------+----------
!  11 | test | 141 | t       | t
!  12 | more | 141 | f       | t
! (2 rows)
! 
! UPDATE foo SET f3 = f3 * 2
!   WHERE f1 > 10
!   RETURNING *, f1+112 IN (SELECT q1 FROM int8_tbl) AS subplan,
!     EXISTS(SELECT * FROM int4_tbl) AS initplan;
!  f1 |  f2  | f3  | subplan | initplan 
! ----+------+-----+---------+----------
!  11 | test | 282 | t       | t
!  12 | more | 282 | f       | t
! (2 rows)
! 
! DELETE FROM foo
!   WHERE f1 > 10
!   RETURNING *, f1+112 IN (SELECT q1 FROM int8_tbl) AS subplan,
!     EXISTS(SELECT * FROM int4_tbl) AS initplan;
!  f1 |  f2  | f3  | subplan | initplan 
! ----+------+-----+---------+----------
!  11 | test | 282 | t       | t
!  12 | more | 282 | f       | t
! (2 rows)
! 
! -- Joins
! UPDATE foo SET f3 = f3*2
!   FROM int4_tbl i
!   WHERE foo.f1 + 123455 = i.f1
!   RETURNING foo.*, i.f1 as "i.f1";
!  f1 |  f2  | f3 |  i.f1  
! ----+------+----+--------
!   1 | test | 84 | 123456
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 
! ----+------+----
!   2 | more | 42
!   1 | test | 84
! (2 rows)
! 
! DELETE FROM foo
!   USING int4_tbl i
!   WHERE foo.f1 + 123455 = i.f1
!   RETURNING foo.*, i.f1 as "i.f1";
!  f1 |  f2  | f3 |  i.f1  
! ----+------+----+--------
!   1 | test | 84 | 123456
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 
! ----+------+----
!   2 | more | 42
! (1 row)
! 
! -- Check inheritance cases
! CREATE TEMP TABLE foochild (fc int) INHERITS (foo);
! INSERT INTO foochild VALUES(123,'child',999,-123);
! ALTER TABLE foo ADD COLUMN f4 int8 DEFAULT 99;
! SELECT * FROM foo;
!  f1  |  f2   | f3  | f4 
! -----+-------+-----+----
!    2 | more  |  42 | 99
!  123 | child | 999 | 99
! (2 rows)
! 
! SELECT * FROM foochild;
!  f1  |  f2   | f3  |  fc  | f4 
! -----+-------+-----+------+----
!  123 | child | 999 | -123 | 99
! (1 row)
! 
! UPDATE foo SET f4 = f4 + f3 WHERE f4 = 99 RETURNING *;
!  f1  |  f2   | f3  |  f4  
! -----+-------+-----+------
!    2 | more  |  42 |  141
!  123 | child | 999 | 1098
! (2 rows)
! 
! SELECT * FROM foo;
!  f1  |  f2   | f3  |  f4  
! -----+-------+-----+------
!    2 | more  |  42 |  141
!  123 | child | 999 | 1098
! (2 rows)
! 
! SELECT * FROM foochild;
!  f1  |  f2   | f3  |  fc  |  f4  
! -----+-------+-----+------+------
!  123 | child | 999 | -123 | 1098
! (1 row)
! 
! UPDATE foo SET f3 = f3*2
!   FROM int8_tbl i
!   WHERE foo.f1 = i.q2
!   RETURNING *;
!  f1  |  f2   |  f3  |  f4  |        q1        | q2  
! -----+-------+------+------+------------------+-----
!  123 | child | 1998 | 1098 | 4567890123456789 | 123
! (1 row)
! 
! SELECT * FROM foo;
!  f1  |  f2   |  f3  |  f4  
! -----+-------+------+------
!    2 | more  |   42 |  141
!  123 | child | 1998 | 1098
! (2 rows)
! 
! SELECT * FROM foochild;
!  f1  |  f2   |  f3  |  fc  |  f4  
! -----+-------+------+------+------
!  123 | child | 1998 | -123 | 1098
! (1 row)
! 
! DELETE FROM foo
!   USING int8_tbl i
!   WHERE foo.f1 = i.q2
!   RETURNING *;
!  f1  |  f2   |  f3  |  f4  |        q1        | q2  
! -----+-------+------+------+------------------+-----
!  123 | child | 1998 | 1098 | 4567890123456789 | 123
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 | f4  
! ----+------+----+-----
!   2 | more | 42 | 141
! (1 row)
! 
! SELECT * FROM foochild;
!  f1 | f2 | f3 | fc | f4 
! ----+----+----+----+----
! (0 rows)
! 
! DROP TABLE foochild;
! -- Rules and views
! CREATE TEMP VIEW voo AS SELECT f1, f2 FROM foo;
! CREATE RULE voo_i AS ON INSERT TO voo DO INSTEAD
!   INSERT INTO foo VALUES(new.*, 57);
! INSERT INTO voo VALUES(11,'zit');
! -- fails:
! INSERT INTO voo VALUES(12,'zoo') RETURNING *, f1*2;
! ERROR:  cannot perform INSERT RETURNING on relation "voo"
! HINT:  You need an unconditional ON INSERT DO INSTEAD rule with a RETURNING clause.
! -- fails, incompatible list:
! CREATE OR REPLACE RULE voo_i AS ON INSERT TO voo DO INSTEAD
!   INSERT INTO foo VALUES(new.*, 57) RETURNING *;
! ERROR:  RETURNING list has too many entries
! CREATE OR REPLACE RULE voo_i AS ON INSERT TO voo DO INSTEAD
!   INSERT INTO foo VALUES(new.*, 57) RETURNING f1, f2;
! -- should still work
! INSERT INTO voo VALUES(13,'zit2');
! -- works now
! INSERT INTO voo VALUES(14,'zoo2') RETURNING *;
!  f1 |  f2  
! ----+------
!  14 | zoo2
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 | f4  
! ----+------+----+-----
!   2 | more | 42 | 141
!  11 | zit  | 57 |  99
!  13 | zit2 | 57 |  99
!  14 | zoo2 | 57 |  99
! (4 rows)
! 
! SELECT * FROM voo;
!  f1 |  f2  
! ----+------
!   2 | more
!  11 | zit
!  13 | zit2
!  14 | zoo2
! (4 rows)
! 
! CREATE OR REPLACE RULE voo_u AS ON UPDATE TO voo DO INSTEAD
!   UPDATE foo SET f1 = new.f1, f2 = new.f2 WHERE f1 = old.f1
!   RETURNING f1, f2;
! update voo set f1 = f1 + 1 where f2 = 'zoo2';
! update voo set f1 = f1 + 1 where f2 = 'zoo2' RETURNING *, f1*2;
!  f1 |  f2  | ?column? 
! ----+------+----------
!  16 | zoo2 |       32
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 | f4  
! ----+------+----+-----
!   2 | more | 42 | 141
!  11 | zit  | 57 |  99
!  13 | zit2 | 57 |  99
!  16 | zoo2 | 57 |  99
! (4 rows)
! 
! SELECT * FROM voo;
!  f1 |  f2  
! ----+------
!   2 | more
!  11 | zit
!  13 | zit2
!  16 | zoo2
! (4 rows)
! 
! CREATE OR REPLACE RULE voo_d AS ON DELETE TO voo DO INSTEAD
!   DELETE FROM foo WHERE f1 = old.f1
!   RETURNING f1, f2;
! DELETE FROM foo WHERE f1 = 13;
! DELETE FROM foo WHERE f2 = 'zit' RETURNING *;
!  f1 | f2  | f3 | f4 
! ----+-----+----+----
!  11 | zit | 57 | 99
! (1 row)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 | f4  
! ----+------+----+-----
!   2 | more | 42 | 141
!  16 | zoo2 | 57 |  99
! (2 rows)
! 
! SELECT * FROM voo;
!  f1 |  f2  
! ----+------
!   2 | more
!  16 | zoo2
! (2 rows)
! 
! -- Try a join case
! CREATE TEMP TABLE joinme (f2j text, other int);
! INSERT INTO joinme VALUES('more', 12345);
! INSERT INTO joinme VALUES('zoo2', 54321);
! INSERT INTO joinme VALUES('other', 0);
! CREATE TEMP VIEW joinview AS
!   SELECT foo.*, other FROM foo JOIN joinme ON (f2 = f2j);
! SELECT * FROM joinview;
!  f1 |  f2  | f3 | f4  | other 
! ----+------+----+-----+-------
!   2 | more | 42 | 141 | 12345
!  16 | zoo2 | 57 |  99 | 54321
! (2 rows)
! 
! CREATE RULE joinview_u AS ON UPDATE TO joinview DO INSTEAD
!   UPDATE foo SET f1 = new.f1, f3 = new.f3
!     FROM joinme WHERE f2 = f2j AND f2 = old.f2
!     RETURNING foo.*, other;
! UPDATE joinview SET f1 = f1 + 1 WHERE f3 = 57 RETURNING *, other + 1;
!  f1 |  f2  | f3 | f4 | other | ?column? 
! ----+------+----+----+-------+----------
!  17 | zoo2 | 57 | 99 | 54321 |    54322
! (1 row)
! 
! SELECT * FROM joinview;
!  f1 |  f2  | f3 | f4  | other 
! ----+------+----+-----+-------
!   2 | more | 42 | 141 | 12345
!  17 | zoo2 | 57 |  99 | 54321
! (2 rows)
! 
! SELECT * FROM foo;
!  f1 |  f2  | f3 | f4  
! ----+------+----+-----
!   2 | more | 42 | 141
!  17 | zoo2 | 57 |  99
! (2 rows)
! 
! SELECT * FROM voo;
!  f1 |  f2  
! ----+------
!   2 | more
!  17 | zoo2
! (2 rows)
! 
! -- Check aliased target relation
! INSERT INTO foo AS bar DEFAULT VALUES RETURNING *; -- ok
!  f1 | f2 | f3 | f4 
! ----+----+----+----
!   4 |    | 42 | 99
! (1 row)
! 
! INSERT INTO foo AS bar DEFAULT VALUES RETURNING foo.*; -- fails, wrong name
! ERROR:  invalid reference to FROM-clause entry for table "foo"
! LINE 1: INSERT INTO foo AS bar DEFAULT VALUES RETURNING foo.*;
!                                                         ^
! HINT:  Perhaps you meant to reference the table alias "bar".
! INSERT INTO foo AS bar DEFAULT VALUES RETURNING bar.*; -- ok
!  f1 | f2 | f3 | f4 
! ----+----+----+----
!   5 |    | 42 | 99
! (1 row)
! 
! INSERT INTO foo AS bar DEFAULT VALUES RETURNING bar.f3; -- ok
!  f3 
! ----
!  42
! (1 row)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/largeobject.out	2016-09-12 12:14:37.867410181 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/largeobject.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,473 ****
! --
! -- Test large object support
! --
! -- ensure consistent test output regardless of the default bytea format
! SET bytea_output TO escape;
! -- Load a file
! CREATE TABLE lotest_stash_values (loid oid, fd integer);
! -- lo_creat(mode integer) returns oid
! -- The mode arg to lo_creat is unused, some vestigal holdover from ancient times
! -- returns the large object id
! INSERT INTO lotest_stash_values (loid) SELECT lo_creat(42);
! -- Test ALTER LARGE OBJECT
! CREATE ROLE regress_lo_user;
! DO $$
!   BEGIN
!     EXECUTE 'ALTER LARGE OBJECT ' || (select loid from lotest_stash_values)
! 		|| ' OWNER TO regress_lo_user';
!   END
! $$;
! SELECT
! 	rol.rolname
! FROM
! 	lotest_stash_values s
! 	JOIN pg_largeobject_metadata lo ON s.loid = lo.oid
! 	JOIN pg_authid rol ON lo.lomowner = rol.oid;
!      rolname     
! -----------------
!  regress_lo_user
! (1 row)
! 
! -- NOTE: large objects require transactions
! BEGIN;
! -- lo_open(lobjId oid, mode integer) returns integer
! -- The mode parameter to lo_open uses two constants:
! --   INV_READ  = 0x20000
! --   INV_WRITE = 0x40000
! -- The return value is a file descriptor-like value which remains valid for the
! -- transaction.
! UPDATE lotest_stash_values SET fd = lo_open(loid, CAST(x'20000' | x'40000' AS integer));
! -- loread/lowrite names are wonky, different from other functions which are lo_*
! -- lowrite(fd integer, data bytea) returns integer
! -- the integer is the number of bytes written
! SELECT lowrite(fd, '
! I wandered lonely as a cloud
! That floats on high o''er vales and hills,
! When all at once I saw a crowd,
! A host, of golden daffodils;
! Beside the lake, beneath the trees,
! Fluttering and dancing in the breeze.
! 
! Continuous as the stars that shine
! And twinkle on the milky way,
! They stretched in never-ending line
! Along the margin of a bay:
! Ten thousand saw I at a glance,
! Tossing their heads in sprightly dance.
! 
! The waves beside them danced; but they
! Out-did the sparkling waves in glee:
! A poet could not but be gay,
! In such a jocund company:
! I gazed--and gazed--but little thought
! What wealth the show to me had brought:
! 
! For oft, when on my couch I lie
! In vacant or in pensive mood,
! They flash upon that inward eye
! Which is the bliss of solitude;
! And then my heart with pleasure fills,
! And dances with the daffodils.
! 
!          -- William Wordsworth
! ') FROM lotest_stash_values;
!  lowrite 
! ---------
!      848
! (1 row)
! 
! -- lo_close(fd integer) returns integer
! -- return value is 0 for success, or <0 for error (actually only -1, but...)
! SELECT lo_close(fd) FROM lotest_stash_values;
!  lo_close 
! ----------
!         0
! (1 row)
! 
! END;
! -- Copy to another large object.
! -- Note: we intentionally don't remove the object created here;
! -- it's left behind to help test pg_dump.
! SELECT lo_from_bytea(0, lo_get(loid)) AS newloid FROM lotest_stash_values
! \gset
! -- Ideally we'd put a comment on this object for pg_dump testing purposes.
! -- But since pg_upgrade fails to preserve large object comments, doing so
! -- would break pg_upgrade's regression test.
! -- COMMENT ON LARGE OBJECT :newloid IS 'I Wandered Lonely as a Cloud';
! -- Read out a portion
! BEGIN;
! UPDATE lotest_stash_values SET fd=lo_open(loid, CAST(x'20000' | x'40000' AS integer));
! -- lo_lseek(fd integer, offset integer, whence integer) returns integer
! -- offset is in bytes, whence is one of three values:
! --  SEEK_SET (= 0) meaning relative to beginning
! --  SEEK_CUR (= 1) meaning relative to current position
! --  SEEK_END (= 2) meaning relative to end (offset better be negative)
! -- returns current position in file
! SELECT lo_lseek(fd, 104, 0) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!       104
! (1 row)
! 
! -- loread/lowrite names are wonky, different from other functions which are lo_*
! -- loread(fd integer, len integer) returns bytea
! SELECT loread(fd, 28) FROM lotest_stash_values;
!             loread            
! ------------------------------
!  A host, of golden daffodils;
! (1 row)
! 
! SELECT lo_lseek(fd, -19, 1) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!       113
! (1 row)
! 
! SELECT lowrite(fd, 'n') FROM lotest_stash_values;
!  lowrite 
! ---------
!        1
! (1 row)
! 
! SELECT lo_tell(fd) FROM lotest_stash_values;
!  lo_tell 
! ---------
!      114
! (1 row)
! 
! SELECT lo_lseek(fd, -744, 2) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!       104
! (1 row)
! 
! SELECT loread(fd, 28) FROM lotest_stash_values;
!             loread            
! ------------------------------
!  A host, on golden daffodils;
! (1 row)
! 
! SELECT lo_close(fd) FROM lotest_stash_values;
!  lo_close 
! ----------
!         0
! (1 row)
! 
! END;
! -- Test resource management
! BEGIN;
! SELECT lo_open(loid, x'40000'::int) from lotest_stash_values;
!  lo_open 
! ---------
!        0
! (1 row)
! 
! ABORT;
! -- Test truncation.
! BEGIN;
! UPDATE lotest_stash_values SET fd=lo_open(loid, CAST(x'20000' | x'40000' AS integer));
! SELECT lo_truncate(fd, 11) FROM lotest_stash_values;
!  lo_truncate 
! -------------
!            0
! (1 row)
! 
! SELECT loread(fd, 15) FROM lotest_stash_values;
!      loread     
! ----------------
!  \012I wandered
! (1 row)
! 
! SELECT lo_truncate(fd, 10000) FROM lotest_stash_values;
!  lo_truncate 
! -------------
!            0
! (1 row)
! 
! SELECT loread(fd, 10) FROM lotest_stash_values;
!                   loread                  
! ------------------------------------------
!  \000\000\000\000\000\000\000\000\000\000
! (1 row)
! 
! SELECT lo_lseek(fd, 0, 2) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!     10000
! (1 row)
! 
! SELECT lo_tell(fd) FROM lotest_stash_values;
!  lo_tell 
! ---------
!    10000
! (1 row)
! 
! SELECT lo_truncate(fd, 5000) FROM lotest_stash_values;
!  lo_truncate 
! -------------
!            0
! (1 row)
! 
! SELECT lo_lseek(fd, 0, 2) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!      5000
! (1 row)
! 
! SELECT lo_tell(fd) FROM lotest_stash_values;
!  lo_tell 
! ---------
!     5000
! (1 row)
! 
! SELECT lo_close(fd) FROM lotest_stash_values;
!  lo_close 
! ----------
!         0
! (1 row)
! 
! END;
! -- Test 64-bit large object functions.
! BEGIN;
! UPDATE lotest_stash_values SET fd = lo_open(loid, CAST(x'20000' | x'40000' AS integer));
! SELECT lo_lseek64(fd, 4294967296, 0) FROM lotest_stash_values;
!  lo_lseek64 
! ------------
!  4294967296
! (1 row)
! 
! SELECT lowrite(fd, 'offset:4GB') FROM lotest_stash_values;
!  lowrite 
! ---------
!       10
! (1 row)
! 
! SELECT lo_tell64(fd) FROM lotest_stash_values;
!  lo_tell64  
! ------------
!  4294967306
! (1 row)
! 
! SELECT lo_lseek64(fd, -10, 1) FROM lotest_stash_values;
!  lo_lseek64 
! ------------
!  4294967296
! (1 row)
! 
! SELECT lo_tell64(fd) FROM lotest_stash_values;
!  lo_tell64  
! ------------
!  4294967296
! (1 row)
! 
! SELECT loread(fd, 10) FROM lotest_stash_values;
!    loread   
! ------------
!  offset:4GB
! (1 row)
! 
! SELECT lo_truncate64(fd, 5000000000) FROM lotest_stash_values;
!  lo_truncate64 
! ---------------
!              0
! (1 row)
! 
! SELECT lo_lseek64(fd, 0, 2) FROM lotest_stash_values;
!  lo_lseek64 
! ------------
!  5000000000
! (1 row)
! 
! SELECT lo_tell64(fd) FROM lotest_stash_values;
!  lo_tell64  
! ------------
!  5000000000
! (1 row)
! 
! SELECT lo_truncate64(fd, 3000000000) FROM lotest_stash_values;
!  lo_truncate64 
! ---------------
!              0
! (1 row)
! 
! SELECT lo_lseek64(fd, 0, 2) FROM lotest_stash_values;
!  lo_lseek64 
! ------------
!  3000000000
! (1 row)
! 
! SELECT lo_tell64(fd) FROM lotest_stash_values;
!  lo_tell64  
! ------------
!  3000000000
! (1 row)
! 
! SELECT lo_close(fd) FROM lotest_stash_values;
!  lo_close 
! ----------
!         0
! (1 row)
! 
! END;
! -- lo_unlink(lobjId oid) returns integer
! -- return value appears to always be 1
! SELECT lo_unlink(loid) from lotest_stash_values;
!  lo_unlink 
! -----------
!          1
! (1 row)
! 
! TRUNCATE lotest_stash_values;
! INSERT INTO lotest_stash_values (loid) SELECT lo_import('/home/claudiofreire/src/postgresql.work/src/test/regress/data/tenk.data');
! BEGIN;
! UPDATE lotest_stash_values SET fd=lo_open(loid, CAST(x'20000' | x'40000' AS integer));
! -- verify length of large object
! SELECT lo_lseek(fd, 0, 2) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!    670800
! (1 row)
! 
! -- with the default BLKSZ, LOBLKSZ = 2048, so this positions us for a block
! -- edge case
! SELECT lo_lseek(fd, 2030, 0) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!      2030
! (1 row)
! 
! -- this should get half of the value from page 0 and half from page 1 of the
! -- large object
! SELECT loread(fd, 36) FROM lotest_stash_values;
!                              loread                              
! -----------------------------------------------------------------
!  AAA\011FBAAAA\011VVVVxx\0122513\01132\0111\0111\0113\01113\0111
! (1 row)
! 
! SELECT lo_tell(fd) FROM lotest_stash_values;
!  lo_tell 
! ---------
!     2066
! (1 row)
! 
! SELECT lo_lseek(fd, -26, 1) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!      2040
! (1 row)
! 
! SELECT lowrite(fd, 'abcdefghijklmnop') FROM lotest_stash_values;
!  lowrite 
! ---------
!       16
! (1 row)
! 
! SELECT lo_lseek(fd, 2030, 0) FROM lotest_stash_values;
!  lo_lseek 
! ----------
!      2030
! (1 row)
! 
! SELECT loread(fd, 36) FROM lotest_stash_values;
!                        loread                        
! -----------------------------------------------------
!  AAA\011FBAAAAabcdefghijklmnop1\0111\0113\01113\0111
! (1 row)
! 
! SELECT lo_close(fd) FROM lotest_stash_values;
!  lo_close 
! ----------
!         0
! (1 row)
! 
! END;
! SELECT lo_export(loid, '/home/claudiofreire/src/postgresql.work/src/test/regress/results/lotest.txt') FROM lotest_stash_values;
!  lo_export 
! -----------
!          1
! (1 row)
! 
! \lo_import 'results/lotest.txt'
! \set newloid :LASTOID
! -- just make sure \lo_export does not barf
! \lo_export :newloid 'results/lotest2.txt'
! -- This is a hack to test that export/import are reversible
! -- This uses knowledge about the inner workings of large object mechanism
! -- which should not be used outside it.  This makes it a HACK
! SELECT pageno, data FROM pg_largeobject WHERE loid = (SELECT loid from lotest_stash_values)
! EXCEPT
! SELECT pageno, data FROM pg_largeobject WHERE loid = :newloid;
!  pageno | data 
! --------+------
! (0 rows)
! 
! SELECT lo_unlink(loid) FROM lotest_stash_values;
!  lo_unlink 
! -----------
!          1
! (1 row)
! 
! TRUNCATE lotest_stash_values;
! \lo_unlink :newloid
! \lo_import 'results/lotest.txt'
! \set newloid_1 :LASTOID
! SELECT lo_from_bytea(0, lo_get(:newloid_1)) AS newloid_2
! \gset
! SELECT md5(lo_get(:newloid_1)) = md5(lo_get(:newloid_2));
!  ?column? 
! ----------
!  t
! (1 row)
! 
! SELECT lo_get(:newloid_1, 0, 20);
!                   lo_get                   
! -------------------------------------------
!  8800\0110\0110\0110\0110\0110\0110\011800
! (1 row)
! 
! SELECT lo_get(:newloid_1, 10, 20);
!                   lo_get                   
! -------------------------------------------
!  \0110\0110\0110\011800\011800\0113800\011
! (1 row)
! 
! SELECT lo_put(:newloid_1, 5, decode('afafafaf', 'hex'));
!  lo_put 
! --------
!  
! (1 row)
! 
! SELECT lo_get(:newloid_1, 0, 20);
!                      lo_get                      
! -------------------------------------------------
!  8800\011\257\257\257\2570\0110\0110\0110\011800
! (1 row)
! 
! SELECT lo_put(:newloid_1, 4294967310, 'foo');
!  lo_put 
! --------
!  
! (1 row)
! 
! SELECT lo_get(:newloid_1);
! ERROR:  large object read request is too large
! SELECT lo_get(:newloid_1, 4294967294, 100);
!                                lo_get                                
! ---------------------------------------------------------------------
!  \000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000foo
! (1 row)
! 
! \lo_unlink :newloid_1
! \lo_unlink :newloid_2
! -- This object is left in the database for pg_dump test purposes
! SELECT lo_from_bytea(0, E'\\xdeadbeef') AS newloid
! \gset
! SET bytea_output TO hex;
! SELECT lo_get(:newloid);
!    lo_get   
! ------------
!  \xdeadbeef
! (1 row)
! 
! DROP TABLE lotest_stash_values;
! DROP ROLE regress_lo_user;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/with.out	2016-09-05 20:45:49.140033814 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/with.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,2262 ****
! --
! -- Tests for common table expressions (WITH query, ... SELECT ...)
! --
! -- Basic WITH
! WITH q1(x,y) AS (SELECT 1,2)
! SELECT * FROM q1, q1 AS q2;
!  x | y | x | y 
! ---+---+---+---
!  1 | 2 | 1 | 2
! (1 row)
! 
! -- Multiple uses are evaluated only once
! SELECT count(*) FROM (
!   WITH q1(x) AS (SELECT random() FROM generate_series(1, 5))
!     SELECT * FROM q1
!   UNION
!     SELECT * FROM q1
! ) ss;
!  count 
! -------
!      5
! (1 row)
! 
! -- WITH RECURSIVE
! -- sum of 1..100
! WITH RECURSIVE t(n) AS (
!     VALUES (1)
! UNION ALL
!     SELECT n+1 FROM t WHERE n < 100
! )
! SELECT sum(n) FROM t;
!  sum  
! ------
!  5050
! (1 row)
! 
! WITH RECURSIVE t(n) AS (
!     SELECT (VALUES(1))
! UNION ALL
!     SELECT n+1 FROM t WHERE n < 5
! )
! SELECT * FROM t;
!  n 
! ---
!  1
!  2
!  3
!  4
!  5
! (5 rows)
! 
! -- recursive view
! CREATE RECURSIVE VIEW nums (n) AS
!     VALUES (1)
! UNION ALL
!     SELECT n+1 FROM nums WHERE n < 5;
! SELECT * FROM nums;
!  n 
! ---
!  1
!  2
!  3
!  4
!  5
! (5 rows)
! 
! CREATE OR REPLACE RECURSIVE VIEW nums (n) AS
!     VALUES (1)
! UNION ALL
!     SELECT n+1 FROM nums WHERE n < 6;
! SELECT * FROM nums;
!  n 
! ---
!  1
!  2
!  3
!  4
!  5
!  6
! (6 rows)
! 
! -- This is an infinite loop with UNION ALL, but not with UNION
! WITH RECURSIVE t(n) AS (
!     SELECT 1
! UNION
!     SELECT 10-n FROM t)
! SELECT * FROM t;
!  n 
! ---
!  1
!  9
! (2 rows)
! 
! -- This'd be an infinite loop, but outside query reads only as much as needed
! WITH RECURSIVE t(n) AS (
!     VALUES (1)
! UNION ALL
!     SELECT n+1 FROM t)
! SELECT * FROM t LIMIT 10;
!  n  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (10 rows)
! 
! -- UNION case should have same property
! WITH RECURSIVE t(n) AS (
!     SELECT 1
! UNION
!     SELECT n+1 FROM t)
! SELECT * FROM t LIMIT 10;
!  n  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (10 rows)
! 
! -- Test behavior with an unknown-type literal in the WITH
! WITH q AS (SELECT 'foo' AS x)
! SELECT x, x IS OF (unknown) as is_unknown FROM q;
!   x  | is_unknown 
! -----+------------
!  foo | t
! (1 row)
! 
! WITH RECURSIVE t(n) AS (
!     SELECT 'foo'
! UNION ALL
!     SELECT n || ' bar' FROM t WHERE length(n) < 20
! )
! SELECT n, n IS OF (text) as is_text FROM t;
!             n            | is_text 
! -------------------------+---------
!  foo                     | t
!  foo bar                 | t
!  foo bar bar             | t
!  foo bar bar bar         | t
!  foo bar bar bar bar     | t
!  foo bar bar bar bar bar | t
! (6 rows)
! 
! --
! -- Some examples with a tree
! --
! -- department structure represented here is as follows:
! --
! -- ROOT-+->A-+->B-+->C
! --      |         |
! --      |         +->D-+->F
! --      +->E-+->G
! CREATE TEMP TABLE department (
! 	id INTEGER PRIMARY KEY,  -- department ID
! 	parent_department INTEGER REFERENCES department, -- upper department ID
! 	name TEXT -- department name
! );
! INSERT INTO department VALUES (0, NULL, 'ROOT');
! INSERT INTO department VALUES (1, 0, 'A');
! INSERT INTO department VALUES (2, 1, 'B');
! INSERT INTO department VALUES (3, 2, 'C');
! INSERT INTO department VALUES (4, 2, 'D');
! INSERT INTO department VALUES (5, 0, 'E');
! INSERT INTO department VALUES (6, 4, 'F');
! INSERT INTO department VALUES (7, 5, 'G');
! -- extract all departments under 'A'. Result should be A, B, C, D and F
! WITH RECURSIVE subdepartment AS
! (
! 	-- non recursive term
! 	SELECT name as root_name, * FROM department WHERE name = 'A'
! 	UNION ALL
! 	-- recursive term
! 	SELECT sd.root_name, d.* FROM department AS d, subdepartment AS sd
! 		WHERE d.parent_department = sd.id
! )
! SELECT * FROM subdepartment ORDER BY name;
!  root_name | id | parent_department | name 
! -----------+----+-------------------+------
!  A         |  1 |                 0 | A
!  A         |  2 |                 1 | B
!  A         |  3 |                 2 | C
!  A         |  4 |                 2 | D
!  A         |  6 |                 4 | F
! (5 rows)
! 
! -- extract all departments under 'A' with "level" number
! WITH RECURSIVE subdepartment(level, id, parent_department, name) AS
! (
! 	-- non recursive term
! 	SELECT 1, * FROM department WHERE name = 'A'
! 	UNION ALL
! 	-- recursive term
! 	SELECT sd.level + 1, d.* FROM department AS d, subdepartment AS sd
! 		WHERE d.parent_department = sd.id
! )
! SELECT * FROM subdepartment ORDER BY name;
!  level | id | parent_department | name 
! -------+----+-------------------+------
!      1 |  1 |                 0 | A
!      2 |  2 |                 1 | B
!      3 |  3 |                 2 | C
!      3 |  4 |                 2 | D
!      4 |  6 |                 4 | F
! (5 rows)
! 
! -- extract all departments under 'A' with "level" number.
! -- Only shows level 2 or more
! WITH RECURSIVE subdepartment(level, id, parent_department, name) AS
! (
! 	-- non recursive term
! 	SELECT 1, * FROM department WHERE name = 'A'
! 	UNION ALL
! 	-- recursive term
! 	SELECT sd.level + 1, d.* FROM department AS d, subdepartment AS sd
! 		WHERE d.parent_department = sd.id
! )
! SELECT * FROM subdepartment WHERE level >= 2 ORDER BY name;
!  level | id | parent_department | name 
! -------+----+-------------------+------
!      2 |  2 |                 1 | B
!      3 |  3 |                 2 | C
!      3 |  4 |                 2 | D
!      4 |  6 |                 4 | F
! (4 rows)
! 
! -- "RECURSIVE" is ignored if the query has no self-reference
! WITH RECURSIVE subdepartment AS
! (
! 	-- note lack of recursive UNION structure
! 	SELECT * FROM department WHERE name = 'A'
! )
! SELECT * FROM subdepartment ORDER BY name;
!  id | parent_department | name 
! ----+-------------------+------
!   1 |                 0 | A
! (1 row)
! 
! -- inside subqueries
! SELECT count(*) FROM (
!     WITH RECURSIVE t(n) AS (
!         SELECT 1 UNION ALL SELECT n + 1 FROM t WHERE n < 500
!     )
!     SELECT * FROM t) AS t WHERE n < (
!         SELECT count(*) FROM (
!             WITH RECURSIVE t(n) AS (
!                    SELECT 1 UNION ALL SELECT n + 1 FROM t WHERE n < 100
!                 )
!             SELECT * FROM t WHERE n < 50000
!          ) AS t WHERE n < 100);
!  count 
! -------
!     98
! (1 row)
! 
! -- use same CTE twice at different subquery levels
! WITH q1(x,y) AS (
!     SELECT hundred, sum(ten) FROM tenk1 GROUP BY hundred
!   )
! SELECT count(*) FROM q1 WHERE y > (SELECT sum(y)/100 FROM q1 qsub);
!  count 
! -------
!     50
! (1 row)
! 
! -- via a VIEW
! CREATE TEMPORARY VIEW vsubdepartment AS
! 	WITH RECURSIVE subdepartment AS
! 	(
! 		 -- non recursive term
! 		SELECT * FROM department WHERE name = 'A'
! 		UNION ALL
! 		-- recursive term
! 		SELECT d.* FROM department AS d, subdepartment AS sd
! 			WHERE d.parent_department = sd.id
! 	)
! 	SELECT * FROM subdepartment;
! SELECT * FROM vsubdepartment ORDER BY name;
!  id | parent_department | name 
! ----+-------------------+------
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   6 |                 4 | F
! (5 rows)
! 
! -- Check reverse listing
! SELECT pg_get_viewdef('vsubdepartment'::regclass);
!                 pg_get_viewdef                 
! -----------------------------------------------
!   WITH RECURSIVE subdepartment AS (           +
!           SELECT department.id,               +
!              department.parent_department,    +
!              department.name                  +
!             FROM department                   +
!            WHERE (department.name = 'A'::text)+
!          UNION ALL                            +
!           SELECT d.id,                        +
!              d.parent_department,             +
!              d.name                           +
!             FROM department d,                +
!              subdepartment sd                 +
!            WHERE (d.parent_department = sd.id)+
!          )                                    +
!   SELECT subdepartment.id,                    +
!      subdepartment.parent_department,         +
!      subdepartment.name                       +
!     FROM subdepartment;
! (1 row)
! 
! SELECT pg_get_viewdef('vsubdepartment'::regclass, true);
!                pg_get_viewdef                
! ---------------------------------------------
!   WITH RECURSIVE subdepartment AS (         +
!           SELECT department.id,             +
!              department.parent_department,  +
!              department.name                +
!             FROM department                 +
!            WHERE department.name = 'A'::text+
!          UNION ALL                          +
!           SELECT d.id,                      +
!              d.parent_department,           +
!              d.name                         +
!             FROM department d,              +
!              subdepartment sd               +
!            WHERE d.parent_department = sd.id+
!          )                                  +
!   SELECT subdepartment.id,                  +
!      subdepartment.parent_department,       +
!      subdepartment.name                     +
!     FROM subdepartment;
! (1 row)
! 
! -- Another reverse-listing example
! CREATE VIEW sums_1_100 AS
! WITH RECURSIVE t(n) AS (
!     VALUES (1)
! UNION ALL
!     SELECT n+1 FROM t WHERE n < 100
! )
! SELECT sum(n) FROM t;
! \d+ sums_1_100
!               View "public.sums_1_100"
!  Column |  Type  | Modifiers | Storage | Description 
! --------+--------+-----------+---------+-------------
!  sum    | bigint |           | plain   | 
! View definition:
!  WITH RECURSIVE t(n) AS (
!          VALUES (1)
!         UNION ALL
!          SELECT t_1.n + 1
!            FROM t t_1
!           WHERE t_1.n < 100
!         )
!  SELECT sum(t.n) AS sum
!    FROM t;
! 
! -- corner case in which sub-WITH gets initialized first
! with recursive q as (
!       select * from department
!     union all
!       (with x as (select * from q)
!        select * from x)
!     )
! select * from q limit 24;
!  id | parent_department | name 
! ----+-------------------+------
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
! (24 rows)
! 
! with recursive q as (
!       select * from department
!     union all
!       (with recursive x as (
!            select * from department
!          union all
!            (select * from q union all select * from x)
!         )
!        select * from x)
!     )
! select * from q limit 32;
!  id | parent_department | name 
! ----+-------------------+------
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
!   0 |                   | ROOT
!   1 |                 0 | A
!   2 |                 1 | B
!   3 |                 2 | C
!   4 |                 2 | D
!   5 |                 0 | E
!   6 |                 4 | F
!   7 |                 5 | G
! (32 rows)
! 
! -- recursive term has sub-UNION
! WITH RECURSIVE t(i,j) AS (
! 	VALUES (1,2)
! 	UNION ALL
! 	SELECT t2.i, t.j+1 FROM
! 		(SELECT 2 AS i UNION ALL SELECT 3 AS i) AS t2
! 		JOIN t ON (t2.i = t.i+1))
! 	SELECT * FROM t;
!  i | j 
! ---+---
!  1 | 2
!  2 | 3
!  3 | 4
! (3 rows)
! 
! --
! -- different tree example
! --
! CREATE TEMPORARY TABLE tree(
!     id INTEGER PRIMARY KEY,
!     parent_id INTEGER REFERENCES tree(id)
! );
! INSERT INTO tree
! VALUES (1, NULL), (2, 1), (3,1), (4,2), (5,2), (6,2), (7,3), (8,3),
!        (9,4), (10,4), (11,7), (12,7), (13,7), (14, 9), (15,11), (16,11);
! --
! -- get all paths from "second level" nodes to leaf nodes
! --
! WITH RECURSIVE t(id, path) AS (
!     VALUES(1,ARRAY[]::integer[])
! UNION ALL
!     SELECT tree.id, t.path || tree.id
!     FROM tree JOIN t ON (tree.parent_id = t.id)
! )
! SELECT t1.*, t2.* FROM t AS t1 JOIN t AS t2 ON
! 	(t1.path[1] = t2.path[1] AND
! 	array_upper(t1.path,1) = 1 AND
! 	array_upper(t2.path,1) > 1)
! 	ORDER BY t1.id, t2.id;
!  id | path | id |    path     
! ----+------+----+-------------
!   2 | {2}  |  4 | {2,4}
!   2 | {2}  |  5 | {2,5}
!   2 | {2}  |  6 | {2,6}
!   2 | {2}  |  9 | {2,4,9}
!   2 | {2}  | 10 | {2,4,10}
!   2 | {2}  | 14 | {2,4,9,14}
!   3 | {3}  |  7 | {3,7}
!   3 | {3}  |  8 | {3,8}
!   3 | {3}  | 11 | {3,7,11}
!   3 | {3}  | 12 | {3,7,12}
!   3 | {3}  | 13 | {3,7,13}
!   3 | {3}  | 15 | {3,7,11,15}
!   3 | {3}  | 16 | {3,7,11,16}
! (13 rows)
! 
! -- just count 'em
! WITH RECURSIVE t(id, path) AS (
!     VALUES(1,ARRAY[]::integer[])
! UNION ALL
!     SELECT tree.id, t.path || tree.id
!     FROM tree JOIN t ON (tree.parent_id = t.id)
! )
! SELECT t1.id, count(t2.*) FROM t AS t1 JOIN t AS t2 ON
! 	(t1.path[1] = t2.path[1] AND
! 	array_upper(t1.path,1) = 1 AND
! 	array_upper(t2.path,1) > 1)
! 	GROUP BY t1.id
! 	ORDER BY t1.id;
!  id | count 
! ----+-------
!   2 |     6
!   3 |     7
! (2 rows)
! 
! -- this variant tickled a whole-row-variable bug in 8.4devel
! WITH RECURSIVE t(id, path) AS (
!     VALUES(1,ARRAY[]::integer[])
! UNION ALL
!     SELECT tree.id, t.path || tree.id
!     FROM tree JOIN t ON (tree.parent_id = t.id)
! )
! SELECT t1.id, t2.path, t2 FROM t AS t1 JOIN t AS t2 ON
! (t1.id=t2.id);
!  id |    path     |         t2         
! ----+-------------+--------------------
!   1 | {}          | (1,{})
!   2 | {2}         | (2,{2})
!   3 | {3}         | (3,{3})
!   4 | {2,4}       | (4,"{2,4}")
!   5 | {2,5}       | (5,"{2,5}")
!   6 | {2,6}       | (6,"{2,6}")
!   7 | {3,7}       | (7,"{3,7}")
!   8 | {3,8}       | (8,"{3,8}")
!   9 | {2,4,9}     | (9,"{2,4,9}")
!  10 | {2,4,10}    | (10,"{2,4,10}")
!  11 | {3,7,11}    | (11,"{3,7,11}")
!  12 | {3,7,12}    | (12,"{3,7,12}")
!  13 | {3,7,13}    | (13,"{3,7,13}")
!  14 | {2,4,9,14}  | (14,"{2,4,9,14}")
!  15 | {3,7,11,15} | (15,"{3,7,11,15}")
!  16 | {3,7,11,16} | (16,"{3,7,11,16}")
! (16 rows)
! 
! --
! -- test cycle detection
! --
! create temp table graph( f int, t int, label text );
! insert into graph values
! 	(1, 2, 'arc 1 -> 2'),
! 	(1, 3, 'arc 1 -> 3'),
! 	(2, 3, 'arc 2 -> 3'),
! 	(1, 4, 'arc 1 -> 4'),
! 	(4, 5, 'arc 4 -> 5'),
! 	(5, 1, 'arc 5 -> 1');
! with recursive search_graph(f, t, label, path, cycle) as (
! 	select *, array[row(g.f, g.t)], false from graph g
! 	union all
! 	select g.*, path || row(g.f, g.t), row(g.f, g.t) = any(path)
! 	from graph g, search_graph sg
! 	where g.f = sg.t and not cycle
! )
! select * from search_graph;
!  f | t |   label    |                   path                    | cycle 
! ---+---+------------+-------------------------------------------+-------
!  1 | 2 | arc 1 -> 2 | {"(1,2)"}                                 | f
!  1 | 3 | arc 1 -> 3 | {"(1,3)"}                                 | f
!  2 | 3 | arc 2 -> 3 | {"(2,3)"}                                 | f
!  1 | 4 | arc 1 -> 4 | {"(1,4)"}                                 | f
!  4 | 5 | arc 4 -> 5 | {"(4,5)"}                                 | f
!  5 | 1 | arc 5 -> 1 | {"(5,1)"}                                 | f
!  1 | 2 | arc 1 -> 2 | {"(5,1)","(1,2)"}                         | f
!  1 | 3 | arc 1 -> 3 | {"(5,1)","(1,3)"}                         | f
!  1 | 4 | arc 1 -> 4 | {"(5,1)","(1,4)"}                         | f
!  2 | 3 | arc 2 -> 3 | {"(1,2)","(2,3)"}                         | f
!  4 | 5 | arc 4 -> 5 | {"(1,4)","(4,5)"}                         | f
!  5 | 1 | arc 5 -> 1 | {"(4,5)","(5,1)"}                         | f
!  1 | 2 | arc 1 -> 2 | {"(4,5)","(5,1)","(1,2)"}                 | f
!  1 | 3 | arc 1 -> 3 | {"(4,5)","(5,1)","(1,3)"}                 | f
!  1 | 4 | arc 1 -> 4 | {"(4,5)","(5,1)","(1,4)"}                 | f
!  2 | 3 | arc 2 -> 3 | {"(5,1)","(1,2)","(2,3)"}                 | f
!  4 | 5 | arc 4 -> 5 | {"(5,1)","(1,4)","(4,5)"}                 | f
!  5 | 1 | arc 5 -> 1 | {"(1,4)","(4,5)","(5,1)"}                 | f
!  1 | 2 | arc 1 -> 2 | {"(1,4)","(4,5)","(5,1)","(1,2)"}         | f
!  1 | 3 | arc 1 -> 3 | {"(1,4)","(4,5)","(5,1)","(1,3)"}         | f
!  1 | 4 | arc 1 -> 4 | {"(1,4)","(4,5)","(5,1)","(1,4)"}         | t
!  2 | 3 | arc 2 -> 3 | {"(4,5)","(5,1)","(1,2)","(2,3)"}         | f
!  4 | 5 | arc 4 -> 5 | {"(4,5)","(5,1)","(1,4)","(4,5)"}         | t
!  5 | 1 | arc 5 -> 1 | {"(5,1)","(1,4)","(4,5)","(5,1)"}         | t
!  2 | 3 | arc 2 -> 3 | {"(1,4)","(4,5)","(5,1)","(1,2)","(2,3)"} | f
! (25 rows)
! 
! -- ordering by the path column has same effect as SEARCH DEPTH FIRST
! with recursive search_graph(f, t, label, path, cycle) as (
! 	select *, array[row(g.f, g.t)], false from graph g
! 	union all
! 	select g.*, path || row(g.f, g.t), row(g.f, g.t) = any(path)
! 	from graph g, search_graph sg
! 	where g.f = sg.t and not cycle
! )
! select * from search_graph order by path;
!  f | t |   label    |                   path                    | cycle 
! ---+---+------------+-------------------------------------------+-------
!  1 | 2 | arc 1 -> 2 | {"(1,2)"}                                 | f
!  2 | 3 | arc 2 -> 3 | {"(1,2)","(2,3)"}                         | f
!  1 | 3 | arc 1 -> 3 | {"(1,3)"}                                 | f
!  1 | 4 | arc 1 -> 4 | {"(1,4)"}                                 | f
!  4 | 5 | arc 4 -> 5 | {"(1,4)","(4,5)"}                         | f
!  5 | 1 | arc 5 -> 1 | {"(1,4)","(4,5)","(5,1)"}                 | f
!  1 | 2 | arc 1 -> 2 | {"(1,4)","(4,5)","(5,1)","(1,2)"}         | f
!  2 | 3 | arc 2 -> 3 | {"(1,4)","(4,5)","(5,1)","(1,2)","(2,3)"} | f
!  1 | 3 | arc 1 -> 3 | {"(1,4)","(4,5)","(5,1)","(1,3)"}         | f
!  1 | 4 | arc 1 -> 4 | {"(1,4)","(4,5)","(5,1)","(1,4)"}         | t
!  2 | 3 | arc 2 -> 3 | {"(2,3)"}                                 | f
!  4 | 5 | arc 4 -> 5 | {"(4,5)"}                                 | f
!  5 | 1 | arc 5 -> 1 | {"(4,5)","(5,1)"}                         | f
!  1 | 2 | arc 1 -> 2 | {"(4,5)","(5,1)","(1,2)"}                 | f
!  2 | 3 | arc 2 -> 3 | {"(4,5)","(5,1)","(1,2)","(2,3)"}         | f
!  1 | 3 | arc 1 -> 3 | {"(4,5)","(5,1)","(1,3)"}                 | f
!  1 | 4 | arc 1 -> 4 | {"(4,5)","(5,1)","(1,4)"}                 | f
!  4 | 5 | arc 4 -> 5 | {"(4,5)","(5,1)","(1,4)","(4,5)"}         | t
!  5 | 1 | arc 5 -> 1 | {"(5,1)"}                                 | f
!  1 | 2 | arc 1 -> 2 | {"(5,1)","(1,2)"}                         | f
!  2 | 3 | arc 2 -> 3 | {"(5,1)","(1,2)","(2,3)"}                 | f
!  1 | 3 | arc 1 -> 3 | {"(5,1)","(1,3)"}                         | f
!  1 | 4 | arc 1 -> 4 | {"(5,1)","(1,4)"}                         | f
!  4 | 5 | arc 4 -> 5 | {"(5,1)","(1,4)","(4,5)"}                 | f
!  5 | 1 | arc 5 -> 1 | {"(5,1)","(1,4)","(4,5)","(5,1)"}         | t
! (25 rows)
! 
! --
! -- test multiple WITH queries
! --
! WITH RECURSIVE
!   y (id) AS (VALUES (1)),
!   x (id) AS (SELECT * FROM y UNION ALL SELECT id+1 FROM x WHERE id < 5)
! SELECT * FROM x;
!  id 
! ----
!   1
!   2
!   3
!   4
!   5
! (5 rows)
! 
! -- forward reference OK
! WITH RECURSIVE
!     x(id) AS (SELECT * FROM y UNION ALL SELECT id+1 FROM x WHERE id < 5),
!     y(id) AS (values (1))
!  SELECT * FROM x;
!  id 
! ----
!   1
!   2
!   3
!   4
!   5
! (5 rows)
! 
! WITH RECURSIVE
!    x(id) AS
!      (VALUES (1) UNION ALL SELECT id+1 FROM x WHERE id < 5),
!    y(id) AS
!      (VALUES (1) UNION ALL SELECT id+1 FROM y WHERE id < 10)
!  SELECT y.*, x.* FROM y LEFT JOIN x USING (id);
!  id | id 
! ----+----
!   1 |  1
!   2 |  2
!   3 |  3
!   4 |  4
!   5 |  5
!   6 |   
!   7 |   
!   8 |   
!   9 |   
!  10 |   
! (10 rows)
! 
! WITH RECURSIVE
!    x(id) AS
!      (VALUES (1) UNION ALL SELECT id+1 FROM x WHERE id < 5),
!    y(id) AS
!      (VALUES (1) UNION ALL SELECT id+1 FROM x WHERE id < 10)
!  SELECT y.*, x.* FROM y LEFT JOIN x USING (id);
!  id | id 
! ----+----
!   1 |  1
!   2 |  2
!   3 |  3
!   4 |  4
!   5 |  5
!   6 |   
! (6 rows)
! 
! WITH RECURSIVE
!    x(id) AS
!      (SELECT 1 UNION ALL SELECT id+1 FROM x WHERE id < 3 ),
!    y(id) AS
!      (SELECT * FROM x UNION ALL SELECT * FROM x),
!    z(id) AS
!      (SELECT * FROM x UNION ALL SELECT id+1 FROM z WHERE id < 10)
!  SELECT * FROM z;
!  id 
! ----
!   1
!   2
!   3
!   2
!   3
!   4
!   3
!   4
!   5
!   4
!   5
!   6
!   5
!   6
!   7
!   6
!   7
!   8
!   7
!   8
!   9
!   8
!   9
!  10
!   9
!  10
!  10
! (27 rows)
! 
! WITH RECURSIVE
!    x(id) AS
!      (SELECT 1 UNION ALL SELECT id+1 FROM x WHERE id < 3 ),
!    y(id) AS
!      (SELECT * FROM x UNION ALL SELECT * FROM x),
!    z(id) AS
!      (SELECT * FROM y UNION ALL SELECT id+1 FROM z WHERE id < 10)
!  SELECT * FROM z;
!  id 
! ----
!   1
!   2
!   3
!   1
!   2
!   3
!   2
!   3
!   4
!   2
!   3
!   4
!   3
!   4
!   5
!   3
!   4
!   5
!   4
!   5
!   6
!   4
!   5
!   6
!   5
!   6
!   7
!   5
!   6
!   7
!   6
!   7
!   8
!   6
!   7
!   8
!   7
!   8
!   9
!   7
!   8
!   9
!   8
!   9
!  10
!   8
!   9
!  10
!   9
!  10
!   9
!  10
!  10
!  10
! (54 rows)
! 
! --
! -- Test WITH attached to a data-modifying statement
! --
! CREATE TEMPORARY TABLE y (a INTEGER);
! INSERT INTO y SELECT generate_series(1, 10);
! WITH t AS (
! 	SELECT a FROM y
! )
! INSERT INTO y
! SELECT a+20 FROM t RETURNING *;
!  a  
! ----
!  21
!  22
!  23
!  24
!  25
!  26
!  27
!  28
!  29
!  30
! (10 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  21
!  22
!  23
!  24
!  25
!  26
!  27
!  28
!  29
!  30
! (20 rows)
! 
! WITH t AS (
! 	SELECT a FROM y
! )
! UPDATE y SET a = y.a-10 FROM t WHERE y.a > 20 AND t.a = y.a RETURNING y.a;
!  a  
! ----
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
! (10 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
! (20 rows)
! 
! WITH RECURSIVE t(a) AS (
! 	SELECT 11
! 	UNION ALL
! 	SELECT a+1 FROM t WHERE a < 50
! )
! DELETE FROM y USING t WHERE t.a = y.a RETURNING y.a;
!  a  
! ----
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
! (10 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (10 rows)
! 
! DROP TABLE y;
! --
! -- error cases
! --
! -- INTERSECT
! WITH RECURSIVE x(n) AS (SELECT 1 INTERSECT SELECT n+1 FROM x)
! 	SELECT * FROM x;
! ERROR:  recursive query "x" does not have the form non-recursive-term UNION [ALL] recursive-term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 INTERSECT SELECT n+1 FROM x...
!                        ^
! WITH RECURSIVE x(n) AS (SELECT 1 INTERSECT ALL SELECT n+1 FROM x)
! 	SELECT * FROM x;
! ERROR:  recursive query "x" does not have the form non-recursive-term UNION [ALL] recursive-term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 INTERSECT ALL SELECT n+1 FR...
!                        ^
! -- EXCEPT
! WITH RECURSIVE x(n) AS (SELECT 1 EXCEPT SELECT n+1 FROM x)
! 	SELECT * FROM x;
! ERROR:  recursive query "x" does not have the form non-recursive-term UNION [ALL] recursive-term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 EXCEPT SELECT n+1 FROM x)
!                        ^
! WITH RECURSIVE x(n) AS (SELECT 1 EXCEPT ALL SELECT n+1 FROM x)
! 	SELECT * FROM x;
! ERROR:  recursive query "x" does not have the form non-recursive-term UNION [ALL] recursive-term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 EXCEPT ALL SELECT n+1 FROM ...
!                        ^
! -- no non-recursive term
! WITH RECURSIVE x(n) AS (SELECT n FROM x)
! 	SELECT * FROM x;
! ERROR:  recursive query "x" does not have the form non-recursive-term UNION [ALL] recursive-term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT n FROM x)
!                        ^
! -- recursive term in the left hand side (strictly speaking, should allow this)
! WITH RECURSIVE x(n) AS (SELECT n FROM x UNION ALL SELECT 1)
! 	SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within its non-recursive term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT n FROM x UNION ALL SELECT 1)
!                                               ^
! CREATE TEMPORARY TABLE y (a INTEGER);
! INSERT INTO y SELECT generate_series(1, 10);
! -- LEFT JOIN
! WITH RECURSIVE x(n) AS (SELECT a FROM y WHERE a = 1
! 	UNION ALL
! 	SELECT x.n+1 FROM y LEFT JOIN x ON x.n = y.a WHERE n < 10)
! SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within an outer join
! LINE 3:  SELECT x.n+1 FROM y LEFT JOIN x ON x.n = y.a WHERE n < 10)
!                                        ^
! -- RIGHT JOIN
! WITH RECURSIVE x(n) AS (SELECT a FROM y WHERE a = 1
! 	UNION ALL
! 	SELECT x.n+1 FROM x RIGHT JOIN y ON x.n = y.a WHERE n < 10)
! SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within an outer join
! LINE 3:  SELECT x.n+1 FROM x RIGHT JOIN y ON x.n = y.a WHERE n < 10)
!                            ^
! -- FULL JOIN
! WITH RECURSIVE x(n) AS (SELECT a FROM y WHERE a = 1
! 	UNION ALL
! 	SELECT x.n+1 FROM x FULL JOIN y ON x.n = y.a WHERE n < 10)
! SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within an outer join
! LINE 3:  SELECT x.n+1 FROM x FULL JOIN y ON x.n = y.a WHERE n < 10)
!                            ^
! -- subquery
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM x
!                           WHERE n IN (SELECT * FROM x))
!   SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within a subquery
! LINE 2:                           WHERE n IN (SELECT * FROM x))
!                                                             ^
! -- aggregate functions
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT count(*) FROM x)
!   SELECT * FROM x;
! ERROR:  aggregate functions are not allowed in a recursive query's recursive term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT count(*) F...
!                                                           ^
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT sum(n) FROM x)
!   SELECT * FROM x;
! ERROR:  aggregate functions are not allowed in a recursive query's recursive term
! LINE 1: WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT sum(n) FRO...
!                                                           ^
! -- ORDER BY
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM x ORDER BY 1)
!   SELECT * FROM x;
! ERROR:  ORDER BY in a recursive query is not implemented
! LINE 1: ...VE x(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM x ORDER BY 1)
!                                                                      ^
! -- LIMIT/OFFSET
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM x LIMIT 10 OFFSET 1)
!   SELECT * FROM x;
! ERROR:  OFFSET in a recursive query is not implemented
! LINE 1: ... AS (SELECT 1 UNION ALL SELECT n+1 FROM x LIMIT 10 OFFSET 1)
!                                                                      ^
! -- FOR UPDATE
! WITH RECURSIVE x(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM x FOR UPDATE)
!   SELECT * FROM x;
! ERROR:  FOR UPDATE/SHARE in a recursive query is not implemented
! -- target list has a recursive query name
! WITH RECURSIVE x(id) AS (values (1)
!     UNION ALL
!     SELECT (SELECT * FROM x) FROM x WHERE id < 5
! ) SELECT * FROM x;
! ERROR:  recursive reference to query "x" must not appear within a subquery
! LINE 3:     SELECT (SELECT * FROM x) FROM x WHERE id < 5
!                                   ^
! -- mutual recursive query (not implemented)
! WITH RECURSIVE
!   x (id) AS (SELECT 1 UNION ALL SELECT id+1 FROM y WHERE id < 5),
!   y (id) AS (SELECT 1 UNION ALL SELECT id+1 FROM x WHERE id < 5)
! SELECT * FROM x;
! ERROR:  mutual recursion between WITH items is not implemented
! LINE 2:   x (id) AS (SELECT 1 UNION ALL SELECT id+1 FROM y WHERE id ...
!           ^
! -- non-linear recursion is not allowed
! WITH RECURSIVE foo(i) AS
!     (values (1)
!     UNION ALL
!        (SELECT i+1 FROM foo WHERE i < 10
!           UNION ALL
!        SELECT i+1 FROM foo WHERE i < 5)
! ) SELECT * FROM foo;
! ERROR:  recursive reference to query "foo" must not appear more than once
! LINE 6:        SELECT i+1 FROM foo WHERE i < 5)
!                                ^
! WITH RECURSIVE foo(i) AS
!     (values (1)
!     UNION ALL
! 	   SELECT * FROM
!        (SELECT i+1 FROM foo WHERE i < 10
!           UNION ALL
!        SELECT i+1 FROM foo WHERE i < 5) AS t
! ) SELECT * FROM foo;
! ERROR:  recursive reference to query "foo" must not appear more than once
! LINE 7:        SELECT i+1 FROM foo WHERE i < 5) AS t
!                                ^
! WITH RECURSIVE foo(i) AS
!     (values (1)
!     UNION ALL
!        (SELECT i+1 FROM foo WHERE i < 10
!           EXCEPT
!        SELECT i+1 FROM foo WHERE i < 5)
! ) SELECT * FROM foo;
! ERROR:  recursive reference to query "foo" must not appear within EXCEPT
! LINE 6:        SELECT i+1 FROM foo WHERE i < 5)
!                                ^
! WITH RECURSIVE foo(i) AS
!     (values (1)
!     UNION ALL
!        (SELECT i+1 FROM foo WHERE i < 10
!           INTERSECT
!        SELECT i+1 FROM foo WHERE i < 5)
! ) SELECT * FROM foo;
! ERROR:  recursive reference to query "foo" must not appear more than once
! LINE 6:        SELECT i+1 FROM foo WHERE i < 5)
!                                ^
! -- Wrong type induced from non-recursive term
! WITH RECURSIVE foo(i) AS
!    (SELECT i FROM (VALUES(1),(2)) t(i)
!    UNION ALL
!    SELECT (i+1)::numeric(10,0) FROM foo WHERE i < 10)
! SELECT * FROM foo;
! ERROR:  recursive query "foo" column 1 has type integer in non-recursive term but type numeric overall
! LINE 2:    (SELECT i FROM (VALUES(1),(2)) t(i)
!                    ^
! HINT:  Cast the output of the non-recursive term to the correct type.
! -- rejects different typmod, too (should we allow this?)
! WITH RECURSIVE foo(i) AS
!    (SELECT i::numeric(3,0) FROM (VALUES(1),(2)) t(i)
!    UNION ALL
!    SELECT (i+1)::numeric(10,0) FROM foo WHERE i < 10)
! SELECT * FROM foo;
! ERROR:  recursive query "foo" column 1 has type numeric(3,0) in non-recursive term but type numeric overall
! LINE 2:    (SELECT i::numeric(3,0) FROM (VALUES(1),(2)) t(i)
!                    ^
! HINT:  Cast the output of the non-recursive term to the correct type.
! -- disallow OLD/NEW reference in CTE
! CREATE TEMPORARY TABLE x (n integer);
! CREATE RULE r2 AS ON UPDATE TO x DO INSTEAD
!     WITH t AS (SELECT OLD.*) UPDATE y SET a = t.n FROM t;
! ERROR:  cannot refer to OLD within WITH query
! --
! -- test for bug #4902
! --
! with cte(foo) as ( values(42) ) values((select foo from cte));
!  column1 
! ---------
!       42
! (1 row)
! 
! with cte(foo) as ( select 42 ) select * from ((select foo from cte)) q;
!  foo 
! -----
!   42
! (1 row)
! 
! -- test CTE referencing an outer-level variable (to see that changed-parameter
! -- signaling still works properly after fixing this bug)
! select ( with cte(foo) as ( values(f1) )
!          select (select foo from cte) )
! from int4_tbl;
!      foo     
! -------------
!            0
!       123456
!      -123456
!   2147483647
!  -2147483647
! (5 rows)
! 
! select ( with cte(foo) as ( values(f1) )
!           values((select foo from cte)) )
! from int4_tbl;
!    column1   
! -------------
!            0
!       123456
!      -123456
!   2147483647
!  -2147483647
! (5 rows)
! 
! --
! -- test for nested-recursive-WITH bug
! --
! WITH RECURSIVE t(j) AS (
!     WITH RECURSIVE s(i) AS (
!         VALUES (1)
!         UNION ALL
!         SELECT i+1 FROM s WHERE i < 10
!     )
!     SELECT i FROM s
!     UNION ALL
!     SELECT j+1 FROM t WHERE j < 10
! )
! SELECT * FROM t;
!  j  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!   5
!   6
!   7
!   8
!   9
!  10
!   6
!   7
!   8
!   9
!  10
!   7
!   8
!   9
!  10
!   8
!   9
!  10
!   9
!  10
!  10
! (55 rows)
! 
! --
! -- test WITH attached to intermediate-level set operation
! --
! WITH outermost(x) AS (
!   SELECT 1
!   UNION (WITH innermost as (SELECT 2)
!          SELECT * FROM innermost
!          UNION SELECT 3)
! )
! SELECT * FROM outermost;
!  x 
! ---
!  1
!  2
!  3
! (3 rows)
! 
! WITH outermost(x) AS (
!   SELECT 1
!   UNION (WITH innermost as (SELECT 2)
!          SELECT * FROM outermost  -- fail
!          UNION SELECT * FROM innermost)
! )
! SELECT * FROM outermost;
! ERROR:  relation "outermost" does not exist
! LINE 4:          SELECT * FROM outermost  
!                                ^
! DETAIL:  There is a WITH item named "outermost", but it cannot be referenced from this part of the query.
! HINT:  Use WITH RECURSIVE, or re-order the WITH items to remove forward references.
! WITH RECURSIVE outermost(x) AS (
!   SELECT 1
!   UNION (WITH innermost as (SELECT 2)
!          SELECT * FROM outermost
!          UNION SELECT * FROM innermost)
! )
! SELECT * FROM outermost;
!  x 
! ---
!  1
!  2
! (2 rows)
! 
! WITH RECURSIVE outermost(x) AS (
!   WITH innermost as (SELECT 2 FROM outermost) -- fail
!     SELECT * FROM innermost
!     UNION SELECT * from outermost
! )
! SELECT * FROM outermost;
! ERROR:  recursive reference to query "outermost" must not appear within a subquery
! LINE 2:   WITH innermost as (SELECT 2 FROM outermost) 
!                                            ^
! --
! -- This test will fail with the old implementation of PARAM_EXEC parameter
! -- assignment, because the "q1" Var passed down to A's targetlist subselect
! -- looks exactly like the "A.id" Var passed down to C's subselect, causing
! -- the old code to give them the same runtime PARAM_EXEC slot.  But the
! -- lifespans of the two parameters overlap, thanks to B also reading A.
! --
! with
! A as ( select q2 as id, (select q1) as x from int8_tbl ),
! B as ( select id, row_number() over (partition by id) as r from A ),
! C as ( select A.id, array(select B.id from B where B.id = A.id) from A )
! select * from C;
!         id         |                array                
! -------------------+-------------------------------------
!                456 | {456}
!   4567890123456789 | {4567890123456789,4567890123456789}
!                123 | {123}
!   4567890123456789 | {4567890123456789,4567890123456789}
!  -4567890123456789 | {-4567890123456789}
! (5 rows)
! 
! --
! -- Test CTEs read in non-initialization orders
! --
! WITH RECURSIVE
!   tab(id_key,link) AS (VALUES (1,17), (2,17), (3,17), (4,17), (6,17), (5,17)),
!   iter (id_key, row_type, link) AS (
!       SELECT 0, 'base', 17
!     UNION ALL (
!       WITH remaining(id_key, row_type, link, min) AS (
!         SELECT tab.id_key, 'true'::text, iter.link, MIN(tab.id_key) OVER ()
!         FROM tab INNER JOIN iter USING (link)
!         WHERE tab.id_key > iter.id_key
!       ),
!       first_remaining AS (
!         SELECT id_key, row_type, link
!         FROM remaining
!         WHERE id_key=min
!       ),
!       effect AS (
!         SELECT tab.id_key, 'new'::text, tab.link
!         FROM first_remaining e INNER JOIN tab ON e.id_key=tab.id_key
!         WHERE e.row_type = 'false'
!       )
!       SELECT * FROM first_remaining
!       UNION ALL SELECT * FROM effect
!     )
!   )
! SELECT * FROM iter;
!  id_key | row_type | link 
! --------+----------+------
!       0 | base     |   17
!       1 | true     |   17
!       2 | true     |   17
!       3 | true     |   17
!       4 | true     |   17
!       5 | true     |   17
!       6 | true     |   17
! (7 rows)
! 
! WITH RECURSIVE
!   tab(id_key,link) AS (VALUES (1,17), (2,17), (3,17), (4,17), (6,17), (5,17)),
!   iter (id_key, row_type, link) AS (
!       SELECT 0, 'base', 17
!     UNION (
!       WITH remaining(id_key, row_type, link, min) AS (
!         SELECT tab.id_key, 'true'::text, iter.link, MIN(tab.id_key) OVER ()
!         FROM tab INNER JOIN iter USING (link)
!         WHERE tab.id_key > iter.id_key
!       ),
!       first_remaining AS (
!         SELECT id_key, row_type, link
!         FROM remaining
!         WHERE id_key=min
!       ),
!       effect AS (
!         SELECT tab.id_key, 'new'::text, tab.link
!         FROM first_remaining e INNER JOIN tab ON e.id_key=tab.id_key
!         WHERE e.row_type = 'false'
!       )
!       SELECT * FROM first_remaining
!       UNION ALL SELECT * FROM effect
!     )
!   )
! SELECT * FROM iter;
!  id_key | row_type | link 
! --------+----------+------
!       0 | base     |   17
!       1 | true     |   17
!       2 | true     |   17
!       3 | true     |   17
!       4 | true     |   17
!       5 | true     |   17
!       6 | true     |   17
! (7 rows)
! 
! --
! -- Data-modifying statements in WITH
! --
! -- INSERT ... RETURNING
! WITH t AS (
!     INSERT INTO y
!     VALUES
!         (11),
!         (12),
!         (13),
!         (14),
!         (15),
!         (16),
!         (17),
!         (18),
!         (19),
!         (20)
!     RETURNING *
! )
! SELECT * FROM t;
!  a  
! ----
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
! (10 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
! (20 rows)
! 
! -- UPDATE ... RETURNING
! WITH t AS (
!     UPDATE y
!     SET a=a+1
!     RETURNING *
! )
! SELECT * FROM t;
!  a  
! ----
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
!  21
! (20 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
!  21
! (20 rows)
! 
! -- DELETE ... RETURNING
! WITH t AS (
!     DELETE FROM y
!     WHERE a <= 10
!     RETURNING *
! )
! SELECT * FROM t;
!  a  
! ----
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (9 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!  11
!  12
!  13
!  14
!  15
!  16
!  17
!  18
!  19
!  20
!  21
! (11 rows)
! 
! -- forward reference
! WITH RECURSIVE t AS (
! 	INSERT INTO y
! 		SELECT a+5 FROM t2 WHERE a > 5
! 	RETURNING *
! ), t2 AS (
! 	UPDATE y SET a=a-11 RETURNING *
! )
! SELECT * FROM t
! UNION ALL
! SELECT * FROM t2;
!  a  
! ----
!  11
!  12
!  13
!  14
!  15
!   0
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
! (16 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   0
!   1
!   2
!   3
!   4
!   5
!   6
!  11
!   7
!  12
!   8
!  13
!   9
!  14
!  10
!  15
! (16 rows)
! 
! -- unconditional DO INSTEAD rule
! CREATE RULE y_rule AS ON DELETE TO y DO INSTEAD
!   INSERT INTO y VALUES(42) RETURNING *;
! WITH t AS (
! 	DELETE FROM y RETURNING *
! )
! SELECT * FROM t;
!  a  
! ----
!  42
! (1 row)
! 
! SELECT * FROM y;
!  a  
! ----
!   0
!   1
!   2
!   3
!   4
!   5
!   6
!  11
!   7
!  12
!   8
!  13
!   9
!  14
!  10
!  15
!  42
! (17 rows)
! 
! DROP RULE y_rule ON y;
! -- check merging of outer CTE with CTE in a rule action
! CREATE TEMP TABLE bug6051 AS
!   select i from generate_series(1,3) as t(i);
! SELECT * FROM bug6051;
!  i 
! ---
!  1
!  2
!  3
! (3 rows)
! 
! WITH t1 AS ( DELETE FROM bug6051 RETURNING * )
! INSERT INTO bug6051 SELECT * FROM t1;
! SELECT * FROM bug6051;
!  i 
! ---
!  1
!  2
!  3
! (3 rows)
! 
! CREATE TEMP TABLE bug6051_2 (i int);
! CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD
!  INSERT INTO bug6051_2
!  SELECT NEW.i;
! WITH t1 AS ( DELETE FROM bug6051 RETURNING * )
! INSERT INTO bug6051 SELECT * FROM t1;
! SELECT * FROM bug6051;
!  i 
! ---
! (0 rows)
! 
! SELECT * FROM bug6051_2;
!  i 
! ---
!  1
!  2
!  3
! (3 rows)
! 
! -- a truly recursive CTE in the same list
! WITH RECURSIVE t(a) AS (
! 	SELECT 0
! 		UNION ALL
! 	SELECT a+1 FROM t WHERE a+1 < 5
! ), t2 as (
! 	INSERT INTO y
! 		SELECT * FROM t RETURNING *
! )
! SELECT * FROM t2 JOIN y USING (a) ORDER BY a;
!  a 
! ---
!  0
!  1
!  2
!  3
!  4
! (5 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   0
!   1
!   2
!   3
!   4
!   5
!   6
!  11
!   7
!  12
!   8
!  13
!   9
!  14
!  10
!  15
!  42
!   0
!   1
!   2
!   3
!   4
! (22 rows)
! 
! -- data-modifying WITH in a modifying statement
! WITH t AS (
!     DELETE FROM y
!     WHERE a <= 10
!     RETURNING *
! )
! INSERT INTO y SELECT -a FROM t RETURNING *;
!   a  
! -----
!    0
!   -1
!   -2
!   -3
!   -4
!   -5
!   -6
!   -7
!   -8
!   -9
!  -10
!    0
!   -1
!   -2
!   -3
!   -4
! (16 rows)
! 
! SELECT * FROM y;
!   a  
! -----
!   11
!   12
!   13
!   14
!   15
!   42
!    0
!   -1
!   -2
!   -3
!   -4
!   -5
!   -6
!   -7
!   -8
!   -9
!  -10
!    0
!   -1
!   -2
!   -3
!   -4
! (22 rows)
! 
! -- check that WITH query is run to completion even if outer query isn't
! WITH t AS (
!     UPDATE y SET a = a * 100 RETURNING *
! )
! SELECT * FROM t LIMIT 10;
!   a   
! ------
!  1100
!  1200
!  1300
!  1400
!  1500
!  4200
!     0
!  -100
!  -200
!  -300
! (10 rows)
! 
! SELECT * FROM y;
!    a   
! -------
!   1100
!   1200
!   1300
!   1400
!   1500
!   4200
!      0
!   -100
!   -200
!   -300
!   -400
!   -500
!   -600
!   -700
!   -800
!   -900
!  -1000
!      0
!   -100
!   -200
!   -300
!   -400
! (22 rows)
! 
! -- data-modifying WITH containing INSERT...ON CONFLICT DO UPDATE
! CREATE TABLE z AS SELECT i AS k, (i || ' v')::text v FROM generate_series(1, 16, 3) i;
! ALTER TABLE z ADD UNIQUE (k);
! WITH t AS (
!     INSERT INTO z SELECT i, 'insert'
!     FROM generate_series(0, 16) i
!     ON CONFLICT (k) DO UPDATE SET v = z.v || ', now update'
!     RETURNING *
! )
! SELECT * FROM t JOIN y ON t.k = y.a ORDER BY a, k;
!  k |   v    | a 
! ---+--------+---
!  0 | insert | 0
!  0 | insert | 0
! (2 rows)
! 
! -- Test EXCLUDED.* reference within CTE
! WITH aa AS (
!     INSERT INTO z VALUES(1, 5) ON CONFLICT (k) DO UPDATE SET v = EXCLUDED.v
!     WHERE z.k != EXCLUDED.k
!     RETURNING *
! )
! SELECT * FROM aa;
!  k | v 
! ---+---
! (0 rows)
! 
! -- New query/snapshot demonstrates side-effects of previous query.
! SELECT * FROM z ORDER BY k;
!  k  |        v         
! ----+------------------
!   0 | insert
!   1 | 1 v, now update
!   2 | insert
!   3 | insert
!   4 | 4 v, now update
!   5 | insert
!   6 | insert
!   7 | 7 v, now update
!   8 | insert
!   9 | insert
!  10 | 10 v, now update
!  11 | insert
!  12 | insert
!  13 | 13 v, now update
!  14 | insert
!  15 | insert
!  16 | 16 v, now update
! (17 rows)
! 
! --
! -- Ensure subqueries within the update clause work, even if they
! -- reference outside values
! --
! WITH aa AS (SELECT 1 a, 2 b)
! INSERT INTO z VALUES(1, 'insert')
! ON CONFLICT (k) DO UPDATE SET v = (SELECT b || ' update' FROM aa WHERE a = 1 LIMIT 1);
! WITH aa AS (SELECT 1 a, 2 b)
! INSERT INTO z VALUES(1, 'insert')
! ON CONFLICT (k) DO UPDATE SET v = ' update' WHERE z.k = (SELECT a FROM aa);
! WITH aa AS (SELECT 1 a, 2 b)
! INSERT INTO z VALUES(1, 'insert')
! ON CONFLICT (k) DO UPDATE SET v = (SELECT b || ' update' FROM aa WHERE a = 1 LIMIT 1);
! WITH aa AS (SELECT 'a' a, 'b' b UNION ALL SELECT 'a' a, 'b' b)
! INSERT INTO z VALUES(1, 'insert')
! ON CONFLICT (k) DO UPDATE SET v = (SELECT b || ' update' FROM aa WHERE a = 'a' LIMIT 1);
! WITH aa AS (SELECT 1 a, 2 b)
! INSERT INTO z VALUES(1, (SELECT b || ' insert' FROM aa WHERE a = 1 ))
! ON CONFLICT (k) DO UPDATE SET v = (SELECT b || ' update' FROM aa WHERE a = 1 LIMIT 1);
! -- Update a row more than once, in different parts of a wCTE. That is
! -- an allowed, presumably very rare, edge case, but since it was
! -- broken in the past, having a test seems worthwhile.
! WITH simpletup AS (
!   SELECT 2 k, 'Green' v),
! upsert_cte AS (
!   INSERT INTO z VALUES(2, 'Blue') ON CONFLICT (k) DO
!     UPDATE SET (k, v) = (SELECT k, v FROM simpletup WHERE simpletup.k = z.k)
!     RETURNING k, v)
! INSERT INTO z VALUES(2, 'Red') ON CONFLICT (k) DO
! UPDATE SET (k, v) = (SELECT k, v FROM upsert_cte WHERE upsert_cte.k = z.k)
! RETURNING k, v;
!  k | v 
! ---+---
! (0 rows)
! 
! DROP TABLE z;
! -- check that run to completion happens in proper ordering
! TRUNCATE TABLE y;
! INSERT INTO y SELECT generate_series(1, 3);
! CREATE TEMPORARY TABLE yy (a INTEGER);
! WITH RECURSIVE t1 AS (
!   INSERT INTO y SELECT * FROM y RETURNING *
! ), t2 AS (
!   INSERT INTO yy SELECT * FROM t1 RETURNING *
! )
! SELECT 1;
!  ?column? 
! ----------
!         1
! (1 row)
! 
! SELECT * FROM y;
!  a 
! ---
!  1
!  2
!  3
!  1
!  2
!  3
! (6 rows)
! 
! SELECT * FROM yy;
!  a 
! ---
!  1
!  2
!  3
! (3 rows)
! 
! WITH RECURSIVE t1 AS (
!   INSERT INTO yy SELECT * FROM t2 RETURNING *
! ), t2 AS (
!   INSERT INTO y SELECT * FROM y RETURNING *
! )
! SELECT 1;
!  ?column? 
! ----------
!         1
! (1 row)
! 
! SELECT * FROM y;
!  a 
! ---
!  1
!  2
!  3
!  1
!  2
!  3
!  1
!  2
!  3
!  1
!  2
!  3
! (12 rows)
! 
! SELECT * FROM yy;
!  a 
! ---
!  1
!  2
!  3
!  1
!  2
!  3
!  1
!  2
!  3
! (9 rows)
! 
! -- triggers
! TRUNCATE TABLE y;
! INSERT INTO y SELECT generate_series(1, 10);
! CREATE FUNCTION y_trigger() RETURNS trigger AS $$
! begin
!   raise notice 'y_trigger: a = %', new.a;
!   return new;
! end;
! $$ LANGUAGE plpgsql;
! CREATE TRIGGER y_trig BEFORE INSERT ON y FOR EACH ROW
!     EXECUTE PROCEDURE y_trigger();
! WITH t AS (
!     INSERT INTO y
!     VALUES
!         (21),
!         (22),
!         (23)
!     RETURNING *
! )
! SELECT * FROM t;
! NOTICE:  y_trigger: a = 21
! NOTICE:  y_trigger: a = 22
! NOTICE:  y_trigger: a = 23
!  a  
! ----
!  21
!  22
!  23
! (3 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  21
!  22
!  23
! (13 rows)
! 
! DROP TRIGGER y_trig ON y;
! CREATE TRIGGER y_trig AFTER INSERT ON y FOR EACH ROW
!     EXECUTE PROCEDURE y_trigger();
! WITH t AS (
!     INSERT INTO y
!     VALUES
!         (31),
!         (32),
!         (33)
!     RETURNING *
! )
! SELECT * FROM t LIMIT 1;
! NOTICE:  y_trigger: a = 31
! NOTICE:  y_trigger: a = 32
! NOTICE:  y_trigger: a = 33
!  a  
! ----
!  31
! (1 row)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  21
!  22
!  23
!  31
!  32
!  33
! (16 rows)
! 
! DROP TRIGGER y_trig ON y;
! CREATE OR REPLACE FUNCTION y_trigger() RETURNS trigger AS $$
! begin
!   raise notice 'y_trigger';
!   return null;
! end;
! $$ LANGUAGE plpgsql;
! CREATE TRIGGER y_trig AFTER INSERT ON y FOR EACH STATEMENT
!     EXECUTE PROCEDURE y_trigger();
! WITH t AS (
!     INSERT INTO y
!     VALUES
!         (41),
!         (42),
!         (43)
!     RETURNING *
! )
! SELECT * FROM t;
! NOTICE:  y_trigger
!  a  
! ----
!  41
!  42
!  43
! (3 rows)
! 
! SELECT * FROM y;
!  a  
! ----
!   1
!   2
!   3
!   4
!   5
!   6
!   7
!   8
!   9
!  10
!  21
!  22
!  23
!  31
!  32
!  33
!  41
!  42
!  43
! (19 rows)
! 
! DROP TRIGGER y_trig ON y;
! DROP FUNCTION y_trigger();
! -- WITH attached to inherited UPDATE or DELETE
! CREATE TEMP TABLE parent ( id int, val text );
! CREATE TEMP TABLE child1 ( ) INHERITS ( parent );
! CREATE TEMP TABLE child2 ( ) INHERITS ( parent );
! INSERT INTO parent VALUES ( 1, 'p1' );
! INSERT INTO child1 VALUES ( 11, 'c11' ),( 12, 'c12' );
! INSERT INTO child2 VALUES ( 23, 'c21' ),( 24, 'c22' );
! WITH rcte AS ( SELECT sum(id) AS totalid FROM parent )
! UPDATE parent SET id = id + totalid FROM rcte;
! SELECT * FROM parent;
!  id | val 
! ----+-----
!  72 | p1
!  82 | c11
!  83 | c12
!  94 | c21
!  95 | c22
! (5 rows)
! 
! WITH wcte AS ( INSERT INTO child1 VALUES ( 42, 'new' ) RETURNING id AS newid )
! UPDATE parent SET id = id + newid FROM wcte;
! SELECT * FROM parent;
!  id  | val 
! -----+-----
!  114 | p1
!   42 | new
!  124 | c11
!  125 | c12
!  136 | c21
!  137 | c22
! (6 rows)
! 
! WITH rcte AS ( SELECT max(id) AS maxid FROM parent )
! DELETE FROM parent USING rcte WHERE id = maxid;
! SELECT * FROM parent;
!  id  | val 
! -----+-----
!  114 | p1
!   42 | new
!  124 | c11
!  125 | c12
!  136 | c21
! (5 rows)
! 
! WITH wcte AS ( INSERT INTO child2 VALUES ( 42, 'new2' ) RETURNING id AS newid )
! DELETE FROM parent USING wcte WHERE id = newid;
! SELECT * FROM parent;
!  id  | val  
! -----+------
!  114 | p1
!  124 | c11
!  125 | c12
!  136 | c21
!   42 | new2
! (5 rows)
! 
! -- check EXPLAIN VERBOSE for a wCTE with RETURNING
! EXPLAIN (VERBOSE, COSTS OFF)
! WITH wcte AS ( INSERT INTO int8_tbl VALUES ( 42, 47 ) RETURNING q2 )
! DELETE FROM a USING wcte WHERE aa = q2;
!                      QUERY PLAN                     
! ----------------------------------------------------
!  Delete on public.a
!    Delete on public.a
!    Delete on public.b
!    Delete on public.c
!    Delete on public.d
!    CTE wcte
!      ->  Insert on public.int8_tbl
!            Output: int8_tbl.q2
!            ->  Result
!                  Output: '42'::bigint, '47'::bigint
!    ->  Nested Loop
!          Output: a.ctid, wcte.*
!          Join Filter: (a.aa = wcte.q2)
!          ->  Seq Scan on public.a
!                Output: a.ctid, a.aa
!          ->  CTE Scan on wcte
!                Output: wcte.*, wcte.q2
!    ->  Nested Loop
!          Output: b.ctid, wcte.*
!          Join Filter: (b.aa = wcte.q2)
!          ->  Seq Scan on public.b
!                Output: b.ctid, b.aa
!          ->  CTE Scan on wcte
!                Output: wcte.*, wcte.q2
!    ->  Nested Loop
!          Output: c.ctid, wcte.*
!          Join Filter: (c.aa = wcte.q2)
!          ->  Seq Scan on public.c
!                Output: c.ctid, c.aa
!          ->  CTE Scan on wcte
!                Output: wcte.*, wcte.q2
!    ->  Nested Loop
!          Output: d.ctid, wcte.*
!          Join Filter: (d.aa = wcte.q2)
!          ->  Seq Scan on public.d
!                Output: d.ctid, d.aa
!          ->  CTE Scan on wcte
!                Output: wcte.*, wcte.q2
! (38 rows)
! 
! -- error cases
! -- data-modifying WITH tries to use its own output
! WITH RECURSIVE t AS (
! 	INSERT INTO y
! 		SELECT * FROM t
! )
! VALUES(FALSE);
! ERROR:  recursive query "t" must not contain data-modifying statements
! LINE 1: WITH RECURSIVE t AS (
!                        ^
! -- no RETURNING in a referenced data-modifying WITH
! WITH t AS (
! 	INSERT INTO y VALUES(0)
! )
! SELECT * FROM t;
! ERROR:  WITH query "t" does not have a RETURNING clause
! LINE 4: SELECT * FROM t;
!                       ^
! -- data-modifying WITH allowed only at the top level
! SELECT * FROM (
! 	WITH t AS (UPDATE y SET a=a+1 RETURNING *)
! 	SELECT * FROM t
! ) ss;
! ERROR:  WITH clause containing a data-modifying statement must be at the top level
! LINE 2:  WITH t AS (UPDATE y SET a=a+1 RETURNING *)
!               ^
! -- most variants of rules aren't allowed
! CREATE RULE y_rule AS ON INSERT TO y WHERE a=0 DO INSTEAD DELETE FROM y;
! WITH t AS (
! 	INSERT INTO y VALUES(0)
! )
! VALUES(FALSE);
! ERROR:  conditional DO INSTEAD rules are not supported for data-modifying statements in WITH
! DROP RULE y_rule ON y;
! -- check that parser lookahead for WITH doesn't cause any odd behavior
! create table foo (with baz);  -- fail, WITH is a reserved word
! ERROR:  syntax error at or near "with"
! LINE 1: create table foo (with baz);
!                           ^
! create table foo (with ordinality);  -- fail, WITH is a reserved word
! ERROR:  syntax error at or near "with"
! LINE 1: create table foo (with ordinality);
!                           ^
! with ordinality as (select 1 as x) select * from ordinality;
!  x 
! ---
!  1
! (1 row)
! 
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/xml_1.out	2016-09-05 20:45:49.140033814 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/xml.out	2016-09-12 12:14:51.887413916 -0300
***************
*** 1,829 ****
! CREATE TABLE xmltest (
!     id int,
!     data xml
! );
! INSERT INTO xmltest VALUES (1, '<value>one</value>');
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (1, '<value>one</value>');
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (2, '<value>two</value>');
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (2, '<value>two</value>');
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (3, '<wrong');
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (3, '<wrong');
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT * FROM xmltest;
!  id | data 
! ----+------
! (0 rows)
! 
! SELECT xmlcomment('test');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlcomment('-test');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlcomment('test-');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlcomment('--test');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlcomment('te st');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat(xmlcomment('hello'),
!                  xmlelement(NAME qux, 'foo'),
!                  xmlcomment('world'));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat('hello', 'you');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlconcat('hello', 'you');
!                          ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat(1, 2);
! ERROR:  argument of XMLCONCAT must be type xml, not type integer
! LINE 1: SELECT xmlconcat(1, 2);
!                          ^
! SELECT xmlconcat('bad', '<syntax');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlconcat('bad', '<syntax');
!                          ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat('<foo/>', NULL, '<?xml version="1.1" standalone="no"?><bar/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlconcat('<foo/>', NULL, '<?xml version="1.1" standa...
!                          ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat('<?xml version="1.1"?><foo/>', NULL, '<?xml version="1.1" standalone="no"?><bar/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlconcat('<?xml version="1.1"?><foo/>', NULL, '<?xml...
!                          ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlconcat(NULL);
!  xmlconcat 
! -----------
!  
! (1 row)
! 
! SELECT xmlconcat(NULL, NULL);
!  xmlconcat 
! -----------
!  
! (1 row)
! 
! SELECT xmlelement(name element,
!                   xmlattributes (1 as one, 'deuce' as two),
!                   'content');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name element,
!                   xmlattributes ('unnamed and wrong'));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name element, xmlelement(name nested, 'stuff'));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name employee, xmlforest(name, age, salary as pay)) FROM emp;
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name duplicate, xmlattributes(1 as a, 2 as b, 3 as a));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name num, 37);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, text 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xml 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, text 'b<a/>r');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xml 'b<a/>r');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, array[1, 2, 3]);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SET xmlbinary TO base64;
! SELECT xmlelement(name foo, bytea 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SET xmlbinary TO hex;
! SELECT xmlelement(name foo, bytea 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xmlattributes(true as bar));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xmlattributes('2009-04-09 00:24:37'::timestamp as bar));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xmlattributes('infinity'::timestamp as bar));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlelement(name foo, xmlattributes('<>&"''' as funny, xml 'b<a/>r' as funnier));
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '  ');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content 'abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<abc>x</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<invalidentity>&</invalidentity>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<undefinedentity>&idontexist;</undefinedentity>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<invalidns xmlns=''&lt;''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<relativens xmlns=''relative''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<twoerrors>&idontexist;</unbalanced>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(content '<nosuchprefix:tag/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '   ');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document 'abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<abc>x</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<invalidentity>&</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<undefinedentity>&idontexist;</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<invalidns xmlns=''&lt;''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<relativens xmlns=''relative''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<twoerrors>&idontexist;</unbalanced>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlparse(document '<nosuchprefix:tag/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name foo);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name xml);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name xmlstuff);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name foo, 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name foo, 'in?>valid');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name foo, null);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name xml, null);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name xmlstuff, null);
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name "xml-stylesheet", 'href="mystyle.css" type="text/css"');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name foo, '   bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot(xml '<foo/>', version no value, standalone no value);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot(xml '<foo/>', version no value, standalone no...
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot(xml '<foo/>', version '2.0');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot(xml '<foo/>', version '2.0');
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot(xml '<foo/>', version no value, standalone yes);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot(xml '<foo/>', version no value, standalone ye...
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot(xml '<?xml version="1.1"?><foo/>', version no value, standalone yes);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot(xml '<?xml version="1.1"?><foo/>', version no...
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot(xmlroot(xml '<foo/>', version '1.0'), version '1.1', standalone no);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot(xmlroot(xml '<foo/>', version '1.0'), version...
!                                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>', version no value, standalone no);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>...
!                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>', version no value, standalone no value);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>...
!                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>', version no value);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlroot('<?xml version="1.1" standalone="yes"?><foo/>...
!                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlroot (
!   xmlelement (
!     name gazonk,
!     xmlattributes (
!       'val' AS name,
!       1 + 1 AS num
!     ),
!     xmlelement (
!       NAME qux,
!       'foo'
!     )
!   ),
!   version '1.0',
!   standalone yes
! );
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlserialize(content data as character varying(20)) FROM xmltest;
!  xmlserialize 
! --------------
! (0 rows)
! 
! SELECT xmlserialize(content 'good' as char(10));
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlserialize(content 'good' as char(10));
!                                     ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlserialize(document 'bad' as text);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xmlserialize(document 'bad' as text);
!                                      ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml '<foo>bar</foo>' IS DOCUMENT;
! ERROR:  unsupported XML feature
! LINE 1: SELECT xml '<foo>bar</foo>' IS DOCUMENT;
!                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml '<foo>bar</foo><bar>foo</bar>' IS DOCUMENT;
! ERROR:  unsupported XML feature
! LINE 1: SELECT xml '<foo>bar</foo><bar>foo</bar>' IS DOCUMENT;
!                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml '<abc/>' IS NOT DOCUMENT;
! ERROR:  unsupported XML feature
! LINE 1: SELECT xml '<abc/>' IS NOT DOCUMENT;
!                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml 'abc' IS NOT DOCUMENT;
! ERROR:  unsupported XML feature
! LINE 1: SELECT xml 'abc' IS NOT DOCUMENT;
!                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT '<>' IS NOT DOCUMENT;
! ERROR:  unsupported XML feature
! LINE 1: SELECT '<>' IS NOT DOCUMENT;
!                ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlagg(data) FROM xmltest;
!  xmlagg 
! --------
!  
! (1 row)
! 
! SELECT xmlagg(data) FROM xmltest WHERE id > 10;
!  xmlagg 
! --------
!  
! (1 row)
! 
! SELECT xmlelement(name employees, xmlagg(xmlelement(name name, name))) FROM emp;
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- Check mapping SQL identifier to XML name
! SELECT xmlpi(name ":::_xml_abc135.%-&_");
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlpi(name "123");
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! PREPARE foo (xml) AS SELECT xmlconcat('<foo/>', $1);
! ERROR:  unsupported XML feature
! LINE 1: PREPARE foo (xml) AS SELECT xmlconcat('<foo/>', $1);
!                                               ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SET XML OPTION DOCUMENT;
! EXECUTE foo ('<bar/>');
! ERROR:  prepared statement "foo" does not exist
! EXECUTE foo ('bad');
! ERROR:  prepared statement "foo" does not exist
! SET XML OPTION CONTENT;
! EXECUTE foo ('<bar/>');
! ERROR:  prepared statement "foo" does not exist
! EXECUTE foo ('good');
! ERROR:  prepared statement "foo" does not exist
! -- Test backwards parsing
! CREATE VIEW xmlview1 AS SELECT xmlcomment('test');
! CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');
! ERROR:  unsupported XML feature
! LINE 1: CREATE VIEW xmlview2 AS SELECT xmlconcat('hello', 'you');
!                                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview3 AS SELECT xmlelement(name element, xmlattributes (1 as ":one:", 'deuce' as two), 'content&');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview4 AS SELECT xmlelement(name employee, xmlforest(name, age, salary as pay)) FROM emp;
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview5 AS SELECT xmlparse(content '<abc>x</abc>');
! CREATE VIEW xmlview6 AS SELECT xmlpi(name foo, 'bar');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview7 AS SELECT xmlroot(xml '<foo/>', version no value, standalone yes);
! ERROR:  unsupported XML feature
! LINE 1: CREATE VIEW xmlview7 AS SELECT xmlroot(xml '<foo/>', version...
!                                                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview8 AS SELECT xmlserialize(content 'good' as char(10));
! ERROR:  unsupported XML feature
! LINE 1: ...EATE VIEW xmlview8 AS SELECT xmlserialize(content 'good' as ...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! CREATE VIEW xmlview9 AS SELECT xmlserialize(content 'good' as text);
! ERROR:  unsupported XML feature
! LINE 1: ...EATE VIEW xmlview9 AS SELECT xmlserialize(content 'good' as ...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT table_name, view_definition FROM information_schema.views
!   WHERE table_name LIKE 'xmlview%' ORDER BY 1;
!  table_name |                                view_definition                                 
! ------------+--------------------------------------------------------------------------------
!  xmlview1   |  SELECT xmlcomment('test'::text) AS xmlcomment;
!  xmlview5   |  SELECT XMLPARSE(CONTENT '<abc>x</abc>'::text STRIP WHITESPACE) AS "xmlparse";
! (2 rows)
! 
! -- Text XPath expressions evaluation
! SELECT xpath('/value', data) FROM xmltest;
!  xpath 
! -------
! (0 rows)
! 
! SELECT xpath(NULL, NULL) IS NULL FROM xmltest;
!  ?column? 
! ----------
! (0 rows)
! 
! SELECT xpath('', '<!-- error -->');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('', '<!-- error -->');
!                          ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//text()', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//text()', '<local:data xmlns:local="http://12...
!                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//loc:piece/@id', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//loc:piece/@id', '<local:data xmlns:local="ht...
!                                         ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//loc:piece', '<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//loc:piece', '<local:data xmlns:local="http:/...
!                                     ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//loc:piece', '<local:data xmlns:local="http://127.0.0.1" xmlns="http://127.0.0.2"><local:piece id="1"><internal>number one</internal><internal2/></local:piece><local:piece id="2" /></local:data>', ARRAY[ARRAY['loc', 'http://127.0.0.1']]);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//loc:piece', '<local:data xmlns:local="http:/...
!                                     ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//b', '<a>one <b>two</b> three <b>etc</b></a>'...
!                             ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//text()', '<root>&lt;</root>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//text()', '<root>&lt;</root>');
!                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('//@value', '<root value="&lt;"/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('//@value', '<root value="&lt;"/>');
!                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('''<<invalid>>''', '<root/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('''<<invalid>>''', '<root/>');
!                                         ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('count(//*)', '<root><sub/><sub/></root>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('count(//*)', '<root><sub/><sub/></root>');
!                                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('count(//*)=0', '<root><sub/><sub/></root>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('count(//*)=0', '<root><sub/><sub/></root>');
!                                      ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('count(//*)=3', '<root><sub/><sub/></root>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('count(//*)=3', '<root><sub/><sub/></root>');
!                                      ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('name(/*)', '<root><sub/><sub/></root>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('name(/*)', '<root><sub/><sub/></root>');
!                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath('/nosuchtag', '<root/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('/nosuchtag', '<root/>');
!                                    ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- Test xmlexists and xpath_exists
! SELECT xmlexists('//town[text() = ''Toronto'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
! ERROR:  unsupported XML feature
! LINE 1: ...sts('//town[text() = ''Toronto'']' PASSING BY REF '<towns><t...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlexists('//town[text() = ''Cwmbran'']' PASSING BY REF '<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>');
! ERROR:  unsupported XML feature
! LINE 1: ...sts('//town[text() = ''Cwmbran'']' PASSING BY REF '<towns><t...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xmlexists('count(/nosuchtag)' PASSING BY REF '<root/>');
! ERROR:  unsupported XML feature
! LINE 1: ...LECT xmlexists('count(/nosuchtag)' PASSING BY REF '<root/>')...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath_exists('//town[text() = ''Toronto'']','<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: ...ELECT xpath_exists('//town[text() = ''Toronto'']','<towns><t...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath_exists('//town[text() = ''Cwmbran'']','<towns><town>Bidford-on-Avon</town><town>Cwmbran</town><town>Bristol</town></towns>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: ...ELECT xpath_exists('//town[text() = ''Cwmbran'']','<towns><t...
!                                                              ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xpath_exists('count(/nosuchtag)', '<root/>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath_exists('count(/nosuchtag)', '<root/>'::xml);
!                                                  ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (4, '<menu><beers><name>Budvar</name><cost>free</cost><name>Carling</name><cost>lots</cost></beers></menu>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (4, '<menu><beers><name>Budvar</n...
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (5, '<menu><beers><name>Molson</name><cost>free</cost><name>Carling</name><cost>lots</cost></beers></menu>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (5, '<menu><beers><name>Molson</n...
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (6, '<myns:menu xmlns:myns="http://myns.com"><myns:beers><myns:name>Budvar</myns:name><myns:cost>free</myns:cost><myns:name>Carling</myns:name><myns:cost>lots</myns:cost></myns:beers></myns:menu>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (6, '<myns:menu xmlns:myns="http:...
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! INSERT INTO xmltest VALUES (7, '<myns:menu xmlns:myns="http://myns.com"><myns:beers><myns:name>Molson</myns:name><myns:cost>free</myns:cost><myns:name>Carling</myns:name><myns:cost>lots</myns:cost></myns:beers></myns:menu>'::xml);
! ERROR:  unsupported XML feature
! LINE 1: INSERT INTO xmltest VALUES (7, '<myns:menu xmlns:myns="http:...
!                                        ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT COUNT(id) FROM xmltest WHERE xmlexists('/menu/beer' PASSING data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xmlexists('/menu/beer' PASSING BY REF data BY REF);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xmlexists('/menu/beers' PASSING BY REF data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xmlexists('/menu/beers/name[text() = ''Molson'']' PASSING BY REF data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/menu/beer',data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/menu/beers',data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/menu/beers/name[text() = ''Molson'']',data);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/myns:menu/myns:beer',data,ARRAY[ARRAY['myns','http://myns.com']]);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/myns:menu/myns:beers',data,ARRAY[ARRAY['myns','http://myns.com']]);
!  count 
! -------
!      0
! (1 row)
! 
! SELECT COUNT(id) FROM xmltest WHERE xpath_exists('/myns:menu/myns:beers/myns:name[text() = ''Molson'']',data,ARRAY[ARRAY['myns','http://myns.com']]);
!  count 
! -------
!      0
! (1 row)
! 
! CREATE TABLE query ( expr TEXT );
! INSERT INTO query VALUES ('/menu/beers/cost[text() = ''lots'']');
! SELECT COUNT(id) FROM xmltest, query WHERE xmlexists(expr PASSING BY REF data);
!  count 
! -------
!      0
! (1 row)
! 
! -- Test xml_is_well_formed and variants
! SELECT xml_is_well_formed_document('<foo>bar</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed_document('abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed_content('<foo>bar</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed_content('abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SET xmloption TO DOCUMENT;
! SELECT xml_is_well_formed('abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<abc/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<foo>bar</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<foo>bar</foo');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<foo><bar>baz</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<local:data xmlns:local="http://127.0.0.1"><local:piece id="1">number one</local:piece><local:piece id="2" /></local:data>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</my:foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<pg:foo xmlns:pg="http://postgresql.org/stuff">bar</pg:foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<invalidentity>&</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<undefinedentity>&idontexist;</abc>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<invalidns xmlns=''&lt;''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<relativens xmlns=''relative''/>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT xml_is_well_formed('<twoerrors>&idontexist;</unbalanced>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SET xmloption TO CONTENT;
! SELECT xml_is_well_formed('abc');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- Since xpath() deals with namespaces, it's a bit stricter about
! -- what's well-formed and what's not. If we don't obey these rules
! -- (i.e. ignore namespace-related errors from libxml), xpath()
! -- fails in subtle ways. The following would for example produce
! -- the xml value
! --   <invalidns xmlns='<'/>
! -- which is invalid because '<' may not appear un-escaped in
! -- attribute values.
! -- Since different libxml versions emit slightly different
! -- error messages, we suppress the DETAIL in this test.
! \set VERBOSITY terse
! SELECT xpath('/*', '<invalidns xmlns=''&lt;''/>');
! ERROR:  unsupported XML feature at character 20
! \set VERBOSITY default
! -- Again, the XML isn't well-formed for namespace purposes
! SELECT xpath('/*', '<nosuchprefix:tag/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('/*', '<nosuchprefix:tag/>');
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- XPath deprecates relative namespaces, but they're not supposed to
! -- throw an error, only a warning.
! SELECT xpath('/*', '<relativens xmlns=''relative''/>');
! ERROR:  unsupported XML feature
! LINE 1: SELECT xpath('/*', '<relativens xmlns=''relative''/>');
!                            ^
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- External entity references should not leak filesystem information.
! SELECT XMLPARSE(DOCUMENT '<!DOCTYPE foo [<!ENTITY c SYSTEM "/etc/passwd">]><foo>&c;</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! SELECT XMLPARSE(DOCUMENT '<!DOCTYPE foo [<!ENTITY c SYSTEM "/etc/no.such.file">]><foo>&c;</foo>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
! -- This might or might not load the requested DTD, but it mustn't throw error.
! SELECT XMLPARSE(DOCUMENT '<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd"><chapter>&nbsp;</chapter>');
! ERROR:  unsupported XML feature
! DETAIL:  This functionality requires the server to be built with libxml support.
! HINT:  You need to rebuild PostgreSQL using --with-libxml.
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/event_trigger.out	2016-09-05 20:45:48.652032317 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/event_trigger.out	2016-09-12 12:14:52.015413949 -0300
***************
*** 1,459 ****
! -- should fail, return type mismatch
! create event trigger regress_event_trigger
!    on ddl_command_start
!    execute procedure pg_backend_pid();
! ERROR:  function pg_backend_pid must return type event_trigger
! -- OK
! create function test_event_trigger() returns event_trigger as $$
! BEGIN
!     RAISE NOTICE 'test_event_trigger: % %', tg_event, tg_tag;
! END
! $$ language plpgsql;
! -- should fail, event triggers cannot have declared arguments
! create function test_event_trigger_arg(name text)
! returns event_trigger as $$ BEGIN RETURN 1; END $$ language plpgsql;
! ERROR:  event trigger functions cannot have declared arguments
! CONTEXT:  compilation of PL/pgSQL function "test_event_trigger_arg" near line 1
! -- should fail, SQL functions cannot be event triggers
! create function test_event_trigger_sql() returns event_trigger as $$
! SELECT 1 $$ language sql;
! ERROR:  SQL functions cannot return type event_trigger
! -- should fail, no elephant_bootstrap entry point
! create event trigger regress_event_trigger on elephant_bootstrap
!    execute procedure test_event_trigger();
! ERROR:  unrecognized event name "elephant_bootstrap"
! -- OK
! create event trigger regress_event_trigger on ddl_command_start
!    execute procedure test_event_trigger();
! -- OK
! create event trigger regress_event_trigger_end on ddl_command_end
!    execute procedure test_event_trigger();
! -- should fail, food is not a valid filter variable
! create event trigger regress_event_trigger2 on ddl_command_start
!    when food in ('sandwich')
!    execute procedure test_event_trigger();
! ERROR:  unrecognized filter variable "food"
! -- should fail, sandwich is not a valid command tag
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('sandwich')
!    execute procedure test_event_trigger();
! ERROR:  filter value "sandwich" not recognized for filter variable "tag"
! -- should fail, create skunkcabbage is not a valid command tag
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('create table', 'create skunkcabbage')
!    execute procedure test_event_trigger();
! ERROR:  filter value "create skunkcabbage" not recognized for filter variable "tag"
! -- should fail, can't have event triggers on event triggers
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('DROP EVENT TRIGGER')
!    execute procedure test_event_trigger();
! ERROR:  event triggers are not supported for DROP EVENT TRIGGER
! -- should fail, can't have event triggers on global objects
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('CREATE ROLE')
!    execute procedure test_event_trigger();
! ERROR:  event triggers are not supported for CREATE ROLE
! -- should fail, can't have event triggers on global objects
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('CREATE DATABASE')
!    execute procedure test_event_trigger();
! ERROR:  event triggers are not supported for CREATE DATABASE
! -- should fail, can't have event triggers on global objects
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('CREATE TABLESPACE')
!    execute procedure test_event_trigger();
! ERROR:  event triggers are not supported for CREATE TABLESPACE
! -- should fail, can't have same filter variable twice
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('create table') and tag in ('CREATE FUNCTION')
!    execute procedure test_event_trigger();
! ERROR:  filter variable "tag" specified more than once
! -- should fail, can't have arguments
! create event trigger regress_event_trigger2 on ddl_command_start
!    execute procedure test_event_trigger('argument not allowed');
! ERROR:  syntax error at or near "'argument not allowed'"
! LINE 2:    execute procedure test_event_trigger('argument not allowe...
!                                                 ^
! -- OK
! create event trigger regress_event_trigger2 on ddl_command_start
!    when tag in ('create table', 'CREATE FUNCTION')
!    execute procedure test_event_trigger();
! -- OK
! comment on event trigger regress_event_trigger is 'test comment';
! -- should fail, event triggers are not schema objects
! comment on event trigger wrong.regress_event_trigger is 'test comment';
! ERROR:  event trigger name cannot be qualified
! -- drop as non-superuser should fail
! create role regress_evt_user;
! set role regress_evt_user;
! create event trigger regress_event_trigger_noperms on ddl_command_start
!    execute procedure test_event_trigger();
! ERROR:  permission denied to create event trigger "regress_event_trigger_noperms"
! HINT:  Must be superuser to create an event trigger.
! reset role;
! -- all OK
! alter event trigger regress_event_trigger enable replica;
! alter event trigger regress_event_trigger enable always;
! alter event trigger regress_event_trigger enable;
! alter event trigger regress_event_trigger disable;
! -- regress_event_trigger2 and regress_event_trigger_end should fire, but not
! -- regress_event_trigger
! create table event_trigger_fire1 (a int);
! NOTICE:  test_event_trigger: ddl_command_start CREATE TABLE
! NOTICE:  test_event_trigger: ddl_command_end CREATE TABLE
! -- regress_event_trigger_end should fire on these commands
! grant all on table event_trigger_fire1 to public;
! NOTICE:  test_event_trigger: ddl_command_end GRANT
! comment on table event_trigger_fire1 is 'here is a comment';
! NOTICE:  test_event_trigger: ddl_command_end COMMENT
! revoke all on table event_trigger_fire1 from public;
! NOTICE:  test_event_trigger: ddl_command_end REVOKE
! drop table event_trigger_fire1;
! NOTICE:  test_event_trigger: ddl_command_end DROP TABLE
! create foreign data wrapper useless;
! NOTICE:  test_event_trigger: ddl_command_end CREATE FOREIGN DATA WRAPPER
! create server useless_server foreign data wrapper useless;
! NOTICE:  test_event_trigger: ddl_command_end CREATE SERVER
! create user mapping for regress_evt_user server useless_server;
! NOTICE:  test_event_trigger: ddl_command_end CREATE USER MAPPING
! alter default privileges for role regress_evt_user
!  revoke delete on tables from regress_evt_user;
! NOTICE:  test_event_trigger: ddl_command_end ALTER DEFAULT PRIVILEGES
! -- alter owner to non-superuser should fail
! alter event trigger regress_event_trigger owner to regress_evt_user;
! ERROR:  permission denied to change owner of event trigger "regress_event_trigger"
! HINT:  The owner of an event trigger must be a superuser.
! -- alter owner to superuser should work
! alter role regress_evt_user superuser;
! alter event trigger regress_event_trigger owner to regress_evt_user;
! -- should fail, name collision
! alter event trigger regress_event_trigger rename to regress_event_trigger2;
! ERROR:  event trigger "regress_event_trigger2" already exists
! -- OK
! alter event trigger regress_event_trigger rename to regress_event_trigger3;
! -- should fail, doesn't exist any more
! drop event trigger regress_event_trigger;
! ERROR:  event trigger "regress_event_trigger" does not exist
! -- should fail, regress_evt_user owns some objects
! drop role regress_evt_user;
! ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
! DETAIL:  owner of event trigger regress_event_trigger3
! owner of default privileges on new relations belonging to role regress_evt_user
! owner of user mapping for regress_evt_user on server useless_server
! -- cleanup before next test
! -- these are all OK; the second one should emit a NOTICE
! drop event trigger if exists regress_event_trigger2;
! drop event trigger if exists regress_event_trigger2;
! NOTICE:  event trigger "regress_event_trigger2" does not exist, skipping
! drop event trigger regress_event_trigger3;
! drop event trigger regress_event_trigger_end;
! -- test support for dropped objects
! CREATE SCHEMA schema_one authorization regress_evt_user;
! CREATE SCHEMA schema_two authorization regress_evt_user;
! CREATE SCHEMA audit_tbls authorization regress_evt_user;
! CREATE TEMP TABLE a_temp_tbl ();
! SET SESSION AUTHORIZATION regress_evt_user;
! CREATE TABLE schema_one.table_one(a int);
! CREATE TABLE schema_one."table two"(a int);
! CREATE TABLE schema_one.table_three(a int);
! CREATE TABLE audit_tbls.schema_one_table_two(the_value text);
! CREATE TABLE schema_two.table_two(a int);
! CREATE TABLE schema_two.table_three(a int, b text);
! CREATE TABLE audit_tbls.schema_two_table_three(the_value text);
! CREATE OR REPLACE FUNCTION schema_two.add(int, int) RETURNS int LANGUAGE plpgsql
!   CALLED ON NULL INPUT
!   AS $$ BEGIN RETURN coalesce($1,0) + coalesce($2,0); END; $$;
! CREATE AGGREGATE schema_two.newton
!   (BASETYPE = int, SFUNC = schema_two.add, STYPE = int);
! RESET SESSION AUTHORIZATION;
! CREATE TABLE undroppable_objs (
! 	object_type text,
! 	object_identity text
! );
! INSERT INTO undroppable_objs VALUES
! ('table', 'schema_one.table_three'),
! ('table', 'audit_tbls.schema_two_table_three');
! CREATE TABLE dropped_objects (
! 	type text,
! 	schema text,
! 	object text
! );
! -- This tests errors raised within event triggers; the one in audit_tbls
! -- uses 2nd-level recursive invocation via test_evtrig_dropped_objects().
! CREATE OR REPLACE FUNCTION undroppable() RETURNS event_trigger
! LANGUAGE plpgsql AS $$
! DECLARE
! 	obj record;
! BEGIN
! 	PERFORM 1 FROM pg_tables WHERE tablename = 'undroppable_objs';
! 	IF NOT FOUND THEN
! 		RAISE NOTICE 'table undroppable_objs not found, skipping';
! 		RETURN;
! 	END IF;
! 	FOR obj IN
! 		SELECT * FROM pg_event_trigger_dropped_objects() JOIN
! 			undroppable_objs USING (object_type, object_identity)
! 	LOOP
! 		RAISE EXCEPTION 'object % of type % cannot be dropped',
! 			obj.object_identity, obj.object_type;
! 	END LOOP;
! END;
! $$;
! CREATE EVENT TRIGGER undroppable ON sql_drop
! 	EXECUTE PROCEDURE undroppable();
! CREATE OR REPLACE FUNCTION test_evtrig_dropped_objects() RETURNS event_trigger
! LANGUAGE plpgsql AS $$
! DECLARE
!     obj record;
! BEGIN
!     FOR obj IN SELECT * FROM pg_event_trigger_dropped_objects()
!     LOOP
!         IF obj.object_type = 'table' THEN
!                 EXECUTE format('DROP TABLE IF EXISTS audit_tbls.%I',
! 					format('%s_%s', obj.schema_name, obj.object_name));
!         END IF;
! 
! 	INSERT INTO dropped_objects
! 		(type, schema, object) VALUES
! 		(obj.object_type, obj.schema_name, obj.object_identity);
!     END LOOP;
! END
! $$;
! CREATE EVENT TRIGGER regress_event_trigger_drop_objects ON sql_drop
! 	WHEN TAG IN ('drop table', 'drop function', 'drop view',
! 		'drop owned', 'drop schema', 'alter table')
! 	EXECUTE PROCEDURE test_evtrig_dropped_objects();
! ALTER TABLE schema_one.table_one DROP COLUMN a;
! DROP SCHEMA schema_one, schema_two CASCADE;
! NOTICE:  drop cascades to 7 other objects
! DETAIL:  drop cascades to table schema_two.table_two
! drop cascades to table schema_two.table_three
! drop cascades to function schema_two.add(integer,integer)
! drop cascades to function schema_two.newton(integer)
! drop cascades to table schema_one.table_one
! drop cascades to table schema_one."table two"
! drop cascades to table schema_one.table_three
! NOTICE:  table "schema_two_table_two" does not exist, skipping
! NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
! ERROR:  object audit_tbls.schema_two_table_three of type table cannot be dropped
! CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
! SQL statement "DROP TABLE IF EXISTS audit_tbls.schema_two_table_three"
! PL/pgSQL function test_evtrig_dropped_objects() line 8 at EXECUTE
! DELETE FROM undroppable_objs WHERE object_identity = 'audit_tbls.schema_two_table_three';
! DROP SCHEMA schema_one, schema_two CASCADE;
! NOTICE:  drop cascades to 7 other objects
! DETAIL:  drop cascades to table schema_two.table_two
! drop cascades to table schema_two.table_three
! drop cascades to function schema_two.add(integer,integer)
! drop cascades to function schema_two.newton(integer)
! drop cascades to table schema_one.table_one
! drop cascades to table schema_one."table two"
! drop cascades to table schema_one.table_three
! NOTICE:  table "schema_two_table_two" does not exist, skipping
! NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
! NOTICE:  table "schema_one_table_one" does not exist, skipping
! NOTICE:  table "schema_one_table two" does not exist, skipping
! NOTICE:  table "schema_one_table_three" does not exist, skipping
! ERROR:  object schema_one.table_three of type table cannot be dropped
! CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
! DELETE FROM undroppable_objs WHERE object_identity = 'schema_one.table_three';
! DROP SCHEMA schema_one, schema_two CASCADE;
! NOTICE:  drop cascades to 7 other objects
! DETAIL:  drop cascades to table schema_two.table_two
! drop cascades to table schema_two.table_three
! drop cascades to function schema_two.add(integer,integer)
! drop cascades to function schema_two.newton(integer)
! drop cascades to table schema_one.table_one
! drop cascades to table schema_one."table two"
! drop cascades to table schema_one.table_three
! NOTICE:  table "schema_two_table_two" does not exist, skipping
! NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
! NOTICE:  table "schema_one_table_one" does not exist, skipping
! NOTICE:  table "schema_one_table two" does not exist, skipping
! NOTICE:  table "schema_one_table_three" does not exist, skipping
! SELECT * FROM dropped_objects WHERE schema IS NULL OR schema <> 'pg_toast';
!      type     |   schema   |               object                
! --------------+------------+-------------------------------------
!  table column | schema_one | schema_one.table_one.a
!  schema       |            | schema_two
!  table        | schema_two | schema_two.table_two
!  type         | schema_two | schema_two.table_two
!  type         | schema_two | schema_two.table_two[]
!  table        | audit_tbls | audit_tbls.schema_two_table_three
!  type         | audit_tbls | audit_tbls.schema_two_table_three
!  type         | audit_tbls | audit_tbls.schema_two_table_three[]
!  table        | schema_two | schema_two.table_three
!  type         | schema_two | schema_two.table_three
!  type         | schema_two | schema_two.table_three[]
!  function     | schema_two | schema_two.add(integer,integer)
!  aggregate    | schema_two | schema_two.newton(integer)
!  schema       |            | schema_one
!  table        | schema_one | schema_one.table_one
!  type         | schema_one | schema_one.table_one
!  type         | schema_one | schema_one.table_one[]
!  table        | schema_one | schema_one."table two"
!  type         | schema_one | schema_one."table two"
!  type         | schema_one | schema_one."table two"[]
!  table        | schema_one | schema_one.table_three
!  type         | schema_one | schema_one.table_three
!  type         | schema_one | schema_one.table_three[]
! (23 rows)
! 
! DROP OWNED BY regress_evt_user;
! NOTICE:  schema "audit_tbls" does not exist, skipping
! SELECT * FROM dropped_objects WHERE type = 'schema';
!   type  | schema |   object   
! --------+--------+------------
!  schema |        | schema_two
!  schema |        | schema_one
!  schema |        | audit_tbls
! (3 rows)
! 
! DROP ROLE regress_evt_user;
! DROP EVENT TRIGGER regress_event_trigger_drop_objects;
! DROP EVENT TRIGGER undroppable;
! CREATE OR REPLACE FUNCTION event_trigger_report_dropped()
!  RETURNS event_trigger
!  LANGUAGE plpgsql
! AS $$
! DECLARE r record;
! BEGIN
!     FOR r IN SELECT * from pg_event_trigger_dropped_objects()
!     LOOP
!     IF NOT r.normal AND NOT r.original THEN
!         CONTINUE;
!     END IF;
!     RAISE NOTICE 'NORMAL: orig=% normal=% istemp=% type=% identity=% name=% args=%',
!         r.original, r.normal, r.is_temporary, r.object_type,
!         r.object_identity, r.address_names, r.address_args;
!     END LOOP;
! END; $$;
! CREATE EVENT TRIGGER regress_event_trigger_report_dropped ON sql_drop
!     EXECUTE PROCEDURE event_trigger_report_dropped();
! CREATE SCHEMA evttrig
! 	CREATE TABLE one (col_a SERIAL PRIMARY KEY, col_b text DEFAULT 'forty two')
! 	CREATE INDEX one_idx ON one (col_b)
! 	CREATE TABLE two (col_c INTEGER CHECK (col_c > 0) REFERENCES one DEFAULT 42);
! ALTER TABLE evttrig.two DROP COLUMN col_c;
! NOTICE:  NORMAL: orig=t normal=f istemp=f type=table column identity=evttrig.two.col_c name={evttrig,two,col_c} args={}
! NOTICE:  NORMAL: orig=f normal=t istemp=f type=table constraint identity=two_col_c_check on evttrig.two name={evttrig,two,two_col_c_check} args={}
! ALTER TABLE evttrig.one ALTER COLUMN col_b DROP DEFAULT;
! NOTICE:  NORMAL: orig=t normal=f istemp=f type=default value identity=for evttrig.one.col_b name={evttrig,one,col_b} args={}
! ALTER TABLE evttrig.one DROP CONSTRAINT one_pkey;
! NOTICE:  NORMAL: orig=t normal=f istemp=f type=table constraint identity=one_pkey on evttrig.one name={evttrig,one,one_pkey} args={}
! DROP INDEX evttrig.one_idx;
! NOTICE:  NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={}
! DROP SCHEMA evttrig CASCADE;
! NOTICE:  drop cascades to 2 other objects
! DETAIL:  drop cascades to table evttrig.one
! drop cascades to table evttrig.two
! NOTICE:  NORMAL: orig=t normal=f istemp=f type=schema identity=evttrig name={evttrig} args={}
! NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.one name={evttrig,one} args={}
! NOTICE:  NORMAL: orig=f normal=t istemp=f type=sequence identity=evttrig.one_col_a_seq name={evttrig,one_col_a_seq} args={}
! NOTICE:  NORMAL: orig=f normal=t istemp=f type=default value identity=for evttrig.one.col_a name={evttrig,one,col_a} args={}
! NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
! DROP TABLE a_temp_tbl;
! NOTICE:  NORMAL: orig=t normal=f istemp=t type=table identity=pg_temp.a_temp_tbl name={pg_temp,a_temp_tbl} args={}
! DROP EVENT TRIGGER regress_event_trigger_report_dropped;
! -- only allowed from within an event trigger function, should fail
! select pg_event_trigger_table_rewrite_oid();
! ERROR:  pg_event_trigger_table_rewrite_oid() can only be called in a table_rewrite event trigger function
! -- test Table Rewrite Event Trigger
! CREATE OR REPLACE FUNCTION test_evtrig_no_rewrite() RETURNS event_trigger
! LANGUAGE plpgsql AS $$
! BEGIN
!   RAISE EXCEPTION 'rewrites not allowed';
! END;
! $$;
! create event trigger no_rewrite_allowed on table_rewrite
!   execute procedure test_evtrig_no_rewrite();
! create table rewriteme (id serial primary key, foo float);
! insert into rewriteme
!      select x * 1.001 from generate_series(1, 500) as t(x);
! alter table rewriteme alter column foo type numeric;
! ERROR:  rewrites not allowed
! CONTEXT:  PL/pgSQL function test_evtrig_no_rewrite() line 3 at RAISE
! alter table rewriteme add column baz int default 0;
! ERROR:  rewrites not allowed
! CONTEXT:  PL/pgSQL function test_evtrig_no_rewrite() line 3 at RAISE
! -- test with more than one reason to rewrite a single table
! CREATE OR REPLACE FUNCTION test_evtrig_no_rewrite() RETURNS event_trigger
! LANGUAGE plpgsql AS $$
! BEGIN
!   RAISE NOTICE 'Table ''%'' is being rewritten (reason = %)',
!                pg_event_trigger_table_rewrite_oid()::regclass,
!                pg_event_trigger_table_rewrite_reason();
! END;
! $$;
! alter table rewriteme
!  add column onemore int default 0,
!  add column another int default -1,
!  alter column foo type numeric(10,4);
! NOTICE:  Table 'rewriteme' is being rewritten (reason = 6)
! -- shouldn't trigger a table_rewrite event
! alter table rewriteme alter column foo type numeric(12,4);
! -- typed tables are rewritten when their type changes.  Don't emit table
! -- name, because firing order is not stable.
! CREATE OR REPLACE FUNCTION test_evtrig_no_rewrite() RETURNS event_trigger
! LANGUAGE plpgsql AS $$
! BEGIN
!   RAISE NOTICE 'Table is being rewritten (reason = %)',
!                pg_event_trigger_table_rewrite_reason();
! END;
! $$;
! create type rewritetype as (a int);
! create table rewritemetoo1 of rewritetype;
! create table rewritemetoo2 of rewritetype;
! alter type rewritetype alter attribute a type text cascade;
! NOTICE:  Table is being rewritten (reason = 4)
! NOTICE:  Table is being rewritten (reason = 4)
! -- but this doesn't work
! create table rewritemetoo3 (a rewritetype);
! alter type rewritetype alter attribute a type varchar cascade;
! ERROR:  cannot alter type "rewritetype" because column "rewritemetoo3.a" uses it
! drop table rewriteme;
! drop event trigger no_rewrite_allowed;
! drop function test_evtrig_no_rewrite();
! -- test Row Security Event Trigger
! RESET SESSION AUTHORIZATION;
! CREATE TABLE event_trigger_test (a integer, b text);
! CREATE OR REPLACE FUNCTION start_command()
! RETURNS event_trigger AS $$
! BEGIN
! RAISE NOTICE '% - ddl_command_start', tg_tag;
! END;
! $$ LANGUAGE plpgsql;
! CREATE OR REPLACE FUNCTION end_command()
! RETURNS event_trigger AS $$
! BEGIN
! RAISE NOTICE '% - ddl_command_end', tg_tag;
! END;
! $$ LANGUAGE plpgsql;
! CREATE OR REPLACE FUNCTION drop_sql_command()
! RETURNS event_trigger AS $$
! BEGIN
! RAISE NOTICE '% - sql_drop', tg_tag;
! END;
! $$ LANGUAGE plpgsql;
! CREATE EVENT TRIGGER start_rls_command ON ddl_command_start
!     WHEN TAG IN ('CREATE POLICY', 'ALTER POLICY', 'DROP POLICY') EXECUTE PROCEDURE start_command();
! CREATE EVENT TRIGGER end_rls_command ON ddl_command_end
!     WHEN TAG IN ('CREATE POLICY', 'ALTER POLICY', 'DROP POLICY') EXECUTE PROCEDURE end_command();
! CREATE EVENT TRIGGER sql_drop_command ON sql_drop
!     WHEN TAG IN ('DROP POLICY') EXECUTE PROCEDURE drop_sql_command();
! CREATE POLICY p1 ON event_trigger_test USING (FALSE);
! NOTICE:  CREATE POLICY - ddl_command_start
! NOTICE:  CREATE POLICY - ddl_command_end
! ALTER POLICY p1 ON event_trigger_test USING (TRUE);
! NOTICE:  ALTER POLICY - ddl_command_start
! NOTICE:  ALTER POLICY - ddl_command_end
! ALTER POLICY p1 ON event_trigger_test RENAME TO p2;
! NOTICE:  ALTER POLICY - ddl_command_start
! NOTICE:  ALTER POLICY - ddl_command_end
! DROP POLICY p2 ON event_trigger_test;
! NOTICE:  DROP POLICY - ddl_command_start
! NOTICE:  DROP POLICY - sql_drop
! NOTICE:  DROP POLICY - ddl_command_end
! DROP EVENT TRIGGER start_rls_command;
! DROP EVENT TRIGGER end_rls_command;
! DROP EVENT TRIGGER sql_drop_command;
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

*** /home/claudiofreire/src/postgresql.work/src/test/regress/expected/stats.out	2016-09-05 20:45:49.072033605 -0300
--- /home/claudiofreire/src/postgresql.work/src/test/regress/results/stats.out	2016-09-12 12:14:52.019413950 -0300
***************
*** 1,202 ****
! --
! -- Test Statistics Collector
! --
! -- Must be run after tenk2 has been created (by create_table),
! -- populated (by create_misc) and indexed (by create_index).
! --
! -- conditio sine qua non
! SHOW track_counts;  -- must be on
!  track_counts 
! --------------
!  on
! (1 row)
! 
! -- ensure that both seqscan and indexscan plans are allowed
! SET enable_seqscan TO on;
! SET enable_indexscan TO on;
! -- for the moment, we don't want index-only scans here
! SET enable_indexonlyscan TO off;
! -- wait to let any prior tests finish dumping out stats;
! -- else our messages might get lost due to contention
! SELECT pg_sleep_for('2 seconds');
!  pg_sleep_for 
! --------------
!  
! (1 row)
! 
! -- save counters
! CREATE TEMP TABLE prevstats AS
! SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
!        (b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
!        (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
!        pg_stat_get_snapshot_timestamp() as snap_ts
!   FROM pg_catalog.pg_stat_user_tables AS t,
!        pg_catalog.pg_statio_user_tables AS b
!  WHERE t.relname='tenk2' AND b.relname='tenk2';
! -- function to wait for counters to advance
! create function wait_for_stats() returns void as $$
! declare
!   start_time timestamptz := clock_timestamp();
!   updated1 bool;
!   updated2 bool;
!   updated3 bool;
! begin
!   -- we don't want to wait forever; loop will exit after 30 seconds
!   for i in 1 .. 300 loop
! 
!     -- With parallel query, the seqscan and indexscan on tenk2 might be done
!     -- in parallel worker processes, which will send their stats counters
!     -- asynchronously to what our own session does.  So we must check for
!     -- those counts to be registered separately from the update counts.
! 
!     -- check to see if seqscan has been sensed
!     SELECT (st.seq_scan >= pr.seq_scan + 1) INTO updated1
!       FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
!      WHERE st.relname='tenk2' AND cl.relname='tenk2';
! 
!     -- check to see if indexscan has been sensed
!     SELECT (st.idx_scan >= pr.idx_scan + 1) INTO updated2
!       FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
!      WHERE st.relname='tenk2' AND cl.relname='tenk2';
! 
!     -- check to see if updates have been sensed
!     SELECT (n_tup_ins > 0) INTO updated3
!       FROM pg_stat_user_tables WHERE relname='trunc_stats_test';
! 
!     exit when updated1 and updated2 and updated3;
! 
!     -- wait a little
!     perform pg_sleep(0.1);
! 
!     -- reset stats snapshot so we can test again
!     perform pg_stat_clear_snapshot();
! 
!   end loop;
! 
!   -- report time waited in postmaster log (where it won't change test output)
!   raise log 'wait_for_stats delayed % seconds',
!     extract(epoch from clock_timestamp() - start_time);
! end
! $$ language plpgsql;
! -- test effects of TRUNCATE on n_live_tup/n_dead_tup counters
! CREATE TABLE trunc_stats_test(id serial);
! CREATE TABLE trunc_stats_test1(id serial);
! CREATE TABLE trunc_stats_test2(id serial);
! CREATE TABLE trunc_stats_test3(id serial);
! CREATE TABLE trunc_stats_test4(id serial);
! -- check that n_live_tup is reset to 0 after truncate
! INSERT INTO trunc_stats_test DEFAULT VALUES;
! INSERT INTO trunc_stats_test DEFAULT VALUES;
! INSERT INTO trunc_stats_test DEFAULT VALUES;
! TRUNCATE trunc_stats_test;
! -- test involving a truncate in a transaction; 4 ins but only 1 live
! INSERT INTO trunc_stats_test1 DEFAULT VALUES;
! INSERT INTO trunc_stats_test1 DEFAULT VALUES;
! INSERT INTO trunc_stats_test1 DEFAULT VALUES;
! UPDATE trunc_stats_test1 SET id = id + 10 WHERE id IN (1, 2);
! DELETE FROM trunc_stats_test1 WHERE id = 3;
! BEGIN;
! UPDATE trunc_stats_test1 SET id = id + 100;
! TRUNCATE trunc_stats_test1;
! INSERT INTO trunc_stats_test1 DEFAULT VALUES;
! COMMIT;
! -- use a savepoint: 1 insert, 1 live
! BEGIN;
! INSERT INTO trunc_stats_test2 DEFAULT VALUES;
! INSERT INTO trunc_stats_test2 DEFAULT VALUES;
! SAVEPOINT p1;
! INSERT INTO trunc_stats_test2 DEFAULT VALUES;
! TRUNCATE trunc_stats_test2;
! INSERT INTO trunc_stats_test2 DEFAULT VALUES;
! RELEASE SAVEPOINT p1;
! COMMIT;
! -- rollback a savepoint: this should count 4 inserts and have 2
! -- live tuples after commit (and 2 dead ones due to aborted subxact)
! BEGIN;
! INSERT INTO trunc_stats_test3 DEFAULT VALUES;
! INSERT INTO trunc_stats_test3 DEFAULT VALUES;
! SAVEPOINT p1;
! INSERT INTO trunc_stats_test3 DEFAULT VALUES;
! INSERT INTO trunc_stats_test3 DEFAULT VALUES;
! TRUNCATE trunc_stats_test3;
! INSERT INTO trunc_stats_test3 DEFAULT VALUES;
! ROLLBACK TO SAVEPOINT p1;
! COMMIT;
! -- rollback a truncate: this should count 2 inserts and produce 2 dead tuples
! BEGIN;
! INSERT INTO trunc_stats_test4 DEFAULT VALUES;
! INSERT INTO trunc_stats_test4 DEFAULT VALUES;
! TRUNCATE trunc_stats_test4;
! INSERT INTO trunc_stats_test4 DEFAULT VALUES;
! ROLLBACK;
! -- do a seqscan
! SELECT count(*) FROM tenk2;
!  count 
! -------
!  10000
! (1 row)
! 
! -- do an indexscan
! SELECT count(*) FROM tenk2 WHERE unique1 = 1;
!  count 
! -------
!      1
! (1 row)
! 
! -- force the rate-limiting logic in pgstat_report_stat() to time out
! -- and send a message
! SELECT pg_sleep(1.0);
!  pg_sleep 
! ----------
!  
! (1 row)
! 
! -- wait for stats collector to update
! SELECT wait_for_stats();
!  wait_for_stats 
! ----------------
!  
! (1 row)
! 
! -- check effects
! SELECT relname, n_tup_ins, n_tup_upd, n_tup_del, n_live_tup, n_dead_tup
!   FROM pg_stat_user_tables
!  WHERE relname like 'trunc_stats_test%' order by relname;
!       relname      | n_tup_ins | n_tup_upd | n_tup_del | n_live_tup | n_dead_tup 
! -------------------+-----------+-----------+-----------+------------+------------
!  trunc_stats_test  |         3 |         0 |         0 |          0 |          0
!  trunc_stats_test1 |         4 |         2 |         1 |          1 |          0
!  trunc_stats_test2 |         1 |         0 |         0 |          1 |          0
!  trunc_stats_test3 |         4 |         0 |         0 |          2 |          2
!  trunc_stats_test4 |         2 |         0 |         0 |          0 |          2
! (5 rows)
! 
! SELECT st.seq_scan >= pr.seq_scan + 1,
!        st.seq_tup_read >= pr.seq_tup_read + cl.reltuples,
!        st.idx_scan >= pr.idx_scan + 1,
!        st.idx_tup_fetch >= pr.idx_tup_fetch + 1
!   FROM pg_stat_user_tables AS st, pg_class AS cl, prevstats AS pr
!  WHERE st.relname='tenk2' AND cl.relname='tenk2';
!  ?column? | ?column? | ?column? | ?column? 
! ----------+----------+----------+----------
!  t        | t        | t        | t
! (1 row)
! 
! SELECT st.heap_blks_read + st.heap_blks_hit >= pr.heap_blks + cl.relpages,
!        st.idx_blks_read + st.idx_blks_hit >= pr.idx_blks + 1
!   FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
!  WHERE st.relname='tenk2' AND cl.relname='tenk2';
!  ?column? | ?column? 
! ----------+----------
!  t        | t
! (1 row)
! 
! SELECT pr.snap_ts < pg_stat_get_snapshot_timestamp() as snapshot_newer
! FROM prevstats AS pr;
!  snapshot_newer 
! ----------------
!  t
! (1 row)
! 
! DROP TABLE trunc_stats_test, trunc_stats_test1, trunc_stats_test2, trunc_stats_test3, trunc_stats_test4;
! -- End of Stats Test
--- 1 ----
! psql: FATAL:  the database system is in recovery mode

======================================================================

regression.outapplication/octet-stream; name=regression.outDownload
#25Peter Geoghegan
pg@heroku.com
In reply to: Claudio Freire (#24)
Re: Tuplesort merge pre-reading

On Mon, Sep 12, 2016 at 8:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

I spoke too soon, git AM had failed and I didn't notice.

I wrote the regression test that causes Postgres to crash with the
patch applied. It tests, among other things, that CLUSTER tuples are
fixed-up by a routine like the current MOVETUP(), which is removed in
Heikki's patch. (There was a 9.6 bug where CLUSTER was broken due to
that.)

It shouldn't be too difficult for Heikki to fix this.
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Claudio Freire (#24)
1 attachment(s)
Re: Tuplesort merge pre-reading

On 09/12/2016 06:47 PM, Claudio Freire wrote:

On Mon, Sep 12, 2016 at 12:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

I don't know what was up with the other ones, but this one works fine.

Benchmarking now.

I spoke too soon, git AM had failed and I didn't notice.

regression.diffs attached

Built with

./configure --enable-debug --enable-cassert && make clean && make -j7
&& make check

Ah, of course! I had been building without assertions, as I was doing
performance testing. With --enable-cassert, it failed for me as well
(and there was even a compiler warning pointing out one of the issues).
Sorry about that.

Here's a fixed version. I'll go through Peter's comments and address
those, but I don't think there was anything there that should affect
performance much, so I think you can proceed with your benchmarking with
this version. (You'll also need to turn off assertions for that!)

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchtext/x-diff; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-.patchDownload
From 6101a4b91f537bf483059b0b6e8ff13d6e7be9fa Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 12 Sep 2016 22:02:34 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized buffers, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge.
---
 src/backend/utils/sort/logtape.c   | 140 +++++-
 src/backend/utils/sort/tuplesort.c | 996 +++++++++++--------------------------
 src/include/utils/logtape.h        |   1 +
 3 files changed, 399 insertions(+), 738 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..3377cef 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -79,6 +79,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +132,12 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	int			read_buffer_size;	/* buffer size to use when reading */
 } LogicalTape;
 
 /*
@@ -228,6 +232,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +597,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +681,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +692,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +766,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +812,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +844,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +895,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +944,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1011,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1002,6 +1073,10 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
+
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	*blocknum = lt->curBlockNumber;
 	*offset = lt->pos;
 }
@@ -1014,3 +1089,28 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 * Also, don't go above MaxAllocSize, to avoid erroring out. A multi-gigabyte
+	 * buffer is unlikely to be helpful, anyway.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	if (avail_mem > MaxAllocSize)
+		avail_mem = MaxAllocSize;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d600670..3c95b2d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,20 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size buffers to store
+ * tuples in. To avoid palloc/pfree overhead.
+ *
+ * 'nextfree' is valid when this chunk is in the free list. When in use, the
+ * buffer holds a tuple.
+ */
+#define MERGETUPLEBUFFER_SIZE 1024
+
+typedef union MergeTupleBuffer
+{
+	union MergeTupleBuffer *nextfree;
+	char		buffer[MERGETUPLEBUFFER_SIZE];
+} MergeTupleBuffer;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -307,14 +320,6 @@ struct Tuplesortstate
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
@@ -332,12 +337,40 @@ struct Tuplesortstate
 	/*
 	 * Memory for tuples is sometimes allocated in batch, rather than
 	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * been abandoned.  Currently, this happens when we start merging.
+	 * Large batch allocations can store tuples (e.g. IndexTuples) without
+	 * palloc() fragmentation and other overhead.
+	 *
+	 * For the batch memory, we use one large allocation, divided into
+	 * MERGETUPLEBUFFER_SIZE chunks. The allocation is sized to hold
+	 * one chunk per tape, plus one additional chunk. We need that many
+	 * chunks to hold all the tuples kept in the heap during merge, plus
+	 * the one we have last returned from the sort.
+	 *
+	 * Initially, all the chunks are kept in a linked list, in freeBufferHead.
+	 * When a tuple is read from a tape, it is put to the next available
+	 * chunk, if it fits. If the tuple is larger than MERGETUPLEBUFFER_SIZE,
+	 * it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the chunk back to the
+	 * free list, or pfree() if it was palloc'd. We know that a tuple was
+	 * allocated from the batch memory arena, if its pointer value is between
+	 * batchMemoryBegin and -End.
 	 */
 	bool		batchUsed;
 
+	char	   *batchMemoryBegin;	/* beginning of batch memory arena */
+	char	   *batchMemoryEnd;		/* end of batch memory arena */
+	MergeTupleBuffer *freeBufferHead;	/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller that came from a tape (that is,
+	 * in TSS_SORTEDONTAPE or TSS_FINALMERGE modes), we remember the tuple
+	 * in 'readlasttuple', so that we can recycle the memory on next
+	 * gettuple call.
+	 */
+	void	   *readlasttuple;
+
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
@@ -358,42 +391,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,11 +483,33 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the batch memory arena?
+ */
+#define IS_MERGETUPLE_BUFFER(state, tuple) \
+	((char *) tuple >= state->batchMemoryBegin && \
+	 (char *) tuple < state->batchMemoryEnd)
+
+/*
+ * Return the given tuple to the batch memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_MERGETUPLE_BUFFER(state, tuple) \
+	do { \
+		MergeTupleBuffer *buf = (MergeTupleBuffer *) tuple; \
+		\
+		if (IS_MERGETUPLE_BUFFER(state, tuple)) \
+		{ \
+			buf->nextfree = state->freeBufferHead; \
+			state->freeBufferHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -553,16 +577,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -576,7 +592,7 @@ static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -584,7 +600,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -592,7 +607,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -602,7 +616,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -610,7 +623,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -762,7 +774,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -835,7 +846,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -927,7 +937,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -995,7 +1004,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1038,7 +1046,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1884,14 +1891,33 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
 			Assert(!state->batchUsed);
-			*should_free = true;
+
+			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call. (This can be NULL, in the Datum case).
+					 */
+					state->readlasttuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1965,74 +1991,63 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->readlasttuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
 			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the batch memory arena. */
 			*should_free = false;
 
 			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->readlasttuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->readlasttuple);
+				state->readlasttuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
 
-				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
-				 */
 				*stup = state->memtuples[0];
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
-				{
-					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
-					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
 
-					mergeprereadone(state, srcTape);
+				/*
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
+				 */
+				state->readlasttuple = stup->tuple;
 
+				/*
+				 * Pull next tuple from tape, and replace the returned tuple
+				 * at top of the heap with it.
+				 */
+				if (!mergereadnext(state, srcTape, &newtup))
+				{
 					/*
-					 * if still no data, we've reached end of run on this tape
+					 * If no more data, we've reached end of run on this tape.
+					 * Remove the top node from the heap.
 					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Remove the top node from the heap */
-						tuplesort_heap_delete_top(state, false);
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+					tuplesort_heap_delete_top(state, false);
+					return true;
 				}
-
-				/*
-				 * pull next preread tuple from list, and replace the returned
-				 * tuple at top of the heap with it.
-				 */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				newtup->tupindex = srcTape;
-				tuplesort_heap_replace_top(state, newtup, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+				newtup.tupindex = srcTape;
+				tuplesort_heap_replace_top(state, &newtup, false);
 				return true;
 			}
 			return false;
@@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but
 	 * don't decrease it to the point that we have no room for tuples. (That
 	 * case is only likely to occur if sorting pass-by-value Datums; in all
 	 * other scenarios the memtuples[] array is unlikely to occupy more than
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2478,6 +2486,11 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	char	   *p;
+	int			i;
+	int			per_tape, cutoff;
+	long		avail_blocks;
+	int			maxTapes;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2499,6 +2512,42 @@ mergeruns(Tuplesortstate *state)
 	}
 
 	/*
+	 * We no longer need a large memtuples array, only one slot per tape. Shrink
+	 * it, to make the memory available for other use. We only need one slot per
+	 * tape.
+	 */
+	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	pfree(state->memtuples);
+	state->memtupsize = state->maxTapes;
+	state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+
+	/*
+	 * If we had fewer runs than tapes, refund buffers for tapes that were never
+	 * allocated.
+	 */
+	maxTapes = state->maxTapes;
+	if (state->currentRun < maxTapes)
+	{
+		FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+		maxTapes = state->currentRun;
+	}
+
+	/* Initialize the merge tuple buffer arena.  */
+	state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+	USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+
+	p = state->batchMemoryBegin;
+	for (i = 0; i < maxTapes; i++)
+	{
+		((MergeTupleBuffer *) p)->nextfree = (MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
+		p += MERGETUPLEBUFFER_SIZE;
+	}
+	((MergeTupleBuffer *) p)->nextfree = NULL;
+
+	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem when replacement selection is used,
 	 * but something we particular count on when input is presorted), we can
@@ -2514,6 +2563,39 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * Use all the spare memory we have available for read buffers. Divide it
+	 * memory evenly among all the tapes.
+	 */
+	avail_blocks = state->availMem / BLCKSZ;
+	per_tape = avail_blocks / maxTapes;
+	cutoff = avail_blocks % maxTapes;
+	if (per_tape == 0)
+	{
+		per_tape = 1;
+		cutoff = 0;
+	}
+	for (tapenum = 0; tapenum < maxTapes; tapenum++)
+	{
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										(per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+	}
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using %ldkB of memory for read buffers in %d tapes, %ldkB per tape",
+			 (long) (avail_blocks * BLCKSZ) / 1024, maxTapes,
+			 (long) (per_tape * BLCKSZ) / 1024);
+#endif
+
+	USEMEM(state, avail_blocks * BLCKSZ);
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of individual tuples.
+	 */
+	state->batchUsed = true;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2544,7 +2626,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2627,16 +2709,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2645,40 +2723,28 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
-		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-			{
-				/* remove the written-out tuple from the heap */
-				tuplesort_heap_delete_top(state, false);
-				continue;
-			}
-		}
+
+		/* recycle the buffer of the tuple we just wrote out, for the next read */
+		RELEASE_MERGETUPLE_BUFFER(state, state->memtuples[0].tuple);
 
 		/*
 		 * pull next preread tuple from list, and replace the written-out
 		 * tuple in the heap with it.
 		 */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tup->tupindex = srcTape;
-		tuplesort_heap_replace_top(state, tup, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		if (!mergereadnext(state, srcTape, &stup))
+		{
+			/* we've reached end of run on this tape */
+			/* remove the written-out tuple from the heap */
+			tuplesort_heap_delete_top(state, false);
+			continue;
+		}
+		stup.tupindex = srcTape;
+		tuplesort_heap_replace_top(state, &stup, false);
 	}
 
 	/*
@@ -2711,18 +2777,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2746,517 +2807,47 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tup->tupindex = srcTape;
-			tuplesort_heap_insert(state, tup, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
-	}
-}
-
-/*
- * batchmemtuples - grow memtuples without palloc overhead
- *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* Caller error if we have no tapes */
-	Assert(state->activeTapes > 0);
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * We need to be sure that we do not cause LACKMEM to become true, else
-	 * the batch allocation size could be calculated as negative, causing
-	 * havoc.  Hence, if availMemLessRefund is negative at this point, we must
-	 * do nothing.  Moreover, if it's positive but rather small, there's
-	 * little point in proceeding because we could only increase memtuples by
-	 * a small amount, not worth the cost of the repalloc's.  We somewhat
-	 * arbitrarily set the threshold at ALLOCSET_DEFAULT_INITSIZE per tape.
-	 * (Note that this does not represent any assumption about tuple sizes.)
-	 */
-	if (availMemLessRefund <=
-		(int64) state->activeTapes * ALLOCSET_DEFAULT_INITSIZE)
-		return;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	state->growmemtuples = false;
-	/* availMem must stay accurate for spacePerTape calculation */
-	FREEMEM(state, availMemLessRefund);
-	if (LACKMEM(state))
-		elog(ERROR, "unexpected out-of-memory situation in tuplesort");
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
+		SortTuple	tup;
 
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
+			tup.tupindex = srcTape;
+			tuplesort_heap_insert(state, &tup, false);
 		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
 	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
 }
 
 /*
- * mergeprereadone - load tuples from one merge input tape
+ * mergereadnext - read next tuple from one merge input tape
  *
- * Read tuples from the specified tape until it has used up its free memory
- * or array slots; but ensure that we have at least one tuple, if any are
- * to be had.
+ * Returns false on EOF.
  */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3901,38 +3492,30 @@ markrunend(Tuplesortstate *state, int tapenum)
 }
 
 /*
- * Get memory for tuple from within READTUP() routine.  Allocate
- * memory and account for that, or consume from tape's batch
- * allocation.
+ * Get memory for tuple from within READTUP() routine.
  *
- * Memory returned here in the final on-the-fly merge case is recycled
- * from tape's batch allocation.  Otherwise, callers must pfree() or
- * reset tuple child memory context, and account for that with a
- * FREEMEM().  Currently, this only ever needs to happen in WRITETUP()
- * routines.
+ * We use next free buffer from the batch memory arena, or palloc() if
+ * the tuple is too large for that.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	MergeTupleBuffer *buf;
+
+	/*
+	 * We pre-allocate enough buffers in the arena that we should never run out.
+	 */
+	Assert(state->freeBufferHead);
+
+	if (tuplen > MERGETUPLEBUFFER_SIZE || !state->freeBufferHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->freeBufferHead;
+		/* Reuse this buffer */
+		state->freeBufferHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4101,8 +3684,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4111,7 +3697,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4132,12 +3718,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4344,8 +3924,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4354,7 +3937,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4379,19 +3961,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4659,8 +4228,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4668,7 +4240,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4683,12 +4255,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4755,7 +4321,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->batchUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4785,7 +4351,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4799,12 +4365,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#27Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#26)
Re: Tuplesort merge pre-reading

On Mon, Sep 12, 2016 at 12:07 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Here's a fixed version. I'll go through Peter's comments and address those,
but I don't think there was anything there that should affect performance
much, so I think you can proceed with your benchmarking with this version.
(You'll also need to turn off assertions for that!)

I agree that it's unlikely that addressing any of my feedback will
result in any major change to performance.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#17)
1 attachment(s)
Re: Tuplesort merge pre-reading

Addressed all your comments one way or another, new patch attached.
Comments on some specific points below:

On 09/12/2016 01:13 AM, Peter Geoghegan wrote:

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

That's not true, the "buffers" in batch memory are not accessed
sequentially. When we pull the next tuple from a tape, we store it in
the next free buffer. Usually, that buffer was used to hold the previous
tuple that was returned from gettuple(), and was just added to the free
list.

It's still quite cache-friendly, though, because we only need a small
number of slots (one for each tape).

* i think you should move "bool *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

Hmm. I think I prefer to keep the memory management stuff in a separate
section. While it's true that it's currently only used during merging,
it's not hard to imagine using it when building the initial runs, for
example. Except for replacement selection, the pattern for building the
runs is: add a bunch of tuples, sort, flush them out. It would be
straightforward to use one large chunk of memory to hold all the tuples.
I'm not going to do that now, but I think keeping the memory management
stuff separate from merge-related state makes sense.

* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.

Hmm. We don't actually use it after the initial runs at all anymore. I
added a call to destroy it in mergeruns().

Now that we use the batch memory buffers for allocations < 1 kB (I
pulled that number out of a hat, BTW), and we only need one allocation
per tape (plus one), there's not much risk of fragmentation.

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

@@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
Assert(lt->frozen);
datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
}
+
+       /* Allocate a read buffer */
+       if (lt->buffer)
+           pfree(lt->buffer);
+       lt->buffer = palloc(lt->read_buffer_size);
+       lt->buffer_size = lt->read_buffer_size;

Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.

True. I fixed that by putting a MaxAllocSize cap on the buffer size
instead. I doubt that doing > 1 GB of read-ahead of a single tape will
do any good.

I wonder if we should actually have a smaller cap there. Even 1 GB seems
excessive. Might be better to start the merging sooner, rather than wait
for the read of 1 GB to complete. The point of the OS readahead is that
the OS will do that for us, in the background. And other processes might
have better use for the memory anyway.

* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):

+   /*
+    * Use all the spare memory we have available for read buffers. Divide it
+    * memory evenly among all the tapes.
+    */
+   avail_blocks = state->availMem / BLCKSZ;
+   per_tape = avail_blocks / maxTapes;
+   cutoff = avail_blocks % maxTapes;
+   if (per_tape == 0)
+   {
+       per_tape = 1;
+       cutoff = 0;
+   }
+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the chance to save us here (if there is a bug
along these lines, let's at least make the right "can't happen error"
complaint to user when it pops up).

Hmm. We don't really need the availMem accounting at all, after we have
started merging. There is nothing we can do to free memory if we run
out, and we use fairly little memory anyway. But yes, better safe than
sorry. I tried to clarify the comments on that.

* It looks like your patch makes us less eager about freeing per-tape
batch memory, now held as preload buffer space within logtape.c.

ISTM that there should be some way to have the "tape exhausted" code
path within tuplesort_gettuple_common() (as well as the similar spot
within mergeonerun()) instruct logtape.c that we're done with that
tape. In other words, when mergeprereadone() (now renamed to
mergereadnext()) detects the tape is exhausted, it should have
logtape.c free its huge tape buffer immediately. Think about cases
where input is presorted, and earlier tapes can be freed quite early
on. It's worth keeping that around, (you removed the old way that this
happened, through mergebatchfreetape()).

OK. I solved that by calling LogicalTapeRewind(), when we're done
reading a tape. Rewinding a tape has the side-effect of freeing the
buffer. I was going to put that into mergereadnext(), but it turns out
that it's tricky to figure out if there are any more runs on the same
tape, because we have the "actual" tape number there, but the tp_runs is
indexed by "logical" tape number. So I put the rewind calls at the end
of mergeruns(), and in TSS_FINALMERGE processing, instead. It means that
we don't free the buffers quite as early as we could, but I think this
is good enough.

On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:

+   for (tapenum = 0; tapenum < maxTapes; tapenum++)
+   {
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
+   }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

logtape.c will only actually allocate the memory when reading.

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

+   /*
+    * If we had fewer runs than tapes, refund buffers for tapes that were never
+    * allocated.
+    */
+   maxTapes = state->maxTapes;
+   if (state->currentRun < maxTapes)
+   {
+       FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
+       maxTapes = state->currentRun;
+   }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.

I changed that so that it does actually change state->maxTapes. I
considered having a separate numTapes field, that can be smaller than
maxTapes, but we don't need the original maxTapes value after that point
anymore, so it would've been just pro forma to track them separately. I
hope the comment now explains that better.

You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Hmm, yes, using currentRun here is wrong. It needs to be "currentRun +
1", because we need one more tape than there are runs, to hold the output.

Note that I'm not re-allocating the read buffers depending on which
tapes are used in the current merge pass. I'm just dividing up the
memory among all tapes. Now that the pre-reading is done in logtape.c,
when we reach the end of a run on a tape, we will already have data from
the next run on the same tape in the read buffer.

Immediately afterwards, you do this:

+   /* Initialize the merge tuple buffer arena.  */
+   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

Ah, no, the "+ 1" comes from the need to hold the tuple that we last
returned to the caller in tuplesort_gettuple, until the next call. See
lastReturnedTuple. I tried to clarify the comments on that.

Thanks for the thorough review! Let me know how this looks now.

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-2.patchtext/x-patch; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-2.patchDownload
From 19c88b74bd0302185bd24ce1e0abcef796db5afd Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 14 Sep 2016 17:29:11 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized buffers, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE. In other words, it's used whenever we do
an external sort.

Reviewed by Peter Geoghegan.
---
 src/backend/utils/sort/logtape.c   |  153 ++++-
 src/backend/utils/sort/tuplesort.c | 1130 ++++++++++++------------------------
 src/include/utils/logtape.h        |    1 +
 3 files changed, 492 insertions(+), 792 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..4152da1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -52,12 +52,17 @@
  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
  * policy for free blocks would be better?)
  *
+ * To further make the I/Os more sequential, we can use a larger buffer
+ * when reading, and read multiple blocks from the same tape in one go,
+ * whenever the buffer becomes empty. LogicalTapeAssignReadBufferSize()
+ * can be used to set the size of the read buffer.
+ *
  * To support the above policy of writing to the lowest free block,
  * ltsGetFreeBlock sorts the list of free block numbers into decreasing
  * order each time it is asked for a block and the list isn't currently
  * sorted.  This is an efficient way to handle it because we expect cycles
  * of releasing many blocks followed by re-using many blocks, due to
- * tuplesort.c's "preread" behavior.
+ * the larger read buffer.
  *
  * Since all the bookkeeping and buffer memory is allocated with palloc(),
  * and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -79,6 +84,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +137,18 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * Desired buffer size to use when reading.  To keep things simple, we
+	 * use a single-block buffer when writing, or when reading a frozen
+	 * tape.  But when we are reading and will only read forwards, we
+	 * allocate a larger buffer, determined by read_buffer_size.
+	 */
+	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -228,6 +243,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +608,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +692,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +703,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +777,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +823,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +855,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +906,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +955,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1022,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1002,6 +1084,10 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
+
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	*blocknum = lt->curBlockNumber;
 	*offset = lt->pos;
 }
@@ -1014,3 +1100,28 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 * Also, don't go above MaxAllocSize, to avoid erroring out. A multi-gigabyte
+	 * buffer is unlikely to be helpful, anyway.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	if (avail_mem > MaxAllocSize)
+		avail_mem = MaxAllocSize;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d600670..131dbef 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -14,13 +14,13 @@
  * sorting algorithm.  Historically, we divided the input into sorted runs
  * using replacement selection, in the form of a priority tree implemented
  * as a heap (essentially his Algorithm 5.2.3H -- although that strategy is
- * often avoided altogether), but that can now only happen first the first
+ * often avoided altogether), but now we only do that for the first
  * run.  We merge the runs using polyphase merge, Knuth's Algorithm
  * 5.4.2D.  The logical "tapes" used by Algorithm D are implemented by
  * logtape.c, which avoids space wastage by recycling disk space as soon
  * as each block is read from its "tape".
  *
- * We never form the initial runs using Knuth's recommended replacement
+ * We do not form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
  * number of records in memory at all times.  Since we are dealing with
  * tuples that may vary considerably in size, we want to be able to vary
@@ -36,11 +36,11 @@
  *
  * In PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with some small
  * customizations) is only used with the aim of producing just one run,
- * thereby avoiding all merging.  Only the first run can use replacement
+ * thereby avoiding all merging.  Only the first run uses replacement
  * selection, which is why there are now only two possible valid run
  * numbers, and why heapification is customized to not distinguish between
  * tuples in the second run (those will be quicksorted).  We generally
- * prefer a simple hybrid sort-merge strategy, where runs are sorted in much
+ * prefer a simple hybrid sort-merge strategy, where runs are sorted in
  * the same way as the entire input of an internal sort is sorted (using
  * qsort()).  The replacement_sort_tuples GUC controls the limited remaining
  * use of replacement selection for the first run.
@@ -74,7 +74,7 @@
  * the merge is complete.  The basic merge algorithm thus needs very little
  * memory --- only M tuples for an M-way merge, and M is constrained to a
  * small number.  However, we can still make good use of our full workMem
- * allocation by pre-reading additional tuples from each source tape.  Without
+ * allocation, to pre-read blocks from each source tape.  Without
  * prereading, our access pattern to the temporary file would be very erratic;
  * on average we'd read one block from each of M source tapes during the same
  * time that we're writing M blocks to the output tape, so there is no
@@ -84,10 +84,10 @@
  * worse when it comes time to read that tape.  A straightforward merge pass
  * thus ends up doing a lot of waiting for disk seeks.  We can improve matters
  * by prereading from each source tape sequentially, loading about workMem/M
- * bytes from each tape in turn.  Then we run the merge algorithm, writing but
- * not reading until one of the preloaded tuple series runs out.  Then we
- * switch back to preread mode, fill memory again, and repeat.  This approach
- * helps to localize both read and write accesses.
+ * bytes from each tape in turn, and making the sequential blocks immediately
+ * available for reuse.  This approach helps to localize both read and  write
+ * accesses. The pre-reading is handled by logtape.c, we just tell it how
+ * much memory to use for the buffers.
  *
  * When the caller requests random access to the sort result, we form
  * the final sorted run on a logical tape which is then "frozen", so
@@ -162,7 +162,7 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * can be freed by a simple pfree() (except during merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,20 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size buffers to store
+ * tuples in. To avoid palloc/pfree overhead.
+ *
+ * 'nextfree' is valid when this chunk is in the free list. When in use, the
+ * buffer holds a tuple.
+ */
+#define MERGETUPLEBUFFER_SIZE 1024
+
+typedef union MergeTupleBuffer
+{
+	union MergeTupleBuffer *nextfree;
+	char		buffer[MERGETUPLEBUFFER_SIZE];
+} MergeTupleBuffer;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -288,41 +301,28 @@ struct Tuplesortstate
 	/*
 	 * Function to write a stored tuple onto tape.  The representation of the
 	 * tuple on tape need not be the same as it is in memory; requirements on
-	 * the tape representation are given below.  After writing the tuple,
-	 * pfree() the out-of-line data (not the SortTuple struct!), and increase
-	 * state->availMem by the amount of memory space thereby released.
+	 * the tape representation are given below.  If !batchUsed, after writing
+	 * the tuple, pfree() the out-of-line data (not the SortTuple struct!),
+	 * and increase state->availMem by the amount of memory space thereby
+	 * released.
 	 */
 	void		(*writetup) (Tuplesortstate *state, int tapenum,
 										 SortTuple *stup);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create a palloc'd copy,
-	 * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
-	 * decrease state->availMem by the amount of memory space consumed. (See
-	 * batchUsed notes for details on how memory is handled when incremental
-	 * accounting is abandoned.)
+	 * the already-read length of the stored tuple.  The tuple is stored in
+	 * the batch memory arena, or is palloc'd, see readtup_alloc().
 	 */
 	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
 	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
-	 * H.  (Note that memtupcount only counts the tuples that are part of the
-	 * heap --- during merge passes, memtuples[] entries beyond tapeRange are
-	 * never in the heap and are used to hold pre-read tuples.)  In state
-	 * SORTEDONTAPE, the array is not used.
+	 * H. In state SORTEDONTAPE, the array is not used.
 	 */
 	SortTuple  *memtuples;		/* array of SortTuple structs */
 	int			memtupcount;	/* number of tuples currently present */
@@ -332,12 +332,40 @@ struct Tuplesortstate
 	/*
 	 * Memory for tuples is sometimes allocated in batch, rather than
 	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * been abandoned.  Currently, this happens when we start merging.
+	 * Large batch allocations can store tuples (e.g. IndexTuples) without
+	 * palloc() fragmentation and other overhead.
+	 *
+	 * For the batch memory, we use one large allocation, divided into
+	 * MERGETUPLEBUFFER_SIZE chunks. The allocation is sized to hold
+	 * one chunk per tape, plus one additional chunk. We need that many
+	 * chunks to hold all the tuples kept in the heap during merge, plus
+	 * the one we have last returned from the sort.
+	 *
+	 * Initially, all the chunks are kept in a linked list, in freeBufferHead.
+	 * When a tuple is read from a tape, it is put to the next available
+	 * chunk, if it fits. If the tuple is larger than MERGETUPLEBUFFER_SIZE,
+	 * it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the chunk back to the
+	 * free list, or pfree() if it was palloc'd. We know that a tuple was
+	 * allocated from the batch memory arena, if its pointer value is between
+	 * batchMemoryBegin and -End.
 	 */
 	bool		batchUsed;
 
+	char	   *batchMemoryBegin;	/* beginning of batch memory arena */
+	char	   *batchMemoryEnd;		/* end of batch memory arena */
+	MergeTupleBuffer *freeBufferHead;	/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
+	 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE modes),
+	 * we remember the tuple in 'lastReturnedTuple', so that we can recycle the
+	 * memory on next gettuple call.
+	 */
+	void	   *lastReturnedTuple;
+
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
@@ -358,42 +386,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,11 +478,33 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the batch memory arena?
+ */
+#define IS_MERGETUPLE_BUFFER(state, tuple) \
+	((char *) tuple >= state->batchMemoryBegin && \
+	 (char *) tuple < state->batchMemoryEnd)
+
+/*
+ * Return the given tuple to the batch memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_MERGETUPLE_BUFFER(state, tuple) \
+	do { \
+		MergeTupleBuffer *buf = (MergeTupleBuffer *) tuple; \
+		\
+		if (IS_MERGETUPLE_BUFFER(state, tuple)) \
+		{ \
+			buf->nextfree = state->freeBufferHead; \
+			state->freeBufferHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -553,16 +572,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -576,7 +587,7 @@ static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -584,7 +595,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -592,7 +602,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -602,7 +611,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -610,7 +618,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -662,10 +669,10 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
 	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * eases memory management.  Destroying it once we're done building
+	 * the initial runs reduces fragmentation.  Note that the memtuples array
+	 * of SortTuples is allocated in the parent context, not this context,
+	 * because there is no need to free memtuples early.
 	 */
 	tuplecontext = AllocSetContextCreate(sortcontext,
 										 "Caller tuples",
@@ -762,7 +769,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -835,7 +841,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -927,7 +932,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -995,7 +999,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1038,7 +1041,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1884,14 +1886,33 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
 			Assert(!state->batchUsed);
-			*should_free = true;
+
+			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call. (This can be NULL, in the Datum case).
+					 */
+					state->lastReturnedTuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1965,74 +1986,70 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->lastReturnedTuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
 			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the batch memory arena. */
 			*should_free = false;
 
 			/*
+			 * The buffer holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_MERGETUPLE_BUFFER(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
+
+				*stup = state->memtuples[0];
 
 				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
 				 */
-				*stup = state->memtuples[0];
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
+				state->lastReturnedTuple = stup->tuple;
+
+				/*
+				 * Pull next tuple from tape, and replace the returned tuple
+				 * at top of the heap with it.
+				 */
+				if (!mergereadnext(state, srcTape, &newtup))
 				{
 					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
+					 * If no more data, we've reached end of run on this tape.
+					 * Remove the top node from the heap.
 					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
-
-					mergeprereadone(state, srcTape);
+					tuplesort_heap_delete_top(state, false);
 
 					/*
-					 * if still no data, we've reached end of run on this tape
+					 * Rewind to free the read buffer.  It'd go away at the
+					 * end of the sort anyway, but better to release the
+					 * memory early.
 					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Remove the top node from the heap */
-						tuplesort_heap_delete_top(state, false);
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+					LogicalTapeRewind(state->tapeset, srcTape, true);
+					return true;
 				}
-
-				/*
-				 * pull next preread tuple from list, and replace the returned
-				 * tuple at top of the heap with it.
-				 */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				newtup->tupindex = srcTape;
-				tuplesort_heap_replace_top(state, newtup, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+				newtup.tupindex = srcTape;
+				tuplesort_heap_replace_top(state, &newtup, false);
 				return true;
 			}
 			return false;
@@ -2317,13 +2334,6 @@ inittapes(Tuplesortstate *state)
 	/* Compute number of tapes to use: merge order plus 1 */
 	maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
 
-	/*
-	 * We must have at least 2*maxTapes slots in the memtuples[] array, else
-	 * we'd not have room for merge heap plus preread.  It seems unlikely that
-	 * this case would ever occur, but be safe.
-	 */
-	maxTapes = Min(maxTapes, state->memtupsize / 2);
-
 	state->maxTapes = maxTapes;
 	state->tapeRange = maxTapes - 1;
 
@@ -2334,13 +2344,13 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
-	 * don't decrease it to the point that we have no room for tuples. (That
-	 * case is only likely to occur if sorting pass-by-value Datums; in all
-	 * other scenarios the memtuples[] array is unlikely to occupy more than
-	 * half of allowedMem.  In the pass-by-value case it's not important to
-	 * account for tuple space, so we don't care if LACKMEM becomes
-	 * inaccurate.)
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but don't decrease it to the point that we
+	 * have no room for tuples. (That case is only likely to occur if sorting
+	 * pass-by-value Datums; in all other scenarios the memtuples[] array is
+	 * unlikely to occupy more than half of allowedMem.  In the pass-by-value
+	 * case it's not important to account for tuple space, so we don't care
+	 * if LACKMEM becomes inaccurate.)
 	 */
 	tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;
 
@@ -2359,14 +2369,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2478,6 +2480,12 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	char	   *p;
+	int			i;
+	int64		availBlocks;
+	int64		usedBlocks;
+	int64		blocksPerTape;
+	int			remainder;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2499,6 +2507,67 @@ mergeruns(Tuplesortstate *state)
 	}
 
 	/*
+	 * Reset tuple memory.  We've freed all the tuples that we previously
+	 * allocated.  We will use the batch memory arena from now on.
+	 */
+	MemoryContextDelete(state->tuplecontext);
+	state->tuplecontext = NULL;
+
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape.
+	 * Shrink it, to make the memory available for other use. We only need one
+	 * slot per tape.
+	 */
+	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	pfree(state->memtuples);
+
+	/*
+	 * If we had fewer runs than tapes, forget about the unused tapes.  We
+	 * decrease maxTapes and tapeRange to reflect the actual number of tapes
+	 * used, and refund buffers for tapes that were never allocated.  (We don't
+	 * try to shrink the various arrays that were allocated according to old
+	 * maxTapes).
+	 */
+	if (state->Level == 1)
+	{
+		int			numTapes;
+
+		numTapes = state->currentRun + 1;
+		FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
+
+		state->maxTapes = numTapes;
+		state->tapeRange = numTapes - 1;
+	}
+
+	/*
+	 * Allocate a new 'memtuples' array, for the heap.
+	 */
+	state->memtupsize = state->maxTapes;
+	state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+
+	/*
+	 * Initialize the batch memory arena.  We need one tuple buffer per tape,
+	 * for tuples the heap, plus one to hold the tuple last returned from
+	 * tuplesort_gettuple.
+	 */
+	state->batchMemoryBegin = palloc((state->maxTapes + 1) *
+									 MERGETUPLEBUFFER_SIZE);
+	state->batchMemoryEnd = state->batchMemoryBegin +
+		(state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
+	state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
+	USEMEM(state, (state->maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
+
+	p = state->batchMemoryBegin;
+	for (i = 0; i < state->maxTapes; i++)
+	{
+		((MergeTupleBuffer *) p)->nextfree =
+			(MergeTupleBuffer *) (p + MERGETUPLEBUFFER_SIZE);
+		p += MERGETUPLEBUFFER_SIZE;
+	}
+	((MergeTupleBuffer *) p)->nextfree = NULL;
+
+	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem when replacement selection is used,
 	 * but something we particular count on when input is presorted), we can
@@ -2514,6 +2583,56 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * Use all the spare memory we have available for read buffers. Divide it
+	 * evenly among all the tapes.
+	 *
+	 * We use the number of *input* tapes (tapeRange) here, rather than maxTapes,
+	 * for the calculation.  At all times, we'll be reading from at most tapeRange
+	 * tapes, and one tape is used for output.  But we call
+	 * LogicalTapeAssignReadBufferSize() for every tape, so that when the tape
+	 * is read from, it will use a properly-sized buffer.
+	 */
+	availBlocks = state->availMem / BLCKSZ;
+	blocksPerTape = availBlocks / state->tapeRange;
+	remainder = availBlocks % state->tapeRange;
+
+	/*
+	 * Use one page per tape, even if we are out of memory. tuplesort_merge_order()
+	 * should've chosen the number of tapes so that this can't happen, but better
+	 * safe than sorry.  (This also protects from a negative availMem.)
+	 */
+	if (blocksPerTape < 1)
+	{
+		blocksPerTape = 1;
+		remainder = 0;
+	}
+
+	usedBlocks = 0;
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		int64		numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+		if (numBlocks > MaxAllocSize / BLCKSZ)
+			numBlocks = MaxAllocSize / BLCKSZ;
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										numBlocks * BLCKSZ);
+		usedBlocks += numBlocks;
+	}
+	USEMEM(state, usedBlocks * BLCKSZ);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d tapes",
+			 (long) (usedBlocks * BLCKSZ) / 1024, state->maxTapes);
+#endif
+
+	/*
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of individual tuples.
+	 */
+	state->batchUsed = true;
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2544,7 +2663,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2614,6 +2733,14 @@ mergeruns(Tuplesortstate *state)
 	state->result_tape = state->tp_tapenum[state->tapeRange];
 	LogicalTapeFreeze(state->tapeset, state->result_tape);
 	state->status = TSS_SORTEDONTAPE;
+
+	/* Release the read buffers on all the other tapes, by rewinding them. */
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		if (tapenum == state->result_tape)
+			continue;
+		LogicalTapeRewind(state->tapeset, tapenum, true);
+	}
 }
 
 /*
@@ -2627,16 +2754,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2645,52 +2768,31 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
-		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-			{
-				/* remove the written-out tuple from the heap */
-				tuplesort_heap_delete_top(state, false);
-				continue;
-			}
-		}
+
+		/* recycle the buffer of the tuple we just wrote out, for the next read */
+		RELEASE_MERGETUPLE_BUFFER(state, state->memtuples[0].tuple);
 
 		/*
 		 * pull next preread tuple from list, and replace the written-out
 		 * tuple in the heap with it.
 		 */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tup->tupindex = srcTape;
-		tuplesort_heap_replace_top(state, tup, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		if (!mergereadnext(state, srcTape, &stup))
+		{
+			/* we've reached end of run on this tape */
+			/* remove the written-out tuple from the heap */
+			tuplesort_heap_delete_top(state, false);
+			continue;
+		}
+		stup.tupindex = srcTape;
+		tuplesort_heap_replace_top(state, &stup, false);
 	}
 
 	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated, but AllocSetFree will have put those chunks of memory on
-	 * particular free lists, bucketed by size class.  Thus, although all of
-	 * that memory is free, it is effectively fragmented.  Resetting the
-	 * context gets us out from under that problem.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
-	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape, and increment its count of real runs.
 	 */
@@ -2711,18 +2813,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2746,517 +2843,47 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tup->tupindex = srcTape;
-			tuplesort_heap_insert(state, tup, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
-	}
-}
-
-/*
- * batchmemtuples - grow memtuples without palloc overhead
- *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* Caller error if we have no tapes */
-	Assert(state->activeTapes > 0);
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * We need to be sure that we do not cause LACKMEM to become true, else
-	 * the batch allocation size could be calculated as negative, causing
-	 * havoc.  Hence, if availMemLessRefund is negative at this point, we must
-	 * do nothing.  Moreover, if it's positive but rather small, there's
-	 * little point in proceeding because we could only increase memtuples by
-	 * a small amount, not worth the cost of the repalloc's.  We somewhat
-	 * arbitrarily set the threshold at ALLOCSET_DEFAULT_INITSIZE per tape.
-	 * (Note that this does not represent any assumption about tuple sizes.)
-	 */
-	if (availMemLessRefund <=
-		(int64) state->activeTapes * ALLOCSET_DEFAULT_INITSIZE)
-		return;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	state->growmemtuples = false;
-	/* availMem must stay accurate for spacePerTape calculation */
-	FREEMEM(state, availMemLessRefund);
-	if (LACKMEM(state))
-		elog(ERROR, "unexpected out-of-memory situation in tuplesort");
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
+		SortTuple	tup;
 
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
-
-		if (rtup->tuple)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
+			tup.tupindex = srcTape;
+			tuplesort_heap_insert(state, &tup, false);
 		}
-		state->mergeoverflow[srcTape] = NULL;
-	}
-}
-
-/*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
- *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
- */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
 	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
 }
 
 /*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
+ * mergereadnext - read next tuple from one merge input tape
  *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
+ * Returns false on EOF.
  */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
- *
- * Read tuples from the specified tape until it has used up its free memory
- * or array slots; but ensure that we have at least one tuple, if any are
- * to be had.
- */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3438,15 +3065,6 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		state->memtupcount--;
 	}
 
-	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated.  It's important to avoid fragmentation when there is a stark
-	 * change in allocation patterns due to the use of batch memory.
-	 * Fragmentation due to AllocSetFree's bucketing by size class might be
-	 * particularly bad if this step wasn't taken.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
 	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
@@ -3901,38 +3519,31 @@ markrunend(Tuplesortstate *state, int tapenum)
 }
 
 /*
- * Get memory for tuple from within READTUP() routine.  Allocate
- * memory and account for that, or consume from tape's batch
- * allocation.
+ * Get memory for tuple from within READTUP() routine.
  *
- * Memory returned here in the final on-the-fly merge case is recycled
- * from tape's batch allocation.  Otherwise, callers must pfree() or
- * reset tuple child memory context, and account for that with a
- * FREEMEM().  Currently, this only ever needs to happen in WRITETUP()
- * routines.
+ * We use next free buffer from the batch memory arena, or palloc() if
+ * the tuple is too large for that.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	MergeTupleBuffer *buf;
+
+	/*
+	 * We pre-allocate enough buffers in the arena that we should never run
+	 * out.
+	 */
+	Assert(state->freeBufferHead);
+
+	if (tuplen > MERGETUPLEBUFFER_SIZE || !state->freeBufferHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->freeBufferHead;
+		/* Reuse this buffer */
+		state->freeBufferHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4101,8 +3712,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4111,7 +3725,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4132,12 +3746,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4344,8 +3952,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4354,7 +3965,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4379,19 +3989,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4659,8 +4256,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->batchUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4668,7 +4268,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4683,12 +4283,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4755,7 +4349,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->batchUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4785,7 +4379,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4799,12 +4393,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#29Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#28)
Re: Tuplesort merge pre-reading

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Addressed all your comments one way or another, new patch attached. Comments
on some specific points below:

Cool. My response here is written under time pressure, which is not
ideal. I think it's still useful, though.

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme.

That's not true, the "buffers" in batch memory are not accessed
sequentially. When we pull the next tuple from a tape, we store it in the
next free buffer. Usually, that buffer was used to hold the previous tuple
that was returned from gettuple(), and was just added to the free list.

It's still quite cache-friendly, though, because we only need a small number
of slots (one for each tape).

That's kind of what I meant, I think -- it's more or less sequential.
Especially in the common case where there is only one merge pass.

True. I fixed that by putting a MaxAllocSize cap on the buffer size instead.
I doubt that doing > 1 GB of read-ahead of a single tape will do any good.

You may well be right about that, but ideally that could be verified.
I think that the tuplesort is entitled to have whatever memory the
user makes available, unless that's almost certainly useless. It
doesn't seem like our job to judge that it's always wrong to use extra
memory with only a small expected benefit. If it's actually a
microscopic expected benefit, or just as often negative to the sort
operation's performance, then I'd say it's okay to cap it at
MaxAllocSize. But it's not obvious to me that this is true; not yet,
anyway.

Hmm. We don't really need the availMem accounting at all, after we have
started merging. There is nothing we can do to free memory if we run out,
and we use fairly little memory anyway. But yes, better safe than sorry. I
tried to clarify the comments on that.

It is true that we don't really care about the accounting at all. But,
the same applies to the existing grow_memtuples() case at the
beginning of merging. The point is, we *do* care about availMem, this
one last time. We must at least produce a sane (i.e. >= 0) number in
any calculation. (I think you understand this already -- just saying.)

OK. I solved that by calling LogicalTapeRewind(), when we're done reading a
tape. Rewinding a tape has the side-effect of freeing the buffer. I was
going to put that into mergereadnext(), but it turns out that it's tricky to
figure out if there are any more runs on the same tape, because we have the
"actual" tape number there, but the tp_runs is indexed by "logical" tape
number. So I put the rewind calls at the end of mergeruns(), and in
TSS_FINALMERGE processing, instead. It means that we don't free the buffers
quite as early as we could, but I think this is good enough.

That seems adequate.

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

I changed that so that it does actually change state->maxTapes. I considered
having a separate numTapes field, that can be smaller than maxTapes, but we
don't need the original maxTapes value after that point anymore, so it
would've been just pro forma to track them separately. I hope the comment
now explains that better.

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.

Hmm, yes, using currentRun here is wrong. It needs to be "currentRun + 1",
because we need one more tape than there are runs, to hold the output.

As I said, I think it should be the number of active tapes, as you see
today within beginmerge() + mergebatch(). Why not do it that way? If
you don't get the distinction, see my remarks below on final merges
always using batch memory, even when there are to be multiple merge
passes (no reason to add that restriction here). More generally, I
really don't want to mess with the long standing definition of
maxTapes and things like that, because I see no advantage.

Ah, no, the "+ 1" comes from the need to hold the tuple that we last
returned to the caller in tuplesort_gettuple, until the next call. See
lastReturnedTuple. I tried to clarify the comments on that.

I see. I don't think that you need to do any of that. Just have the
sizing be based on an even share among activeTapes on final merges
(not necessarily final on-the-fly merges, per my 0002-* patch --
you've done that here, it looks like). However, It looks like you're
doing the wrong thing by only having the check at the top of
beginmerge() -- "if (state->Level == 1)". You're not going to use
batch memory for the final merge just because there happened to be
multiple merges, that way. Sure, you shouldn't be using batch memory
for non-final merges (due to their need for an uneven amount of memory
per active tape), which you're not doing, but the mere fact that you
had a non-final merge should not affect the final merge at all. Which
is the kind of thing I'm concerned about when I talk about
beginmerge() being the right place for all that new code (not
mergeruns()). Even if you can make it work, it fits a lot better to
put it in beginmerge() -- that's the existing flow of things, which
should be preserved. I don't think you need to examine "state->Level"
or "state->currentRun" at all.

Did you do it this way to make the "replacement selection best case"
stuff within mergeruns() catch on, so it does the right thing during
its TSS_SORTEDONTAPE processing?

Thanks
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#29)
Re: Tuplesort merge pre-reading

On 09/15/2016 10:12 PM, Peter Geoghegan wrote:

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

I changed that so that it does actually change state->maxTapes. I considered
having a separate numTapes field, that can be smaller than maxTapes, but we
don't need the original maxTapes value after that point anymore, so it
would've been just pro forma to track them separately. I hope the comment
now explains that better.

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run
boundary. For example, when we read the last 1 MB of the first run on a
tape, and we're using a 10 MB read buffer, we will merrily also read the
first 9 MB from the next run. You cannot un-read that data, even if the
tape is inactive in the next merge pass.

I don't think it makes much difference in practice, because most merge
passes use all, or almost all, of the available tapes. BTW, I think the
polyphase algorithm prefers to do all the merges that don't use all
tapes upfront, so that the last final merge always uses all the tapes.
I'm not 100% sure about that, but that's my understanding of the
algorithm, and that's what I've seen in my testing.

Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.

We certainly still need multi-pass merges.

BTW, does a 1-way merge make any sense? I was surprised to see this in
the log, even without this patch:

LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER
BY i) t;

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#30)
Re: Tuplesort merge pre-reading

On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run
boundary. For example, when we read the last 1 MB of the first run on a
tape, and we're using a 10 MB read buffer, we will merrily also read the
first 9 MB from the next run. You cannot un-read that data, even if the tape
is inactive in the next merge pass.

I'm not sure that I like that approach. At the very least, it seems to
not be a good fit with the existing structure of things. I need to
think about it some more, and study how that plays out in practice.

BTW, does a 1-way merge make any sense?

Not really, no, but it's something that I've seen plenty of times.

This is seen when runs are distributed such that mergeonerun() only
finds one real run on all active tapes, with all other active tapes
having only dummy runs. In general, dummy runs are "logically merged"
(decremented) in preference to any real runs on the same tape (that's
the reason why they exist), so you end up with what we call a "1-way
merge" when you see one real one on one active tape only. You never
see something like "0-way merge" within trace_sort output, though,
because that case is optimized to be almost a no-op.

It's almost a no-op because when it happens then mergeruns() knows to
itself directly decrement the number of dummy runs once for each
active tape, making that "logical merge" completed with only that
simple change in metadata (that is, the "merge" completes by just
decrementing dummy run counts -- no actual call to mergeonerun()).

We could optimize away "1-way merge" cases, perhaps, so that tuples
don't have to be spilt out one at a time (there could perhaps instead
be just some localized change to metadata, a bit like the all-dummy
case). That doesn't seem worth bothering with, especially with this
new approach of yours. I prefer to avoid special cases like that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#30)
Re: Tuplesort merge pre-reading

On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

BTW, does a 1-way merge make any sense? I was surprised to see this in the
log, even without this patch:

LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;
LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
i) t;

Another thing that I think it worth pointing out here is that the
number of merge passes shown is excessive, practically speaking. I
suggested that we have something like checkpoint_warning for this last
year, which Greg Stark eventually got behind, but Robert Haas didn't
seem to like. Maybe this idea should be revisited. What do you think?

There is no neat explanation for why it's considered excessive to
checkpoint every 10 seconds, but not every 2 minutes. But, we warn
about the former case by default, and not the latter. It's hard to
know exactly where to draw the line, but that isn't a great reason to
not do it (maybe one extra merge pass is a good threshold -- that's
what I suggested once). I think that other database systems similarly
surface multiple merge passes. It's just inefficient to ever do
multiple merge passes, even if you're very frugal with memory.
Certainly, it's almost impossible to defend doing 3+ passes these
days.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#28)
Re: Tuplesort merge pre-reading

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Addressed all your comments one way or another, new patch attached.

So, it's clear that this isn't ready today. As I mentioned, I'm going
away for a week now. I ask that you hold off on committing this until
I return, and have a better opportunity to review the performance
characteristics of this latest revision, for one thing.

Thanks
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#33)
Re: Tuplesort merge pre-reading

On 09/17/2016 07:27 PM, Peter Geoghegan wrote:

On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Addressed all your comments one way or another, new patch attached.

So, it's clear that this isn't ready today. As I mentioned, I'm going
away for a week now. I ask that you hold off on committing this until
I return, and have a better opportunity to review the performance
characteristics of this latest revision, for one thing.

Ok. I'll probably read through it myself once more some time next week,
and also have a first look at your actual parallel sorting patch. Have a
good trip!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#34)
Re: Tuplesort merge pre-reading

On Sat, Sep 17, 2016 at 9:41 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Ok. I'll probably read through it myself once more some time next week, and
also have a first look at your actual parallel sorting patch. Have a good
trip!

Thanks! It will be good to get away for a while.

I'd be delighted to recruit you to work on the parallel CREATE INDEX
patch. I've already explained how I think this preread patch of yours
works well with parallel tuplesort (as proposed) in particular. To
reiterate: while what you've come up with here is technically an
independent improvement to merging, it's much more valuable in the
overall context of parallel sort, where disproportionate wall clock
time is spent merging, and where multiple passes are the norm (one
pass in each worker, plus one big final pass in the leader process
alone -- logtape.c fragmentation becomes a real issue). The situation
is actually similar to the original batch memory preloading patch that
went into 9.6 (which your new patch supersedes), and the subsequently
committed quicksort for external sort patch (which my new patch
extends to work in parallel).

Because I think of your preload patch as a part of the overall
parallel CREATE INDEX effort, if that was the limit of your
involvement then I'd still think it fair to credit you as my
co-author. I hope it isn't the limit of your involvement, though,
because it seems likely that the final result will be better still if
you get involved with the big patch that formally introduces parallel
CREATE INDEX.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#12)
1 attachment(s)
Re: Tuplesort merge pre-reading

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

Attachments:

logtape_preload_timings.odsapplication/vnd.oasis.opendocument.spreadsheet; name=logtape_preload_timings.odsDownload
#37Claudio Freire
klaussfreire@gmail.com
In reply to: Claudio Freire (#36)
1 attachment(s)
Re: Tuplesort merge pre-reading

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Claudio, if you could also repeat the tests you ran on Peter's patch set on
the other thread, with these patches, that'd be nice. These patches are
effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few comment
fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
that were completely unused.

Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.

Attachments:

logtape_preload_timings.odsapplication/vnd.oasis.opendocument.spreadsheet; name=logtape_preload_timings.odsDownload
PKL6I�l9�..mimetypeapplication/vnd.oasis.opendocument.spreadsheetPKL6IhC��'.'.Thumbnails/thumbnail.png�PNG


IHDR���KfPLTE###+++333<<<AAAKKKSSS\\\ccckkksss{{{����������������������������������������������������}�-|IDATx��}�b�H�m��������e�N�)���*�(�,Q$��)8(|��o��������,�������������}s^u~�E�en����w2���~U�87k.�OO�wE���z����m����I���]���A�=U�� >~W��1��SD��A�����6�K����A����A����;������}B^�������$��9�V��#�1���o{j���B���I�o�x
�}WG"�&��#J�]q>b#S���#NJ��|�u7��g.����o)����������Jk��C/�%�7�>��{����Gn����%�����|�dn��k-���65�4��&��n������IF��|��0��������x��B�^����.as!�����q�z�>��`��8��/����iy����>}�S�9�b���N�������L��H'xS���E8;���I)@� ���w��z��1�d�)Q��<?me���!��������8�����Y�;�\��I�����1���7����]V*�E�B<�dmY��;�v8�>���x��z�Z3�@�)��(�}�1�A���
�:c$7���>�9+��O��HW���Ro�Lq�b����8{u�Sd��B����mZ�`Tn0��oINz �c4��,4�p������*p����A���kjwUJY�jn�H���x�cS��|��j�zu8{5x[��'��x�NoEH�[�X��"��%-<��A������C�fa�������+t����J��m���'���e��G���))`R�Ok�W�"R*��.s��8�R7�/jV
��[�B(i/�"#
f����n$�YY$���P.$BD������������	�J�D�t�����W�qR1]R��=���d���-|��@�2j<���Uu���6,�B��V��o��L��z�?Y,����h�N�NB�q&7��L-�[�B���G%��m^��t���1p*��]��2��,����� j����d��y]:+Jd���"��m[����#�|�@�Y,��\�-5��:tD����*�	pQ_�4�i���C#b�QbI"(���[�C@���t���������sxG�6F����T�6��1�c�G��}�&�X�e'�lG?�u��5������"��k�#�>������T}�&f�^���9�{�w��Qc59�uR�+|�!�{��{m��-G/l���S[,&O��D�>��u�]�����I���W�����'��[ �
�c[hg�*��t��H���v�GE����2��2(��^
$��V\	��l,h�0Z�0`�n~v����C�E�(�^>��m��|�j1��Hp*j�~{OBoNR�if
��d�I'�g ��;��}�@�B���%�D
5c�����Z�R}�	���+[1�:S���\Vg�9x�����8f�fd�)#k�8�����!�ft���Y����'7[�e�]�Sw�N���v`li�aHk<nNp��X�N�^��ru�� �s���S>��oY5����i�+����u���D����G��!5f��G�>�BL"��Ad����D���������|�	t6�+����V"������,Ws��N��b!N����\{�
w������H��B��9��2I�-0P�����{/�
�>3���e�qyt����x��j)���?;�rPi[����c�GA�%}a'x#���56l�[��u�����B���8����e��
�nn��%�{���W�:���V���}T ��>T@�E��������^��+���
���K���J����s��H/�T��V�P�+���P��5Zf�����^�	��:`6h�W%#0�q�T��N�����r@��r�����L�ge�sU��`
�>F=��Uj�y��_�����S��E$�����
N+�����9������a�N�=Ld�|���9��Mn�/h�wE���(�G���
�����;�a _"��|����7�xW��Z�o
n���^=>�E�����'o�l���3����>;C�G��QL3��>9��]��q���6gYy:����V�*�����
CJ�P�^�������^�����25T��x�UN����}�wbq1��-��^���,�B����yZ���<�����i�I%�C����=
��(]�%��xy����p��l|�tQP<�a�$��xY}$��7OP�h�Q'�����tz��d���-IPo�����g��MO`�/�VH�6NH�67�-%�e��%i�*4�xu��4�66;
���;���N�"�v���r����7~���~'@�|�{�]����Qt:��5���_FyR�����N~/�jh����,x��x�Tst+*�ic�D��	+�6�e���n4�ctfH��O}��RM���Rw��#�GI���A��M���1�����mg�p�a[���i������tqD4�Jf�e�,x�~��O�x����=z�����q��
g�>�m��g������wx�8��v���C��6������?��#��Gxx:W�?�N��8��dk1����X}�� �d}������Q�t�������M�J�f&�a
�r�Z�I{���-�L�I�~`�_�d�3��{�#�8�V,��hQ-��z���������������;v�P^��V�9Wg��%'P��Y�T��n]E#�A��PZ�8��]%!7=���<�����o��U����+d-�T6Dwt8�� �t� ��]X=��<�����$%��
�������8M8�E�0�n��w��8�6��V�����0���hug�q(4��E���b���8G 1M$&
+�f�<��"���x9�����\m$m�Mk�1�O�����g������'�����w����%f���
���~�;-O��|�
���3��P>������4F�+��b������s�m��M|��go�hA�.?Y��DdHr���wze�����KT���f��6F��[�����$�����t�����M��[�����>����>2������>�u��Y}������5����
"'
_5[�6?0W`��`5�L����M3W�����R'����8�9��\t3)B)��T��w�19M?]\�����px�kw���O�}���b�~F7�FN����������T�mf�g���+�tRp��~�a�M?����C�P���7�o����{?�P���~l������2W��G�A����(�z1��8P���T-�Z��0w)
"r��;�o^|�Z��U[r��0�S<�����^b\�i���?�M����������B��s�����H<�f�:F�����PoW�bb21���G��pC�3�eo��jq�+�'�S�=-��;m���d0~z)W��B8[y�;m���cE�X/�M:��\�e�w��
����(�C��;��%���sDYH<�u�Ne�[Y"��$�wG�&�MS9��Bh}��b�7yL��@��
����w,>G�p.����1�������D�� E���t"��d���������E������w�����#����NxQ{/~'V����9�^P�_/��2N�]IX��� 4:��D�-����
�����0@U����
�8�N9��@<���@w��;�lg�	G�����8i�H^��SPd0=����tS����Cd/�02L�5hPQ���]�Mp�y`�����
8���q��������N��_��=���Z5���A���z{�/���7V�uo���KK��{G������T�"��UPK������1����c��%��RB��)Jc�
`d*��|�[������y�CC(lp����A4�GB������IA+�[�%"�Fk	(�~(\��� }���k� {����[T���K���~�6�f�����g�$QLB���X�F4G�F�v�� �U
��;">��f6M~;C���B`�{k���'yX%"�v��{���8���-o��J��!��3CSzl�
���?�=��f�`�>j����>�v'�F{~���������c#����Bs�7v�7C�����/�<��x���"&R�v(�Z��gO%�q����>���P�e5����a ����!�������}���!���Z,���v�\�!�� ��T�d���k���uU=�Z#��z%���aO��4���X&��0���Ky��U(�et0-�������-���h4ors�qx��}R{F<����f/��k��e�l"�<\�xg �z�����]�\�D\8�iL,b.���1���$��/<���3>�G��'��\*F�bQ�j���H��*��"Y,������tWF���R����s�9�����3W���z�^f
�y�����%�4;f��}�<R�u��f]��J���k���P��Z~��I/m��G�3��}�6����R/�= 1T;d��N�?B�a �?b�u�N��#�KQ0[�$i�5����Mi��p��,Okr�N�9}E����aM,����ls������Q���������%F�c@����3�h��kb�	K��	��b�()��RS1��P�A������.�����x�R�3�~.�tFC��x�e=���Q�.�ZO�F	D���5R�K���!�6Id?9��,�DQ�B���1������nA9}2���k�3�'�5�1
q���iu6����"��h���� ����O�O=���ef1�`��2WP��G�6sq%�@��R��6T�0�T��.}����K5O���J���aH��'#eH]�%�#S0(��L���N#������g+�k#�)}�vstz�����N����&w3f�B�o[�;�]��\	J�[��<�� �6�X��&5r��#���?5�;i����a����e��{�o�����7^�0�=P?���Y	��~���%��X�I�2��Ns�;I�h��x���OIICW|ETT%YT��16��:���8V))�������6o�>���N�����Vh���d>H�����y��"�sWU(S�� F�\��o�d b��5_�-���I������[}��uWd�cnV4����2�?�8���*>�@���>J�5n��c�1o�Is��2�;��P���'��J*,�
&�����T4/��q-e`� z�J6S��V���90����I1��������,{�hu�>
�N�Lh��(8wm���U�CH��}|�����J����9W��K"�oY}�o�f��)H�N1�ER�$FE���i
�*�����t��P�������~l�1�qY����y����\��;M�V�.�+�
?���t�Fe��xI^��4�86��q����By,�k�?�L-N�,Q����������=8��o��q���Z$�x
��6v��������R����_�N��x�@ww\<�t1�U�K�(y
$�Zkc�4��NO��Y���u�=��;��]��^����x���O�8�>|��%�i^��B����v��A���|P����A�O,�H���������=2��` ��V��[��@��-��m������c�h^Ah�����-M�Vt�GG��*����4P
�0E�S�;(}�4��)������z'�#�A�;��6[����G<O�������b ��Zd:@l5�����Z�D=���izIF
�*o���##�n~'*�s��>�N�*L��=h��D��~
�m  �0�-*�pZ�tis{�^��;�N�7���I]������(V���
���5Z>;��&��j��������p�CR�p����eAw
��~�dg��y�M!�/.<�3�U����Cx�z��aX��&��{!aU���@~=�������N����Q�yF�������U���o!�w����"�`M�t
�Y�rU�v��Y�2���x/�1��:��2���V�����������nJ9���z��=
qb�:�R�,e�b�
)Z�����<>9��/	/aZ�s�%�k�ty���h�s6��4����)WA���r_�.��s�bAL(��\�pIze�p����!�J8��o�?[��i�v���i,�z��0c��&V:�g�[��%���Q���*h��J[I�2�q�c�%�I�=z�t����J`�����+���+�w+��H��iW�����f�N'5����0�ol���Y�����u�zo�L��I��$�/=g�Y��9uVo3��@�
T`���f,�J}�����TD��W����F6�$�����M#���X�)`<�����>��h�,����C��q�2/�)�W�9v]����4�b!>�zk���WS���{,��dC<@�a�4:�����z����H<�G�a���x����O�Lt�X�E@S�a�����Y����F���0�Mup`��Z�W�2G����V1����u��-��4��J��	W����#�-0��3�~e��u`���a��W�������v��:{�s���)T�����`j��w��z�ol)��
o�lU{��6�����1���{ ~a �+sB���A�XlNe�t���k:����0���W����;�o�B��
���d�
y�9����+V��H��|'Xq�i�7,�c�Zw;k�A��3�m}��1���D]�b @<\��,�edO�%�)�����&`X�^%` Qy,��D����d6\1�Y�amV�'t|`x�|\I�������8��m���d\P2��(&�Gs_e %��K�RZ]�+�s
"���`��d����\�G�2a*0�_�u��'\��j1g����>��������@�	�f�:�Q:�]y��r����2X� �Qc>R�--_D��������6�[Wjg���YR��@��$~��d
�g1�HU=�(;��!��W)W��WO�]�Hd����E@?�U]yh�Y���y;{�����C�p���+�V�K�����i�-���3q�8���pFm�W.�����-��w�\�t���Vy����@Q����Q���+�G����z:l�#���6�"�v`�-��l(�v��Q��m��
�c�%�X�Q0���t�K�c{5�����0��#�h�p���29��){ 	).=F����wXC�iM��Y~�l�H�T�pnT�pOQ����(�=qb�#�Xu���Fx���d�!�v~�!)U8O��Xg������&'�����������x��6�5�?:���������K��]�.G�g�+_*��+����?�x��
���Y�����	s����01'���S���O��Eu�����*=O����:1p~wX�ZZ�R����V��
��r
LN������|e<]�
������Q��3_�i$;3���sAqgE+�:� �2U�(�7?Q��4h�
���j:VH� y�Ny|��IdAT��`�j��zD�dD����pC���+L�D
�[��+�<_.����^f��Ld$��J0j�� �b�`��^�q�X�)��nc���8�!-h�}�d@a�<����@�]a���:�D��t��eD�@���fa��@�6�6�?�@��h5b2��j,���r�5OR8�����H�;�/�/m?���������|���,TpV;9��q�eM\v��W7�T�9��b{O�ek�R��r.g���UZ��m]�FJ����p�X =�����wk�^xnc���l8�2�5�������y�g�\:��n�fX����M,\
 	Q��R�M���G�zb��d�V���9�;*�������\Z�N\��]�q�����d�o��4���Z3v.$3����������x��Dq��������I�<�$�R���<x-��G�,���+�7Mo�������w���|����w��k�����k�` �������$�K^5�Vb�����t3�����N/���*G������;PLV�O#�0�F�9���H�����l]b�������.��7~��y/���d@�28g���%��	�z3�l��X��>Fzw-N��e�O4=�$u�C�-E�.M��y��1	
.qu�5��T����oG� �������x��r�<7�;��������s1�F�@�.�f�i	s�KZ �N��_~�'��^�����iY}j���;��5��<V�����=r���
��gy�f4A#��l����$e �)��V��L#7����)q��n��*���V�/�NEP���s=�Qn����s���~�\&�;Y��V�'�@*�:� ��e�
V)gAM�{"^���~�*u�V��w�q�6_�����W�Id���)P��.<��2������o
U��U��r��N|�KZV�e���-c�l�'�X���i�c�r�y���|��K@kY��W��}� +q�r���0�y�{��}	��w~�s����Y��n_��1A�*f*�3��q���2�!///~G���W��B��?�K��Y��j��5Cy��Ho�/�	u�wB����#~a �+&����4r�=1�oVx�Xsy9���~�5����;��8���������%���L������b�����wBW&�q�&u�8��Y����t��D����U�Io�,��T0[���;s�j?�����uU�V��T��U�R��/������+�����I1�0(��pcn?�H}�vK�GoG��h�w2��j��;��e�l��dY!=����;�e�x,��_�2>��
���d��.J�:��}�{���/����Gf�b,�M����O���W�����z�+��w:�i��x\0��O�?�8Q����s�>g �|����5I�N;W���k<����I���N�W�N�J��w���2��w���;�kw�� �s?�C?��}c�\&�6O��=�|�6N����i�~'~����������N�o�k8kY����_�N��^��:�ep}��V�q���
��?e�1~�P���@T�����'Xa�1��6�a|��]�������dx��B�QuL8��G%bMky�[���0S����q�j����1���<�Rs#��"��
��'�bk9�6��i�8�]��^�>��V��5��B��_��$��K'�.����_����5�T�q�����<���s��{������Bh��������vn�ro�����87_���qa����v���^)�,�~��\-����|~�#�����2c��D����$Vu��T�U��V}��C�9�8j������cc���N�l?��n�Uv���t��I��������}e|� �$�
w6q#n�cC���%�
�~3�KIx�;�Ao~'�U��`�+Z��L0��_�P�\����m��X�J�'��o"`1o���������3o�`+F h�|R��nc!��z��P������X�k�����D5�NEda7�S������������N�-����za������p��[�'P�wV`���c����H��Z��s����������$���O�9�1��\w���s���'�j��3)������Xyr�q��c}�p)��|9���
m������*��
���c7��8kX��0�3��p,�	^�NK �F�@}51J�����;��,*!&yf +w	���U������v&�$�f�z�:W&<����pnqdg��.��,�E�|o�t���a����i�~���ywy�`>��ki�w�������0��B��[��X�u���s�5F��ij���\~p�/V����J�����"l�ZK���D��>�{a��f}8�#��{su��W�;�0F ��53�?���e��[}Y���2	���s�L�� x�0)4���������������/q ����������)�s�Z}���|���Hf�c��n�e�������1/���wa�W����
��k��	4V����w�!��$�3sD������=V���4�y"�ie$���zhX�L���HX���u��<;nx���������A�d$��H/���]=���J7ax�8���bO��W��dy�G���O=��S�9?�����i�9�����?�������K�$�?V�_��sw
}M�$����x�8H'�k��Rz����<k��pk��K���P��	��8W�}�0vdOX����gK!���N���I���D��X~���s�p����]����f�W� Uo��l�u*�i�GzU��{�e�@��� =�������n~'~������mH�+M����j���o���i�H�xR��I���b D�5[O�O�:�H'�:z�!��=N�3�@���	�N#�j.:S!34�y�6�F*��z'�*��w��e�g�jG���@��-z�q�m<������b�r�v&�c�u��[����w
7s�E}8�gD+6|�qh3�oZ\���OvXU�M7W�D����'�N��C�����{i�h+MO)��p�a����d��j��������B0�2���$b����T����t��|�������c!��t����	R���M�2��&�������N������{�q.��}���b9
v-\2���[����:�Qj"u���������i1������Z}.���z'��[m|�!�4<��
������`m<�^���!MJp�7������V���Q;�>��pi8F�@�����`�7-h�g"�	�9mX�vJ�����9t������W��%/:�8�������wy�RR��{�8s}���O'������89���=��v��<�e�:�����<1?:E�5;��fX�R��l��1�4�F�X%��+\T�y�BL�Y���U	}k�;8G���s�(����hTsC�x,�	un��t��s_1.41Q�4HIL�Fzl��z��Z����4 ]�3�f4�7���h*>Z��}�]#VJ���V����J�cL�{:�>�ngw����iT�|_e�����f��MB��.� bI,���Z����nX�m�$8��B���B��{.F�O�+qy��g{�@��SZb��P�<��U;�u���{�{X�:P�1�^��;P��XO����o�C��8���V�;�`��p��A�6�{`��/������!���
7����,��Q������]�	f�ewm�)����7_K�������NJ=u�*1V��a��@�5o^F���A�m�0�@&�E��&��U?�(��9�a,����������z�L�������XM�Gs��G�8���W����<�Y��g�������b�U=|c��-L�(eN�M�/����+c>����#��c��A~��G@|����e�D-�MYy�;+��^���-�������[��'�j/���[
��A<��v�������D�f�����m?����OO�7A�#���>y��t"����cD~���6'��o��#bI���6>4�Aw�?}�\�b}����^�5��O���r�7q��8���W��oBd�u�������S�O���x���r��_{���?����~���IEND�B`�PKL6Isettings.xml�Z]s�8}�_��u'�@J�L�c\>�&)��Ix����jd��d��+h7��;���X��WW�]	N?-\Y@�%m��AU*���Cd�V&���?���_�t6C.ly��H��B�.�"��Z7������-��h�vX�q�Vjl�d��o+�a�Z���C��@��Z�4���n�����|_S���MQJJ��I��U�����T6N>
M]�l���~�tc`�v���T6����4�Z ����k��16���Pg�i�l�*������Vy�9��r���'�]��f���j�!Ds���zM�����.����������b�}<����q�&)�tB���j��sdD�s���ZPQI�a: �w!���N� ��]m����(GB��M�<z�|�s���R�([!F��z0�<>e�H�r�B�x��P�)����uW�L�M@�9��;�3���[��ma�<�y%n���E�R!hP ���`,Q
etjeQ�I�0�����y ���x�PV�J1D����;�y7����.F����IC{W�w�~W�CT���|��7��y�i��or��x�Ia������2��X����<t]�)������<���L��yt��s����)d4�������_RQ�aI�r�jPLY�$�j�f�^�x��>Y��2\�����r�X>�B�L	�g�[$0"�Z����M�#)��r����/��G����+���[�cj.`�UEX�9���J�`B.��{��O��,����W�[���������C������[+��������K����������"�~�����K+-qL8�]����_,�R��\_��bJ��xc�����/�1q!.^.'��3Y�� ��s��2�s�("�����i������������_�����(�|V�J����n�,����A�S^F%Rfu^~�
5d��K�~	d��lz������9"���'����+�R���Vj��G��Q9;�����Q�f�6����R:�>���z�	����O�_B�����o�5Q��w-�����~���5�3�?���{�J��@C����Z�� ]9�����L3�v4�����Wn�#o`��@�����W������%����Fp�p���j�x��> ��7�a�\iF����~w�����M:�����3����X���t0���L���D{�����X�
��2{��d�`���A��d�'S�2���9�j������.Gc����Z��������g^Q���W��}{�js�n��0��	��3�x��\�c�!�<I�_]������}�������PK!�m�@�%PKL6I
styles.xml�Z[s�8~�_�qww`f�i�$4��;�WF�eG�lyd%i��{$[��[L�-����}:79��������$,_��,p�G,&y�r�]?��W��.Y��/cm3��w�����\�[�/*I��Q��������Z6�K�T5��MW�����b���Z��f��
���9�O�X�iS<aS�oK�%��XV AZ,n)�?���������l>c<���b��YC82�b��B���)���~8}���@S�Il�R��n0�l$P���%@`�2.�)j�X��K'G�.0s�A|r�)�*���P9�����>�_������C\�l�Zk�*�����
��g��R�JvEw~��@�G�{N�
x4
����Y�g4��> <��!��\nzP����qa�$��XgnRu#2:��rVCS��P�s�C�B�x;��gV-���W�f�_bL
�y����8H�6������o���BT�--
��������`V���6��^5�yY���������ey.�\q���s�<U�n��4����'g���LP��G�\_V�;��$�r_(�j��[�C�BY�������'*X����]�R-�^�s�5$O�'ei!
""(/;���
��3�/z����L�tW
�}'�H��s���Q
��������sD�
�>Na��_��?��x�����8A[Z�|Zs�U�/���^ �R���W@Na.����A+�����e<X���d"u����H���<��fq�"(��wo��
*6��a�0Z��|���Z��x���V���+6�
i�6S��R�1���j�����Ge�W��9�SFrU�R��IJD	�I-����7D�P����ph5�&�!�bV������F��1��������z4Om�F*�r7���-����Y�: ��=
��F5[;�4�z��
����a6=�r�-�eN���r�0S����J<���?K�����Z��E�[Y'C���<&�-f�����g*��L������=+,G\UIv�����*}���2�{bu��}���5�_z��b#x��,m$��-8h�n��p��?��r���9s�K���tU�w/�ai}
�7���XcJFI<��X^R�V�D]���w�p�c�n�����?�8���cV���x����c$_�}����I���^��\6L	����N��4&T��� Z6JPo#&?(]������`f��|��C��
E��9��hx�I�;�������������|MvL|����������DG�����k�ul�I�2�]^��Is���������qV���z��pg�n�[��7�������5tB�Y�_�\�U����5n�t��whN=�O��c��{N�G�q�8� �
B��Y���8�	�/�-�pT������������c���'�f�[�a�G�\epxE�����K/D�
�/�,t{@������2��}�6�����j�K����;ya��a3gs�4d��0U������.{4E�\�Q����*:m>~��7���w�P<������'���A���)��y��gQ ��u����(������������U=����05�����h4����U����c�����`\��WWW�~{�)ZFhy^z�>�;�De��k���A�����C�^c�CA���>J�����i��i�Xv���n�Ry���>����
'dc��T��;��V��T�V��
s�1 O����
��^���CA��� �YH��/G���� �:\������<�x>x�%���l�/������t�&P�y]/M`5��b���~�"��PK�@�$PKL6Imeta.xml���n�0D��
��V�C�ad�U��tg�����"
�������^t��g8�+V��$���v�F4#(+���P���t������j	\9��`c�Bl��j�G5����	:p��x���.~M�)hVNF��5z���1��>�����eY�i��J^�c��D)����0�(^�������J��K�����8F�
��}�J�[�
�����;����p����������#t��2et�(�W�YQ�yA	�Tx���<(�_���7q8Y�l���+|{��&?�� ���<��0���z��������?k��v?7��z�\��w�@FLIK�;mT��1O�.�+��d�Z&����T���14��Y�U�����c�2 �
Z/����PK���b�PKL6Imanifest.rdf���n�0��<�e��@/r(��j��5�X/������VQ�������F3�����a�����T4c)%�Hh��+:�.���:���+��j���*�wn*9_��-7l���(x��<O�"��8qH���	�Bi��|9��	fWQt���y� =��:���
a�R��� ��@�	L��t��NK�3��Q9�����`����<`�+�������^����\��|�hz�czu����#�`�2�O��;y���.�����vDl@��g�����UG�PK��h��PKL6IConfigurations2/toolpanel/PKL6I'Configurations2/accelerator/current.xmlPKPKL6IConfigurations2/images/Bitmaps/PKL6IConfigurations2/popupmenu/PKL6IConfigurations2/statusbar/PKL6IConfigurations2/progressbar/PKL6IConfigurations2/toolbar/PKL6IConfigurations2/menubar/PKL6IConfigurations2/floater/PKL6IMETA-INF/manifest.xml�T�n� ��+,�����Bqr��/H?����b�(���Q���T���;;3��js��C&�/�YT�6��F|l��W�Y/V����X����tQ2�h��F�4[`m	����Q�]X�����u�C=�������u2�k��ErIh��yH���w��S{l�����))e0-�X�9V��>�8O�OK���a������T������3���T�El#���G�o7���~�dn�;n��z����s}�G
Z*c-x���-9�=��i��*��dA'�5�$�R����PKS*��!EPKL6Icontent.xml��}o]9����~
!;�*�4#�o��������]����������Le��WR�+��/y��-]:�)����m��G}u~��������{����g�W����
(�����W������o�������������]�ys������~~w����������������[?��o~�|�������w�O��]}w����g�����S�w��Z�������O_��~����������}�������?����������E����\H?�oWo�����������F��������o~������?~TI]\���������_t~�|��^�zq��,}������}wv}*_�~:��?����R���^����~�<�����M��}�9�=_�A�t���������K�s��?T���Q���~����?�������/��������{%�g/����������/..CM�����6/�?���E�������O���(u�����xw�E�:x�����D���*|�X?|_�.~�������~<{wz#>�m����W���o^���C(�K�����������7���0����-�u��n�._�>*����m$B����g��g��_~��"�������E���?��7��������������8������������}��W��	~{��%��}�>{�??{���'�2��]��!>�����?���w�}���_�~s�+�����]������H{�l3C�N�L������{����_��s��oN_�=}�������������i������a,/��?}I�o������_���<�pq��n�������/���8{_��vq�����3����W�
����������C����N���_�'��~��>{w�1]^��^���?���������;�~�q����������d����_������?__D�=�|�:��u�����8|�<��=,������7�g~���?D��.����N�\|�����_�1�_0}��+f������	���z�|2�/
jp.����F66�4�����C���xk`�o>�����?����c|T!}�/����������}��g__�|&����qo�.���.�8�����f��������Q����E����oO���_-M��������+��|~�}����������G>�^�����[��e�������M���-��o����zq���O���������s�1�_��Czq�:���_\_�����g�|��o���3.���������������i���K�����U�s)a�d^��;T� ���Oi�V�z��?���sTT�z�^�������Y����������_�Z6}�W����ol��K�[/����;x��-�C�$�T��w�����;~y�>l}^�����������od��<��Tz��)������
�e�����?����(}�h�~��������v�z����W��y�����������x'�+������g/?\���8}�2������%�R�_L�O��"c?��Q:?�^+��?��{�z:?�^��;�����1��o��W�!s_������H����P�	D���CYN����'�_���Sx�z-���[���O{���{���_�^����5��}�[po����s����C��w��������{-�~g����4{-�~oN� �&�ma���[/g������k�S����i�����u�{���{ �k=8�}������x=6���}���zl����'�z��~����}Q�&�|�����}�����������n
q��
?�]������Svs�w�|�7��_?{����8�|s��j���������[��~�E��(R6��/�3�G�K�:O����c�T��@_��&n_s����W�/�>��^����74��>��L��[_�����/�����o�Z��iu����Qyu}���lWd�|,~������|v���������~�����������~���y������CW��{��>���~<{�������/~����{��~��l{����JoN����?��{�7������$D��a���(�
"�t����E�mm�O�\���O�K����K�#��o�
2NS^��{?����H��6�G��8��b����w�����t#������8h/|��r� H�YT������3�>����+w3�8����g��O^]�����?������_�|�m��?����xw�����".��N�����������z������3(�`���d��2$7q*�B����1�
A#���Q3��X�X��z�����ChHs5��x���.X�z,�Q"{�������cX�~b9�,�`�r�.@!k�?xg�2���=�t��J��5�gta�G��
��R�P>���E;J�(�F��{@�(}��[��&N��Pz|zP�O�Pv�5����2�Q�a
�	��=���am��K
T,-��0��d��s�ZH��4���
�a�d�B&7��]���t:g�����D����F'�aSC�"�Ol�h�^���5�i������_*��Nk'��2�Py�l�a3Z���%u8a�Z�4����W[���4O�8�o�����g'����'��p� �Tn�Q��l�,,��kfuR�X4�z��3��{WG*�I�$UH*��!��z�`��J���]H]�A����z���/\�Q��T'�5�������A=�T����e�.��i!h�9�q}�h��:��S�S|�A��Aak��Y=
U�P��uT�4D��d����X��#DwQ���jQsDN����
&.������M��5M
q":N|�v�/�������N@��G�Q��;�Gj).{���bSwtMAC�(��@�C��,
z:�2��"h��h��3ZP�_H�*���-��p�	�)�8������:V��v"�8�*��FbS��)-�nzPwt�AS�(Xc������Lx�1�[���{n����o�J���������l�t�2���Q��.
d{����d�h���N�6�eHl�Qt*�2'��]�\SF��2
LHh�U,<#����L8{[g���v�6� 8������7q8s���`Q`������C[���t���2��L�u���dw�\�����+ym�.t�a"l%:����N��::��w�G:'��D�8U%����WV�s�X���k�[D�@k��q\s�,(z��9��'�qJ����n�Qx��
X���������%6�j����&6��y�9��
�DRb<W���W^�
>����f��%3d�PSp6]��$��Yl��O��y!�z�}z2*k�]]3C���La�'��kg�o@o��M��y�&���[�c������[��.��q ��������`|���)��u���o����?��?����?O�����:|��/�V������g��`a��,d�����^���. ��l�����
�ZB)Rxt�����������pB���+*��z�W� �Z�������������6�X��'�|M� ��������=�	Sl�x��F��6F���kd�Z";�|8`��J{����8��������V��a�#J���Fp�����&�Fq�'�5W'����Z2]�A\�Sk*��-�Q�=��%������WK����i�r=��+�����
����*6� ���J���fg����w.u9��m��Bt��������.��t�����gl�. �j��ElM\���5�,���n6]��u�N4W�������*��y������>`�
��H-��f��������=��H�p�������G�]�[�,T�d����XN�'�Z+����������9�l����(>R?�
&��)S*�x'V��]`\�-�R��[�t��}�	zewS+���s��=���.,�a�(3�U��5�B-uk���� x(�R���M���}b5�X	��n�QC1�*H�����f
���pK��:�8t���MA�n�~��:_�7���-�r��
o���.���S_e�u�k���L=������la�^��^�8)i��E;���R�Sz=~w�p
����U��A;]�"����������lP�\��NA�^�]`[�+����I[*���<SM���n������08����?�G���A9a�����k��4�q����	u�.n�I��|�8��f&.����0+�]�E��5�b�#,hT�#4����X�TZ�]�������G��jC�Y;h ��2���wk��T�Y@�G�: 
5�	�o7��
��'��s�y�Y'�����\KV���+=�w:fdu���i)���|cYK��"���o�x�
����"6�4���E���5�b�:6���8�IU������9��=��9a������qAeIX�pSw�q�����L��kAkGu�2�M�eZ_�3�P��*�����mv����:-�Hq�z��T+�MY�i~c~:�"/-�����}��t
��]`\�.��VK\�yDF�SR(]���E�i}#�O��3������0Z�H���fZlK�%����td���������M�{ ��V9���A=j�(���^%��.8����zI'-��5���t1�o@���2��u+[J��5�(�B�k�-���CJt��s5�>���������>��+���g'?=;��,��X������etP��U^<��������Q���Q�$[��xc�x�h7;3E?�|N��y�!u�����AC��*�qwao����8Q����Tq"q&M�9����h��-k���A=����w7uB���mI��Y��]]C@��S�i���Q����&5�
	G�h���flK`��	�!!]	�����Q��������
�[K�(w4�����.��!�T,F;�S�C��;���$Nw��%�2I.�x����2���&��[35��<����LjQz���4c����?�?ldL{|����z��QG�uB���]]�6�1iC�X�B����H�L������ {�7���8������|k��5o(���������_����%��E���b��N�MNld��&��2E��Zo�Q�1��K�%�7uD�0�k	��������Sl���+���9���Y;����A�M��5���9�`|=|*:�5ms��Rp��}��01wP����nfSwt����.Kl����
U���� :�q�3�Q��f�Y;����(<��.�����s�k0������2��;��4��������/�(w�����
�Dm�.������6�B����[U��p�t�Q'�q�i�������mV���.���W���8O��u�������!N3���t��sGc�,P����G�c0�����.������f��j�L��m�w�/m�cw�A��7��e,��� k{���T�oH�8J�8Nu�-��r��q/���;�����8����i�NgR�
�
ewArM���Np���7����m����:�k{��(-[�-�Q�aeKGtw&�I��5w��s7��,�PW��$e�N�GQ�ban��aF����Fw!r�����Mp��	d�\����F�����(��\��<��,��i��t��N�Dt���C|M����B��i�L���F��,��� �b1��&��� ��i|C�&x��O`�,0��0���H*���8�U^�]x[C4���
Z��������2�!nZ��Z��F�;�x�F��h��C�M��54�B3��
��L�R���:p?'�����7W���PN;Wi����o������-i�5��������~������_^�;{��	s��
J���G{Q�
�e����o���H�p�Y;�9������������>{���������r-Lt-�a���,���5�:8ym���V�����r�h��o���������m=���������������������G��!('����������+����k�%4�[<���|�W6xV��9�$�Q������x�:���\{w!sM���
4�9�v6BEy�H�~b���G!���R��;d��QdR2���&�B���	
�����}d]������3w���d>�i�#�+����4Ai++������&gBC]���L*KcJ����)),'s96�dv��z6q�*�!�
�kO�
k��]�\#4��$��6����R������\�d�1�t�W���	�A=j0�RVj����k�&4T�a��fkM��`:���Kv�J�xa���F';q�����k�&Tgo@i ���b���I
;��6�|:�����G��q���b�u:��Nh���.�Vk��'[E'���p�~�-<?��1Ym����X�RQ���M=j0�U,.zg0Y���5��C?�I1�A����inrR|�I��� <tH91i�M=h0qN�4�
fS�`���7���:��u"�Ue�rO1�0Y�FSPZ���4���T�RC�����]X]sD\�#�/���Z�.%���jJNV'�R<\0)8'duS�b��S���7uV�d7���!�,�v�$�qV�rOp������L6��MS�E�M=�P2F1
g����k���FiG	L�q������_��wz��.�]Q8������b\����uB��7���4����XX�Dh�:���������z��aW$���B��7����B(�s�f�55{BwSJr�h�\����*o�Q��u
����.���#n(��-a|�����r����5:�������Va��M<f(SE��n�.x��#n)�c�:������?q	M���o���L|@���i	��o�.x��#�N����!0h�G.q�����xZ��-��n������T�
��gq<���T�TZ�����Q<Y��x}{��yr$w�,�g2Da��M���r\�L��J4a���eG%�����_'����G����3����*+��gqB!��)B��`����.���$�����W��@�z�`�kV��$�A��Q��6T)�g�&�j:���EO��d�/���� �A=�Q��\]j�z�����(eF�q@��(�G�B�I��������2D��M=l�����Z��tWuDMF�%Nl��]j�r��4�7�75�&��O=��y�$B��k�fm�lF�%'������d+o����V(�J��n��?}{��
������e��x��*)/�S�����2�
U�B\���$9�J�IK��vshr��������*�a�c:Q9h�����U��6�ik�*�s�1D�tA�������t��p�t%�@X�'�����Q��h��!1d[�<��e��'�L]o=��I�����g�EV��aD������m�ik������J�5�V������{�{V1�If���7��M��E�)hI�x�h,v���Bi���; ��z;*'l��J������c���-gj�!S��nE"��5*�
o�����+S�w��V���N�d=�Visnr3��������t����ke..���������q�'�i��>��4T����@��P��2�u�}��nJ�O�Z�wg�z0�~�W���heIx��j����-��m1x$W����|�%}�f���iN9�1���O�0���
�!��J��vNX�}x��hj�<S�.p�����+���������<�Vya/�U:f&������%�r����u0�.�)�<��w\��uv�r,�pSL���i�M����z��^Zhu:k���TF��sb�_�#�[� �c����o9�
��0kM���b9��oOw��=��������a�?�M��uS��1�_��0M��`<�������
�)�6������TR��8I���,�����o�.X�p�������5��K����6��A�����m�M=j0�X�b3��><�V'`����X.0��U��~UM�����\�����EJWi�r�ZM��:5/7X*�u��H�n6[�t�}Rw����&M����7������]��`K)��j$:��6L��a7.W�y~>�<���{����o;���U�P��S�E���������m ��T]�M<����=��o�Y�zw�i]�\2o���>���6�t2q�i,{KU������f�s���U�u����7���8��������CdN�`Cef082�4B�����)�4���x�%J
#��d�H�>�����\B@���&_���?t����[c�4i���5�����1�>D�6�x�W��
�MNE4��/�M�������5�!���*�\��`K��
m<�uUU^��n�M|�-PD����l�a~�@��tSw!�r�j��9AJ��|�Dw������g�����"d�,�,�>���5�z�8S���J�te�T�w�����O���������||��N.�p��U���Y=h0�u���,uJsB�Z(�f�������;:�/zbFt"���GcH�Ud���x�P0��/,N��}��Aj��d����pI�>�U�4����^�0�}P��k� ,����P��3T��&i
9���uE�&N���@:MrT�&m�"+,�����b}�k+l��}��qj���Q|]�1�=8��g��B�f��h�t���KM2�G�$�n�����Ci��PK��Q`�<�����d�N�n��L�����4	]2�
%>��o�o�>���
Ugn�
���J���%}I��D��K>�KZg��"
�f����u'���dV��4�p�%�c1c��z����)G3]r�Z�2�
��x�P����p��/�r���L �[`�5c�,5���
_r��_���K.���$��t�f,-(v$;�
��[�����&GuLKT�Z�=�{�+l�{��+�������aZf����+-p��/gwLCvG���h��&��A���7]�w{<`\
g��z�`�6���"���9�c��;��@������;�������\��Vd�6���xL�e;K��9�c�;!Kz�8�K�"��f?�[��%G�%��
��
w��Y=�%���;a=����9�cZj�2X��]��s|.K
����t�{�d*a!|�7���Xr�9a�6�����;����a�}���+����R�,{7����������X�1e�0���7��"���9�c��;40jg���)���Z�%G�$��,�&5�V�����>����iH�0�n���;�Koe��~*;��-r:c+
N{'.�������J�9.k�0��:�!�E;�!N�1�����I���0�EO��5>��~��_�*4���n,���.�>����i����r� ri�{�l��T������
����x����&i"m���0N �YG+4���
�,�6�U�GV��V�0�F�Wi�r�6$p���^;R��
DVz7��
>�
FC�/���l(-<V��}������	&NO��
�R+���w�K3�p���	����AV��2Ix:�H�@�c6����u��1�)�q+e�^��V��[�A���W�(+����>�&�Ce�����8�"�5Z�U�
���fZ�+��@+�J'���t9Kc[K���F��P�*��>nW?�o����//���{�����Q�i%C,�F
�tQ�������������_+N��g�����������g�O��:{9A���o�.�Q�^2L�>����mhe|MW(�wt��h7�o���������m%���������������������� �����M<f(F;����n6q"s������?�������G�g^�����l>�^�����L��Z���}����P�������	+�sW��	���)-v��Gg EV�d��]�t9�Zj�x�-���%(�B(8�~�JM8�t��Hg��x�s)��,�&�g����M�Sk:�adWS���Y���s9;�pv!���1����z�`�^'�S����5��������>��MU
�TOr:���@A��@�z�`����b������}������`�����������M�f�H��PxHxP�4Z�B�d���h����6���d�J=�K���[N@g�PyaV��hH���.�w��>����k��TEU��@�tN|����OhG�At����sL\�jem��\�&4�T�BX����k���:8�Y�:�x]��P��E�������a���SAwl6� \���vr>����sD��l�kX�q�Z��Y�>�Q v��{�QY���A=WS����k����Ag�� ����	�U'�q�Ai��l�Q�r�nm	�v�}p��#��;
��
:8�kJ��s7�������^��7�Y9��z�;��������>��|K��
��3$�*HY���S'���"aW�E9
N�A�0�P��3��|K�(��SQe
DP��
�`�h��Nd#?������dP�����i� ����s\�!��#�|�w_&��!�v$�9��Q>���g7uDs
�7���Ak��=�R����	�o!:�����
VI+���AAYi��,��g!��j@�=��$��4�]�������OVFz��hG���$�L�>l���oi��
���M���z����� �l���f�,���G
%�B)����9r�["Gl���MP�GV��}�	��gw�D�@�w����'*+��o�>����oJ9g�����sP�U����k��9���E��/za�����9[�[�E!Ns)5z�J�f
G-_�4wBzH-)[:����5�
o�n�>��D�oI1�T�����3��.}G-����UA���a�	��[�`6uHCN��DQ�QJ�y����+����4�tB��N��
�ZB���0���SV��4'�BS�h	-�����h��w�����D���G����)k�
�I;���>���P������s^3��L�t�v�ThH�����"J�����!��q-��w~�x�PEN8[��>0�dPhI���1�������(��!�x�1J#y�t�0���y�*�\����zC���Z\2�����v|�����=����2���a���X\�s��0gzBK��8��j�j��F���z\��jk��V(l��J��C:��7��LD.�^������t��Y�y�|)%}g���
�rP,]�eqs'��qL��29��_p?�M�|����(^�E��5��2�Wi�r�&Tgk���4�t��kN����nN�{�5h%���jG��Rd'�z��>(�MhI��.�|�\S�=z_�
���F��H����+?i��E�8�inH� b����y���p_8iO��]�����t�c��*-��Y�}��!��,�
��+�t�VD����9�;>�L��w.����L���Z�t�eSLt��8��}���n*����9�n������P�3<6��.x?�9�1�*3�+���na�H���-��~|l
���'�Wa:���s��:��������Q�H�Seu s��[*��8�`��W�����i80
p����5�(f��e���>���7t�����E������q��<l=
��Rt�r������xS�2'a��\M����e�u���\~��9��S�U:�e�.��p9���i���&`�����3JK@��)�t��r@�������������eo���9
�M�f�N�q���������1p�P����]F�0a��U���y���KZ�ED�\�5.�i7������<��j6[g����2�5���,���^�/�����<{��yC���~���������9��o�}��A�amI�t0�����l�*��5�����_���� y4:���3Paq�E�����1��P�B`��]�z/�?�~?-p�F�\[����9�3��������l)���7�65��T�wZ��L�iE�E:��U�Vi�L&���D�������M��7M��"Ljr/,���G� k'�gq m��L�d�i���
���l�L�
g`�t��� \.�>��L\C���"y���;n�V�n���`��
N|��Gm�b���.��-���� �3�M��,8coL)L^j��4�4����Mp�[R�'��8��4����Z0����/��q�����������w����O�N>>�'��L80��J�wf�������e����)gL[��X
:������vG�����!��>�`�W$<��c��uP,�(Z�]�������*��!XoRn��#��T�nZ�N�|D���!5�����4�@�x?��`6uLs�Zr4���YL�6]�j������#�y$�QZ�|����BAy�M�7qs�Z������i�Zy����O_KF�����&��^�~�R�\�} �Qh���s��vt���N�9��.9�%z�V��&4BR��}����}��h���l�C��)�f�:]�1W���o���dV��I�iDh����9�-i�Hi1i,a�`��K��WF�.���R���m�;���Gm�jT���~9�-Ui�c@k�aO�]��������~�##}��;�X���l��#��@��v�EXl�AC!B����l�>�����u�u�����^U�L���X���+��fc��T@�&g��k��
�7qsv�;�iW	<1���K����.��.i�*-�qqP�L�>��,�wP��4gw�����N3zNg#�����;/05}��9�WVx�q���u�X��fws~��;ZM!� ��T���A�(<7}�}<;�K����iS��I�<i�����i��`K~���$ym�!�k��I�n��M������������\-(',L�����;�����H�cd������O�MT�O>�z�!+�e�Q��~��t���`���9��
M�4:�	������Q���_=�`��=nvf���IP(��E������������!��/U��~����-r:c3	N����W�7��� �
�\k���2�u�)�cB��Z�M�R����1�	�}�'{�<������"4���@6]�}��1l�����������p7�M+|h+'����GY!�� [!n�>X�`�s��(s���n�q��1CTNX7j�2C�z���p���.�q�!����A���T�H0��B�4�7CV����,e��*�qn�>X��
�T��KTB��M���_�q���M�
N�'�J��Q�jn�v��r���L�1��[
53P@��S��1�p��F����M<�5('-��h�@�6�V ������[S�*�y���{�A��J��=�Wi�r��Z�Mq� N�S���MR�Q����7�xq������=q�����N�qE���6a+�E������8���u���.���x��
�_�N��_��?}����d������_��!e-*����/T�}Y�����U����5�h��o���������m]����������������������#�7J����*5���H�z��>h��$`8�3Eg���A��B'����3*���wxX����zG��.�e�A3�d�!&����[2�������hN4�dU^�sjCS+��eq4sT���2)�m�m�����Ia���SL4M����8,��`
o\em,s���4Zx���+m���vS�FN�rz8��C��i��������A���*
����Cg��PKA�q��c�a���5��U�������*���CDV��a�M�O��7���
3{����[�}2+���g�����Y�A=j0�����xfu<s
�T�p��).8�7k��ir��='�_����M��������s�����g���VL���m�@U���8yzl�������V<�&m���8�
uj-#,�|P�a5LK'�����i������g}�F:ak�e�}��A��>����)���kN���9�P��j���9Z��������UAXz��4�t�Pag����9KdZ�D��5��P��A93a��
� �������6�(X�e�]�6uXs�����������r1L|�Y)��|�ND�1�t�('l�|P[6;+��uP�A4��LKA��/��k��RP�	�&�}���"IH���~�|EKM7uDs���$��h����(�t��T���>��Z��r�x�P�7Z����A��:2-��h��{�o�������9������I�����c0���n�.��<�m�#H]�����������7#:w��������&5�8��Z:�U����<�M�#G�<}HS��iV���?0�|�
K�45����F���&�Cg���:A�NF�����D:wSL}���t�B�x �,,�������F����v��&�V��[����}��V
�*��|��}��	#��0r��W��=�������v&�Q��sS�XT(��pP�A4��lu�h���qQ����V�Ud'��#T+p����z�`��)+L*�}�a"�T��=���b	Qx�[D�{R�� �irP�gC�������9Ld�J�����r�:4�e��ND�E!U��sW�0D�U���7uDs��6�/����������\��{)zSFl���#�8�>��v�@"���w�h���SB���WH)[�m|U\�QL)i��b���C�v�M����o��z��a��+������=��Sf����.��\K����@��I�%���d���n:^ya}�U:����!]�}p������90i�K���m�������=��a|$dO�&�|l	;pn�>(�$�k)"d��g�'����{�~
�M��l9FY�N�*4TT����"��[���`3��|M�K��x7g�����(a7�����h#��"�bN����`�k��Z�V|~7S��{�}OK;�.�a��Xv�J����4�%M� �/�B�k���S�������,��b��x��Y���b�&��b�����Lr>�)S��K���v��t����Nx!c���-\x.�>���k��0�8�4��E�y����������9E���Y;j �a�����C^N���dp�T��z*�3)��x7�z�����7s��Os����[o�S�q�M=j0#^y��}p�q�wa���4���2+�M����s:`��^W���S-.I-�XY�Bo�v���H�������1��P�	v�������t��q�t�H�������i!�us��WG^��Y�J�zc�������
���:��VAX�m�s?#l��J���c-��@
 ��&��KW���)�My��~�~����u����3�B�,��{�Q��{�-����C�D�u����������X�vn��.��p�s���-�[|Cy������;kN���s�s"���X�U�����d����K�x��1G_|K�*��	����O��+��� �[�J����0i�J����-��*x8X���fv�����V�������y'����i�v�@H9+��fm�r���D]�a�;KX�h\,%��Lu!�����|N<����;����E��~��}��	_��ISO�&�h����\r`E��zN�{Jh��r� �<���\|����O@q�q�g�J���g���nJ�N�{ ���:���M<��HcLZ�7���r&4ux���3��f;+�������e~,���H
���]�}x�	��R�E[o-�h��k��W��wN�{�%��e��a�?��U8��}����R���
6���S�������4�{` ���$��c�"��[����.q�g<��������L|��?��tAh9�x�P<�r���M���	-^R_���wu^������?_�Pr���t�0�iS�E����s	M%^���1icBi��T���<r���������w]���O�N>>�m'�mD��'J���A=f0��k�������9Z�� #8�H�h�5�_pGm������=��kMo�1CIW����Y����	���8+&gu�������K����7����0>��M�M<����A	��n�>���Lh��(���lU;8q���ZM�����Bi���;f�(��@����&��_���
��M��R��*@CF�����8��2�b��mS��J�l����)�X
��j���q[��;���9������������c��*[�xrk$Y������������>������t7�M�|�=Vc|��	�X7��������Sr{0���9w�
�g4;@`�d������{�t:��U���M�M<h(V�b��&��_��pK�������LM��8�U�����<2�����5>;�V9=��
LU5���:�
�"h����uHs8�*�h�Sj��c���qH�������`�1�c ��n�AC�F�����l�>���7du��@&�U�vZa7�#�#�w��6����C�\�N8�M������J6�#B�o`u�BI��\��9j�3^X�p��e�M�,��_N�puZ'�Ic�������f?w/�M�|<��+ ����Q�	�)��>�����x�������*n8��S4-�+'Q\���6���H�
���Y���������������j~�^��P�f�X���e�+��������cR�P�&�{@Jz����#�<cjw�k������i�y�N��wrHM��l���}��_C\�il�,R�x�Q��Vz7W��!���t�F,�Q!o(g���Xg
�>6q(1C�R��e=�5��*����q����� Y�rq��VF���J�0G���L;`��<[�O�;
�O#|X#���B#\����)�r���&C�R ��9r�*�)Uw=��F���2��5��#�-�A{2�N���b��A�f�r7`�9c��J!�������$�>��>�D��>���|0���.&���U�J��l��c#�.�K���n��>�5������t�0@���I�7�qk���"9�����N��
�D������T0�r�Y<l-m�
K7fq(C���ST\�xm�o���M�_�&�5�G�H�^���!���(������L�������*2�d��wd�W?�o����//���{���Q�i�^v��(�
B�f���_��/�|������D�sw�������g�O��:{9Y��c ���(�
B��b-�\�!��M����U��5��f[�h��o���������mM�������������������W��c|��*�����cY���b�	�o�>h����]|\������\�krt�����D�Q�L�j�c�U<j(�)@!�Y�����PL���`�X��a������t�*��x���������1��;�S\���\SD*����x�h>��T0��~�x���b4q4s��#4KW���D�t�����$s9=�dv��Z�q�(Nqf����t�J�|S��3gi��f�F6�XH|���j�p77�&�����65�����Qx���F�I����g��@K�����M�t���q<qG5m&����>(b����z�`,��y��>p�4��	:�8m�X��.Y�~��L6���(�&;�<��Y'���7���Cg��@C���wh��3���O��RS[B���������m��
wL�A�����y>����9��-M�<��!���M�^�\�NXG�j�a���4�Q��`��9�����C���M:�*��.9+LX'�b>\��-�����q�eS�M���&���V�2�.8�)��P������*�C���9�r��duXs�[��h"���r]����}`��6Sa��T�Cx�S��g�t�����duDs�[rFl�r}|q��&g��b&���@��h����z�`����x0���9o�
%{t\�z�ud�J�D��)]7},D
Zq���z�h������haK���:��lM����s�;	�-B���E������eP!I��Y����;��^ZAk���Q�����O��VD�v�=%��D��e�7����*��g�aK������#*6�+��s����o�IV����Q|BP����U���>��bA��"<�������?w�|��O�B6Q���o��.\R�QK������qMf7bI���o��V��ms�aV^�����9WD�����3�E4����S�U���:�l'"��"P��|��M=j�
L*�%�F��>��<�T'
qFa|��y���~-[��{!Jq�%t�,(zTA�_���h�QS�"��.����r.���"���G�,(+���
�x���E���!��RE�92��sX��^@���D�>������5��Y9'L�m�>��5�,��F>)�i�x(+����7��&m�<������eY;j F�c�����]�QS1"�@�}�P�)���*tJ�n��?}{���
��8�3@�x�P�O��d�M�����qi��{����~*��/4���t���~��SZXp���-�y�)��n�_��s:v��t��p>��	�o����@JXp/k�`�c;��1h����sL�}����n��M���z�-�E9h����U�6��8���W�\�#kZ]�L�z�������Y��
��W�x�P8~uv6��}P�����!���r�����h��9����{�1�����E:hF9��s���-�hLK���v���u�Fe]�m7����=��E�|����[���/.&�'Y����1-iC��1	*N�z���}���U	+H��a�K�:��n9�b��/�t��hc�=�R6���^�{I������M}��T-�����5���>��D�i)�d5�I'!��,'�&z|��>x3�\�4g��O�~���^�������e�0`��}p�A�T��d���c]�����y�t��$��6��6I:h���X�}p�a��E{�=���;i���;��=��Y�q-"4�M=���Q��V������.�%��:Nz�$�U�.��������
o&�0�3R�i�r��4���; bmm�)��o{�mz�Cx�RV
����s&�	Y��]p�9�b[
����k�.����M���}Az�n�s?�+�(��[���VU#r��7�����g���u^�7k;�G�!*+��wP��1_lS����Mi3��|�}���8����g����}O=�iZXmb���-�[lKC���C��&9���f����K7�'v��;+��Yg��*'�u��>����m��[k�z�����y77�
��g��s�y��X���.-�aSOg�+Z��z��>@��m���������b�������s�����NK[&&��a��jB�i�r��6�ro}`����2�������u�?0*�����E<��)&�X���9c[�����)��Rr�d}�>s��w�g]����0�����j���o����t��M00�Q������v���0�	����}��#u���d�����.l�6)���������f3f`t�g=I������k�Q��6�#.���K���u\���s�B����������{S�b�R���^����C0���K�C0���2ga7eu�����4A|����>��-��o9���S.i��8��<��k��>�3�����<������]|v������r���e��RW�[L����c�y���duFs4��Dc4B r���R��O��N��zp�1���x�PRKz',8������1�%c|tH��.U�?���Qtz��y$A�O����z�`���E���]�} ���R4F{���<�h��<����pz���)���"6���xV(,������4��R\>CBU���������G�HM����^�A=��}H'��y�A���q-��n4Y����"��~G�������Uh�������h�-=�����}��Q���6�	31�)����Z��;�n1=�!<�j��E�6��%-j��@�A���q-���<DCPU�>����t����
�.�x�P<8���]��9��
�x�T���q�����MP��������\���I6����T���'s�M<f(+��n�>��t�oI�xC��4y�����y1�O����. �(�Df���8T,��������:��"�Gm�g�Rm�~��3�i��(��9�=n�AC!�/Y���dq@sP���h�:0G� W��D>�S�t���p�EeJ�w�1�x�$�����U������-96.�t�P��z���������G�G�R;L�x�P�R�Sh�Y�����-r���)n^�JpPz�����cM����h�Z>�#9�.Wq�rF�7dt�,D����
������	�o����g����������l�>��|�oh��Q�%LU��f{4�����G������B��_]��>��|�o����)�F2&:e��4��n7�7�����qzb#	6��+baE�,4�������C��>P�D�oI�`�l\������Q���������Hi�p��8��
c�va.�Nh(�����gc��������4����0J}0iG�`��yi]�,��d����M��&'$��75�R~7u��
��AF�V��a,�e�X�}��q��PG;fthtp��;n�N�\NK�A��JUY<���E����d����� ������z�q5���Nc��t�|��t�
����*��\����XMp��������r�9��6��6�
����x�P�	J7h6q(s�&�t��D�u&.	K�\G��*��M�i�c�0N��F�c��������>���Lh����&N��R��t�ws������>^\�����w���������1��_�eQ��l��(����/��<M�y�l�i��fq���������������_������5�c��y� �eMe�r�%Tg]"k����#��k�*�����c�������7���U��������g�'��rr������tp�Z��$m���
�[���?�N?�U���x	
����'�M���)���
��%1�|x4��i�-��ACI�f�U������1��A�������qG�.��G��%��k�M<�5�2���&��f�pC@&X��'Mhk�x��t������FdV��G�IN�4���}��9n([;���ti�q�L�x7-��d.����dg�!��57�����T9��A�������Rq~��Pp���V�{����1M������H��a���B��7�>��W'pRKp����F��bG�9��'��2d�4K���C-*���fyDs�[�8�$`B��e�P����$�1
�*�i�&5�h���%7�>���W�w5���������L�)�x�b���fB<B��$�����
����k�qK%�������.��vs���;�����<)��z�f��)}���}p�i"n���5jD�R��:\�S�c����R�_�I8�<�G
G;Jx�����kNqu�(�j�S�+�*������N\�|��R�W�F>��S�r]*�x�����F����{�D���B���,}k��i;�X���A>j8����I�}0��iK�&�^Sj�����)�OL�rl|�e��������S���F�S��������q�u����Jn�$������
o����
'�2���w���}0��iS,)�z�����v�'�z��v�/N�X�������6V����}05��p���9��&>��U�`z��s�~`�`�r���&4k�#+<���}������[�|���:i�l�|
6:�����(��n�A|�������&����|����,W ���~�wJ'�}�$T��_��*����{����>t�LgKM"��39��=�����_N:�NgLN6���������4�&��g�|6�1
�������Q�e���w�������	cCP����5C����tY�R�����bJG.�AW��R:�*�F�{]I!K
�k����8�M|����yL!���:e�0�&x��T� :b���N����WM�W����|��2:��Mjo�}0�)#hJ9f
���u�N7�������]z;��A�������.����9a
���v�8$WS�,"���5��>6	l{���
��(Y;n X�)}d Q������zK���l�M����,4+J%�n��?}{��-���6)S�u���"3��:ya��,��bAC0(Xr.�H��:*r�nb�G[M�k7���F�9�t�0�x����)h�?�2��h������3��p7E6��=�����6M��Q���Ba��M�����@O��P���P*}X����8}���X��l��J
��+�"��[N�@K� v^st�87U�kY��L3��=��!)Yzt�r=���:-�v�,��a�@K��m��s�K�E����}��zo1�Sg�c��s=aE�U�����>&0��&��}N�#���=��A\7������*��TX��xg������96�
��`��
��w���r���
����;_)�q�����f��N����l9������������-�+*v���j�'�I1�k{�	H�0�����Z�@m�r����Xt��:�������4!9>�L��q.�3������.=P�X�&5b��oyl�>H����]���_����4��j����t����A'����1�F�A�*��[�`u�%"�����,�:�*��`o:�C8 ���n��a����D-�0��}���l)m�:Wj�P�f}uSR������:�_\��W� ��������n9��������.������;����=���g�wd7�����+�k�
p��A2�_��<
;�V;��������}���yXY�y�*f�NX�|���-g\���Uj}���o���Cw���>�j������A>j86���0�s#��d��`Si���!�������������r�z��a��Av�J��F9�B-��,D3C�>���q���%��.�O��a3��%��a8��p(k���s/��{8��4�s��Q��M�����|�y>'��|�)��*u(���A>n��Si%�p�>P�H��y	FG*���-

�n�n��h�s����_t)'|w�%J
�)f^Vi�r��Z�.��z�:��N�� )�&8!�e��(g����|�	2*+�7s#�eN�PK��h����8A����d��)G8M�
-������t�!�z`R��-G^�%���#�`Q��R
v~1-��r|���U�QWt!N��'"���9C-
������P*�6W���;�@�v[Y���B�%Vi�r��Z
���@
~V^�{/�9-���U�Wr�W��5 ;�X�/{��A2c�����=���TH�T}p7HN��=A��Z�������i�E����z���/���9�������wS^���������w����O�N>>�}	'��F����B#��r����l�g�w�����i�����;��t������N��{4^��@ k
$��T(5��=�,�_����>I�h�B�85�XS+&(�
{�$�N���|�t��v��M=�"��S�qZ�&�Ci�����2�S���	Tj
wt.Z��wO�E��"mdJhL�v�@@L_��A/�iLS��,g��T���)�}
�������l\�3-����T��%��p6yPs������0[���*q�w��wz� k��n��H��0�%����&��^�����J�;c�\�E.nV��+���G>�*�Q�\!l3��G9������������9�cJ��f�������4q��9�9j������)kG
���.���,�>��8�i*A��I0�!U���^�Ob����q����d������+F�m���
'z0S4
�7�>����i	�X��X
�UM?^�O!��1�6ZC\�	��n�Q.I������|9�c;�l��2��\����#�{ao����d:���)�$�F>���y���j����6'vlKb��kI�SZgu�l�����4�A{���rN	�����d�~�����3;�:��� �����8�Dz*G��'��3Fa������Z������9�c�J� ���*���>9�'M �$�n�A�I
�|i�w{5��}�����1
G�F;���hR%���7}�}�R�)����|�p��QF���F����M�q���SZ��K5'�nu������xT^x�iJ��,�gq�r|��o�
����,����/��D&;1��ow?�7o'&��B��ivK����k�����[e�u%������c���b���L� �V�w2FH�����3�'7��:y^$wm\����4/��6����K���<����C+�#BN�K��9����YM����pD�q{u�Z���ge�c8W]:���C�|���%.5�]8�O/�6�.�6B�����FB�)�@�M<*x)�#�V8r�S����$=�U���rB!K��c!T���v!����qh��(�>���C���6dv!|Z!��mm�x3�H�{�8��Q[�x)��nx��F"�^���f����p���v!�Lg9�faH�L�t����W������1���S���/�j��`!�ZY5�`�P����US(3x)���W�j�#�0G���t���.*�~����r��jF
.��-�v������4�S�P���#1r��MW���i>�U����w����g�9c���x�b���M�9�b��kZ�`���'�,����	u���77��?�������n>��-�<�hB���?�c�j���+</,R$�I-��fW����x3���3��������n�������W�wo���~�����W/�_�	Bv�ml��V�����5<�����+�/T����KK�vn����k;7���e��iv�������	Dr�q����j���26{�����xv��7�!�s�Y��A���l���6z�f�nV��p�$�q� �z�Eum��������Mt�-�X3f��J��1��~^��IKg�L�c�T�	���9#��A��5�����'�U8IP\Z�o�(�>����k��1T�t� F�=��Ec�Lw��������t�h���V�0��B��1�v��m��z�k�ODRO������pw������<��7�(���8������I+fUy/TKS�]����������hN������C���z������IkFZy���FZ�
����'z@�O�v�|]M���g���p��|N�Rhc�	>���������G���#K�����;c�����=�G���V���q�U����/J����X����{,���/�_�3���A���{U'�c�L�[��<fl��alu���|HBR*��blpi��w�v0VKSU�V���f�
�us�N�	>�������H���b��(QOWe�l�E�E��yz1�3����O�7�J'h��:������R�/��7���x�j�����&�m��;O/!�*$�@6��G�U8.7���cxZ]Ji��-�	���P�Vw������_"F�J<��|�����zm55��!,��S�n�R9���53����e"�����_�E�����3�,�������1<�^%Y�U*�G��������^��O�m���J��6^�`�PR��RK���*xA�QIV��^ !F��;��li��N��"�>��,\3�*��]l�0�����dE�!��}$B>���y�E�}w~PBt����l�T�&}m;F3x?�+IV���������C?��������|	n`�F�`3~�����c�Y=H���Q���R�yi�B�K�i'v�N��;?/1���Uec�#�*�L)9�����cxZ�G���Q��zO�,Kc�\�i���K�H���F�:��hD5���C�j6�5��2I����~���M�w�^��r�����m3�L�CN��Sq���i5��&H9g��1&���O�����yz��$iW����J:��Z����1<�f#Y3b,���������w��G��d;
�>�Pf��=��()�����B�\�Cy�1M�S$��Qg7�^�����4�������>����uLe2^c/�
6�+��4�T��c�X-By�t1L�#@)-�f���g�����a��Dj��>A����)f��U�O^��(
'F�d��j>n����}O�}H�m{di���_�h��1\������#����Pd�9��.l����o�����m�jFh��t���[5��f���{��{r��)�fZ�����O�cl��p�i_p���k��b5��5��1'��<O=�T��fN�w�/:���&�Q�rl��:�m�^��4��S��5���v��\�+��T}Q���U�l�|98jo;��p��h�
�d$��.�x���
w�-:�|�V�j�}���v���[u��5�wQ�L>��k�mF�>s�^;�V>�e����y������q9`�P�\��n:����xK�9��t���Oi���=�\��G���ak��#�(��_p�(�'�N�?x_�Wt����J���s��t��NO�s����<�9�j����(p�mY�:�oP���N�(����V.�
7���k�Sh 2ct������@�����Y"G�N���5&�r�"�C�|��>�x����KMA��5
#9Y��x��7�|[�����L�v{.:�����8Xt�D�F��	n���;����G�N���5fIed�'A�i-%.m�b��5p�7|F��1�w�x���7�|[3
K3M���#�N�������O��J+�D�u��n���t�k�}D�a$WF�1�+%3����*�7sK~���C.66�?@�����k�c��*����*NO�|�����p������M��U�Rj,>�1����U��R�uJN�G~Cc�����7{�yq������:��6`�I�!���\Y����x������`�6�l:������x�XZO�j�D�h��C����Fy
	<A��n\���n��'j��},�����3�l&��]�G�VV�h�S�i~��C~���YL.o��.�2�|
"�-mf<�� ���Z_`��'�[9"���}t�k������4()7^�;��jA����s�#|+�A�d��_`��4��N���M��w�Ps����L)5��t���V�_3Kq��\L�����%�"8\%���-�Np�J�w�����c8Y2��%��`��B�MM�J��%`9fo�x��,5�{�C�j��5��D!�W��<:��>#��o�������j���������:�pg�Z-
������7
=���=����c�Z�3�f��� ������P\�S���t���?�e�l��5
D3����s;�{�=k�3H80�r��c��poW��TI"r����d�[�#�d����|S���8j${�Q(��dZP���=�U�L%�ktv�V��}c���j�	k�5%q�b�����E#�s�b��g�H��.��KFG�U8Cr�H�|S��&�������E����t���<
C���`9�%-����>�	;�z�z�t���6sND��n�-]�
�9��E�9��X�����
7��s��l��
��j�	k�8�����:ho����������W�kHp)5��;�{����9��b�Z2"��|�����%T_+����Tr����X�*��\��8Im"yD����ARG��V�NX��&2 �@��R?��1����YV�Gw���]�9�kH"p��Hc�a_5��n�N9l�	c$T�����N638t�����5��t+�(��6Z���1���������A��,}���K���+�e�=$����l����8�������vB�m�;f�2-�����h3��5��HH��P����6
F���M/���cXZ-;a���!�"J�nOpq3��w�4�#T�Z�Z`�PD���F���!��j��n�N�4��	0@\�.w���6s����3�$�!�f�9��v[S��.	�������j��53�"�������fH����HL.�Fe�`�P�]h=��1���\�3'k�
Q"���]nx)'��r���R��tP���	%�'���1��\�����`o��*�NV�.m������o����m��Y���cW�9���C!EME�f��/�[�.�2(.����
6������R3x+�W4�J��B��s,�E��Ari3��iP[�Y�F!�K�{���U�
�i�(����������;w|Z���������	k&���q��������q�����k�dTX���7�N�j��^���R��(!�\�'dH��KN�����"���Bh ��4�k��T��b��o��?����jpE��3�Id�}\�B��%�]
M4(8��;�5
\j4��CH��&�i��l1!B��9`q;�1>�U����w����'��f�n��!���[�6!xc��c�U�/�����g\��y������Z&Z�s������?^��)v���v��8���L�1�����"3Q� �����������������n�������W�wo���~�����W/�;�!���y:�6
��4N���1��N����(
��<�����i5������<��X����*�Wo[s�!gu��U���J�I�$��U�l����9��9g��m3t��S�v(��1�����j��`�1b��FK���s����Km<ss�%���%��{��f���=lrJ$>����h>_ g���a;9�����c�)����m��
4uVIi
g��!i5��UMl8$d���t��������yH������nN���%i�:>����q�*#�Ut�l7Rg
��c����BR�!8h���f$-�4�o&i��!i���U�nP3^�#S�[.�����C���8��j�7#)qr����	>��T}<���3�&�^z�s������,�tN�N�K�lR��q������J6�8��C�j
�5�qK0Z��(]���������>*~�V��p3��d:����1��f#Z3R��,��,�%���xd�S����)J���#���K������(;��P�Z����4Q6f��(�R��H��t��)�P�'���qj�	nFY���iu>��1���$Z��'���*�R��tp�_�����4����s�����1��������N�'��V������O��p��U��yx�u43���kz�f�n�n@*]��w�c�Z�J����/�c(!���{�������1Cb�)���nN��O}����1D��%Z�YJ!Bd��ci�Y"j�����D%�eSc���,��L.��k3|Q�o�����XN�%jH)w\k�����u�K����M�3�(���������0���hE3�,\�����]����K��������j�l
;\jl�x���P��%^1�+'��ynF���W8����Z��O,g��b����=�G��<�����+J�5�&��}�K9��,�\v�������������6��6K�IK��,���0���x�	C��"��:.�B9B�&�
����C�7���I�'����� �~"t��!ju!�Rb,�AL��vm%���qp'�eD	�-�t<"�7#*��xj�{7���z�x��(�9����t�t��������D��X�����'�QY����'��V��r��$(q��_:����Zp'�`fP e��5�G���V�@��A+Qg��V��re`��7c��T����yO��v�~�#�����(�S����;�{�P�+��e�a�C��)�o�Qix�qT~���KodHDB.om[/3�(�TZm7����c�XMC�m*b�H�4���KWq��18�LO����v��i��rc'�������e���U�h\D1�JQ�g3Ob7C�]��F�<��vmf�Q(����k����1U�OZc�!����1,u�?�~%���qW��O{��m�Q���?c��!\u��n��p��=E:�kf����]��D�t�v�1	��V����C��h�!c5��S�r��X������%ct��TtW?�	�eP3�Q��U�MZc�� D��m����9h
��S�]��F���mW�f�Q(��Ph�6����Zj��9cA���^�������f�o��h���4�5#r��1|����ft�G��>����Yl���3���n���^�?`��Y�������jvIk�.@�!e	�5��t����r�O<�O����=����a��F.�V�P
���V�cC\��!eu��umu�sF(��:�_mg^��tW�K�x��|�����	$kb�8��C��uI+�@�(Db����aC�
v|Tb%�G���f:���[���cHY�0iUO�!xT�bz����MB�:x��Psv�Z�18l��W�C('��"k\/zFD�{���o�Q���O��!�2�������
&��[4�<�f���d���_L���yB�9�yC}�v4P���C��F����Vo3x���EV�_<gA���*��#	��H�]��� n����j����AN�76?����zd��#s�����!.]�������u��Fa$�X�N�1��>Y��HZ	�(A������
~�"�N��z<���1�X�@���FL�1���Ya�II=�_!��K���o�
�i�y�'��g|�72�7K@��E�|,_�chY�2���)z(����5z�"��f�F�Cw!�(�C�x}BZ��pt�J���c8W�0��!�gM>������fv����]
�L
��q�x-��
�}b��OHf�ZV���i
��:���Q�����]
�]��Ac���l3��o��Q�1�#k�0�B
���������.�;
Y������p��v��J������B�\�2y�_F�g�cZjq�$���?��x��/-�Fa���4t��K|�f��b��,KC&�9�"8��Fw>��w���f� �.�f��C�j��k��h(���r���S�8��X�]-�����(�Q�7�W��cW�0yM� &	r��E:�����gt����~����X�����W�~_���\���cJ�B�=��*J������Q83|U��&����D>�O������]��8V��D\��������kH�rg��h��c�W�4y��FD����$/u�:k�'���+��(%E�.��D�7�s��v�q���j5��U���`N)B_3��f6wv�4SJ&v����6
%$�W�,n;�~�k�W��AP�m�",e�g����K8b���Z)�0���4���4�f%n��8�
�C����+���P�%R�����]:�])�y"����=f�Uy��j[f�V;N^1�	�V�@���K���EE�������9�2z@'���y�[i�$�]h��:�G����S����$�$��"&K��K%�i��+��R������`�Pb"G���C@�\��f	�>s�Kc��+%���k�����RN��U�])W��c�g���;���!�p���a03zMC��O.J	$�A��w���o��ZI��p�s0��4�d��F����.�3+x�p��I����diR��Dm��������H�.6vF=��t2��Eh�=����t��'�����P�R5|V'��Iv�4���2q���kH$r�o\U��Q���YQ��#$��Ee��7�uW��T���[l��H%+�*�H���XM��14�J�5�rT&�O>,��]��n���J�U�Y���&N3������<���p��'HIR1H��5P8������N���q��mL&v��#zM��tM�\\BA8k=������7OB�������6
E�<-V���<���p�����H�������?���K9�</��6^"G��1}��F�0z���q�%T��i��1���������nW�B��������<�����c�Z��j%���c��a]�����:K/+V1�)=��R� w1.����Gbx[:��R�nj��e���k{��]'��{�<����r�B+-LK�Bk�B
����q��!]���;N,g9CN�����~3��]��pQ�6k���B����Q�1���Xc�aT�#��yi��������>,��)8�(p�Z�N�1����v��]R@��t�W������>����U3�j����s�V�ZV�
�q�P�LBr�]����^~��Si�����.����t���V���Y|�9�K����e3�>�U����w����'��N����/���{����_��������7��o��O��#u���.`������W�
���������A���������|������_��{M�k0�r�Wn��	9�Z��k:��Xf�f���8�L�uv��w�O�r��������p����������]�����w�7�"�\Hm3�&����R����!���V��	!�J(� =^���fL�;5���X�uG��U9��B��|���
C��X	k+>��9R�@���f��;5����q6��
���U�C��Z	���BM"��3�9����tH�������o2��B�I4Gm���P�z[B����Cie�c����g�����L��8o��>�D�./50z��#�*/�bj�(�>�����}.�m�K�D:��n��a���ST%�����	nE��);h��C�j�	kL1�)z�!{��eO��_`��St'�\Po����V�`H���H�G���1���f�P�_��6�3R_���N�q��!�
a���nN��\�}��C���	+���2��� �.�/q��C?���u5?r��H�� ��Vx�9�����C�j�	k�NA(�48�tN��-;a�+����S�p+�
��X�(�G��VSQX��F�_F� ��%h�����R;a��zt���s��6�v����!���5�U��(� �����|}�'3;_��J\fN4���k�ex�R��G|��c�Z
H���	��������8��s�3��iz/b��ac3��*�������3|M�	W��$y��SM����C�i�\4%,�����n�j@���u���14��$\�t�#z��9k4�u
M��(Ww�^�-�����p�p4qL�[7���14��$\cNR�f��}NK{��4}�w�v���
8i�,p�[���K�2�3�����p�A�D�YR�}�5����M�=���A����K�`�P���
���T�~Vw������!z��9r�������������j�l
������c�Y�I����A�����?�7�}��s�3��yV�hM`3~�����3x?�	W��S*7����E��/Cw~^J
d�����`�P|p1��[��1���#\c>��������13���s5'01Br�[o��p�����n����Ch��(���Uf�FL���L';Iw�~�������}�7%),
�8OR��!i��n��DRN^<
AZ:g^�i�&;M/j�����V^�f4�����>���g����e�u�(�3�tq��N���B��z���{����r}�[mO3|M��(��!� "�IR��:����S3���kY�e:\#'�M e�7zq+v��u(����H9�}H�!���Z�$T:����}w�����0JfG���f�U(�]l����1T��������D;�ga3���N��o���n1ez��OP���rcyx���[u��n�r)�8a���4��7�m���	(��\\2�=$a[)�g��F�=R�x��'���Er���,u�_���Ng�]��>�!��>���(r�CH�1A����tb�M�L�dH�c��z��&e��l���)��5�\�5x$|�U(98���W�c�X9���S�/����IF�������]��+�R���������a�A�:�lT}5��W$P���9k`����@��w[|������x�{���gRH,�)��L��TL�[���p����jWx��0:�n�Ck\0J6�u!"��M/��7C������u�;��t]����k���T��Uc�1�@�P��X}�JT�����9g��)��~�����~I��"�p�p'J�n�-���1���Z�,'$����w^�����2�\�������W���}����h���P*x����V�[B���������[?u��p�f����F?���������JX�c(Y0�fFDNR��������q��J8X~�����N3�((SB[F<�����]hM�$����|��5UM���+�S(!����X��VJH��yn�#|%�%��t�	��	�-���+�7J?��Kc!V�V5�VyK9���p��!]5�P;�!�#��k^3F����]�@s�!4����VB�"�m�V!��C8��#�k�]E��4k���R'��B74�n��%��Am�**�J��F����f��U7��% !f�����jL�����|��K�5v"��Fapp��9��C����5S�`�wD9���:,6,�xz�iz������H�I\�A��
s�����i��!e���*���VD�E%_"�v���7t��KR?���5KT�Q(����V���1���^�	&k�'D���M��Y���]�]��`"��K���L�V�fv��9�	>���5��z�d"�QL{{�������p�E�n���P�*�J]����
��j��5��B�!x({�=bX��n�w�>�s;e�����#�*_f��f��C�j��5��D�}�9/I�Re(�������Cq�q���
�K�#�f��Uw�i#!'*9�W���O��
k�?�������C��b�z��C����5���b��l��b'�%�v��I�.���!;��{L�Q(�yejl�8���.U�LZ����Y|�^�Lk\~	���o�����������������:�p��:R�����F���Vr���K��'��VSM�6�@���c��3����-5�_��J���ke�����Q&`�PB�q�:�z�V��m5��b���R���OS��������lY[���t���!�|t�u��>���j��Xm2k��I�2�����c3��F	S�)��S��5
�����f�z�N�1��n���m#��8D$�a��������
�v�|6���j���A�'�Y�ZX����|Q�'�q�hr�B�������������V���f,<��	kT��FZ�w�P��UNZa���S�r=��������^I����Jd�Z��g�v��$GK�&g*�|Q�5'����f���
!��~��Y���uWI#�L�a�H��&�����i�;�z���V�sBD"/IR�����l��9�T�)�i�$��_U��5r%'R���p)�z��#�&����[�G��V�NZ����"`t�{L�*x3���>(�Spe��1����*h�]n�l=���O�iGV�~"���{��A:4RKr��&����EP_n����#�J#�����;��1$��Y��	���02����	����i$L���-��}4���*g�2*5zv*x��kGV���<��S��(I/������u$#��m{�G��F�>�Z��sD�!iu��
�N����������"�Yw����'O��o�lT����7�o��_u�H�k'h�)k��oM��$���v�F>[��k����F
��s�a��C����~���S�����=iC;=�@Un�I'��"���B�����<���c�W
;�b�T��4���T�����Rz�}</��&�MZ�$OKm�&�l��>���m�3x'�7G��9e�w�	{KW��o{����;��=�X��6/ZA������ ���rd�!GRH)H�)-]
[�H
�9��E�iE0��F�lJ*3�'T�RV�tpT��Q�����"�_������"HK���*�P�0�Qh�>5A�p.W�M^���	KK�D�k���bk+����i���yl�!��F���g���1��.��j��$��%K�U�,��n����ano3A��@�8�f���\������1Dd������9�����B��dd?ey��
��dPm����C�j��k����OT���7�
�l��bB
S��(��b��������kfL!B��5R�}��7����o�����_��~��9�8���p[mu����]�6a��cV�/�����9��O;��o��FxO{�������7~�~����]W��]�fA�Q��p��]��!Q���g.Fv�����s��������n��u����W�wo���~�����W/�K��{B�fC�X���Z����*x1��%w;^�����i��Z�������f�z��|zb"ajn<<���A\����<������1�<������\������;5���1'���9����-��~L�	<���"�WM�B��j&���/��Ss��gT��=�F���V�L��q��A��>���U'$)��cob���a;1������Ci��x��7��>���/`��c(
��+z����@�G!�]�6���)�%������VM��[l4���3|EC�h�	T�}���I.-5!9�:�oh��N�g��'����m�8��(����K=�Qt���(V����$"�}��(=����.�;Ew�.r"3��|c����!��K�����1������=�)���K��q3SU;��P���;_WKj1Lg'����f
��4�n����1|���5��$��rN���X*L��x|'��6��4��m�v���Sp�:�n��!,W�����=�qW��{X�8�+�N�f�Je�}#a�p+�b���G���J���@yIB����T���u�k+]!�f����Np+���s�1??��V*a���HP�/{M=�@��s�6��t=/d�8�9����$�������>����t�){/(���k����$����Nl0�4:�9�<��h�1�vc�>��P�I���1sH�K`�9�y��oe?x��4���x���lEQ�Z�6�N8����`�1�})NC����/f~�����C	�>�Fo��h�d�������1$��$XaM�QKRF�Z'���i�}@�}��2bp7I�8���mB!��\h4J��1���$X�KbNYr�����D�_�����K�����3������:t��g5%�SR���$/{&W��7p���B~N��6vNP+n���R��G�<��p��`�	|F�@��K;�������y�v���
�l�O�K��G�<������`���S����x�>����%61z�A��N�����ki���oz����F�R��n�V��V���yd�b&MYe���p~�2���B�J��Bl��nE�H([�9������`���
jb�B��.B�v��4�M�������p35�Pv8��chZ�E���Q��D���*q��N�/�4sp����p+�r�r�X�|MCu�U��B�������p���S���+�c<gm�n�k����Q�q��^5���H��O^5��z�X*�B���E���iZQ��w7/}P��nq�~H�
�	%f���9��1L�������	�������x3���*��o�����_�I�p~�5bt���C�j�	+�>)D�%e���-��+�f��+�=c���A>��p�"���7^��1T����f\X�tR�W�������~`���}�:M�����(��}��C�j�	�V�I,���������f��+�S(_c��T�U(���5�!��1T�����������g"�s��

���o����w�P�0�a#�+v���&t�k��si����G��&;��Q��}O�}��tk�	k����2���:h�����������6���
��|���5��R�I�:�n�	��0e"{����L1������f��������^i�4L�Q�w[���I\���<�}X
.�mp�.QH�8�D~i��Y�%ra3�g�����s���u�~�'�JJ���#�(���H���3���cY]/��]���epo�}�,��������]/�P��5N���F���w*���3x������B�!�&����2:�L����S� �T��V��7�A���T�JV��yC�����C�j�H�o�Z�.���cp�*�l��J�`�<�u���k:�x/�%���;��v?w!|
!����o��w�[	a���Z��C����5�k�LvNA�,�^�������'rrYu���t��13x����L����bK������;����I���wG�YA(19
�a���du���422�����kH���tW���C!�����BI�q�/������a���]�(e���;��g��u������w����o4?W�M ��b��l���^���oL*�b"�.oLr�[���'��,�fOC/MCQ�o-�Z��nNi9#.�6a>���2V�L\��KN!w��{'�>�����>@�K�Uz3+��-����	�3x��3&�p�����s��{�St��5�>���D����p�p|(]~�q��>���6�t��1R��wH��k�x�4h��x$P�V{2�Vj4���1����������DS9`��=�2�n����������!���9GK;���
C�j��k�3Z
�����I���!m�C]����5
���k���c8W�1qE��Tvar�=���B-(��1f��Ua�Y��-�#�L}$M���G�RV�L\�;����3t�����~���:O4�l.
'�UQ(���y�3x��?&�� �1&�e�Q-|E����������$~����~���:�	��Q��r,��xq��IRv��b��c�Z5qE���GYj�x�[6t�pQ +Mw�<H�Z�vU��fcl�6�����j���K�m������h����t�����@+l+#Op�p�RFp6v?�������Vm ��3s�H�Vh����H"i}�[�V��>N*e��t������m��m�c������w\��-��U��TR|�������*&#Qp�q`�	>����Ck:�P�2&dJ= ��oG�u�2��~�?��B��|k��	;�|��C+|89	{a�R�|�����J>�Jf*�m���J%1d�b�m�|Q�7�Vxs@���AJo}����w���i������Gw7`�-������,�\+x��M�V4��rw?P������<�t���f��tR��|���<~U�r������):j���6�$��1���cHZm;�������&@����\���tY#�)�V#]Z��i�l
8���8�	<�~��Ck����T�S����^�]�]"M�	�������:s�p�pXDJ���D�����������Z�������f�����HP&*�a�u��5�h�K��@f��U��p�@�RT��v������o��g+#%&r���sD)dInQd*��BR��^a��>k���1sW���]��s`WH������c�lJ����lt��_5���R*�)y��vZ}r�K���/��)�����Op+�����jK���1,��^��!�"C$�]�,�.��<	jA���J"A\l����1���^1Y
5E�#1�l�b|9y^ w]\KI�b�����#�H�jn�$�<������t�B*���,��;GK��B�
-?��;��=�8M�h�a1A���.46�9@�p�sxE����dL������^����J��\�����[�3�*���^[�Z�cY]8��y����Z&0�m���m+��U�B~���8mq���.����c8W�7��s�LG���1A������P�2��dd_�3�&�l%��*��:x�!eu��
��e�F����/c�t��JB'�se&��j���
C����5��@<
���9��$^��������\#+�*�.-]�}�<����iR��F)�iVZ��.��<K���^~��+s�����5Scn�}<A�p�Zh���7��)'Y���+j�z�?������
���������_�-���J����������_�����?���u`�����|����M��_n������������o~�~����>Bo_���WR}|}|�����G�N����xz��^����w��������c4�6��5/��.7}K�3�3�s���������rv������p�����O?�Y?�F.���7���w��P~��������/�%<\_?�����YS<���������|���=��{���bO����TeL��?�����yw���u�{�����/�I��/�w�
�����@o�~6�8�@��_G$�%�U_���_K ���?�O��y��r�_���p�����w�S��~<���@�s�1g,#�:�����-��9�gc.�4�cDY2���9:�B}��1{�s�*)t�j2����
��1��L,t�+������������)'���{���~a��Q���@
�%1\j4���3�>�������{�b���������")g��`��|�����	���%{];�{7r�-�.�M?�����]�i"��/�����|��\K��s����4?�.�E������ g�%��0���@��2�����I_��x��&J|Q�A�%������aw����k�5M�v�7��~���#i����}��]:�e"f��1������
�T�&�C�uQBij*��m��
�{4��mb��)w�u�'���<�*Cj���c��^�����+ZB6!�O�X��Y�~���LQz>����	!����rrh��4yP��8����K���2)R Yj?�\����&ZA��s�8	2^-UePkzQ}]���%�c���<l�|��t�����r�������C���ki�[��'�A�E����KT��C]��:[��?�����#b�=�r���ra	��M&���D`��>Q�F��!y�zV��_�y{il�C������[�l��Ja�G����L��a�|;;�x���c�=��&Y��i�LB�����.$2���7t�_�fXe&hh��=��N�y����DKfY�T��|
�|�6#�O����Q��<jE
��P�i�T����z6tA���X9y�!2�.�R[X��u�9r^���6����#�,��]LK]w>R6�����"��~:O�)�P��
�4������z��4o��}����E����tn�������+i��,��n���!_�6z��D'��jI��Er��)q���k�>
��c4B�s��7��B}];O�m�=Y�t����������$e��Z�%���d)-Dd��r��o�:�����y����z����-{��TG���u<me����YS>��5F�7�}aIT�HQW�}K����������F�?��c�$���L?��\�K�X�&�R����`_^��X��v�^��������J	�z�W��3�s�q��L��)Z�Q��*�o}�c�X�EY���`]�A�u��2
��h��i��i1b�2�������I��"R���3-..��|�+���D�G��C���r
��5�.c��g����5�������DM���51��0}���ka*���#K��(����'Q�K��Xo;��Q�AQBC�x�"qJ������P�6�+��������Z��d���$_�-xil��
����}G-�����X7(�����o�8�5��Z�"P�GZ���qma��mA{���$^!R�wP�p=���x�I�M�C
"]S�!�`�`N���	b,�;Ic|������9AW���_l|w�G
����8w�>R�k�\R��Cq=��R��[�GR.����c��K�O-��_,e#sTa��.b����>dZT����y�k��b��mDv���By�}��+{������di �I4#������'0���)���~��	s��x-�����!H��'�������9C�4dmA��7�H�����c`��������.��A(�r��=O�oEA4�����KbnQ�r��C��
4�0�'�rX��/z��c����c�d�k}�[f�f�1
t�H��1�K7���(��F]NS2�r�-��!	�[X�1%_��L���?�������\�Kc�.~�DJ��K�=��	z���"]�E�G��7/u=K�\ "K�8;VA6���K���EoOh��Vlr\bP"����b���M:r�!L��s�Zo1>o_���k.?�P!L�� B2��UbS]S�%-v��O��f�n�S&�����u�����
�h�V���#��5 H$��E�6)�;�����G���k{9�����fA��!d��^l����~�%_	=�sf���2����c�=���G��"`_�c���L�}�,���Gs?^��������_>���?_���_�|�]���\�tw�����o��^�]�������Lu�'i���d����<���P�N��5g������gr�L�19��}q�c�o��O�g�t��G�$�ojb���7��1�����$�d��Auu�`���T����S���%�;�����}�`R>@Oh�zZw|�"�O����Y��}�v����A���5���\|O�!�lp���c�P�O{G�Ke�*L�i���F%���������=���t�/�*E�X���}Y�o,��uXI2�j�M��{�p��e-x v�81w�BH�M�.�]��o]^Om�U#}HaiZ��Y�m[42��s���y��<6}�T%�@�:������g)�!��T�� 4��.K_�F�+�H��-�+���]j�l�Ki�<u��L��1�
EB)'�}����h	��HR���c.}(��!
)�K��Y��z���m���i����]��
�1)%%k�����f�����%,�A��E���2�,y,=���4;4������C����y{"���eh����|�&�r��L7����7Qb�H����Y�w��y�>��z.��CB����Ga�5�[���
���e�����uJ#������.+����h�OZ&��c�X�BS�n�Jr�!h}����Csi�vv������es����}@��������vK������.�g��k��{>�cJS,�8J�D�g�L���������a^���w�1��[�{vB��{�H t
��������V�O�����s_�eo������*+��WP&�_<"%@Y�K��kO4�Le���6��=��r�zu�J �>�{�����-O�(���'�da���T��6�u	7��-��T	K/����|s1�2�03u^���rHf�>@������2�XT&���M=�kOa,�">iz��s�VN�m/�O���j!���.��O��Qk�T���z��b+[!{�>���N���Z���5sF���F�SD���+���i�<"K���,2���+����MO��_K�9�|�A����FNO��F�4�����`�f����fkY{�������#�j�%�X��h<���-G���W ��R��[�H��O��vpO
�x���'-6����P�c�Q�����HS?����	Z�2a�[ ���1�J�l�i!�m�>�F�M�UM�zZLM���x!Q$�����WB����Q�������VW��5���~O`u-��5e)
{��|�d���R�Oi��Pw��z��U�t��E��F�<��D)��#(��7����SI����mkk����	�}��,9c*��z�<&	�~4��T��Q:����[TE����U�Szm7��%3eU�NJ���RH�� ��C�i)4��4�l���z�`]��'�!e	�]�tKZ`[v����c(>���Iz��`��Xi<��o}�n�\>S�@�%1H}986���[C�"���~��on>�_x���������������;�eb���}�c��&���N�xdMs�\F ����e�&���P���[q����t]a���{rg�����1j��r�g��u���H-@���]e����)��W�I}�������R�5G��
���``�
)���1�8�X�����n��=����R�t��kl����!�zW��1����B�Rt�wW&���V��B�O~u����6$f,(�U@�&�IQ�~'����S�W-49!��C�w����yro�n*]�J���1|�[�V��(r������oR����)o���k�cF	��Nw�
��C ��Z�VX���2�����V��
��Mg������7������/������d.#��/��]Os�|��k�_��?���A1?��A1[��n���c~����8����G=O9bP������!@W����0�}~����8h��-��g�\�g!�k�x��������7�^�t{������}�������u����_�z�����������������_n�������7{�������|�����w�o����y�����p����S�_���L���������^�������O
������:�������w����O�����{��/~������S�}u�~.Q)��?�_y������o����������N��O�����R��k����{v������W��O�WF}��V|��oQ�C?�r�?��o�W+>HX�����
-;j�~�yw}���?���7�>^���w7������������2�U?�������x�����dNC}���>������������������W��0�Q%��o����0�<�w7ee*O�.������_>�������7w?��7]��/������/��:�����~������������I��W7���_�{�������?�����;|�)��|B���|�?~d7~�F����/��|�o�����o:}���y��?���u�K�C�T_�����n��^(�y��i���K�z����w��_���<��';�>���MO���~���w�a�0���G���S�:?d�;|���}-
���O^M���S�������~��?����/����9�fNO��/L?a?M�]�������&����^;�E�b�u/�/�Z6��/�w��]3�N��������/����x-3n������7�������C)�����������~����8�O�<Gd��z�h��:���4�'������):;0���~����g����s}S�y���>�7��~M?{��&Rf����W��_c\_����z������������~�E��7������
��7?�t�~,/�Q@����U��B���Y�d��{q�G�^<:�{[���V/��o[>����=�2	����Iv/v���h���-�D�7dh*�
_��ma�4{�����8�������"��������vJ$.YF�,?�-_��D����E��B,3����s���f�7�\F���]���p(�G��|�Ie���2�h�*f��?[f���vJ�-�}q�����l�*��{nX�d�,���+;�h���l��d3������n��L�\m_���S�u��}zu3��W�K�W7��B���Mc�S���M�;A�^��}��!�^�L<�W7}��6Ij7[�W��&�^�T���6M-!,_�4�`���CmCk���\�-����i��6v�U�5�W7}���5���=L�n��l�q���|04�W��(�i��$?�W7\	����������������X(Sa�^>3������bh���J��{h�`i�g��v[����)���F��������M��h�G��w�
�`���e�=���zu��vMrl�@r~���_$y��$��0���JO������w�o�~�������}���woo�����{}��������PKyN�����PKL6I�l9�..mimetypePKL6IhC��'.'.TThumbnails/thumbnail.pngPKL6I!�m�@�%�.settings.xmlPKL6I�@�$
+4styles.xmlPKL6I���b�|;meta.xmlPKL6I��h��7=manifest.rdfPKL6Iv>Configurations2/toolpanel/PKL6I'�>Configurations2/accelerator/current.xmlPKL6I?Configurations2/images/Bitmaps/PKL6IB?Configurations2/popupmenu/PKL6Iz?Configurations2/statusbar/PKL6I�?Configurations2/progressbar/PKL6I�?Configurations2/toolbar/PKL6I"@Configurations2/menubar/PKL6IX@Configurations2/floater/PKL6IS*��!E�@META-INF/manifest.xmlPKL6IyN������Acontent.xmlPKp*�
#38Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Claudio Freire (#37)
Re: Tuplesort merge pre-reading

On 09/22/2016 03:40 AM, Claudio Freire wrote:

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.

Looking at the spreadsheet, there is a 40% slowdown in the "slow"
"CREATE INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);"
test with 4GB of work_mem. I can't reproduce that on my laptop, though.
Got any clue what's going on there?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Claudio Freire
klaussfreire@gmail.com
In reply to: Heikki Linnakangas (#38)
Re: Tuplesort merge pre-reading

On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 09/22/2016 03:40 AM, Claudio Freire wrote:

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com>
wrote:

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.

Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE
INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB
of work_mem. I can't reproduce that on my laptop, though. Got any clue
what's going on there?

It's not present in other runs, so I think it's a fluke the
spreadsheet isn't filtering out. Especially considering that one
should be a fully in-memory fast sort and thus unaffected by the
current patch (z and z2 being integers, IIRC, most comparisons should
be about comparing the first columns and thus rarely involve the big
strings).

I'll try to confirm that's the case though.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#30)
Re: Tuplesort merge pre-reading

On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run
boundary. For example, when we read the last 1 MB of the first run on a
tape, and we're using a 10 MB read buffer, we will merrily also read the
first 9 MB from the next run. You cannot un-read that data, even if the tape
is inactive in the next merge pass.

I've had a chance to think about this some more. I did a flying review
of the same revision that I talk about here, but realized some more
things. First, I will do a recap.

I don't think it makes much difference in practice, because most merge
passes use all, or almost all, of the available tapes. BTW, I think the
polyphase algorithm prefers to do all the merges that don't use all tapes
upfront, so that the last final merge always uses all the tapes. I'm not
100% sure about that, but that's my understanding of the algorithm, and
that's what I've seen in my testing.

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

FWIW, polyphase merge actually doesn't distribute runs based on the
Fibonacci sequence (at least in Postgres). It uses a generalization of
the Fibonacci numbers [1]http://mathworld.wolfram.com/Fibonaccin-StepNumber.html -- Peter Geoghegan -- *which* generalization varies according
to merge order/maxTapes. IIRC, with Knuth's P == 6 (i.e. merge order
== 6), it's the "hexanacci" sequence.

The following code is from your latest version, posted on 2016-09-14,
within the high-level mergeruns() function (the function that takes
quicksorted runs, and produces final output following 1 or more merge
passes):

+   usedBlocks = 0;
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+       int64       numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+       if (numBlocks > MaxAllocSize / BLCKSZ)
+           numBlocks = MaxAllocSize / BLCKSZ;
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       numBlocks * BLCKSZ);
+       usedBlocks += numBlocks;
+   }
+   USEMEM(state, usedBlocks * BLCKSZ);

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

It's weird that you change the meaning of maxTapes by reassigning in
advance of iterating through maxTapes like this. I should point out
again that the master branch mergebatch() function does roughly the
same thing as you're doing here as follows:

static void
mergebatch(Tuplesortstate *state, int64 spacePerTape)
{
int srcTape;

Assert(state->activeTapes > 0);
Assert(state->tuples);

/*
* For the purposes of tuplesort's memory accounting, the batch allocation
* is special, and regular memory accounting through USEMEM() calls is
* abandoned (see mergeprereadone()).
*/
for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
{
char *mergetuples;

if (!state->mergeactive[srcTape])
continue;

/* Allocate buffer for each active tape */
mergetuples = MemoryContextAllocHuge(state->tuplecontext,
spacePerTape);

/* Initialize state for tape */
state->mergetuples[srcTape] = mergetuples;
state->mergecurrent[srcTape] = mergetuples;
state->mergetail[srcTape] = mergetuples;
state->mergeoverflow[srcTape] = NULL;
}

state->batchUsed = true;
state->spacePerTape = spacePerTape;
}

Notably, this function:

* Works without altering the meaning of maxTapes. state->maxTapes is
Knuth's T, which is established very early and doesn't change with
polyphase merge (same applies to state->tapeRange). What does change
across merge passes is the number of *active* tapes. I don't think
it's necessary to change that in any way. I find it very confusing.
(Also, that you're using state->currentRun here at all seems bad, for
more or less the same reason -- that's the number of quicksorted
runs.)

* Does allocation for the final merge (that's the only point that it's
called), and so is not based on the number of active tapes that happen
to be in play when merging begins at a high level (i.e., when
mergeruns() is first called). Many tapes will be totally inactive by
the final merge, so this seems completely necessary for multiple merge
pass cases.

End of recap. Here is some new information:

I was previously confused on this last point, because I thought that
logtape.c might be able to do something smart to recycle memory that
is bound per-tape by calls to LogicalTapeAssignReadBufferSize() in
your patch. But it doesn't: all the recycling stuff only happens for
the much smaller buffers that are juggled and reused to pass tuples
back to caller within tuplesort_gettuple_common(), etc -- not the
logtape.c managed buffers. So, AFAICT there is no justification that I
can see for not adopting these notable properties of mergebatch() for
some analogous point in this patch. Actually, you should probably not
get rid of mergebatch(), but instead call
LogicalTapeAssignReadBufferSize() there. This change would mean that
the big per-tape memory allocations would happen in the same place as
before -- you'd just be asking logtape.c to do it for you, instead of
allocating directly.

Perhaps the practical consequences of not doing something closer to
mergebatch() are debatable, but I suspect that there is little point
in actually debating it. I think you might as well do it that way. No?

[1]: http://mathworld.wolfram.com/Fibonaccin-StepNumber.html -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#40)
Re: Tuplesort merge pre-reading

On 09/28/2016 06:05 PM, Peter Geoghegan wrote:

On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I don't think it makes much difference in practice, because most merge
passes use all, or almost all, of the available tapes. BTW, I think the
polyphase algorithm prefers to do all the merges that don't use all tapes
upfront, so that the last final merge always uses all the tapes. I'm not
100% sure about that, but that's my understanding of the algorithm, and
that's what I've seen in my testing.

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

How does the distribution of the runs on the tapes matter?

+   usedBlocks = 0;
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+       int64       numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+       if (numBlocks > MaxAllocSize / BLCKSZ)
+           numBlocks = MaxAllocSize / BLCKSZ;
+       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+                                       numBlocks * BLCKSZ);
+       usedBlocks += numBlocks;
+   }
+   USEMEM(state, usedBlocks * BLCKSZ);

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

You can't release the tape buffer at the end of a pass, because the
buffer of a tape will already be filled with data from the next run on
the same tape.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#41)
Re: Tuplesort merge pre-reading

On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

How does the distribution of the runs on the tapes matter?

The exact details are not really relevant to this discussion (I think
it's confusing that we simply say "Target Fibonacci run counts",
FWIW), but the simple fact that it can be quite uneven is.

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

You can't release the tape buffer at the end of a pass, because the buffer
of a tape will already be filled with data from the next run on the same
tape.

Okay, but can't you just not use batch memory for non-final merges,
per my initial approach? That seems far cleaner.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Peter Geoghegan
pg@heroku.com
In reply to: Peter Geoghegan (#42)
Re: Tuplesort merge pre-reading

On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan <pg@heroku.com> wrote:

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

Wait. I guess you feel you have to, since it wouldn't be okay to use
almost no memory per tape on non-final merges.

You're able to throw out so much code here in large part because you
give almost all memory over to logtape.c (e.g., you don't manage each
tape's share of "slots" anymore -- better to give everything to
logtape.c). So, with your patch, you would actually only have one
caller tuple in memory at once per tape for a merge that doesn't use
batch memory (if you made it so that a merge *could* avoid the use of
batch memory, as I suggest).

In summary, under your scheme, the "batchUsed" variable contains a
tautological value, since you cannot sensibly not use batch memory.
(This is even true with !state->tuples callers).

Do I have that right? If so, this seems rather awkward. Hmm.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#42)
Re: Tuplesort merge pre-reading

On 09/28/2016 07:11 PM, Peter Geoghegan wrote:

On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

How does the distribution of the runs on the tapes matter?

The exact details are not really relevant to this discussion (I think
it's confusing that we simply say "Target Fibonacci run counts",
FWIW), but the simple fact that it can be quite uneven is.

Well, I claim that the fact that the distribution of runs is uneven,
does not matter. Can you explain why you think it does?

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

Yep. As the patch stands, we wouldn't really need batchUsed, as we know
that it's always true when merging, and false otherwise. But I kept it,
as it seems like that might not always be true - we might use batch
memory when building the initial runs, for example - and because it
seems nice to have an explicit flag for it, for readability and
debugging purposes.

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

You can't release the tape buffer at the end of a pass, because the buffer
of a tape will already be filled with data from the next run on the same
tape.

Okay, but can't you just not use batch memory for non-final merges,
per my initial approach? That seems far cleaner.

Why? I don't see why the final merge should behave differently from the
non-final ones.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#43)
Re: Tuplesort merge pre-reading

On 09/28/2016 07:20 PM, Peter Geoghegan wrote:

On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan <pg@heroku.com> wrote:

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

Wait. I guess you feel you have to, since it wouldn't be okay to use
almost no memory per tape on non-final merges.

You're able to throw out so much code here in large part because you
give almost all memory over to logtape.c (e.g., you don't manage each
tape's share of "slots" anymore -- better to give everything to
logtape.c). So, with your patch, you would actually only have one
caller tuple in memory at once per tape for a merge that doesn't use
batch memory (if you made it so that a merge *could* avoid the use of
batch memory, as I suggest).

Correct.

In summary, under your scheme, the "batchUsed" variable contains a
tautological value, since you cannot sensibly not use batch memory.
(This is even true with !state->tuples callers).

I suppose.

Do I have that right? If so, this seems rather awkward. Hmm.

How is it awkward?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#45)
Re: Tuplesort merge pre-reading

On Thu, Sep 29, 2016 at 10:49 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Do I have that right? If so, this seems rather awkward. Hmm.

How is it awkward?

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, regardless of how much that matters in
practice. One tape could have several runs in memory at once, while
another only has a fraction of a single much larger run. Maybe this is
just the first time that I find myself on the *other* side of a
discussion about an algorithm that seems brute-force compared to what
it might replace, but is actually better overall. :-)

Now, that could be something that I just need to get over. In any
case, I still think:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.

* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.

* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".

* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:

case TSS_SORTEDONTAPE:
Assert(forward || state->randomAccess);
Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)

And:

case TSS_FINALMERGE:
Assert(forward);
Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)

* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#46)
Re: Tuplesort merge pre-reading

On Thu, Sep 29, 2016 at 6:52 AM, Peter Geoghegan <pg@heroku.com> wrote:

How is it awkward?

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, ...

I don't get it. If the memory is being used for prereading, then the
point is just to avoid doing many small I/Os instead of one big I/O,
and that goal will be accomplished whether the memory is equally
distributed or not; indeed, it's likely to be accomplished BETTER if
the memory is equally distributed than if it isn't.

I can imagine that there might be a situation in which it makes sense
to give a bigger tape more resources than a smaller one; for example,
if one were going to divide N tapes across K worker processess and
make individual workers or groups of workers responsible for sorting
particular tapes, one would of course want to divide up the data
across the available processes relatively evenly, rather than having
some workers responsible for only a small amount of data and others
for a very large amount of data. But such considerations are
irrelevant here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#46)
1 attachment(s)
Re: Tuplesort merge pre-reading

On 09/29/2016 01:52 PM, Peter Geoghegan wrote:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.

Ok. I still think that changing maxTapes would make sense, when there
are fewer runs than tapes, but that is actually orthogonal to the rest
of the patch, so let's discuss that separately. I've changed the patch
to not do that.

* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.

Ok.

* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".

I'm not sure I understood what your concern was, but please have a look
at this new version, if the comments in initTapeBuffers() explain that
well enough.

* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:

Ok. I replaced "batch" terminology with "slab allocator" and "slab
slots", I hope this is less confusing. This isn't exactly like e.g. the
slab allocator in the Linux kernel, as it has a fixed number of slots,
so perhaps an "object pool" might be more accurate. But I like "slab"
because it's not used elsewhere in the system. I also didn't use the
term "cache" for the "slots", as might be typical for slab allocators,
because "cache" is such an overloaded term.

case TSS_SORTEDONTAPE:
Assert(forward || state->randomAccess);
Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)

Good catch. It wasn't, because mergeruns() set batchUsed only after
checking for the TSS_SORTEDONTAPE case, even though it set up the batch
memory arena before it. So if replacement selection was able to produce
a single run batchUsed was false. Fixed, the slab allocator (as it's now
called) is now always used in TSS_SORTEDONTAPE case.

And:

case TSS_FINALMERGE:
Assert(forward);
Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)

Yeah, fixed.

* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.

True, fixed. I still set slabAllocatorUsed (was batchUsed), but it's
initialized as a dummy 0-slot arena when !state->tuples.

Here's a new patch version, addressing the points you made. Please have
a look!

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-3.patchtext/x-patch; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-3.patchDownload
From a958cea32550825aa0ea487f58ac87c2c3620cda Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 14 Sep 2016 17:29:11 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE. In other words, it's used whenever we do
an external sort.

Reviewed by Peter Geoghegan and Claudio Freire.
---
 src/backend/utils/sort/logtape.c   |  153 ++++-
 src/backend/utils/sort/tuplesort.c | 1208 +++++++++++++-----------------------
 src/include/utils/logtape.h        |    1 +
 3 files changed, 565 insertions(+), 797 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..4152da1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -52,12 +52,17 @@
  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
  * policy for free blocks would be better?)
  *
+ * To further make the I/Os more sequential, we can use a larger buffer
+ * when reading, and read multiple blocks from the same tape in one go,
+ * whenever the buffer becomes empty. LogicalTapeAssignReadBufferSize()
+ * can be used to set the size of the read buffer.
+ *
  * To support the above policy of writing to the lowest free block,
  * ltsGetFreeBlock sorts the list of free block numbers into decreasing
  * order each time it is asked for a block and the list isn't currently
  * sorted.  This is an efficient way to handle it because we expect cycles
  * of releasing many blocks followed by re-using many blocks, due to
- * tuplesort.c's "preread" behavior.
+ * the larger read buffer.
  *
  * Since all the bookkeeping and buffer memory is allocated with palloc(),
  * and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -79,6 +84,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +137,18 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * Desired buffer size to use when reading.  To keep things simple, we
+	 * use a single-block buffer when writing, or when reading a frozen
+	 * tape.  But when we are reading and will only read forwards, we
+	 * allocate a larger buffer, determined by read_buffer_size.
+	 */
+	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -228,6 +243,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +608,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +692,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +703,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +777,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +823,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +855,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +906,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +955,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1022,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1002,6 +1084,10 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
+
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	*blocknum = lt->curBlockNumber;
 	*offset = lt->pos;
 }
@@ -1014,3 +1100,28 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 * Also, don't go above MaxAllocSize, to avoid erroring out. A multi-gigabyte
+	 * buffer is unlikely to be helpful, anyway.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	if (avail_mem > MaxAllocSize)
+		avail_mem = MaxAllocSize;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 16ceb30..d19235d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -74,7 +74,7 @@
  * the merge is complete.  The basic merge algorithm thus needs very little
  * memory --- only M tuples for an M-way merge, and M is constrained to a
  * small number.  However, we can still make good use of our full workMem
- * allocation by pre-reading additional tuples from each source tape.  Without
+ * allocation, to pre-read blocks from each source tape.  Without
  * prereading, our access pattern to the temporary file would be very erratic;
  * on average we'd read one block from each of M source tapes during the same
  * time that we're writing M blocks to the output tape, so there is no
@@ -84,10 +84,10 @@
  * worse when it comes time to read that tape.  A straightforward merge pass
  * thus ends up doing a lot of waiting for disk seeks.  We can improve matters
  * by prereading from each source tape sequentially, loading about workMem/M
- * bytes from each tape in turn.  Then we run the merge algorithm, writing but
- * not reading until one of the preloaded tuple series runs out.  Then we
- * switch back to preread mode, fill memory again, and repeat.  This approach
- * helps to localize both read and write accesses.
+ * bytes from each tape in turn, and making the sequential blocks immediately
+ * available for reuse.  This approach helps to localize both read and  write
+ * accesses. The pre-reading is handled by logtape.c, we just tell it how
+ * much memory to use for the buffers.
  *
  * When the caller requests random access to the sort result, we form
  * the final sorted run on a logical tape which is then "frozen", so
@@ -162,8 +162,8 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
- * when memory is used in batch).  SortTuples also contain the tuple's
+ * can be freed by a simple pfree() (except during merge, when we use a
+ * simple slab allocator).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
  * Storing the first key column lets us save heap_getattr or index_getattr
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,24 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size slots to hold
+ * tuples.  To avoid palloc/pfree overhead.
+ *
+ * Merge doesn't require a lot of memory, so we can afford to waste some,
+ * by using gratuitously-sized slots.  If a tuple is larger than 1 kB, the
+ * palloc() overhead is not significant anymore.
+ *
+ * 'nextfree' is valid when this chunk is in the free list.  When in use, the
+ * slot holds a tuple.
+ */
+#define SLAB_SLOT_SIZE 1024
+
+typedef union SlabSlot
+{
+	union SlabSlot *nextfree;
+	char		buffer[SLAB_SLOT_SIZE];
+} SlabSlot;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -288,41 +305,28 @@ struct Tuplesortstate
 	/*
 	 * Function to write a stored tuple onto tape.  The representation of the
 	 * tuple on tape need not be the same as it is in memory; requirements on
-	 * the tape representation are given below.  After writing the tuple,
-	 * pfree() the out-of-line data (not the SortTuple struct!), and increase
-	 * state->availMem by the amount of memory space thereby released.
+	 * the tape representation are given below.  Unless the slab allocator is
+	 * used, after writing the tuple, pfree() the out-of-line data (not the
+	 * SortTuple struct!), and increase state->availMem by the amount of memory
+	 * space thereby released.
 	 */
 	void		(*writetup) (Tuplesortstate *state, int tapenum,
 										 SortTuple *stup);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create a palloc'd copy,
-	 * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
-	 * decrease state->availMem by the amount of memory space consumed. (See
-	 * batchUsed notes for details on how memory is handled when incremental
-	 * accounting is abandoned.)
+	 * the already-read length of the stored tuple.  The tuple is allocated
+	 * from the slab memory arena, or is palloc'd, see readtup_alloc().
 	 */
 	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
 	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
-	 * H.  (Note that memtupcount only counts the tuples that are part of the
-	 * heap --- during merge passes, memtuples[] entries beyond tapeRange are
-	 * never in the heap and are used to hold pre-read tuples.)  In state
-	 * SORTEDONTAPE, the array is not used.
+	 * H. In state SORTEDONTAPE, the array is not used.
 	 */
 	SortTuple  *memtuples;		/* array of SortTuple structs */
 	int			memtupcount;	/* number of tuples currently present */
@@ -330,13 +334,42 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
-	 * Memory for tuples is sometimes allocated in batch, rather than
-	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * Memory for tuples is sometimes allocated using a simple slab allocator,
+	 * rather than with palloc().  Currently, we switch to slab allocation when
+	 * we start merging.  Merging only needs to keep a small, fixed number tuples
+	 * in memory at any time, so we can avoid the palloc/pfree overhead by
+	 * recycling a fixed number of fixed-size slots to hold the tuples.
+	 *
+	 * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
+	 * slots.  The allocation is sized to have one slot per tape, plus one
+	 * additional slot.  We need that many slots to hold all the tuples kept in
+	 * the heap during merge, plus the one we have last returned from the sort,
+	 * with tuplesort_gettuple.
+	 *
+	 * Initially, all the slots are kept in a linked list of free slots.  When
+	 * a tuple is read from a tape, it is put to the next available slot, if it
+	 * fits.  If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the slot back to the free
+	 * list, or pfree() if it was palloc'd.  We know that a tuple was allocated
+	 * from the slab, if its pointer value is between slabMemoryBegin and -End.
+	 *
+	 * When the slab allocator is used, the USEMEM/LACKMEM mechanism of tracking
+	 * memory usage is not used.
+	 */
+	bool		slabAllocatorUsed;
+
+	char	   *slabMemoryBegin;	/* beginning of slab memory arena */
+	char	   *slabMemoryEnd;		/* end of slab memory arena */
+	SlabSlot   *slabFreeHead;		/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
+	 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE modes),
+	 * we remember the tuple in 'lastReturnedTuple', so that we can recycle the
+	 * memory on next gettuple call.
 	 */
-	bool		batchUsed;
+	void	   *lastReturnedTuple;
 
 	/*
 	 * While building initial runs, this indicates if the replacement
@@ -358,42 +391,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,12 +483,34 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the slab memory arena?
+ */
+#define IS_SLAB_SLOT(state, tuple) \
+	((char *) tuple >= state->slabMemoryBegin && \
+	 (char *) tuple < state->slabMemoryEnd)
+
+/*
+ * Return the given tuple to the slab memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_SLAB_SLOT(state, tuple) \
+	do { \
+		SlabSlot *buf = (SlabSlot *) tuple; \
+		\
+		if (IS_SLAB_SLOT(state, tuple)) \
+		{ \
+			buf->nextfree = state->slabFreeHead; \
+			state->slabFreeHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
-#define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
+#define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
 
@@ -553,16 +577,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -576,7 +592,7 @@ static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -584,7 +600,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -592,7 +607,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -602,7 +616,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -610,7 +623,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -662,10 +674,10 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
 	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * eases memory management.  Destroying it once we're done building
+	 * the initial runs reduces fragmentation.  Note that the memtuples array
+	 * of SortTuples is allocated in the parent context, not this context,
+	 * because there is no need to free memtuples early.
 	 */
 	tuplecontext = AllocSetContextCreate(sortcontext,
 										 "Caller tuples",
@@ -705,7 +717,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 						ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
 
 	state->growmemtuples = true;
-	state->batchUsed = false;
+	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
 
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
@@ -762,7 +774,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -835,7 +846,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -927,7 +937,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -995,7 +1004,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1038,7 +1046,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1838,7 +1845,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	{
 		case TSS_SORTEDINMEM:
 			Assert(forward || state->randomAccess);
-			Assert(!state->batchUsed);
+			Assert(!state->slabAllocatorUsed);
 			*should_free = false;
 			if (forward)
 			{
@@ -1883,15 +1890,35 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
-			Assert(!state->batchUsed);
-			*should_free = true;
+			Assert(state->slabAllocatorUsed);
+
+			/*
+			 * The slot that held the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call.  (This can be NULL, in the !state->tuples
+					 * case).
+					 */
+					state->lastReturnedTuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1965,74 +1992,70 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->lastReturnedTuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
-			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the slab allocator. */
+			Assert(state->slabAllocatorUsed);
 			*should_free = false;
 
 			/*
+			 * The slab slot holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
+
+				*stup = state->memtuples[0];
 
 				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
 				 */
-				*stup = state->memtuples[0];
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
+				state->lastReturnedTuple = stup->tuple;
+
+				/*
+				 * Pull next tuple from tape, and replace the returned tuple
+				 * at top of the heap with it.
+				 */
+				if (!mergereadnext(state, srcTape, &newtup))
 				{
 					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
+					 * If no more data, we've reached end of run on this tape.
+					 * Remove the top node from the heap.
 					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
-
-					mergeprereadone(state, srcTape);
+					tuplesort_heap_delete_top(state, false);
 
 					/*
-					 * if still no data, we've reached end of run on this tape
+					 * Rewind to free the read buffer.  It'd go away at the
+					 * end of the sort anyway, but better to release the
+					 * memory early.
 					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Remove the top node from the heap */
-						tuplesort_heap_delete_top(state, false);
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+					LogicalTapeRewind(state->tapeset, srcTape, true);
+					return true;
 				}
-
-				/*
-				 * pull next preread tuple from list, and replace the returned
-				 * tuple at top of the heap with it.
-				 */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				newtup->tupindex = srcTape;
-				tuplesort_heap_replace_top(state, newtup, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+				newtup.tupindex = srcTape;
+				tuplesort_heap_replace_top(state, &newtup, false);
 				return true;
 			}
 			return false;
@@ -2317,13 +2340,6 @@ inittapes(Tuplesortstate *state)
 	/* Compute number of tapes to use: merge order plus 1 */
 	maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
 
-	/*
-	 * We must have at least 2*maxTapes slots in the memtuples[] array, else
-	 * we'd not have room for merge heap plus preread.  It seems unlikely that
-	 * this case would ever occur, but be safe.
-	 */
-	maxTapes = Min(maxTapes, state->memtupsize / 2);
-
 	state->maxTapes = maxTapes;
 	state->tapeRange = maxTapes - 1;
 
@@ -2334,13 +2350,13 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
-	 * don't decrease it to the point that we have no room for tuples. (That
-	 * case is only likely to occur if sorting pass-by-value Datums; in all
-	 * other scenarios the memtuples[] array is unlikely to occupy more than
-	 * half of allowedMem.  In the pass-by-value case it's not important to
-	 * account for tuple space, so we don't care if LACKMEM becomes
-	 * inaccurate.)
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but don't decrease it to the point that we
+	 * have no room for tuples. (That case is only likely to occur if sorting
+	 * pass-by-value Datums; in all other scenarios the memtuples[] array is
+	 * unlikely to occupy more than half of allowedMem.  In the pass-by-value
+	 * case it's not important to account for tuple space, so we don't care
+	 * if LACKMEM becomes inaccurate.)
 	 */
 	tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;
 
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2466,6 +2474,104 @@ selectnewtape(Tuplesortstate *state)
 }
 
 /*
+ * Initialize the slab allocation arena, for the given number of slots.
+ */
+static void
+initSlabAllocator(Tuplesortstate *state, int numSlots)
+{
+	if (numSlots > 0)
+	{
+		char	   *p;
+		int			i;
+
+		state->slabMemoryBegin = palloc((state->maxTapes + 1) * SLAB_SLOT_SIZE);
+		state->slabMemoryEnd = state->slabMemoryBegin +
+			(state->maxTapes + 1) * SLAB_SLOT_SIZE;
+		state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
+		USEMEM(state, (state->maxTapes + 1) * SLAB_SLOT_SIZE);
+
+		p = state->slabMemoryBegin;
+		for (i = 0; i < state->maxTapes; i++)
+		{
+			((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
+			p += SLAB_SLOT_SIZE;
+		}
+		((SlabSlot *) p)->nextfree = NULL;
+	}
+	else
+	{
+		state->slabMemoryBegin = state->slabMemoryEnd = NULL;
+		state->slabFreeHead = NULL;
+	}
+	state->slabAllocatorUsed = true;
+}
+
+/*
+ * Divide all remaining work memory (availMem) as read buffers, for all
+ * the tapes that will be used during the merge.
+ *
+ * We use the number of possible *input* tapes here, rather than maxTapes,
+ * for the calculation.  At all times, we'll be reading from at most
+ * numInputTapes tapes, and one tape is used for output (unless we do an
+ * on-the-fly final merge, in which case we don't have an output tape).
+ */
+static void
+initTapeBuffers(Tuplesortstate *state, int numInputTapes)
+{
+	int64		availBlocks;
+	int64		blocksPerTape;
+	int			remainder;
+	int			tapenum;
+
+	/*
+	 * Divide availMem evenly among the number of input tapes.
+	 */
+	availBlocks = state->availMem / BLCKSZ;
+	blocksPerTape = availBlocks / numInputTapes;
+	remainder = availBlocks % numInputTapes;
+	USEMEM(state, availBlocks * BLCKSZ);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
+			 (long) (availBlocks * BLCKSZ) / 1024, numInputTapes);
+#endif
+
+	/*
+	 * Use one page per tape, even if we are out of memory. tuplesort_merge_order()
+	 * should've chosen the number of tapes so that this can't happen, but better
+	 * safe than sorry.  (This also protects from a negative availMem.)
+	 */
+	if (blocksPerTape < 1)
+	{
+		blocksPerTape = 1;
+		remainder = 0;
+	}
+
+	/*
+	 * Set the buffers for the tapes.
+	 *
+	 * In a multi-phase merge, the tape that is initially used as an output
+	 * tape, will later be rewound and read from, and should also use a large
+	 * buffer at that point.  So we must loop up to maxTapes, not just
+	 * numInputTapes!
+	 *
+	 * If there are fewer runs than tapes, we will set the buffer size also
+	 * for tapes that will go completely unused, but that's harmless.
+	 * LogicalTapeAssignReadBufferSize() doesn't allocate the buffer
+	 * immediately, it just sets the size that will be used, when the tape is
+	 * rewound for read, and the tape isn't empty.
+	 */
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		int64		numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										numBlocks * BLCKSZ);
+	}
+}
+
+/*
  * mergeruns -- merge all the completed initial runs.
  *
  * This implements steps D5, D6 of Algorithm D.  All input data has
@@ -2478,6 +2584,8 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	int			numTapes;
+	int			numInputTapes;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2499,6 +2607,64 @@ mergeruns(Tuplesortstate *state)
 	}
 
 	/*
+	 * Reset tuple memory.  We've freed all the tuples that we previously
+	 * allocated.  We will use the slab allocator from now on.
+	 */
+	MemoryContextDelete(state->tuplecontext);
+	state->tuplecontext = NULL;
+
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape.
+	 * Shrink it, to make the memory available for other use. We only need one
+	 * slot per tape.
+	 */
+	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	pfree(state->memtuples);
+
+	/*
+	 * If we had fewer runs than tapes, refund the memory that we imagined we
+	 * would need for the tape buffers of the unused tapes.
+	 *
+	 * numTapes and numInputTapes reflect the actual number of tapes we will
+	 * use. Note that the output tape's tape number is maxTapes - 1, so the
+	 * tape numbers of the used tapes are not consecutive, so you cannot
+	 * just loop from 0 to numTapes to visit all used tapes!
+	 */
+	if (state->Level == 1)
+	{
+		numInputTapes = state->currentRun;
+		numTapes = numInputTapes + 1;
+		FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
+	}
+	else
+	{
+		numInputTapes = state->maxTapes - 1;
+		numTapes = state->maxTapes;
+	}
+
+	/*
+	 * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
+	 * from each input tape.
+	 */
+	state->memtupsize = numInputTapes;
+	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+
+	/*
+	 * Initialize the slab allocator.  We need one slab slot per input tape, for
+	 * the tuples in the heap, plus one to hold the tuple last returned from
+	 * tuplesort_gettuple.  (If we're sorting pass-by-val Datums, however, we don't
+	 * need to do allocate anything.)
+	 *
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of individual tuples.
+	 */
+	if (state->tuples)
+		initSlabAllocator(state, numInputTapes + 1);
+	else
+		initSlabAllocator(state, 0);
+
+	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem when replacement selection is used,
 	 * but something we particular count on when input is presorted), we can
@@ -2514,6 +2680,27 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * Use all the spare memory we have available for read buffers for the
+	 * tapes.
+	 *
+	 * We do this only after checking for the case that we produced only one
+	 * initial run, because there is no need to use a large read buffer when
+	 * we're reading from a single tape. With one tape, the I/O pattern will
+	 * be the same regardless of the buffer size.
+	 *
+	 * We don't try to "rebalance" the amount of memory among tapes, when we
+	 * start a new merge phase, even if some tapes can be inactive in the
+	 * phase.  That would be hard, because logtape.c doesn't know where one
+	 * run ends and another begins.  When a new merge phase begins, and a tape
+	 * doesn't participate in it, its buffer nevertheless already contains
+	 * tuples from the next run on same tape, so we cannot release the buffer.
+	 * That's OK in practice, merge performance isn't that sensitive to the
+	 * amount of buffers used, and most merge phases use all or almost all
+	 * tapes, anyway.
+	 */
+	initTapeBuffers(state, numInputTapes);
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2544,7 +2731,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2614,6 +2801,14 @@ mergeruns(Tuplesortstate *state)
 	state->result_tape = state->tp_tapenum[state->tapeRange];
 	LogicalTapeFreeze(state->tapeset, state->result_tape);
 	state->status = TSS_SORTEDONTAPE;
+
+	/* Release the read buffers on all the other tapes, by rewinding them. */
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		if (tapenum == state->result_tape)
+			continue;
+		LogicalTapeRewind(state->tapeset, tapenum, true);
+	}
 }
 
 /*
@@ -2627,16 +2822,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2645,52 +2836,31 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
-		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-			{
-				/* remove the written-out tuple from the heap */
-				tuplesort_heap_delete_top(state, false);
-				continue;
-			}
-		}
+
+		/* recycle the slot of the tuple we just wrote out, for the next read */
+		RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
 
 		/*
 		 * pull next preread tuple from list, and replace the written-out
 		 * tuple in the heap with it.
 		 */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tup->tupindex = srcTape;
-		tuplesort_heap_replace_top(state, tup, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		if (!mergereadnext(state, srcTape, &stup))
+		{
+			/* we've reached end of run on this tape */
+			/* remove the written-out tuple from the heap */
+			tuplesort_heap_delete_top(state, false);
+			continue;
+		}
+		stup.tupindex = srcTape;
+		tuplesort_heap_replace_top(state, &stup, false);
 	}
 
 	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated, but AllocSetFree will have put those chunks of memory on
-	 * particular free lists, bucketed by size class.  Thus, although all of
-	 * that memory is free, it is effectively fragmented.  Resetting the
-	 * context gets us out from under that problem.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
-	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape, and increment its count of real runs.
 	 */
@@ -2711,18 +2881,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2746,517 +2911,47 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tup->tupindex = srcTape;
-			tuplesort_heap_insert(state, tup, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
-	}
-}
-
-/*
- * batchmemtuples - grow memtuples without palloc overhead
- *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* Caller error if we have no tapes */
-	Assert(state->activeTapes > 0);
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * We need to be sure that we do not cause LACKMEM to become true, else
-	 * the batch allocation size could be calculated as negative, causing
-	 * havoc.  Hence, if availMemLessRefund is negative at this point, we must
-	 * do nothing.  Moreover, if it's positive but rather small, there's
-	 * little point in proceeding because we could only increase memtuples by
-	 * a small amount, not worth the cost of the repalloc's.  We somewhat
-	 * arbitrarily set the threshold at ALLOCSET_DEFAULT_INITSIZE per tape.
-	 * (Note that this does not represent any assumption about tuple sizes.)
-	 */
-	if (availMemLessRefund <=
-		(int64) state->activeTapes * ALLOCSET_DEFAULT_INITSIZE)
-		return;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	state->growmemtuples = false;
-	/* availMem must stay accurate for spacePerTape calculation */
-	FREEMEM(state, availMemLessRefund);
-	if (LACKMEM(state))
-		elog(ERROR, "unexpected out-of-memory situation in tuplesort");
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
+		SortTuple	tup;
 
-		if (rtup->tuple)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
+			tup.tupindex = srcTape;
+			tuplesort_heap_insert(state, &tup, false);
 		}
-		state->mergeoverflow[srcTape] = NULL;
 	}
 }
 
 /*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
+ * mergereadnext - read next tuple from one merge input tape
  *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
+ * Returns false on EOF.
  */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
- *
- * Read tuples from the specified tape until it has used up its free memory
- * or array slots; but ensure that we have at least one tuple, if any are
- * to be had.
- */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3438,15 +3133,6 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		state->memtupcount--;
 	}
 
-	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated.  It's important to avoid fragmentation when there is a stark
-	 * change in allocation patterns due to the use of batch memory.
-	 * Fragmentation due to AllocSetFree's bucketing by size class might be
-	 * particularly bad if this step wasn't taken.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
 	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
@@ -3901,38 +3587,31 @@ markrunend(Tuplesortstate *state, int tapenum)
 }
 
 /*
- * Get memory for tuple from within READTUP() routine.  Allocate
- * memory and account for that, or consume from tape's batch
- * allocation.
+ * Get memory for tuple from within READTUP() routine.
  *
- * Memory returned here in the final on-the-fly merge case is recycled
- * from tape's batch allocation.  Otherwise, callers must pfree() or
- * reset tuple child memory context, and account for that with a
- * FREEMEM().  Currently, this only ever needs to happen in WRITETUP()
- * routines.
+ * We use next free slot from the slab allocator, or palloc() if the tuple
+ * is too large for that.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	SlabSlot   *buf;
+
+	/*
+	 * We pre-allocate enough slots in the slab arena that we should never run
+	 * out.
+	 */
+	Assert(state->slabFreeHead);
+
+	if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->slabFreeHead;
+		/* Reuse this slot */
+		state->slabFreeHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4101,8 +3780,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4111,7 +3793,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4132,12 +3814,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4344,8 +4020,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4354,7 +4033,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4379,19 +4057,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4659,8 +4324,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4668,7 +4336,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4683,12 +4351,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4755,7 +4417,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->slabAllocatorUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4785,7 +4447,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4799,12 +4461,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#49Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#48)
1 attachment(s)
Re: Tuplesort merge pre-reading

On 09/29/2016 05:41 PM, Heikki Linnakangas wrote:

Here's a new patch version, addressing the points you made. Please have
a look!

Bah, I fumbled the initSlabAllocator() function, attached is a fixed
version.

- Heikki

Attachments:

0001-Change-the-way-pre-reading-in-external-sort-s-merge-4.patchtext/x-patch; name=0001-Change-the-way-pre-reading-in-external-sort-s-merge-4.patchDownload
From bd74cb9c32b3073637d6932f3b4552598fcdc92a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 14 Sep 2016 17:29:11 +0300
Subject: [PATCH 1/1] Change the way pre-reading in external sort's merge phase
 works.

Don't pre-read tuples into SortTuple slots during merge. Instead, use the
memory for larger read buffers in logtape.c. We're doing the same number
of READTUP() calls either way, but managing the pre-read SortTuple slots
is much more complicated. Also, the on-tape representation is more compact
than SortTuples, so we can fit more pre-read tuples into the same amount
of memory this way. And we have better cache-locality, when we use just a
small number of SortTuple slots.

Now that we only hold one tuple from each tape in the SortTuple slots, we
can greatly simplify the "batch memory" management. We now maintain a
small set of fixed-sized slots, to hold the tuples, and fall back to
palloc() for larger tuples. We use this method during all merge phases,
not just the final merge, and also when randomAccess is requested, and
also in the TSS_SORTEDONTAPE. In other words, it's used whenever we do
an external sort.

Reviewed by Peter Geoghegan and Claudio Freire.
---
 src/backend/utils/sort/logtape.c   |  153 ++++-
 src/backend/utils/sort/tuplesort.c | 1208 +++++++++++++-----------------------
 src/include/utils/logtape.h        |    1 +
 3 files changed, 565 insertions(+), 797 deletions(-)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 7745207..4152da1 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -52,12 +52,17 @@
  * not clear this helps much, but it can't hurt.  (XXX perhaps a LIFO
  * policy for free blocks would be better?)
  *
+ * To further make the I/Os more sequential, we can use a larger buffer
+ * when reading, and read multiple blocks from the same tape in one go,
+ * whenever the buffer becomes empty. LogicalTapeAssignReadBufferSize()
+ * can be used to set the size of the read buffer.
+ *
  * To support the above policy of writing to the lowest free block,
  * ltsGetFreeBlock sorts the list of free block numbers into decreasing
  * order each time it is asked for a block and the list isn't currently
  * sorted.  This is an efficient way to handle it because we expect cycles
  * of releasing many blocks followed by re-using many blocks, due to
- * tuplesort.c's "preread" behavior.
+ * the larger read buffer.
  *
  * Since all the bookkeeping and buffer memory is allocated with palloc(),
  * and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -79,6 +84,7 @@
 
 #include "storage/buffile.h"
 #include "utils/logtape.h"
+#include "utils/memutils.h"
 
 /*
  * Block indexes are "long"s, so we can fit this many per indirect block.
@@ -131,9 +137,18 @@ typedef struct LogicalTape
 	 * reading.
 	 */
 	char	   *buffer;			/* physical buffer (separately palloc'd) */
+	int			buffer_size;	/* allocated size of the buffer */
 	long		curBlockNumber; /* this block's logical blk# within tape */
 	int			pos;			/* next read/write position in buffer */
 	int			nbytes;			/* total # of valid bytes in buffer */
+
+	/*
+	 * Desired buffer size to use when reading.  To keep things simple, we
+	 * use a single-block buffer when writing, or when reading a frozen
+	 * tape.  But when we are reading and will only read forwards, we
+	 * allocate a larger buffer, determined by read_buffer_size.
+	 */
+	int			read_buffer_size;
 } LogicalTape;
 
 /*
@@ -228,6 +243,53 @@ ltsReadBlock(LogicalTapeSet *lts, long blocknum, void *buffer)
 }
 
 /*
+ * Read as many blocks as we can into the per-tape buffer.
+ *
+ * The caller can specify the next physical block number to read, in
+ * datablocknum, or -1 to fetch the next block number from the internal block.
+ * If datablocknum == -1, the caller must've already set curBlockNumber.
+ *
+ * Returns true if anything was read, 'false' on EOF.
+ */
+static bool
+ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt, long datablocknum)
+{
+	lt->pos = 0;
+	lt->nbytes = 0;
+
+	do
+	{
+		/* Fetch next block number (unless provided by caller) */
+		if (datablocknum == -1)
+		{
+			datablocknum = ltsRecallNextBlockNum(lts, lt->indirect, lt->frozen);
+			if (datablocknum == -1L)
+				break;			/* EOF */
+			lt->curBlockNumber++;
+		}
+
+		/* Read the block */
+		ltsReadBlock(lts, datablocknum, (void *) (lt->buffer + lt->nbytes));
+		if (!lt->frozen)
+			ltsReleaseBlock(lts, datablocknum);
+
+		if (lt->curBlockNumber < lt->numFullBlocks)
+			lt->nbytes += BLCKSZ;
+		else
+		{
+			/* EOF */
+			lt->nbytes += lt->lastBlockBytes;
+			break;
+		}
+
+		/* Advance to next block, if we have buffer space left */
+		datablocknum = -1;
+	} while (lt->nbytes < lt->buffer_size);
+
+	return (lt->nbytes > 0);
+}
+
+/*
  * qsort comparator for sorting freeBlocks[] into decreasing order.
  */
 static int
@@ -546,6 +608,8 @@ LogicalTapeSetCreate(int ntapes)
 		lt->numFullBlocks = 0L;
 		lt->lastBlockBytes = 0;
 		lt->buffer = NULL;
+		lt->buffer_size = 0;
+		lt->read_buffer_size = BLCKSZ;
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
@@ -628,7 +692,10 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 
 	/* Allocate data buffer and first indirect block on first write */
 	if (lt->buffer == NULL)
+	{
 		lt->buffer = (char *) palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
 	if (lt->indirect == NULL)
 	{
 		lt->indirect = (IndirectBlock *) palloc(sizeof(IndirectBlock));
@@ -636,6 +703,7 @@ LogicalTapeWrite(LogicalTapeSet *lts, int tapenum,
 		lt->indirect->nextup = NULL;
 	}
 
+	Assert(lt->buffer_size == BLCKSZ);
 	while (size > 0)
 	{
 		if (lt->pos >= BLCKSZ)
@@ -709,18 +777,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 			Assert(lt->frozen);
 			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
 		}
+
+		/* Allocate a read buffer */
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(lt->read_buffer_size);
+		lt->buffer_size = lt->read_buffer_size;
+
 		/* Read the first block, or reset if tape is empty */
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
 		if (datablocknum != -1L)
-		{
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-		}
+			ltsReadFillBuffer(lts, lt, datablocknum);
 	}
 	else
 	{
@@ -754,6 +823,13 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
 		lt->curBlockNumber = 0L;
 		lt->pos = 0;
 		lt->nbytes = 0;
+
+		if (lt->buffer)
+		{
+			pfree(lt->buffer);
+			lt->buffer = NULL;
+			lt->buffer_size = 0;
+		}
 	}
 }
 
@@ -779,20 +855,8 @@ LogicalTapeRead(LogicalTapeSet *lts, int tapenum,
 		if (lt->pos >= lt->nbytes)
 		{
 			/* Try to load more data into buffer. */
-			long		datablocknum = ltsRecallNextBlockNum(lts, lt->indirect,
-															 lt->frozen);
-
-			if (datablocknum == -1L)
+			if (!ltsReadFillBuffer(lts, lt, -1))
 				break;			/* EOF */
-			lt->curBlockNumber++;
-			lt->pos = 0;
-			ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
-			if (!lt->frozen)
-				ltsReleaseBlock(lts, datablocknum);
-			lt->nbytes = (lt->curBlockNumber < lt->numFullBlocks) ?
-				BLCKSZ : lt->lastBlockBytes;
-			if (lt->nbytes <= 0)
-				break;			/* EOF (possible here?) */
 		}
 
 		nthistime = lt->nbytes - lt->pos;
@@ -842,6 +906,22 @@ LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
 	lt->writing = false;
 	lt->frozen = true;
 	datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, true);
+
+	/*
+	 * The seek and backspace functions assume a single block read buffer.
+	 * That's OK with current usage. A larger buffer is helpful to make the
+	 * read pattern of the backing file look more sequential to the OS, when
+	 * we're reading from multiple tapes. But at the end of a sort, when a
+	 * tape is frozen, we only read from a single tape anyway.
+	 */
+	if (!lt->buffer || lt->buffer_size != BLCKSZ)
+	{
+		if (lt->buffer)
+			pfree(lt->buffer);
+		lt->buffer = palloc(BLCKSZ);
+		lt->buffer_size = BLCKSZ;
+	}
+
 	/* Read the first block, or reset if tape is empty */
 	lt->curBlockNumber = 0L;
 	lt->pos = 0;
@@ -875,6 +955,7 @@ LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -941,6 +1022,7 @@ LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 	lt = &lts->tapes[tapenum];
 	Assert(lt->frozen);
 	Assert(offset >= 0 && offset <= BLCKSZ);
+	Assert(lt->buffer_size == BLCKSZ);
 
 	/*
 	 * Easy case for seek within current block.
@@ -1002,6 +1084,10 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 
 	Assert(tapenum >= 0 && tapenum < lts->nTapes);
 	lt = &lts->tapes[tapenum];
+
+	/* With a larger buffer, 'pos' wouldn't be the same as offset within page */
+	Assert(lt->buffer_size == BLCKSZ);
+
 	*blocknum = lt->curBlockNumber;
 	*offset = lt->pos;
 }
@@ -1014,3 +1100,28 @@ LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
 	return lts->nFileBlocks;
 }
+
+/*
+ * Set buffer size to use, when reading from given tape.
+ */
+void
+LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t avail_mem)
+{
+	LogicalTape *lt;
+
+	Assert(tapenum >= 0 && tapenum < lts->nTapes);
+	lt = &lts->tapes[tapenum];
+
+	/*
+	 * The buffer size must be a multiple of BLCKSZ in size, so round the
+	 * given value down to nearest BLCKSZ. Make sure we have at least one page.
+	 * Also, don't go above MaxAllocSize, to avoid erroring out. A multi-gigabyte
+	 * buffer is unlikely to be helpful, anyway.
+	 */
+	if (avail_mem < BLCKSZ)
+		avail_mem = BLCKSZ;
+	if (avail_mem > MaxAllocSize)
+		avail_mem = MaxAllocSize;
+	avail_mem -= avail_mem % BLCKSZ;
+	lt->read_buffer_size = avail_mem;
+}
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 16ceb30..a80de41 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -74,7 +74,7 @@
  * the merge is complete.  The basic merge algorithm thus needs very little
  * memory --- only M tuples for an M-way merge, and M is constrained to a
  * small number.  However, we can still make good use of our full workMem
- * allocation by pre-reading additional tuples from each source tape.  Without
+ * allocation, to pre-read blocks from each source tape.  Without
  * prereading, our access pattern to the temporary file would be very erratic;
  * on average we'd read one block from each of M source tapes during the same
  * time that we're writing M blocks to the output tape, so there is no
@@ -84,10 +84,10 @@
  * worse when it comes time to read that tape.  A straightforward merge pass
  * thus ends up doing a lot of waiting for disk seeks.  We can improve matters
  * by prereading from each source tape sequentially, loading about workMem/M
- * bytes from each tape in turn.  Then we run the merge algorithm, writing but
- * not reading until one of the preloaded tuple series runs out.  Then we
- * switch back to preread mode, fill memory again, and repeat.  This approach
- * helps to localize both read and write accesses.
+ * bytes from each tape in turn, and making the sequential blocks immediately
+ * available for reuse.  This approach helps to localize both read and  write
+ * accesses. The pre-reading is handled by logtape.c, we just tell it how
+ * much memory to use for the buffers.
  *
  * When the caller requests random access to the sort result, we form
  * the final sorted run on a logical tape which is then "frozen", so
@@ -162,8 +162,8 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree() (except during final on-the-fly merge,
- * when memory is used in batch).  SortTuples also contain the tuple's
+ * can be freed by a simple pfree() (except during merge, when we use a
+ * simple slab allocator).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
  * Storing the first key column lets us save heap_getattr or index_getattr
@@ -191,9 +191,8 @@ bool		optimize_bounded_sort = true;
  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
  * selection is always abandoned after the first run; no other run number
  * should be represented here.  During merge passes, we re-use it to hold the
- * input tape number that each tuple in the heap was read from, or to hold the
- * index of the next tuple pre-read from the same tape in the case of pre-read
- * entries.  tupindex goes unused if the sort occurs entirely in memory.
+ * input tape number that each tuple in the heap was read from.  tupindex goes
+ * unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
@@ -203,6 +202,24 @@ typedef struct
 	int			tupindex;		/* see notes above */
 } SortTuple;
 
+/*
+ * During merge, we use a pre-allocated set of fixed-size slots to hold
+ * tuples.  To avoid palloc/pfree overhead.
+ *
+ * Merge doesn't require a lot of memory, so we can afford to waste some,
+ * by using gratuitously-sized slots.  If a tuple is larger than 1 kB, the
+ * palloc() overhead is not significant anymore.
+ *
+ * 'nextfree' is valid when this chunk is in the free list.  When in use, the
+ * slot holds a tuple.
+ */
+#define SLAB_SLOT_SIZE 1024
+
+typedef union SlabSlot
+{
+	union SlabSlot *nextfree;
+	char		buffer[SLAB_SLOT_SIZE];
+} SlabSlot;
 
 /*
  * Possible states of a Tuplesort object.  These denote the states that
@@ -288,41 +305,28 @@ struct Tuplesortstate
 	/*
 	 * Function to write a stored tuple onto tape.  The representation of the
 	 * tuple on tape need not be the same as it is in memory; requirements on
-	 * the tape representation are given below.  After writing the tuple,
-	 * pfree() the out-of-line data (not the SortTuple struct!), and increase
-	 * state->availMem by the amount of memory space thereby released.
+	 * the tape representation are given below.  Unless the slab allocator is
+	 * used, after writing the tuple, pfree() the out-of-line data (not the
+	 * SortTuple struct!), and increase state->availMem by the amount of memory
+	 * space thereby released.
 	 */
 	void		(*writetup) (Tuplesortstate *state, int tapenum,
 										 SortTuple *stup);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create a palloc'd copy,
-	 * initialize tuple/datum1/isnull1 in the target SortTuple struct, and
-	 * decrease state->availMem by the amount of memory space consumed. (See
-	 * batchUsed notes for details on how memory is handled when incremental
-	 * accounting is abandoned.)
+	 * the already-read length of the stored tuple.  The tuple is allocated
+	 * from the slab memory arena, or is palloc'd, see readtup_alloc().
 	 */
 	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,
 										int tapenum, unsigned int len);
 
 	/*
-	 * Function to move a caller tuple.  This is usually implemented as a
-	 * memmove() shim, but function may also perform additional fix-up of
-	 * caller tuple where needed.  Batch memory support requires the movement
-	 * of caller tuples from one location in memory to another.
-	 */
-	void		(*movetup) (void *dest, void *src, unsigned int len);
-
-	/*
 	 * This array holds the tuples now in sort memory.  If we are in state
 	 * INITIAL, the tuples are in no particular order; if we are in state
 	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
 	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
-	 * H.  (Note that memtupcount only counts the tuples that are part of the
-	 * heap --- during merge passes, memtuples[] entries beyond tapeRange are
-	 * never in the heap and are used to hold pre-read tuples.)  In state
-	 * SORTEDONTAPE, the array is not used.
+	 * H. In state SORTEDONTAPE, the array is not used.
 	 */
 	SortTuple  *memtuples;		/* array of SortTuple structs */
 	int			memtupcount;	/* number of tuples currently present */
@@ -330,13 +334,42 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
-	 * Memory for tuples is sometimes allocated in batch, rather than
-	 * incrementally.  This implies that incremental memory accounting has
-	 * been abandoned.  Currently, this only happens for the final on-the-fly
-	 * merge step.  Large batch allocations can store tuples (e.g.
-	 * IndexTuples) without palloc() fragmentation and other overhead.
+	 * Memory for tuples is sometimes allocated using a simple slab allocator,
+	 * rather than with palloc().  Currently, we switch to slab allocation when
+	 * we start merging.  Merging only needs to keep a small, fixed number tuples
+	 * in memory at any time, so we can avoid the palloc/pfree overhead by
+	 * recycling a fixed number of fixed-size slots to hold the tuples.
+	 *
+	 * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
+	 * slots.  The allocation is sized to have one slot per tape, plus one
+	 * additional slot.  We need that many slots to hold all the tuples kept in
+	 * the heap during merge, plus the one we have last returned from the sort,
+	 * with tuplesort_gettuple.
+	 *
+	 * Initially, all the slots are kept in a linked list of free slots.  When
+	 * a tuple is read from a tape, it is put to the next available slot, if it
+	 * fits.  If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd instead.
+	 *
+	 * When we're done processing a tuple, we return the slot back to the free
+	 * list, or pfree() if it was palloc'd.  We know that a tuple was allocated
+	 * from the slab, if its pointer value is between slabMemoryBegin and -End.
+	 *
+	 * When the slab allocator is used, the USEMEM/LACKMEM mechanism of tracking
+	 * memory usage is not used.
+	 */
+	bool		slabAllocatorUsed;
+
+	char	   *slabMemoryBegin;	/* beginning of slab memory arena */
+	char	   *slabMemoryEnd;		/* end of slab memory arena */
+	SlabSlot   *slabFreeHead;		/* head of free list */
+
+	/*
+	 * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
+	 * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE modes),
+	 * we remember the tuple in 'lastReturnedTuple', so that we can recycle the
+	 * memory on next gettuple call.
 	 */
-	bool		batchUsed;
+	void	   *lastReturnedTuple;
 
 	/*
 	 * While building initial runs, this indicates if the replacement
@@ -358,42 +391,11 @@ struct Tuplesortstate
 	 */
 
 	/*
-	 * These variables are only used during merge passes.  mergeactive[i] is
+	 * This variable is only used during merge passes.  mergeactive[i] is
 	 * true if we are reading an input run from (actual) tape number i and
-	 * have not yet exhausted that run.  mergenext[i] is the memtuples index
-	 * of the next pre-read tuple (next to be loaded into the heap) for tape
-	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly
-	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]
-	 * is the number of unused memtuples[] slots reserved for tape i, and
-	 * mergeavailmem[i] is the amount of unused space allocated for tape i.
-	 * mergefreelist and mergefirstfree keep track of unused locations in the
-	 * memtuples[] array.  The memtuples[].tupindex fields link together
-	 * pre-read tuples for each tape as well as recycled locations in
-	 * mergefreelist. It is OK to use 0 as a null link in these lists, because
-	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.
+	 * have not yet exhausted that run.
 	 */
 	bool	   *mergeactive;	/* active input run source? */
-	int		   *mergenext;		/* first preread tuple for each source */
-	int		   *mergelast;		/* last preread tuple for each source */
-	int		   *mergeavailslots;	/* slots left for prereading each tape */
-	int64	   *mergeavailmem;	/* availMem for prereading each tape */
-	int			mergefreelist;	/* head of freelist of recycled slots */
-	int			mergefirstfree; /* first slot never used in this merge */
-
-	/*
-	 * Per-tape batch state, when final on-the-fly merge consumes memory from
-	 * just a few large allocations.
-	 *
-	 * Aside from the general benefits of performing fewer individual retail
-	 * palloc() calls, this also helps make merging more cache efficient,
-	 * since each tape's tuples must naturally be accessed sequentially (in
-	 * sorted order).
-	 */
-	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
-	char	  **mergetuples;	/* Each tape's memory allocation */
-	char	  **mergecurrent;	/* Current offset into each tape's memory */
-	char	  **mergetail;		/* Last item's start point for each tape */
-	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
 
 	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
@@ -481,12 +483,34 @@ struct Tuplesortstate
 #endif
 };
 
+/*
+ * Is the given tuple allocated from the slab memory arena?
+ */
+#define IS_SLAB_SLOT(state, tuple) \
+	((char *) tuple >= state->slabMemoryBegin && \
+	 (char *) tuple < state->slabMemoryEnd)
+
+/*
+ * Return the given tuple to the slab memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_SLAB_SLOT(state, tuple) \
+	do { \
+		SlabSlot *buf = (SlabSlot *) tuple; \
+		\
+		if (IS_SLAB_SLOT(state, tuple)) \
+		{ \
+			buf->nextfree = state->slabFreeHead; \
+			state->slabFreeHead = buf; \
+		} else \
+			pfree(tuple); \
+	} while(0)
+
 #define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define MOVETUP(dest,src,len) ((*(state)->movetup) (dest, src, len))
-#define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
+#define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
 
@@ -553,16 +577,8 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state, bool finalMergeBatch);
-static void batchmemtuples(Tuplesortstate *state);
-static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
-static void mergebatchone(Tuplesortstate *state, int srcTape,
-			  SortTuple *stup, bool *should_free);
-static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
-				   SortTuple *rtup, bool *should_free);
-static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
-static void mergepreread(Tuplesortstate *state);
-static void mergeprereadone(Tuplesortstate *state, int srcTape);
+static void beginmerge(Tuplesortstate *state);
+static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
 static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
@@ -576,7 +592,7 @@ static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
-static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
+static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -584,7 +600,6 @@ static void writetup_heap(Tuplesortstate *state, int tapenum,
 			  SortTuple *stup);
 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
 			 int tapenum, unsigned int len);
-static void movetup_heap(void *dest, void *src, unsigned int len);
 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
 				   Tuplesortstate *state);
 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -592,7 +607,6 @@ static void writetup_cluster(Tuplesortstate *state, int tapenum,
 				 SortTuple *stup);
 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int len);
-static void movetup_cluster(void *dest, void *src, unsigned int len);
 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 					   Tuplesortstate *state);
 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
@@ -602,7 +616,6 @@ static void writetup_index(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_index(void *dest, void *src, unsigned int len);
 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
 				 Tuplesortstate *state);
 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -610,7 +623,6 @@ static void writetup_datum(Tuplesortstate *state, int tapenum,
 			   SortTuple *stup);
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len);
-static void movetup_datum(void *dest, void *src, unsigned int len);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
 
 /*
@@ -662,10 +674,10 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	 * Caller tuple (e.g. IndexTuple) memory context.
 	 *
 	 * A dedicated child context used exclusively for caller passed tuples
-	 * eases memory management.  Resetting at key points reduces
-	 * fragmentation. Note that the memtuples array of SortTuples is allocated
-	 * in the parent context, not this context, because there is no need to
-	 * free memtuples early.
+	 * eases memory management.  Destroying it once we're done building
+	 * the initial runs reduces fragmentation.  Note that the memtuples array
+	 * of SortTuples is allocated in the parent context, not this context,
+	 * because there is no need to free memtuples early.
 	 */
 	tuplecontext = AllocSetContextCreate(sortcontext,
 										 "Caller tuples",
@@ -705,7 +717,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 						ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
 
 	state->growmemtuples = true;
-	state->batchUsed = false;
+	state->slabAllocatorUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
 
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
@@ -762,7 +774,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
-	state->movetup = movetup_heap;
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 	state->abbrevNext = 10;
@@ -835,7 +846,6 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	state->copytup = copytup_cluster;
 	state->writetup = writetup_cluster;
 	state->readtup = readtup_cluster;
-	state->movetup = movetup_cluster;
 	state->abbrevNext = 10;
 
 	state->indexInfo = BuildIndexInfo(indexRel);
@@ -927,7 +937,6 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 	state->abbrevNext = 10;
 
 	state->heapRel = heapRel;
@@ -995,7 +1004,6 @@ tuplesort_begin_index_hash(Relation heapRel,
 	state->copytup = copytup_index;
 	state->writetup = writetup_index;
 	state->readtup = readtup_index;
-	state->movetup = movetup_index;
 
 	state->heapRel = heapRel;
 	state->indexRel = indexRel;
@@ -1038,7 +1046,6 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state->copytup = copytup_datum;
 	state->writetup = writetup_datum;
 	state->readtup = readtup_datum;
-	state->movetup = movetup_datum;
 	state->abbrevNext = 10;
 
 	state->datumType = datumType;
@@ -1838,7 +1845,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	{
 		case TSS_SORTEDINMEM:
 			Assert(forward || state->randomAccess);
-			Assert(!state->batchUsed);
+			Assert(!state->slabAllocatorUsed);
 			*should_free = false;
 			if (forward)
 			{
@@ -1883,15 +1890,35 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
-			Assert(!state->batchUsed);
-			*should_free = true;
+			Assert(state->slabAllocatorUsed);
+
+			/*
+			 * The slot that held the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
 			if (forward)
 			{
 				if (state->eof_reached)
 					return false;
+
 				if ((tuplen = getlen(state, state->result_tape, true)) != 0)
 				{
 					READTUP(state, stup, state->result_tape, tuplen);
+
+					/*
+					 * Remember the tuple we return, so that we can recycle its
+					 * memory on next call.  (This can be NULL, in the !state->tuples
+					 * case).
+					 */
+					state->lastReturnedTuple = stup->tuple;
+
+					*should_free = false;
 					return true;
 				}
 				else
@@ -1965,74 +1992,70 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 									  tuplen))
 				elog(ERROR, "bogus tuple length in backward scan");
 			READTUP(state, stup, state->result_tape, tuplen);
+
+			/*
+			 * Remember the tuple we return, so that we can recycle its
+			 * memory on next call. (This can be NULL, in the Datum case).
+			 */
+			state->lastReturnedTuple = stup->tuple;
+
+			*should_free = false;
 			return true;
 
 		case TSS_FINALMERGE:
 			Assert(forward);
-			Assert(state->batchUsed || !state->tuples);
-			/* For now, assume tuple is stored in tape's batch memory */
+			/* We are managing memory ourselves, with the slab allocator. */
+			Assert(state->slabAllocatorUsed);
 			*should_free = false;
 
 			/*
+			 * The slab slot holding the tuple that we returned in previous
+			 * gettuple call can now be reused.
+			 */
+			if (state->lastReturnedTuple)
+			{
+				RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
+				state->lastReturnedTuple = NULL;
+			}
+
+			/*
 			 * This code should match the inner loop of mergeonerun().
 			 */
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				int			tupIndex;
-				SortTuple  *newtup;
+				SortTuple	newtup;
+
+				*stup = state->memtuples[0];
 
 				/*
-				 * Returned tuple is still counted in our memory space most of
-				 * the time.  See mergebatchone() for discussion of why caller
-				 * may occasionally be required to free returned tuple, and
-				 * how preread memory is managed with regard to edge cases
-				 * more generally.
+				 * Remember the tuple we return, so that we can recycle its
+				 * memory on next call. (This can be NULL, in the Datum case).
 				 */
-				*stup = state->memtuples[0];
-				if ((tupIndex = state->mergenext[srcTape]) == 0)
+				state->lastReturnedTuple = stup->tuple;
+
+				/*
+				 * Pull next tuple from tape, and replace the returned tuple
+				 * at top of the heap with it.
+				 */
+				if (!mergereadnext(state, srcTape, &newtup))
 				{
 					/*
-					 * out of preloaded data on this tape, try to read more
-					 *
-					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry, though not before preparing its
-					 * batch memory for a new round of sequential consumption.
-					 * See mergepreread() comments.
+					 * If no more data, we've reached end of run on this tape.
+					 * Remove the top node from the heap.
 					 */
-					if (state->batchUsed)
-						mergebatchone(state, srcTape, stup, should_free);
-
-					mergeprereadone(state, srcTape);
+					tuplesort_heap_delete_top(state, false);
 
 					/*
-					 * if still no data, we've reached end of run on this tape
+					 * Rewind to free the read buffer.  It'd go away at the
+					 * end of the sort anyway, but better to release the
+					 * memory early.
 					 */
-					if ((tupIndex = state->mergenext[srcTape]) == 0)
-					{
-						/* Remove the top node from the heap */
-						tuplesort_heap_delete_top(state, false);
-						/* Free tape's buffer, avoiding dangling pointer */
-						if (state->batchUsed)
-							mergebatchfreetape(state, srcTape, stup, should_free);
-						return true;
-					}
+					LogicalTapeRewind(state->tapeset, srcTape, true);
+					return true;
 				}
-
-				/*
-				 * pull next preread tuple from list, and replace the returned
-				 * tuple at top of the heap with it.
-				 */
-				newtup = &state->memtuples[tupIndex];
-				state->mergenext[srcTape] = newtup->tupindex;
-				if (state->mergenext[srcTape] == 0)
-					state->mergelast[srcTape] = 0;
-				newtup->tupindex = srcTape;
-				tuplesort_heap_replace_top(state, newtup, false);
-				/* put the now-unused memtuples entry on the freelist */
-				newtup->tupindex = state->mergefreelist;
-				state->mergefreelist = tupIndex;
-				state->mergeavailslots[srcTape]++;
+				newtup.tupindex = srcTape;
+				tuplesort_heap_replace_top(state, &newtup, false);
 				return true;
 			}
 			return false;
@@ -2317,13 +2340,6 @@ inittapes(Tuplesortstate *state)
 	/* Compute number of tapes to use: merge order plus 1 */
 	maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
 
-	/*
-	 * We must have at least 2*maxTapes slots in the memtuples[] array, else
-	 * we'd not have room for merge heap plus preread.  It seems unlikely that
-	 * this case would ever occur, but be safe.
-	 */
-	maxTapes = Min(maxTapes, state->memtupsize / 2);
-
 	state->maxTapes = maxTapes;
 	state->tapeRange = maxTapes - 1;
 
@@ -2334,13 +2350,13 @@ inittapes(Tuplesortstate *state)
 #endif
 
 	/*
-	 * Decrease availMem to reflect the space needed for tape buffers; but
-	 * don't decrease it to the point that we have no room for tuples. (That
-	 * case is only likely to occur if sorting pass-by-value Datums; in all
-	 * other scenarios the memtuples[] array is unlikely to occupy more than
-	 * half of allowedMem.  In the pass-by-value case it's not important to
-	 * account for tuple space, so we don't care if LACKMEM becomes
-	 * inaccurate.)
+	 * Decrease availMem to reflect the space needed for tape buffers, when
+	 * writing the initial runs; but don't decrease it to the point that we
+	 * have no room for tuples. (That case is only likely to occur if sorting
+	 * pass-by-value Datums; in all other scenarios the memtuples[] array is
+	 * unlikely to occupy more than half of allowedMem.  In the pass-by-value
+	 * case it's not important to account for tuple space, so we don't care
+	 * if LACKMEM becomes inaccurate.)
 	 */
 	tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;
 
@@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
 	state->tapeset = LogicalTapeSetCreate(maxTapes);
 
 	state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
-	state->mergenext = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
-	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
-	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
-	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2466,6 +2474,104 @@ selectnewtape(Tuplesortstate *state)
 }
 
 /*
+ * Initialize the slab allocation arena, for the given number of slots.
+ */
+static void
+initSlabAllocator(Tuplesortstate *state, int numSlots)
+{
+	if (numSlots > 0)
+	{
+		char	   *p;
+		int			i;
+
+		state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);
+		state->slabMemoryEnd = state->slabMemoryBegin +
+			numSlots * SLAB_SLOT_SIZE;
+		state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
+		USEMEM(state, numSlots * SLAB_SLOT_SIZE);
+
+		p = state->slabMemoryBegin;
+		for (i = 0; i < numSlots - 1; i++)
+		{
+			((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
+			p += SLAB_SLOT_SIZE;
+		}
+		((SlabSlot *) p)->nextfree = NULL;
+	}
+	else
+	{
+		state->slabMemoryBegin = state->slabMemoryEnd = NULL;
+		state->slabFreeHead = NULL;
+	}
+	state->slabAllocatorUsed = true;
+}
+
+/*
+ * Divide all remaining work memory (availMem) as read buffers, for all
+ * the tapes that will be used during the merge.
+ *
+ * We use the number of possible *input* tapes here, rather than maxTapes,
+ * for the calculation.  At all times, we'll be reading from at most
+ * numInputTapes tapes, and one tape is used for output (unless we do an
+ * on-the-fly final merge, in which case we don't have an output tape).
+ */
+static void
+initTapeBuffers(Tuplesortstate *state, int numInputTapes)
+{
+	int64		availBlocks;
+	int64		blocksPerTape;
+	int			remainder;
+	int			tapenum;
+
+	/*
+	 * Divide availMem evenly among the number of input tapes.
+	 */
+	availBlocks = state->availMem / BLCKSZ;
+	blocksPerTape = availBlocks / numInputTapes;
+	remainder = availBlocks % numInputTapes;
+	USEMEM(state, availBlocks * BLCKSZ);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
+			 (long) (availBlocks * BLCKSZ) / 1024, numInputTapes);
+#endif
+
+	/*
+	 * Use one page per tape, even if we are out of memory. tuplesort_merge_order()
+	 * should've chosen the number of tapes so that this can't happen, but better
+	 * safe than sorry.  (This also protects from a negative availMem.)
+	 */
+	if (blocksPerTape < 1)
+	{
+		blocksPerTape = 1;
+		remainder = 0;
+	}
+
+	/*
+	 * Set the buffers for the tapes.
+	 *
+	 * In a multi-phase merge, the tape that is initially used as an output
+	 * tape, will later be rewound and read from, and should also use a large
+	 * buffer at that point.  So we must loop up to maxTapes, not just
+	 * numInputTapes!
+	 *
+	 * If there are fewer runs than tapes, we will set the buffer size also
+	 * for tapes that will go completely unused, but that's harmless.
+	 * LogicalTapeAssignReadBufferSize() doesn't allocate the buffer
+	 * immediately, it just sets the size that will be used, when the tape is
+	 * rewound for read, and the tape isn't empty.
+	 */
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		int64		numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
+
+		LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
+										numBlocks * BLCKSZ);
+	}
+}
+
+/*
  * mergeruns -- merge all the completed initial runs.
  *
  * This implements steps D5, D6 of Algorithm D.  All input data has
@@ -2478,6 +2584,8 @@ mergeruns(Tuplesortstate *state)
 				svTape,
 				svRuns,
 				svDummy;
+	int			numTapes;
+	int			numInputTapes;
 
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
@@ -2499,6 +2607,64 @@ mergeruns(Tuplesortstate *state)
 	}
 
 	/*
+	 * Reset tuple memory.  We've freed all the tuples that we previously
+	 * allocated.  We will use the slab allocator from now on.
+	 */
+	MemoryContextDelete(state->tuplecontext);
+	state->tuplecontext = NULL;
+
+	/*
+	 * We no longer need a large memtuples array, only one slot per tape.
+	 * Shrink it, to make the memory available for other use. We only need one
+	 * slot per tape.
+	 */
+	FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
+	pfree(state->memtuples);
+
+	/*
+	 * If we had fewer runs than tapes, refund the memory that we imagined we
+	 * would need for the tape buffers of the unused tapes.
+	 *
+	 * numTapes and numInputTapes reflect the actual number of tapes we will
+	 * use.  Note that the output tape's tape number is maxTapes - 1, so the
+	 * tape numbers of the used tapes are not consecutive, so you cannot
+	 * just loop from 0 to numTapes to visit all used tapes!
+	 */
+	if (state->Level == 1)
+	{
+		numInputTapes = state->currentRun;
+		numTapes = numInputTapes + 1;
+		FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
+	}
+	else
+	{
+		numInputTapes = state->maxTapes - 1;
+		numTapes = state->maxTapes;
+	}
+
+	/*
+	 * Allocate a new 'memtuples' array, for the heap.  It will hold one tuple
+	 * from each input tape.
+	 */
+	state->memtupsize = numInputTapes;
+	state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
+	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
+
+	/*
+	 * Initialize the slab allocator.  We need one slab slot per input tape, for
+	 * the tuples in the heap, plus one to hold the tuple last returned from
+	 * tuplesort_gettuple.  (If we're sorting pass-by-val Datums, however, we don't
+	 * need to do allocate anything.)
+	 *
+	 * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism to
+	 * track memory usage of individual tuples.
+	 */
+	if (state->tuples)
+		initSlabAllocator(state, numInputTapes + 1);
+	else
+		initSlabAllocator(state, 0);
+
+	/*
 	 * If we produced only one initial run (quite likely if the total data
 	 * volume is between 1X and 2X workMem when replacement selection is used,
 	 * but something we particular count on when input is presorted), we can
@@ -2514,6 +2680,27 @@ mergeruns(Tuplesortstate *state)
 		return;
 	}
 
+	/*
+	 * Use all the spare memory we have available for read buffers for the
+	 * tapes.
+	 *
+	 * We do this only after checking for the case that we produced only one
+	 * initial run, because there is no need to use a large read buffer when
+	 * we're reading from a single tape.  With one tape, the I/O pattern will
+	 * be the same regardless of the buffer size.
+	 *
+	 * We don't try to "rebalance" the amount of memory among tapes, when we
+	 * start a new merge phase, even if some tapes can be inactive in the
+	 * phase.  That would be hard, because logtape.c doesn't know where one
+	 * run ends and another begins.  When a new merge phase begins, and a tape
+	 * doesn't participate in it, its buffer nevertheless already contains
+	 * tuples from the next run on same tape, so we cannot release the buffer.
+	 * That's OK in practice, merge performance isn't that sensitive to the
+	 * amount of buffers used, and most merge phases use all or almost all
+	 * tapes, anyway.
+	 */
+	initTapeBuffers(state, numInputTapes);
+
 	/* End of step D2: rewind all output tapes to prepare for merging */
 	for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
 		LogicalTapeRewind(state->tapeset, tapenum, false);
@@ -2544,7 +2731,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state, state->tuples);
+				beginmerge(state);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2614,6 +2801,14 @@ mergeruns(Tuplesortstate *state)
 	state->result_tape = state->tp_tapenum[state->tapeRange];
 	LogicalTapeFreeze(state->tapeset, state->result_tape);
 	state->status = TSS_SORTEDONTAPE;
+
+	/* Release the read buffers on all the other tapes, by rewinding them. */
+	for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+	{
+		if (tapenum == state->result_tape)
+			continue;
+		LogicalTapeRewind(state->tapeset, tapenum, true);
+	}
 }
 
 /*
@@ -2627,16 +2822,12 @@ mergeonerun(Tuplesortstate *state)
 {
 	int			destTape = state->tp_tapenum[state->tapeRange];
 	int			srcTape;
-	int			tupIndex;
-	SortTuple  *tup;
-	int64		priorAvail,
-				spaceFreed;
 
 	/*
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state, false);
+	beginmerge(state);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2645,52 +2836,31 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	while (state->memtupcount > 0)
 	{
+		SortTuple stup;
+
 		/* write the tuple to destTape */
-		priorAvail = state->availMem;
 		srcTape = state->memtuples[0].tupindex;
 		WRITETUP(state, destTape, &state->memtuples[0]);
-		/* writetup adjusted total free space, now fix per-tape space */
-		spaceFreed = state->availMem - priorAvail;
-		state->mergeavailmem[srcTape] += spaceFreed;
-		if ((tupIndex = state->mergenext[srcTape]) == 0)
-		{
-			/* out of preloaded data on this tape, try to read more */
-			mergepreread(state);
-			/* if still no data, we've reached end of run on this tape */
-			if ((tupIndex = state->mergenext[srcTape]) == 0)
-			{
-				/* remove the written-out tuple from the heap */
-				tuplesort_heap_delete_top(state, false);
-				continue;
-			}
-		}
+
+		/* recycle the slot of the tuple we just wrote out, for the next read */
+		RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
 
 		/*
 		 * pull next preread tuple from list, and replace the written-out
 		 * tuple in the heap with it.
 		 */
-		tup = &state->memtuples[tupIndex];
-		state->mergenext[srcTape] = tup->tupindex;
-		if (state->mergenext[srcTape] == 0)
-			state->mergelast[srcTape] = 0;
-		tup->tupindex = srcTape;
-		tuplesort_heap_replace_top(state, tup, false);
-		/* put the now-unused memtuples entry on the freelist */
-		tup->tupindex = state->mergefreelist;
-		state->mergefreelist = tupIndex;
-		state->mergeavailslots[srcTape]++;
+		if (!mergereadnext(state, srcTape, &stup))
+		{
+			/* we've reached end of run on this tape */
+			/* remove the written-out tuple from the heap */
+			tuplesort_heap_delete_top(state, false);
+			continue;
+		}
+		stup.tupindex = srcTape;
+		tuplesort_heap_replace_top(state, &stup, false);
 	}
 
 	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated, but AllocSetFree will have put those chunks of memory on
-	 * particular free lists, bucketed by size class.  Thus, although all of
-	 * that memory is free, it is effectively fragmented.  Resetting the
-	 * context gets us out from under that problem.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
-	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape, and increment its count of real runs.
 	 */
@@ -2711,18 +2881,13 @@ mergeonerun(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
- *
- * finalMergeBatch indicates if this is the beginning of a final on-the-fly
- * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state, bool finalMergeBatch)
+beginmerge(Tuplesortstate *state)
 {
 	int			activeTapes;
 	int			tapenum;
 	int			srcTape;
-	int			slotsPerTape;
-	int64		spacePerTape;
 
 	/* Heap should be empty here */
 	Assert(state->memtupcount == 0);
@@ -2746,517 +2911,47 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 	}
 	state->activeTapes = activeTapes;
 
-	/* Clear merge-pass state variables */
-	memset(state->mergenext, 0,
-		   state->maxTapes * sizeof(*state->mergenext));
-	memset(state->mergelast, 0,
-		   state->maxTapes * sizeof(*state->mergelast));
-	state->mergefreelist = 0;	/* nothing in the freelist */
-	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
-
-	if (finalMergeBatch)
-	{
-		/* Free outright buffers for tape never actually allocated */
-		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
-
-		/*
-		 * Grow memtuples one last time, since the palloc() overhead no longer
-		 * incurred can make a big difference
-		 */
-		batchmemtuples(state);
-	}
-
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
 	 */
 	Assert(activeTapes > 0);
-	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
-	Assert(slotsPerTape > 0);
-	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		if (state->mergeactive[srcTape])
-		{
-			state->mergeavailslots[srcTape] = slotsPerTape;
-			state->mergeavailmem[srcTape] = spacePerTape;
-		}
-	}
-
-	/*
-	 * Preallocate tuple batch memory for each tape.  This is the memory used
-	 * for tuples themselves (not SortTuples), so it's never used by
-	 * pass-by-value datum sorts.  Memory allocation is performed here at most
-	 * once per sort, just in advance of the final on-the-fly merge step.
-	 */
-	if (finalMergeBatch)
-		mergebatch(state, spacePerTape);
-
-	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
-	 */
-	mergepreread(state);
 
 	/* Load the merge heap with the first tuple from each input tape */
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
-		int			tupIndex = state->mergenext[srcTape];
-		SortTuple  *tup;
-
-		if (tupIndex)
-		{
-			tup = &state->memtuples[tupIndex];
-			state->mergenext[srcTape] = tup->tupindex;
-			if (state->mergenext[srcTape] == 0)
-				state->mergelast[srcTape] = 0;
-			tup->tupindex = srcTape;
-			tuplesort_heap_insert(state, tup, false);
-			/* put the now-unused memtuples entry on the freelist */
-			tup->tupindex = state->mergefreelist;
-			state->mergefreelist = tupIndex;
-			state->mergeavailslots[srcTape]++;
-
-#ifdef TRACE_SORT
-			if (trace_sort && finalMergeBatch)
-			{
-				int64		perTapeKB = (spacePerTape + 1023) / 1024;
-				int64		usedSpaceKB;
-				int			usedSlots;
-
-				/*
-				 * Report how effective batchmemtuples() was in balancing the
-				 * number of slots against the need for memory for the
-				 * underlying tuples (e.g. IndexTuples).  The big preread of
-				 * all tapes when switching to FINALMERGE state should be
-				 * fairly representative of memory utilization during the
-				 * final merge step, and in any case is the only point at
-				 * which all tapes are guaranteed to have depleted either
-				 * their batch memory allowance or slot allowance.  Ideally,
-				 * both will be completely depleted for every tape by now.
-				 */
-				usedSpaceKB = (state->mergecurrent[srcTape] -
-							   state->mergetuples[srcTape] + 1023) / 1024;
-				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
-
-				elog(LOG, "tape %d initially used " INT64_FORMAT " KB of "
-					 INT64_FORMAT " KB batch (%2.3f) and %d out of %d slots "
-					 "(%2.3f)", srcTape,
-					 usedSpaceKB, perTapeKB,
-					 (double) usedSpaceKB / (double) perTapeKB,
-					 usedSlots, slotsPerTape,
-					 (double) usedSlots / (double) slotsPerTape);
-			}
-#endif
-		}
-	}
-}
-
-/*
- * batchmemtuples - grow memtuples without palloc overhead
- *
- * When called, availMem should be approximately the amount of memory we'd
- * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
- * that were allocated with palloc() overhead, and in doing so use up all
- * allocated slots.  However, though slots and tuple memory is in balance
- * following the last grow_memtuples() call, that's predicated on the observed
- * average tuple size for the "final" grow_memtuples() call, which includes
- * palloc overhead.  During the final merge pass, where we will arrange to
- * squeeze out the palloc overhead, we might need more slots in the memtuples
- * array.
- *
- * To make that happen, arrange for the amount of remaining memory to be
- * exactly equal to the palloc overhead multiplied by the current size of
- * the memtuples array, force the grow_memtuples flag back to true (it's
- * probably but not necessarily false on entry to this routine), and then
- * call grow_memtuples.  This simulates loading enough tuples to fill the
- * whole memtuples array and then having some space left over because of the
- * elided palloc overhead.  We expect that grow_memtuples() will conclude that
- * it can't double the size of the memtuples array but that it can increase
- * it by some percentage; but if it does decide to double it, that just means
- * that we've never managed to use many slots in the memtuples array, in which
- * case doubling it shouldn't hurt anything anyway.
- */
-static void
-batchmemtuples(Tuplesortstate *state)
-{
-	int64		refund;
-	int64		availMemLessRefund;
-	int			memtupsize = state->memtupsize;
-
-	/* Caller error if we have no tapes */
-	Assert(state->activeTapes > 0);
-
-	/* For simplicity, assume no memtuples are actually currently counted */
-	Assert(state->memtupcount == 0);
-
-	/*
-	 * Refund STANDARDCHUNKHEADERSIZE per tuple.
-	 *
-	 * This sometimes fails to make memory use perfectly balanced, but it
-	 * should never make the situation worse.  Note that Assert-enabled builds
-	 * get a larger refund, due to a varying STANDARDCHUNKHEADERSIZE.
-	 */
-	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
-	availMemLessRefund = state->availMem - refund;
-
-	/*
-	 * We need to be sure that we do not cause LACKMEM to become true, else
-	 * the batch allocation size could be calculated as negative, causing
-	 * havoc.  Hence, if availMemLessRefund is negative at this point, we must
-	 * do nothing.  Moreover, if it's positive but rather small, there's
-	 * little point in proceeding because we could only increase memtuples by
-	 * a small amount, not worth the cost of the repalloc's.  We somewhat
-	 * arbitrarily set the threshold at ALLOCSET_DEFAULT_INITSIZE per tape.
-	 * (Note that this does not represent any assumption about tuple sizes.)
-	 */
-	if (availMemLessRefund <=
-		(int64) state->activeTapes * ALLOCSET_DEFAULT_INITSIZE)
-		return;
-
-	/*
-	 * To establish balanced memory use after refunding palloc overhead,
-	 * temporarily have our accounting indicate that we've allocated all
-	 * memory we're allowed to less that refund, and call grow_memtuples() to
-	 * have it increase the number of slots.
-	 */
-	state->growmemtuples = true;
-	USEMEM(state, availMemLessRefund);
-	(void) grow_memtuples(state);
-	state->growmemtuples = false;
-	/* availMem must stay accurate for spacePerTape calculation */
-	FREEMEM(state, availMemLessRefund);
-	if (LACKMEM(state))
-		elog(ERROR, "unexpected out-of-memory situation in tuplesort");
-
-#ifdef TRACE_SORT
-	if (trace_sort)
-	{
-		Size		OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
-		Size		NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
-
-		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
-			 (double) NewKb / (double) OldKb,
-			 memtupsize, OldKb,
-			 state->memtupsize, NewKb);
-	}
-#endif
-}
-
-/*
- * mergebatch - initialize tuple memory in batch
- *
- * This allows sequential access to sorted tuples buffered in memory from
- * tapes/runs on disk during a final on-the-fly merge step.  Note that the
- * memory is not used for SortTuples, but for the underlying tuples (e.g.
- * MinimalTuples).
- *
- * Note that when batch memory is used, there is a simple division of space
- * into large buffers (one per active tape).  The conventional incremental
- * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
- * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
- * performed, which is then immediately detected in a way that is analogous to
- * LACKMEM().  This keeps each tape's use of memory fair, which is always a
- * goal.
- */
-static void
-mergebatch(Tuplesortstate *state, int64 spacePerTape)
-{
-	int			srcTape;
-
-	Assert(state->activeTapes > 0);
-	Assert(state->tuples);
-
-	/*
-	 * For the purposes of tuplesort's memory accounting, the batch allocation
-	 * is special, and regular memory accounting through USEMEM() calls is
-	 * abandoned (see mergeprereadone()).
-	 */
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-	{
-		char	   *mergetuples;
-
-		if (!state->mergeactive[srcTape])
-			continue;
-
-		/* Allocate buffer for each active tape */
-		mergetuples = MemoryContextAllocHuge(state->tuplecontext,
-											 spacePerTape);
-
-		/* Initialize state for tape */
-		state->mergetuples[srcTape] = mergetuples;
-		state->mergecurrent[srcTape] = mergetuples;
-		state->mergetail[srcTape] = mergetuples;
-		state->mergeoverflow[srcTape] = NULL;
-	}
-
-	state->batchUsed = true;
-	state->spacePerTape = spacePerTape;
-}
-
-/*
- * mergebatchone - prepare batch memory for one merge input tape
- *
- * This is called following the exhaustion of preread tuples for one input
- * tape.  All that actually occurs is that the state for the source tape is
- * reset to indicate that all memory may be reused.
- *
- * This routine must deal with fixing up the tuple that is about to be returned
- * to the client, due to "overflow" allocations.
- */
-static void
-mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-			  bool *should_free)
-{
-	Assert(state->batchUsed);
-
-	/*
-	 * Tuple about to be returned to caller ("stup") is final preread tuple
-	 * from tape, just removed from the top of the heap.  Special steps around
-	 * memory management must be performed for that tuple, to make sure it
-	 * isn't overwritten early.
-	 */
-	if (!state->mergeoverflow[srcTape])
-	{
-		Size		tupLen;
-
-		/*
-		 * Mark tuple buffer range for reuse, but be careful to move final,
-		 * tail tuple to start of space for next run so that it's available to
-		 * caller when stup is returned, and remains available at least until
-		 * the next tuple is requested.
-		 */
-		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		MOVETUP(state->mergecurrent[srcTape], state->mergetail[srcTape],
-				tupLen);
-
-		/* Make SortTuple at top of the merge heap point to new tuple */
-		rtup->tuple = (void *) state->mergecurrent[srcTape];
-
-		state->mergetail[srcTape] = state->mergecurrent[srcTape];
-		state->mergecurrent[srcTape] += tupLen;
-	}
-	else
-	{
-		/*
-		 * Handle an "overflow" retail palloc.
-		 *
-		 * This is needed when we run out of tuple memory for the tape.
-		 */
-		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
-		state->mergetail[srcTape] = state->mergetuples[srcTape];
+		SortTuple	tup;
 
-		if (rtup->tuple)
+		if (mergereadnext(state, srcTape, &tup))
 		{
-			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
-			/* Caller should free palloc'd tuple */
-			*should_free = true;
+			tup.tupindex = srcTape;
+			tuplesort_heap_insert(state, &tup, false);
 		}
-		state->mergeoverflow[srcTape] = NULL;
 	}
 }
 
 /*
- * mergebatchfreetape - handle final clean-up for batch memory once tape is
- * about to become exhausted
+ * mergereadnext - read next tuple from one merge input tape
  *
- * All tuples are returned from tape, but a single final tuple, *rtup, is to be
- * passed back to caller.  Free tape's batch allocation buffer while ensuring
- * that the final tuple is managed appropriately.
+ * Returns false on EOF.
  */
-static void
-mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
-				   bool *should_free)
-{
-	Assert(state->batchUsed);
-	Assert(state->status == TSS_FINALMERGE);
-
-	/*
-	 * Tuple may or may not already be an overflow allocation from
-	 * mergebatchone()
-	 */
-	if (!*should_free && rtup->tuple)
-	{
-		/*
-		 * Final tuple still in tape's batch allocation.
-		 *
-		 * Return palloc()'d copy to caller, and have it freed in a similar
-		 * manner to overflow allocation.  Otherwise, we'd free batch memory
-		 * and pass back a pointer to garbage.  Note that we deliberately
-		 * allocate this in the parent tuplesort context, to be on the safe
-		 * side.
-		 */
-		Size		tuplen;
-		void	   *oldTuple = rtup->tuple;
-
-		tuplen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
-		rtup->tuple = MemoryContextAlloc(state->sortcontext, tuplen);
-		MOVETUP(rtup->tuple, oldTuple, tuplen);
-		*should_free = true;
-	}
-
-	/* Free spacePerTape-sized buffer */
-	pfree(state->mergetuples[srcTape]);
-}
-
-/*
- * mergebatchalloc - allocate memory for one tuple using a batch memory
- * "logical allocation".
- *
- * This is used for the final on-the-fly merge phase only.  READTUP() routines
- * receive memory from here in place of palloc() and USEMEM() calls.
- *
- * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
- * contiguous order (while allowing safe reuse of memory made available to
- * each tape).  This maximizes locality of access as tuples are returned by
- * final merge.
- *
- * Caller must not subsequently attempt to free memory returned here.  In
- * general, only mergebatch* functions know about how memory returned from
- * here should be freed, and this function's caller must ensure that batch
- * memory management code will definitely have the opportunity to do the right
- * thing during the final on-the-fly merge.
- */
-static void *
-mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
-{
-	Size		reserve_tuplen = MAXALIGN(tuplen);
-	char	   *ret;
-
-	/* Should overflow at most once before mergebatchone() call: */
-	Assert(state->mergeoverflow[tapenum] == NULL);
-	Assert(state->batchUsed);
-
-	/* It should be possible to use precisely spacePerTape memory at once */
-	if (state->mergecurrent[tapenum] + reserve_tuplen <=
-		state->mergetuples[tapenum] + state->spacePerTape)
-	{
-		/*
-		 * Usual case -- caller is returned pointer into its tape's buffer,
-		 * and an offset from that point is recorded as where tape has
-		 * consumed up to for current round of preloading.
-		 */
-		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
-		state->mergecurrent[tapenum] += reserve_tuplen;
-	}
-	else
-	{
-		/*
-		 * Allocate memory, and record as tape's overflow allocation.  This
-		 * will be detected quickly, in a similar fashion to a LACKMEM()
-		 * condition, and should not happen again before a new round of
-		 * preloading for caller's tape.  Note that we deliberately allocate
-		 * this in the parent tuplesort context, to be on the safe side.
-		 *
-		 * Sometimes, this does not happen because merging runs out of slots
-		 * before running out of memory.
-		 */
-		ret = state->mergeoverflow[tapenum] =
-			MemoryContextAlloc(state->sortcontext, tuplen);
-	}
-
-	return ret;
-}
-
-/*
- * mergepreread - load tuples from merge input tapes
- *
- * This routine exists to improve sequentiality of reads during a merge pass,
- * as explained in the header comments of this file.  Load tuples from each
- * active source tape until the tape's run is exhausted or it has used up
- * its fair share of available memory.  In any case, we guarantee that there
- * is at least one preread tuple available from each unexhausted input tape.
- *
- * We invoke this routine at the start of a merge pass for initial load,
- * and then whenever any tape's preread data runs out.  Note that we load
- * as much data as possible from all tapes, not just the one that ran out.
- * This is because logtape.c works best with a usage pattern that alternates
- * between reading a lot of data and writing a lot of data, so whenever we
- * are forced to read, we should fill working memory completely.
- *
- * In FINALMERGE state, we *don't* use this routine, but instead just preread
- * from the single tape that ran dry.  There's no read/write alternation in
- * that state and so no point in scanning through all the tapes to fix one.
- * (Moreover, there may be quite a lot of inactive tapes in that state, since
- * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.  Plus, only
- * FINALMERGE state has to consider memory management for a batch
- * allocation.)
- */
-static void
-mergepreread(Tuplesortstate *state)
-{
-	int			srcTape;
-
-	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
-		mergeprereadone(state, srcTape);
-}
-
-/*
- * mergeprereadone - load tuples from one merge input tape
- *
- * Read tuples from the specified tape until it has used up its free memory
- * or array slots; but ensure that we have at least one tuple, if any are
- * to be had.
- */
-static void
-mergeprereadone(Tuplesortstate *state, int srcTape)
+static bool
+mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
 {
 	unsigned int tuplen;
-	SortTuple	stup;
-	int			tupIndex;
-	int64		priorAvail,
-				spaceUsed;
 
 	if (!state->mergeactive[srcTape])
-		return;					/* tape's run is already exhausted */
+		return false;					/* tape's run is already exhausted */
 
-	/*
-	 * Manage per-tape availMem.  Only actually matters when batch memory not
-	 * in use.
-	 */
-	priorAvail = state->availMem;
-	state->availMem = state->mergeavailmem[srcTape];
-
-	/*
-	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
-	 * test is relevant; otherwise, only LACKMEM() test is relevant.
-	 */
-	while ((state->mergeavailslots[srcTape] > 0 &&
-			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
-		   state->mergenext[srcTape] == 0)
+	/* read next tuple, if any */
+	if ((tuplen = getlen(state, srcTape, true)) == 0)
 	{
-		/* read next tuple, if any */
-		if ((tuplen = getlen(state, srcTape, true)) == 0)
-		{
-			state->mergeactive[srcTape] = false;
-			break;
-		}
-		READTUP(state, &stup, srcTape, tuplen);
-		/* find a free slot in memtuples[] for it */
-		tupIndex = state->mergefreelist;
-		if (tupIndex)
-			state->mergefreelist = state->memtuples[tupIndex].tupindex;
-		else
-		{
-			tupIndex = state->mergefirstfree++;
-			Assert(tupIndex < state->memtupsize);
-		}
-		state->mergeavailslots[srcTape]--;
-		/* store tuple, append to list for its tape */
-		stup.tupindex = 0;
-		state->memtuples[tupIndex] = stup;
-		if (state->mergelast[srcTape])
-			state->memtuples[state->mergelast[srcTape]].tupindex = tupIndex;
-		else
-			state->mergenext[srcTape] = tupIndex;
-		state->mergelast[srcTape] = tupIndex;
+		state->mergeactive[srcTape] = false;
+		return false;
 	}
-	/* update per-tape and global availmem counts */
-	spaceUsed = state->mergeavailmem[srcTape] - state->availMem;
-	state->mergeavailmem[srcTape] = state->availMem;
-	state->availMem = priorAvail - spaceUsed;
+	READTUP(state, stup, srcTape, tuplen);
+
+	return true;
 }
 
 /*
@@ -3438,15 +3133,6 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		state->memtupcount--;
 	}
 
-	/*
-	 * Reset tuple memory.  We've freed all of the tuples that we previously
-	 * allocated.  It's important to avoid fragmentation when there is a stark
-	 * change in allocation patterns due to the use of batch memory.
-	 * Fragmentation due to AllocSetFree's bucketing by size class might be
-	 * particularly bad if this step wasn't taken.
-	 */
-	MemoryContextReset(state->tuplecontext);
-
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
 	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
@@ -3901,38 +3587,31 @@ markrunend(Tuplesortstate *state, int tapenum)
 }
 
 /*
- * Get memory for tuple from within READTUP() routine.  Allocate
- * memory and account for that, or consume from tape's batch
- * allocation.
+ * Get memory for tuple from within READTUP() routine.
  *
- * Memory returned here in the final on-the-fly merge case is recycled
- * from tape's batch allocation.  Otherwise, callers must pfree() or
- * reset tuple child memory context, and account for that with a
- * FREEMEM().  Currently, this only ever needs to happen in WRITETUP()
- * routines.
+ * We use next free slot from the slab allocator, or palloc() if the tuple
+ * is too large for that.
  */
 static void *
-readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+readtup_alloc(Tuplesortstate *state, Size tuplen)
 {
-	if (state->batchUsed)
-	{
-		/*
-		 * No USEMEM() call, because during final on-the-fly merge accounting
-		 * is based on tape-private state. ("Overflow" allocations are
-		 * detected as an indication that a new round or preloading is
-		 * required. Preloading marks existing contents of tape's batch buffer
-		 * for reuse.)
-		 */
-		return mergebatchalloc(state, tapenum, tuplen);
-	}
+	SlabSlot   *buf;
+
+	/*
+	 * We pre-allocate enough slots in the slab arena that we should never run
+	 * out.
+	 */
+	Assert(state->slabFreeHead);
+
+	if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
+		return MemoryContextAlloc(state->sortcontext, tuplen);
 	else
 	{
-		char	   *ret;
+		buf = state->slabFreeHead;
+		/* Reuse this slot */
+		state->slabFreeHead = buf->nextfree;
 
-		/* Batch allocation yet to be performed */
-		ret = MemoryContextAlloc(state->tuplecontext, tuplen);
-		USEMEM(state, GetMemoryChunkSpace(ret));
-		return ret;
+		return buf;
 	}
 }
 
@@ -4101,8 +3780,11 @@ writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_free_minimal_tuple(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_free_minimal_tuple(tuple);
+	}
 }
 
 static void
@@ -4111,7 +3793,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
@@ -4132,12 +3814,6 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 								&stup->isnull1);
 }
 
-static void
-movetup_heap(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for the CLUSTER case (HeapTuple data, with
  * comparisons per a btree index definition)
@@ -4344,8 +4020,11 @@ writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	heap_freetuple(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		heap_freetuple(tuple);
+	}
 }
 
 static void
@@ -4354,7 +4033,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
 	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
-												  tapenum,
 												  t_len + HEAPTUPLESIZE);
 
 	/* Reconstruct the HeapTupleData header */
@@ -4379,19 +4057,6 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 									&stup->isnull1);
 }
 
-static void
-movetup_cluster(void *dest, void *src, unsigned int len)
-{
-	HeapTuple	tuple;
-
-	memmove(dest, src, len);
-
-	/* Repoint the HeapTupleData header */
-	tuple = (HeapTuple) dest;
-	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
-}
-
-
 /*
  * Routines specialized for IndexTuple case
  *
@@ -4659,8 +4324,11 @@ writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &tuplen, sizeof(tuplen));
 
-	FREEMEM(state, GetMemoryChunkSpace(tuple));
-	pfree(tuple);
+	if (!state->slabAllocatorUsed)
+	{
+		FREEMEM(state, GetMemoryChunkSpace(tuple));
+		pfree(tuple);
+	}
 }
 
 static void
@@ -4668,7 +4336,7 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tuplen);
 
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
@@ -4683,12 +4351,6 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 								 &stup->isnull1);
 }
 
-static void
-movetup_index(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Routines specialized for DatumTuple case
  */
@@ -4755,7 +4417,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		LogicalTapeWrite(state->tapeset, tapenum,
 						 (void *) &writtenlen, sizeof(writtenlen));
 
-	if (stup->tuple)
+	if (!state->slabAllocatorUsed && stup->tuple)
 	{
 		FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
 		pfree(stup->tuple);
@@ -4785,7 +4447,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
+		void	   *raddr = readtup_alloc(state, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
@@ -4799,12 +4461,6 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 							 &tuplen, sizeof(tuplen));
 }
 
-static void
-movetup_datum(void *dest, void *src, unsigned int len)
-{
-	memmove(dest, src, len);
-}
-
 /*
  * Convenience routine to free a tuple previously loaded into sort memory
  */
diff --git a/src/include/utils/logtape.h b/src/include/utils/logtape.h
index fa1e992..03d0a6f 100644
--- a/src/include/utils/logtape.h
+++ b/src/include/utils/logtape.h
@@ -39,6 +39,7 @@ extern bool LogicalTapeSeek(LogicalTapeSet *lts, int tapenum,
 				long blocknum, int offset);
 extern void LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 				long *blocknum, int *offset);
+extern void LogicalTapeAssignReadBufferSize(LogicalTapeSet *lts, int tapenum, size_t bufsize);
 extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);
 
 #endif   /* LOGTAPE_H */
-- 
2.9.3

#50Peter Geoghegan
pg@heroku.com
In reply to: Robert Haas (#47)
Re: Tuplesort merge pre-reading

On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, ...

I don't get it. If the memory is being used for prereading, then the
point is just to avoid doing many small I/Os instead of one big I/O,
and that goal will be accomplished whether the memory is equally
distributed or not; indeed, it's likely to be accomplished BETTER if
the memory is equally distributed than if it isn't.

I think it could hurt performance if preloading loads runs on a tape
that won't be needed until some subsequent merge pass, in preference
to using that memory proportionately, giving more to larger input runs
for *each* merge pass (giving memory proportionate to the size of each
run to be merged from each tape). For tapes with a dummy run, the
appropriate amount of memory for there next merge pass is zero.

I'm not arguing that it would be worth it to do that, but I do think
that that's the sensible way of framing the idea of using a uniform
amount of memory to every maybe-active tape up front. I'm not too
concerned about this because I'm never too concerned about multiple
merge pass cases, which are relatively rare and relatively
unimportant. Let's just get our story straight.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#50)
Re: Tuplesort merge pre-reading

On Thu, Sep 29, 2016 at 11:38 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, ...

I don't get it. If the memory is being used for prereading, then the
point is just to avoid doing many small I/Os instead of one big I/O,
and that goal will be accomplished whether the memory is equally
distributed or not; indeed, it's likely to be accomplished BETTER if
the memory is equally distributed than if it isn't.

I think it could hurt performance if preloading loads runs on a tape
that won't be needed until some subsequent merge pass, in preference
to using that memory proportionately, giving more to larger input runs
for *each* merge pass (giving memory proportionate to the size of each
run to be merged from each tape). For tapes with a dummy run, the
appropriate amount of memory for there next merge pass is zero.

OK, true. But I still suspect that unless the amount of data you need
to read from a tape is zero or very small, the size of the buffer
doesn't matter. For example, if you have a 1GB tape and a 10GB tape,
I doubt there's any benefit in making the buffer for the 10GB tape 10x
larger. They can probably be the same.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#49)
Re: Tuplesort merge pre-reading

On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Bah, I fumbled the initSlabAllocator() function, attached is a fixed
version.

This looks much better. It's definitely getting close. Thanks for
being considerate of my more marginal concerns. More feedback:

* Should say "fixed number of...":

+ * we start merging. Merging only needs to keep a small, fixed number tuples

* Minor concern about these new macros:

+#define IS_SLAB_SLOT(state, tuple) \
+   ((char *) tuple >= state->slabMemoryBegin && \
+    (char *) tuple < state->slabMemoryEnd)
+
+/*
+ * Return the given tuple to the slab memory free list, or free it
+ * if it was palloc'd.
+ */
+#define RELEASE_SLAB_SLOT(state, tuple) \
+   do { \
+       SlabSlot *buf = (SlabSlot *) tuple; \
+       \
+       if (IS_SLAB_SLOT(state, tuple)) \
+       { \
+           buf->nextfree = state->slabFreeHead; \
+           state->slabFreeHead = buf; \
+       } else \
+           pfree(tuple); \
+   } while(0)

I suggest duplicating the paranoia seen elsewhere around what "state"
macro argument could expand to. You know, by surrounding "state" with
parenthesis each time it is used. This is what we see with existing,
similar macros.

* Should cast to int64 here (for the benefit of win64):

+       elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
+            (long) (availBlocks * BLCKSZ) / 1024, numInputTapes);

* FWIW, I still don't love this bit:

+    * numTapes and numInputTapes reflect the actual number of tapes we will
+    * use.  Note that the output tape's tape number is maxTapes - 1, so the
+    * tape numbers of the used tapes are not consecutive, so you cannot
+    * just loop from 0 to numTapes to visit all used tapes!
+    */
+   if (state->Level == 1)
+   {
+       numInputTapes = state->currentRun;
+       numTapes = numInputTapes + 1;
+       FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
+   }

But I can see how the verbosity of almost-duplicating the activeTapes
stuff seems unappealing. That said, I think that you should point out
in comments that you're calculating the number of
maybe-active-in-some-merge tapes. They're maybe-active in that they
have some number of real tapes. Not going to insist on that, but
something to think about.

* Shouldn't this use state->tapeRange?:

+   else
+   {
+       numInputTapes = state->maxTapes - 1;
+       numTapes = state->maxTapes;
+   }

* Doesn't it also set numTapes without it being used? Maybe that
variable can be declared within "if (state->Level == 1)" block.

* Minor issues with initSlabAllocator():

You call the new function initSlabAllocator() as follows:

+   if (state->tuples)
+       initSlabAllocator(state, numInputTapes + 1);
+   else
+       initSlabAllocator(state, 0);

Isn't the number of slots (the second argument to initSlabAllocator())
actually just numInputTapes when we're "state->tuples"? And so,
shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
can just inspect "state->tuples" itself. In short, maybe push a bit
more into initSlabAllocator(). Making the arguments match those passed
to initTapeBuffers() a bit later would be nice, perhaps.

* This could be simpler, I think:

+   /* Release the read buffers on all the other tapes, by rewinding them. */
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+       if (tapenum == state->result_tape)
+           continue;
+       LogicalTapeRewind(state->tapeset, tapenum, true);
+   }

Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.

* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#52)
Re: Tuplesort merge pre-reading

On 09/30/2016 04:08 PM, Peter Geoghegan wrote:

On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Bah, I fumbled the initSlabAllocator() function, attached is a fixed
version.

This looks much better. It's definitely getting close. Thanks for
being considerate of my more marginal concerns. More feedback:

Fixed most of the things you pointed out, thanks.

* Minor issues with initSlabAllocator():

You call the new function initSlabAllocator() as follows:

+   if (state->tuples)
+       initSlabAllocator(state, numInputTapes + 1);
+   else
+       initSlabAllocator(state, 0);

Isn't the number of slots (the second argument to initSlabAllocator())
actually just numInputTapes when we're "state->tuples"? And so,
shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
can just inspect "state->tuples" itself. In short, maybe push a bit
more into initSlabAllocator(). Making the arguments match those passed
to initTapeBuffers() a bit later would be nice, perhaps.

The comment above that explains the "+ 1". init_slab_allocator allocates
the number of slots that was requested, and the caller is responsible
for deciding how many slots are needed. Yeah, we could remove the
argument and move the logic altogether into init_slab_allocator(), but I
think it's clearer this way. Matter of taste, I guess.

* This could be simpler, I think:

+   /* Release the read buffers on all the other tapes, by rewinding them. */
+   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
+   {
+       if (tapenum == state->result_tape)
+           continue;
+       LogicalTapeRewind(state->tapeset, tapenum, true);
+   }

Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.

Don't think so. result_tape == tapeRange only when the merge was done in
a single pass (or you're otherwise lucky).

* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?

Meh, I can't imagine that using more than 1 GB for a read-ahead buffer
could make any difference in practice. If you have a very large
work_mem, you'll surely get away with a single merge pass, and
fragmentation won't become an issue. And 1GB should be more than enough
to trigger OS read-ahead.

Committed with some final kibitzing. Thanks for the review!

PS. This patch didn't fix bug #14344, the premature reuse of memory with
tuplesort_gettupleslot. We'll still need to come up with 1. a
backportable fix for that, and 2. perhaps a different fix for master.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54Peter Geoghegan
pg@heroku.com
In reply to: Heikki Linnakangas (#53)
Re: Tuplesort merge pre-reading

On Mon, Oct 3, 2016 at 3:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.

Don't think so. result_tape == tapeRange only when the merge was done in a
single pass (or you're otherwise lucky).

Ah, yes. Logical tape assignment/physical tape number confusion on my part here.

* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?

Meh, I can't imagine that using more than 1 GB for a read-ahead buffer could
make any difference in practice. If you have a very large work_mem, you'll
surely get away with a single merge pass, and fragmentation won't become an
issue. And 1GB should be more than enough to trigger OS read-ahead.

I had a non-specific concern, not an intuition of suspicion about
this. I think that I'll figure it out when I rebase the parallel
CREATE INDEX patch on top of this and test that.

Committed with some final kibitzing. Thanks for the review!

Thanks for working on this!

PS. This patch didn't fix bug #14344, the premature reuse of memory with
tuplesort_gettupleslot. We'll still need to come up with 1. a backportable
fix for that, and 2. perhaps a different fix for master.

Agreed. It seemed like you favor not changing memory ownership
semantics for 9.6. I'm not sure that that's the easiest approach for
9.6, but let's discuss that over on the dedicated thread soon.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Tuplesort merge pre-reading

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I'm talking about the code that reads a bunch of from each tape, loading
them into the memtuples array. That code was added by Tom Lane, back in
1999:

commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Oct 30 17:27:15 1999 +0000

Further performance improvements in sorting: reduce number of comparisons
during initial run formation by keeping both current run and next-run
tuples in the same heap (yup, Knuth is smarter than I am). And, during
merge passes, make use of available sort memory to load multiple tuples
from any one input 'tape' at a time, thereby improving locality of
access to the temp file.

So apparently there was a benefit back then, but is it still worthwhile?

I'm fairly sure that the point was exactly what it said, ie improve
locality of access within the temp file by sequentially reading as many
tuples in a row as we could, rather than grabbing one here and one there.

It may be that the work you and Peter G. have been doing have rendered
that question moot. But I'm a bit worried that the reason you're not
seeing any effect is that you're only testing situations with zero seek
penalty (ie your laptop's disk is an SSD). Back then I would certainly
have been testing with temp files on spinning rust, and I fear that this
may still be an issue in that sort of environment.

The relevant mailing list thread seems to be "sort on huge table" in
pgsql-hackers in October/November 1999. The archives don't seem to have
threaded that too successfully, but here's a message specifically
describing the commit you mention:

/messages/by-id/2726.941493808@sss.pgh.pa.us

and you can find the rest by looking through the archive summary pages
for that interval.

The larger picture to be drawn from that thread is that we were seeing
very different performance characteristics on different platforms.
The specific issue that Tatsuo-san reported seemed like it might be
down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
but the point that this stuff can be environment-dependent is still
something to take to heart.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#55)
Re: Tuplesort merge pre-reading

I wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

I'm talking about the code that reads a bunch of from each tape, loading
them into the memtuples array. That code was added by Tom Lane, back in
1999:
So apparently there was a benefit back then, but is it still worthwhile?

I'm fairly sure that the point was exactly what it said, ie improve
locality of access within the temp file by sequentially reading as many
tuples in a row as we could, rather than grabbing one here and one there.

[ blink... ] Somehow, my mail reader popped up a message from 2016
as current, and I spent some time researching and answering it without
noticing the message date.

Never mind, nothing to see here ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tom Lane (#55)
Re: Tuplesort merge pre-reading

On Thu, Apr 13, 2017 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm fairly sure that the point was exactly what it said, ie improve
locality of access within the temp file by sequentially reading as many
tuples in a row as we could, rather than grabbing one here and one there.

It may be that the work you and Peter G. have been doing have rendered
that question moot. But I'm a bit worried that the reason you're not
seeing any effect is that you're only testing situations with zero seek
penalty (ie your laptop's disk is an SSD). Back then I would certainly
have been testing with temp files on spinning rust, and I fear that this
may still be an issue in that sort of environment.

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

The larger picture to be drawn from that thread is that we were seeing
very different performance characteristics on different platforms.
The specific issue that Tatsuo-san reported seemed like it might be
down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
but the point that this stuff can be environment-dependent is still
something to take to heart.

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

Knuth was completely correct to say that it basically made no
difference once more than 7 tapes are used to merge, because he didn't
have logtape.c fragmentation to worry about.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Peter Geoghegan (#57)
Re: Tuplesort merge pre-reading

On Thu, Apr 13, 2017 at 10:19 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

BTW, what you might have missed is that Heikki did end up using a
significant amount of memory in the committed version. It just ended
up being managed by logtape.c, which now does the prereading instead
of tuplesort.c, but at a lower level. There is only one tuple in the
merge heap, but there is still up to 1GB of memory per tape,
containing raw preread tuples mixed with integers that demarcate tape
contents.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#57)
Re: Tuplesort merge pre-reading

On Fri, Apr 14, 2017 at 1:19 AM, Peter Geoghegan <pg@bowt.ie> wrote:

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

I don't think there's any one fixed answer, because increasing the
number of tapes reduces I/O by adding CPU cost, and visca versa.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Robert Haas (#59)
Re: Tuplesort merge pre-reading

On Fri, Apr 14, 2017 at 5:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't think there's any one fixed answer, because increasing the
number of tapes reduces I/O by adding CPU cost, and visca versa.

Sort of, but if you have to merge hundreds of runs (a situation that
should be quite rare), then you should be concerned about being CPU
bound first, as Knuth was. Besides, on modern hardware, read-ahead can
be more effective if you have more merge passes, to a point, which
might also make it worth it -- using hundreds of tapes results in
plenty of *random* I/O. Plus, most of the time you only do a second
pass over a subset of initial quicksorted runs -- not all of them.

Probably the main complicating factor that Knuth doesn't care about is
time to return the first tuple -- startup cost. That was a big
advantage for commit df700e6 that I should have mentioned.

I'm not seriously suggesting that we should prefer multiple passes in
the vast majority of real world cases, nor am I suggesting that we
should go out of our way to help cases that need to do that. I just
find all this interesting.

--
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#57)
Re: Tuplesort merge pre-reading

On 04/14/2017 08:19 AM, Peter Geoghegan wrote:

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

The pre-read buffers are only needed for input tapes; the total number
of tapes doesn't matter.

For comparison, imagine that you want to perform a merge, such that you
always merge 7 runs into one. With polyphase merge, you would need 8
tapes, so that you always read from 7 of them, and write onto one. With
balanced merge, you would need 14 tapes: you always read from 7 tapes,
and you would need up to 7 output tapes, of which one would be active at
any given time.

Those extra idle output tapes are practically free in our
implementation. The "pre-read buffers" are only needed for input tapes,
the number of output tapes doesn't matter. Likewise, maintaining the
heap is cheaper if you only merge a small number of tapes at a time, but
that's also dependent on the number of *input* tapes, not the total
number of tapes.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers