Scaling up deferred unique checks and the after trigger queue

Started by Dean Rasheedover 16 years ago14 messages
#1Dean Rasheed
dean.a.rasheed@googlemail.com

I've started looking at the following TODO item:

"Improve deferrable unique constraints for cases with many conflicts"

and Tom's suggestion that the rows to be checked can be stored in a
bitmap, which would become lossy when the number of rows becomes large
enough. There is also another TODO item:

"Add deferred trigger queue file"

to prevent the trigger queue from exhausting backend memory.

I've got some prototype code which attempts to replace all the
after-triggers-queue stuff with TID bitmaps (not just for deferred
constraint triggers). This would solve the memory-usage problem without
resorting to file storage, and makes it easier to then optimise constraint
checks by doing a bulk check if the number of rows is large enough.

The initial results are encouraging, but I'm still pretty new to a lot of
this code, so I wanted to check that this is a sane thing to try to do.
For UPDATEs, I'm storing the old tid in the bitmap and relying on its ctid
pointer to retrieve the new tuple for the trigger function. AFAICS
heap_update() always links the old and new tuples in this way.

I'm aware that the "triggers on columns" patch is going to be a problem
for this. I haven't looked at it in any detail, but I suspect that it won't
work with a lossy queue, because the information about exactly which
rows to trigger on is only known at update time. So maybe I could fall
back on a tuplestore, spilling to disk in that case?

Thoughts?

- Dean

#2Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Dean Rasheed (#1)
1 attachment(s)
Re: Scaling up deferred unique checks and the after trigger queue

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

Part of the motivation for this is to scale up the deferrable unique
constraints support. I've not done anything to optimise that case
directly yet, but this patch will make it easier to switch over to doing a
bulk check in that case, and even though it is still executing the trigger
for each potential violation, performance for large updates is already
greatly improved because of the reduction in the queue's memory footprint.

It passes all the regression tests, except for the copy2 test, which
sometimes fails, with rows being ordered differently. I believe that the
reason for this failure is that during the bitmap heap scan, pages are
being pruned, and so the UPDATEs inside later trigger executions start
re-using space at the start of the page, whereas the old code moves onto
the next page. This could be fixed by changing the test, grouping the
COPYs inside a single transaction, to prevent page pruning.

- Dean

Attachments:

after_triggers_queue.patchtext/x-patch; charset=US-ASCII; name=after_triggers_queue.patchDownload
*** ./src/backend/commands/copy.c.orig	2009-10-07 08:28:40.000000000 +0100
--- ./src/backend/commands/copy.c	2009-10-07 08:28:39.000000000 +0100
***************
*** 2147,2154 ****
  
  		if (!skip_tuple)
  		{
- 			List *recheckIndexes = NIL;
- 
  			/* Place tuple in tuple slot */
  			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
  
--- 2147,2152 ----
***************
*** 2160,2171 ****
  			heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
  
  			if (resultRelInfo->ri_NumIndices > 0)
! 				recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! 													   estate, false);
  
  			/* AFTER ROW INSERT Triggers */
! 			ExecARInsertTriggers(estate, resultRelInfo, tuple,
! 								 recheckIndexes);
  
  			/*
  			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
--- 2158,2167 ----
  			heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
  
  			if (resultRelInfo->ri_NumIndices > 0)
! 				ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
  
  			/* AFTER ROW INSERT Triggers */
! 			ExecARInsertTriggers(estate, resultRelInfo, tuple);
  
  			/*
  			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
*** ./src/backend/commands/trigger.c.orig	2009-10-15 08:36:05.000000000 +0100
--- ./src/backend/commands/trigger.c	2009-10-19 10:23:32.000000000 +0100
***************
*** 29,34 ****
--- 29,35 ----
  #include "commands/trigger.h"
  #include "executor/executor.h"
  #include "executor/instrument.h"
+ #include "executor/nodeBitmapHeapscan.h"
  #include "miscadmin.h"
  #include "nodes/bitmapset.h"
  #include "nodes/makefuncs.h"
***************
*** 74,80 ****
  					MemoryContext per_tuple_context);
  static void AfterTriggerSaveEvent(ResultRelInfo *relinfo, int event,
  					  bool row_trigger, HeapTuple oldtup, HeapTuple newtup,
! 					  List *recheckIndexes, Bitmapset *modifiedCols);
  
  
  /*
--- 75,81 ----
  					MemoryContext per_tuple_context);
  static void AfterTriggerSaveEvent(ResultRelInfo *relinfo, int event,
  					  bool row_trigger, HeapTuple oldtup, HeapTuple newtup,
! 					  Bitmapset *modifiedCols);
  
  
  /*
***************
*** 1711,1717 ****
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_INSERT] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_INSERT,
! 							  false, NULL, NULL, NIL, NULL);
  }
  
  HeapTuple
--- 1712,1718 ----
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_INSERT] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_INSERT,
! 							  false, NULL, NULL, NULL);
  }
  
  HeapTuple
***************
*** 1758,1770 ****
  
  void
  ExecARInsertTriggers(EState *estate, ResultRelInfo *relinfo,
! 					 HeapTuple trigtuple, List *recheckIndexes)
  {
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
  
  	if (trigdesc && trigdesc->n_after_row[TRIGGER_EVENT_INSERT] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_INSERT,
! 							  true, NULL, trigtuple, recheckIndexes, NULL);
  }
  
  void
--- 1759,1771 ----
  
  void
  ExecARInsertTriggers(EState *estate, ResultRelInfo *relinfo,
! 					 HeapTuple trigtuple)
  {
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
  
  	if (trigdesc && trigdesc->n_after_row[TRIGGER_EVENT_INSERT] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_INSERT,
! 							  true, NULL, trigtuple, NULL);
  }
  
  void
***************
*** 1824,1830 ****
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_DELETE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_DELETE,
! 							  false, NULL, NULL, NIL, NULL);
  }
  
  bool
--- 1825,1831 ----
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_DELETE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_DELETE,
! 							  false, NULL, NULL, NULL);
  }
  
  bool
***************
*** 1894,1900 ****
  												   tupleid, NULL);
  
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_DELETE,
! 							  true, trigtuple, NULL, NIL, NULL);
  		heap_freetuple(trigtuple);
  	}
  }
--- 1895,1901 ----
  												   tupleid, NULL);
  
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_DELETE,
! 							  true, trigtuple, NULL, NULL);
  		heap_freetuple(trigtuple);
  	}
  }
***************
*** 1959,1965 ****
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_UPDATE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_UPDATE,
! 							  false, NULL, NULL, NIL,
  							  GetModifiedColumns(relinfo, estate));
  }
  
--- 1960,1966 ----
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_UPDATE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_UPDATE,
! 							  false, NULL, NULL,
  							  GetModifiedColumns(relinfo, estate));
  }
  
***************
*** 2027,2034 ****
  
  void
  ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
! 					 ItemPointer tupleid, HeapTuple newtuple,
! 					 List *recheckIndexes)
  {
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
  
--- 2028,2034 ----
  
  void
  ExecARUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
! 					 ItemPointer tupleid, HeapTuple newtuple)
  {
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
  
***************
*** 2038,2044 ****
  												   tupleid, NULL);
  
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_UPDATE,
! 							  true, trigtuple, newtuple, recheckIndexes,
  							  GetModifiedColumns(relinfo, estate));
  		heap_freetuple(trigtuple);
  	}
--- 2038,2044 ----
  												   tupleid, NULL);
  
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_UPDATE,
! 							  true, trigtuple, newtuple,
  							  GetModifiedColumns(relinfo, estate));
  		heap_freetuple(trigtuple);
  	}
***************
*** 2101,2107 ****
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_TRUNCATE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_TRUNCATE,
! 							  false, NULL, NULL, NIL, NULL);
  }
  
  
--- 2101,2107 ----
  
  	if (trigdesc && trigdesc->n_after_statement[TRIGGER_EVENT_TRUNCATE] > 0)
  		AfterTriggerSaveEvent(relinfo, TRIGGER_EVENT_TRUNCATE,
! 							  false, NULL, NULL, NULL);
  }
  
  
***************
*** 2269,2277 ****
   * considerable effort to minimize per-event memory consumption.  The event
   * records are grouped into chunks and common data for similar events in the
   * same chunk is only stored once.
-  *
-  * XXX We need to be able to save the per-event data in a file if it grows too
-  * large.
   * ----------
   */
  
--- 2269,2274 ----
***************
*** 2309,2422 ****
  /*
   * Per-trigger-event data
   *
!  * The actual per-event data, AfterTriggerEventData, includes DONE/IN_PROGRESS
!  * status bits and one or two tuple CTIDs.	Each event record also has an
!  * associated AfterTriggerSharedData that is shared across all instances
!  * of similar events within a "chunk".
!  *
!  * We arrange not to waste storage on ate_ctid2 for non-update events.
!  * We could go further and not store either ctid for statement-level triggers,
!  * but that seems unlikely to be worth the trouble.
!  *
!  * Note: ats_firing_id is initially zero and is set to something else when
!  * AFTER_TRIGGER_IN_PROGRESS is set.  It indicates which trigger firing
!  * cycle the trigger will be fired in (or was fired in, if DONE is set).
!  * Although this is mutable state, we can keep it in AfterTriggerSharedData
!  * because all instances of the same type of event in a given event list will
!  * be fired at the same time, if they were queued between the same firing
!  * cycles.	So we need only ensure that ats_firing_id is zero when attaching
!  * a new event to an existing AfterTriggerSharedData record.
   */
  typedef uint32 TriggerFlags;
  
- #define AFTER_TRIGGER_OFFSET			0x0FFFFFFF		/* must be low-order
- 														 * bits */
- #define AFTER_TRIGGER_2CTIDS			0x10000000
  #define AFTER_TRIGGER_DONE				0x20000000
  #define AFTER_TRIGGER_IN_PROGRESS		0x40000000
  
- typedef struct AfterTriggerSharedData *AfterTriggerShared;
- 
- typedef struct AfterTriggerSharedData
- {
- 	TriggerEvent ats_event;		/* event type indicator, see trigger.h */
- 	Oid			ats_tgoid;		/* the trigger's ID */
- 	Oid			ats_relid;		/* the relation it's on */
- 	CommandId	ats_firing_id;	/* ID for firing cycle */
- } AfterTriggerSharedData;
- 
- typedef struct AfterTriggerEventData *AfterTriggerEvent;
- 
- typedef struct AfterTriggerEventData
- {
- 	TriggerFlags ate_flags;		/* status bits and offset to shared data */
- 	ItemPointerData ate_ctid1;	/* inserted, deleted, or old updated tuple */
- 	ItemPointerData ate_ctid2;	/* new updated tuple */
- } AfterTriggerEventData;
  
! /* This struct must exactly match the one above except for not having ctid2 */
! typedef struct AfterTriggerEventDataOneCtid
  {
! 	TriggerFlags ate_flags;		/* status bits and offset to shared data */
! 	ItemPointerData ate_ctid1;	/* inserted, deleted, or old updated tuple */
! } AfterTriggerEventDataOneCtid;
  
! #define SizeofTriggerEvent(evt) \
! 	(((evt)->ate_flags & AFTER_TRIGGER_2CTIDS) ? \
! 	 sizeof(AfterTriggerEventData) : sizeof(AfterTriggerEventDataOneCtid))
  
- #define GetTriggerSharedData(evt) \
- 	((AfterTriggerShared) ((char *) (evt) + ((evt)->ate_flags & AFTER_TRIGGER_OFFSET)))
  
  /*
!  * To avoid palloc overhead, we keep trigger events in arrays in successively-
!  * larger chunks (a slightly more sophisticated version of an expansible
!  * array).	The space between CHUNK_DATA_START and freeptr is occupied by
!  * AfterTriggerEventData records; the space between endfree and endptr is
!  * occupied by AfterTriggerSharedData records.
   */
- typedef struct AfterTriggerEventChunk
- {
- 	struct AfterTriggerEventChunk *next;		/* list link */
- 	char	   *freeptr;		/* start of free space in chunk */
- 	char	   *endfree;		/* end of free space in chunk */
- 	char	   *endptr;			/* end of chunk */
- 	/* event data follows here */
- } AfterTriggerEventChunk;
- 
- #define CHUNK_DATA_START(cptr) ((char *) (cptr) + MAXALIGN(sizeof(AfterTriggerEventChunk)))
- 
- /* A list of events */
  typedef struct AfterTriggerEventList
  {
! 	AfterTriggerEventChunk *head;
! 	AfterTriggerEventChunk *tail;
! 	char	   *tailfree;		/* freeptr of tail chunk */
  } AfterTriggerEventList;
  
- /* Macros to help in iterating over a list of events */
- #define for_each_chunk(cptr, evtlist) \
- 	for (cptr = (evtlist).head; cptr != NULL; cptr = cptr->next)
- #define for_each_event(eptr, cptr) \
- 	for (eptr = (AfterTriggerEvent) CHUNK_DATA_START(cptr); \
- 		 (char *) eptr < (cptr)->freeptr; \
- 		 eptr = (AfterTriggerEvent) (((char *) eptr) + SizeofTriggerEvent(eptr)))
- /* Use this if no special per-chunk processing is needed */
- #define for_each_event_chunk(eptr, cptr, evtlist) \
- 	for_each_chunk(cptr, evtlist) for_each_event(eptr, cptr)
- 
  
  /*
   * All per-transaction data for the AFTER TRIGGERS module.
   *
   * AfterTriggersData has the following fields:
   *
-  * firing_counter is incremented for each call of afterTriggerInvokeEvents.
-  * We mark firable events with the current firing cycle's ID so that we can
-  * tell which ones to work on.	This ensures sane behavior if a trigger
-  * function chooses to do SET CONSTRAINTS: the inner SET CONSTRAINTS will
-  * only fire those events that weren't already scheduled for firing.
-  *
   * state keeps track of the transaction-local effects of SET CONSTRAINTS.
   * This is saved and restored across failed subtransactions.
   *
--- 2306,2367 ----
  /*
   * Per-trigger-event data
   *
!  * The actual per-event data, AfterTriggerTupleSetData, includes
!  * DONE/IN_PROGRESS status bits. Each record represents a set of tuples
!  * to be fired for one or more triggers.
!  *
!  * Note: atts_query_cmd is the command ID at the start of the query
!  * which caused the triggers to be queued and atts_firing_cmd is the
!  * command ID when we started executing the triggers. We only fire
!  * triggers for tuples updated between these command IDs, and so any
!  * additional triggers queued during the trigger firing round are not
!  * fired until the next round. Tuple sets being fired are locked, so
!  * that any additional triggers queued are added to a new set, to be
!  * fired later.
   */
  typedef uint32 TriggerFlags;
  
  #define AFTER_TRIGGER_DONE				0x20000000
  #define AFTER_TRIGGER_IN_PROGRESS		0x40000000
+ #define AFTER_TRIGGER_UNIQUE_KEY_RECHECK 0x80000000
  
  
! /*
!  * This structure holds the details of a trigger event and the set of
!  * tuples on which the trigger(s) for that event should be fired.
!  */
! typedef struct AfterTriggerTupleSetData
  {
! 	struct AfterTriggerTupleSetData	*atts_next;	/* next in linked list */
! 	Oid				atts_relid;		/* the relation the rows are from */
! 	TriggerEvent	atts_event;		/* the triggering event */
! 	List		   *atts_tgoids;	/* the triggers to fire */
! 	CommandId		atts_query_cmd;	/* cmd ID when the query started */
! 	CommandId		atts_firing_cmd; /* cmd ID when we started firing trigs */
! 	TriggerFlags	atts_flags;		/* status bits (fired/in progress/...) */
! 	TIDBitmap	   *atts_tbm;	/* the relevant tuples */
! } AfterTriggerTupleSetData;
  
! typedef struct AfterTriggerTupleSetData *AfterTriggerTupleSet;
  
  
  /*
!  * A list of events (tuple sets). This is either the triggers to fire at
!  * the end of a command, or the list of deferred triggers.
   */
  typedef struct AfterTriggerEventList
  {
! 	AfterTriggerTupleSet	atel_head;
! 	AfterTriggerTupleSet	atel_tail;
! 	CommandId		atel_query_cmd;
  } AfterTriggerEventList;
  
  
  /*
   * All per-transaction data for the AFTER TRIGGERS module.
   *
   * AfterTriggersData has the following fields:
   *
   * state keeps track of the transaction-local effects of SET CONSTRAINTS.
   * This is saved and restored across failed subtransactions.
   *
***************
*** 2424,2430 ****
   * all subtransactions of the current transaction.	In a subtransaction
   * abort, we know that the events added by the subtransaction are at the
   * end of the list, so it is relatively easy to discard them.  The event
!  * list chunks themselves are stored in event_cxt.
   *
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
--- 2369,2375 ----
   * all subtransactions of the current transaction.	In a subtransaction
   * abort, we know that the events added by the subtransaction are at the
   * end of the list, so it is relatively easy to discard them.  The event
!  * list tuple sets themselves are stored in event_cxt.
   *
   * query_depth is the current depth of nested AfterTriggerBeginQuery calls
   * (-1 when the stack is empty).
***************
*** 2448,2456 ****
   * depth_stack is a stack of copies of subtransaction-start-time query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * firing_stack is a stack of copies of subtransaction-start-time
!  * firing_counter.	We use this to recognize which deferred triggers were
!  * fired (or marked for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
   * index in these stacks.  maxtransdepth is the number of allocated entries in
--- 2393,2401 ----
   * depth_stack is a stack of copies of subtransaction-start-time query_depth,
   * which we similarly use to clean up at subtransaction abort.
   *
!  * cmd_stack is a stack of copies of subtransaction-start-time command IDs.
!  * We use this to recognize which deferred triggers were fired (or marked
!  * for firing) within an aborted subtransaction.
   *
   * We use GetCurrentTransactionNestLevel() to determine the correct array
   * index in these stacks.  maxtransdepth is the number of allocated entries in
***************
*** 2460,2466 ****
   */
  typedef struct AfterTriggersData
  {
- 	CommandId	firing_counter; /* next firing ID to assign */
  	SetConstraintState state;	/* the active S C state */
  	AfterTriggerEventList events;		/* deferred-event list */
  	int			query_depth;	/* current query list index */
--- 2405,2410 ----
***************
*** 2473,2479 ****
  	SetConstraintState *state_stack;	/* stacked S C states */
  	AfterTriggerEventList *events_stack;		/* stacked list pointers */
  	int		   *depth_stack;	/* stacked query_depths */
! 	CommandId  *firing_stack;	/* stacked firing_counters */
  	int			maxtransdepth;	/* allocated len of above arrays */
  } AfterTriggersData;
  
--- 2417,2423 ----
  	SetConstraintState *state_stack;	/* stacked S C states */
  	AfterTriggerEventList *events_stack;		/* stacked list pointers */
  	int		   *depth_stack;	/* stacked query_depths */
! 	CommandId  *cmd_stack;		/* stacked command IDs */
  	int			maxtransdepth;	/* allocated len of above arrays */
  } AfterTriggersData;
  
***************
*** 2482,2492 ****
  static AfterTriggers afterTriggers;
  
  
- static void AfterTriggerExecute(AfterTriggerEvent event,
- 					Relation rel, TriggerDesc *trigdesc,
- 					FmgrInfo *finfo,
- 					Instrumentation *instr,
- 					MemoryContext per_tuple_context);
  static SetConstraintState SetConstraintStateCreate(int numalloc);
  static SetConstraintState SetConstraintStateCopy(SetConstraintState state);
  static SetConstraintState SetConstraintStateAddItem(SetConstraintState state,
--- 2426,2431 ----
***************
*** 2500,2508 ****
   * ----------
   */
  static bool
! afterTriggerCheckState(AfterTriggerShared evtshared)
  {
- 	Oid			tgoid = evtshared->ats_tgoid;
  	SetConstraintState state = afterTriggers->state;
  	int			i;
  
--- 2439,2446 ----
   * ----------
   */
  static bool
! afterTriggerCheckState(Oid tgoid, TriggerEvent event)
  {
  	SetConstraintState state = afterTriggers->state;
  	int			i;
  
***************
*** 2510,2516 ****
  	 * For not-deferrable triggers (i.e. normal AFTER ROW triggers and
  	 * constraints declared NOT DEFERRABLE), the state is always false.
  	 */
! 	if ((evtshared->ats_event & AFTER_TRIGGER_DEFERRABLE) == 0)
  		return false;
  
  	/*
--- 2448,2454 ----
  	 * For not-deferrable triggers (i.e. normal AFTER ROW triggers and
  	 * constraints declared NOT DEFERRABLE), the state is always false.
  	 */
! 	if ((event & AFTER_TRIGGER_DEFERRABLE) == 0)
  		return false;
  
  	/*
***************
*** 2531,2537 ****
  	/*
  	 * Otherwise return the default state for the trigger.
  	 */
! 	return ((evtshared->ats_event & AFTER_TRIGGER_INITDEFERRED) != 0);
  }
  
  
--- 2469,2475 ----
  	/*
  	 * Otherwise return the default state for the trigger.
  	 */
! 	return ((event & AFTER_TRIGGER_INITDEFERRED) != 0);
  }
  
  
***************
*** 2539,2655 ****
   * afterTriggerAddEvent()
   *
   *	Add a new trigger event to the specified queue.
!  *	The passed-in event data is copied.
   * ----------
   */
  static void
  afterTriggerAddEvent(AfterTriggerEventList *events,
! 					 AfterTriggerEvent event, AfterTriggerShared evtshared)
  {
! 	Size		eventsize = SizeofTriggerEvent(event);
! 	Size		needed = eventsize + sizeof(AfterTriggerSharedData);
! 	AfterTriggerEventChunk *chunk;
! 	AfterTriggerShared newshared;
! 	AfterTriggerEvent newevent;
! 
! 	/*
! 	 * If empty list or not enough room in the tail chunk, make a new chunk.
! 	 * We assume here that a new shared record will always be needed.
! 	 */
! 	chunk = events->tail;
! 	if (chunk == NULL ||
! 		chunk->endfree - chunk->freeptr < needed)
! 	{
! 		Size		chunksize;
! 
! 		/* Create event context if we didn't already */
! 		if (afterTriggers->event_cxt == NULL)
! 			afterTriggers->event_cxt =
! 				AllocSetContextCreate(TopTransactionContext,
! 									  "AfterTriggerEvents",
! 									  ALLOCSET_DEFAULT_MINSIZE,
! 									  ALLOCSET_DEFAULT_INITSIZE,
! 									  ALLOCSET_DEFAULT_MAXSIZE);
! 
! 		/*
! 		 * Chunk size starts at 1KB and is allowed to increase up to 1MB.
! 		 * These numbers are fairly arbitrary, though there is a hard limit at
! 		 * AFTER_TRIGGER_OFFSET; else we couldn't link event records to their
! 		 * shared records using the available space in ate_flags.  Another
! 		 * constraint is that if the chunk size gets too huge, the search loop
! 		 * below would get slow given a (not too common) usage pattern with
! 		 * many distinct event types in a chunk.  Therefore, we double the
! 		 * preceding chunk size only if there weren't too many shared records
! 		 * in the preceding chunk; otherwise we halve it.  This gives us some
! 		 * ability to adapt to the actual usage pattern of the current query
! 		 * while still having large chunk sizes in typical usage.  All chunk
! 		 * sizes used should be MAXALIGN multiples, to ensure that the shared
! 		 * records will be aligned safely.
! 		 */
! #define MIN_CHUNK_SIZE 1024
! #define MAX_CHUNK_SIZE (1024*1024)
! 
! #if MAX_CHUNK_SIZE > (AFTER_TRIGGER_OFFSET+1)
! #error MAX_CHUNK_SIZE must not exceed AFTER_TRIGGER_OFFSET
! #endif
! 
! 		if (chunk == NULL)
! 			chunksize = MIN_CHUNK_SIZE;
! 		else
! 		{
! 			/* preceding chunk size... */
! 			chunksize = chunk->endptr - (char *) chunk;
! 			/* check number of shared records in preceding chunk */
! 			if ((chunk->endptr - chunk->endfree) <=
! 				(100 * sizeof(AfterTriggerSharedData)))
! 				chunksize *= 2; /* okay, double it */
! 			else
! 				chunksize /= 2; /* too many shared records */
! 			chunksize = Min(chunksize, MAX_CHUNK_SIZE);
! 		}
! 		chunk = MemoryContextAlloc(afterTriggers->event_cxt, chunksize);
! 		chunk->next = NULL;
! 		chunk->freeptr = CHUNK_DATA_START(chunk);
! 		chunk->endptr = chunk->endfree = (char *) chunk + chunksize;
! 		Assert(chunk->endfree - chunk->freeptr >= needed);
  
! 		if (events->head == NULL)
! 			events->head = chunk;
  		else
! 			events->tail->next = chunk;
! 		events->tail = chunk;
  	}
  
  	/*
! 	 * Try to locate a matching shared-data record already in the chunk. If
! 	 * none, make a new one.
  	 */
! 	for (newshared = ((AfterTriggerShared) chunk->endptr) - 1;
! 		 (char *) newshared >= chunk->endfree;
! 		 newshared--)
  	{
! 		if (newshared->ats_tgoid == evtshared->ats_tgoid &&
! 			newshared->ats_relid == evtshared->ats_relid &&
! 			newshared->ats_event == evtshared->ats_event &&
! 			newshared->ats_firing_id == 0)
! 			break;
  	}
- 	if ((char *) newshared < chunk->endfree)
- 	{
- 		*newshared = *evtshared;
- 		newshared->ats_firing_id = 0;	/* just to be sure */
- 		chunk->endfree = (char *) newshared;
- 	}
- 
- 	/* Insert the data */
- 	newevent = (AfterTriggerEvent) chunk->freeptr;
- 	memcpy(newevent, event, eventsize);
- 	/* ... and link the new event to its shared record */
- 	newevent->ate_flags &= ~AFTER_TRIGGER_OFFSET;
- 	newevent->ate_flags |= (char *) newshared - (char *) newevent;
  
! 	chunk->freeptr += eventsize;
! 	events->tailfree = chunk->freeptr;
  }
  
  /* ----------
--- 2477,2599 ----
   * afterTriggerAddEvent()
   *
   *	Add a new trigger event to the specified queue.
!  *
!  *	If the tuple set from_ts is non-NULL, then all the rows from that
!  *	tuple set are added to the queue and from_ts is freed. Otherwise,
!  *	just a single entry is added.
   * ----------
   */
  static void
  afterTriggerAddEvent(AfterTriggerEventList *events,
! 					 Oid relid, TriggerEvent event, List *tgoids,
! 					 ItemPointer ctid, AfterTriggerTupleSet from_ts,
! 					 TriggerFlags flags)
  {
! 	AfterTriggerTupleSet ts;
! 
! 	/*
! 	 * Search for the tuple set in which to store this trigger event.
! 	 * Don't add to tuple sets already fired or currently firing.
! 	 *
! 	 * For deferrable triggers, or triggers which may only fire for a
! 	 * smaller subset of the rows (such as deferred uniqueness checks,
! 	 * or FK checks), the list tgoids will contain just one OID.
! 	 *
! 	 * Otherwise, for triggers which fire immediately, and for all
! 	 * rows, we assume that the list of OIDs is immutable for the
! 	 * duration of the statement, so we just compare the first item.
! 	 *
! 	 * Tuple sets which have been fired or are currently being fired
! 	 * are considered to be locked, and no trigger events can be added
! 	 * to them. The event is instead added to another tuple set, to be
! 	 * fired later.
! 	 */
! 	for (ts = events->atel_head; ts != NULL; ts = ts->atts_next)
! 		if (ts->atts_relid == relid &&
! 			ts->atts_event == event &&
! 			linitial_oid(ts->atts_tgoids) == linitial_oid(tgoids) &&
! 			!(ts->atts_flags &
! 			  (AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS)))
! 			break;
! 
! 	if (ts == NULL)
! 	{
! 		MemoryContext oldContext;
! 
! 		/*
! 		 * Allocate a new tuple set, and add it to the list. The tuple
! 		 * set structure is in TopTransactionContext.
! 		 */
! 		oldContext = MemoryContextSwitchTo(TopMemoryContext);
! 
! 		ts = (AfterTriggerTupleSet) palloc(sizeof(AfterTriggerTupleSetData));
  
! 		ts->atts_next = NULL;
! 		ts->atts_relid = relid;
! 		ts->atts_event = event;
! 		ts->atts_tgoids = list_copy(tgoids);
! 		ts->atts_query_cmd = events->atel_query_cmd;
! 		ts->atts_firing_cmd = 0;
! 		ts->atts_flags = flags;
! 		ts->atts_tbm = NULL;
! 
! 		MemoryContextSwitchTo(oldContext);
! 
! 		if (events->atel_tail != NULL)
! 			events->atel_tail->atts_next = ts;
  		else
! 			events->atel_head = ts;
! 		events->atel_tail = ts;
! 
! 		if (from_ts == NULL && (event & TRIGGER_EVENT_ROW))
! 		{
! 			/*
! 			 * Create the TID bitmap in a separate child context.
! 			 */
! 			if (afterTriggers->event_cxt == NULL)
! 				afterTriggers->event_cxt =
! 					AllocSetContextCreate(TopTransactionContext,
! 										  "AfterTriggerEvents",
! 										  ALLOCSET_DEFAULT_MINSIZE,
! 										  ALLOCSET_DEFAULT_INITSIZE,
! 										  ALLOCSET_DEFAULT_MAXSIZE);
! 
! 			oldContext = MemoryContextSwitchTo(afterTriggers->event_cxt);
! 			ts->atts_tbm = tbm_create(work_mem * 1024L);
! 			MemoryContextSwitchTo(oldContext);
! 		}
  	}
  
  	/*
! 	 * For row triggers, add the tuple(s) to the tuple set's bitmap.
  	 */
! 	if (event & TRIGGER_EVENT_ROW)
  	{
! 		if (from_ts != NULL)
! 		{
! 			if (ts->atts_tbm == NULL)
! 			{
! 				ts->atts_tbm = from_ts->atts_tbm;
! 				from_ts->atts_tbm = NULL;
! 			}
! 			else
! 				tbm_union(ts->atts_tbm, from_ts->atts_tbm);
! 		}
! 		else
! 			tbm_add_tuples(ts->atts_tbm, ctid, 1, false);
  	}
  
! 	/*
! 	 * Free up from_ts, now that we have its data
! 	 */
! 	if (from_ts != NULL)
! 	{
! 		if (from_ts->atts_tbm != NULL)
! 			tbm_free(from_ts->atts_tbm);
! 		if (from_ts->atts_tgoids != NIL)
! 			list_free(from_ts->atts_tgoids);
! 		pfree(from_ts);
! 	}
  }
  
  /* ----------
***************
*** 2661,2677 ****
  static void
  afterTriggerFreeEventList(AfterTriggerEventList *events)
  {
! 	AfterTriggerEventChunk *chunk;
! 	AfterTriggerEventChunk *next_chunk;
  
! 	for (chunk = events->head; chunk != NULL; chunk = next_chunk)
  	{
! 		next_chunk = chunk->next;
! 		pfree(chunk);
  	}
! 	events->head = NULL;
! 	events->tail = NULL;
! 	events->tailfree = NULL;
  }
  
  /* ----------
--- 2605,2626 ----
  static void
  afterTriggerFreeEventList(AfterTriggerEventList *events)
  {
! 	AfterTriggerTupleSet ts;
! 	AfterTriggerTupleSet next_ts;
  
! 	for (ts = events->atel_head; ts != NULL; ts = next_ts)
  	{
! 		next_ts = ts->atts_next;
! 
! 		if (ts->atts_tbm != NULL)
! 			tbm_free(ts->atts_tbm);
! 		if (ts->atts_tgoids != NIL)
! 			list_free(ts->atts_tgoids);
! 		pfree(ts);
  	}
! 
! 	events->atel_head = NULL;
! 	events->atel_tail = NULL;
  }
  
  /* ----------
***************
*** 2685,2694 ****
  afterTriggerRestoreEventList(AfterTriggerEventList *events,
  							 const AfterTriggerEventList *old_events)
  {
! 	AfterTriggerEventChunk *chunk;
! 	AfterTriggerEventChunk *next_chunk;
  
! 	if (old_events->tail == NULL)
  	{
  		/* restoring to a completely empty state, so free everything */
  		afterTriggerFreeEventList(events);
--- 2634,2643 ----
  afterTriggerRestoreEventList(AfterTriggerEventList *events,
  							 const AfterTriggerEventList *old_events)
  {
! 	AfterTriggerTupleSet ts;
! 	AfterTriggerTupleSet next_ts;
  
! 	if (old_events->atel_tail == NULL)
  	{
  		/* restoring to a completely empty state, so free everything */
  		afterTriggerFreeEventList(events);
***************
*** 2696,2715 ****
  	else
  	{
  		*events = *old_events;
! 		/* free any chunks after the last one we want to keep */
! 		for (chunk = events->tail->next; chunk != NULL; chunk = next_chunk)
  		{
! 			next_chunk = chunk->next;
! 			pfree(chunk);
! 		}
! 		/* and clean up the tail chunk to be the right length */
! 		events->tail->next = NULL;
! 		events->tail->freeptr = events->tailfree;
  
! 		/*
! 		 * We don't make any effort to remove now-unused shared data records.
! 		 * They might still be useful, anyway.
! 		 */
  	}
  }
  
--- 2645,2662 ----
  	else
  	{
  		*events = *old_events;
! 		/* free any tuple sets after the last one we want to keep */
! 		for (ts = events->atel_tail->atts_next; ts != NULL; ts = next_ts)
  		{
! 			next_ts = ts->atts_next;
  
! 			if (ts->atts_tbm != NULL)
! 				tbm_free(ts->atts_tbm);
! 			if (ts->atts_tgoids != NIL)
! 				list_free(ts->atts_tgoids);
! 			pfree(ts);
! 		}
! 		events->atel_tail->atts_next = NULL;
  	}
  }
  
***************
*** 2717,2846 ****
  /* ----------
   * AfterTriggerExecute()
   *
!  *	Fetch the required tuples back from the heap and fire one
!  *	single trigger function.
   *
   *	Frequently, this will be fired many times in a row for triggers of
   *	a single relation.	Therefore, we cache the open relation and provide
   *	fmgr lookup cache space at the caller level.  (For triggers fired at
   *	the end of a query, we can even piggyback on the executor's state.)
   *
!  *	event: event currently being fired.
   *	rel: open relation for event.
   *	trigdesc: working copy of rel's trigger info.
   *	finfo: array of fmgr lookup cache entries (one per trigger in trigdesc).
   *	instr: array of EXPLAIN ANALYZE instrumentation nodes (one per trigger),
   *		or NULL if no instrumentation is wanted.
   *	per_tuple_context: memory context to call trigger function in.
   * ----------
   */
  static void
! AfterTriggerExecute(AfterTriggerEvent event,
  					Relation rel, TriggerDesc *trigdesc,
  					FmgrInfo *finfo, Instrumentation *instr,
! 					MemoryContext per_tuple_context)
  {
- 	AfterTriggerShared evtshared = GetTriggerSharedData(event);
- 	Oid			tgoid = evtshared->ats_tgoid;
  	TriggerData LocTriggerData;
  	HeapTupleData tuple1;
  	HeapTupleData tuple2;
  	HeapTuple	rettuple;
  	Buffer		buffer1 = InvalidBuffer;
  	Buffer		buffer2 = InvalidBuffer;
! 	int			tgindx;
  
  	/*
! 	 * Locate trigger in trigdesc.
  	 */
  	LocTriggerData.tg_trigger = NULL;
- 	for (tgindx = 0; tgindx < trigdesc->numtriggers; tgindx++)
- 	{
- 		if (trigdesc->triggers[tgindx].tgoid == tgoid)
- 		{
- 			LocTriggerData.tg_trigger = &(trigdesc->triggers[tgindx]);
- 			break;
- 		}
- 	}
- 	if (LocTriggerData.tg_trigger == NULL)
- 		elog(ERROR, "could not find trigger %u", tgoid);
  
! 	/*
! 	 * If doing EXPLAIN ANALYZE, start charging time to this trigger. We want
! 	 * to include time spent re-fetching tuples in the trigger cost.
! 	 */
! 	if (instr)
! 		InstrStartNode(instr + tgindx);
! 
! 	/*
! 	 * Fetch the required tuple(s).
! 	 */
! 	if (ItemPointerIsValid(&(event->ate_ctid1)))
  	{
! 		ItemPointerCopy(&(event->ate_ctid1), &(tuple1.t_self));
! 		if (!heap_fetch(rel, SnapshotAny, &tuple1, &buffer1, false, NULL))
! 			elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
! 		LocTriggerData.tg_trigtuple = &tuple1;
! 		LocTriggerData.tg_trigtuplebuf = buffer1;
  	}
  	else
  	{
! 		LocTriggerData.tg_trigtuple = NULL;
! 		LocTriggerData.tg_trigtuplebuf = InvalidBuffer;
  	}
  
! 	/* don't touch ctid2 if not there */
! 	if ((event->ate_flags & AFTER_TRIGGER_2CTIDS) &&
! 		ItemPointerIsValid(&(event->ate_ctid2)))
! 	{
! 		ItemPointerCopy(&(event->ate_ctid2), &(tuple2.t_self));
! 		if (!heap_fetch(rel, SnapshotAny, &tuple2, &buffer2, false, NULL))
! 			elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
! 		LocTriggerData.tg_newtuple = &tuple2;
! 		LocTriggerData.tg_newtuplebuf = buffer2;
  	}
! 	else
  	{
! 		LocTriggerData.tg_newtuple = NULL;
! 		LocTriggerData.tg_newtuplebuf = InvalidBuffer;
  	}
  
! 	/*
! 	 * Setup the remaining trigger information
! 	 */
! 	LocTriggerData.type = T_TriggerData;
! 	LocTriggerData.tg_event =
! 		evtshared->ats_event & (TRIGGER_EVENT_OPMASK | TRIGGER_EVENT_ROW);
! 	LocTriggerData.tg_relation = rel;
  
! 	MemoryContextReset(per_tuple_context);
  
! 	/*
! 	 * Call the trigger and throw away any possibly returned updated tuple.
! 	 * (Don't let ExecCallTriggerFunc measure EXPLAIN time.)
  	 */
! 	rettuple = ExecCallTriggerFunc(&LocTriggerData,
! 								   tgindx,
! 								   finfo,
! 								   NULL,
! 								   per_tuple_context);
! 	if (rettuple != NULL && rettuple != &tuple1 && rettuple != &tuple2)
! 		heap_freetuple(rettuple);
  
! 	/*
! 	 * Release buffers
! 	 */
! 	if (buffer1 != InvalidBuffer)
! 		ReleaseBuffer(buffer1);
! 	if (buffer2 != InvalidBuffer)
! 		ReleaseBuffer(buffer2);
  
! 	/*
! 	 * If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
! 	 * one "tuple returned" (really the number of firings).
! 	 */
! 	if (instr)
! 		InstrStopNode(instr + tgindx, 1);
  }
  
  
--- 2664,2980 ----
  /* ----------
   * AfterTriggerExecute()
   *
!  *	Fetch the required tuples back from the heap and fire one or more
!  *	trigger functions.
   *
   *	Frequently, this will be fired many times in a row for triggers of
   *	a single relation.	Therefore, we cache the open relation and provide
   *	fmgr lookup cache space at the caller level.  (For triggers fired at
   *	the end of a query, we can even piggyback on the executor's state.)
   *
!  *  ts: the tuple set with details of the trigger(s) to be fired.
   *	rel: open relation for event.
   *	trigdesc: working copy of rel's trigger info.
   *	finfo: array of fmgr lookup cache entries (one per trigger in trigdesc).
   *	instr: array of EXPLAIN ANALYZE instrumentation nodes (one per trigger),
   *		or NULL if no instrumentation is wanted.
   *	per_tuple_context: memory context to call trigger function in.
+  *	ctid1: tid of tuple to fetch.
+  *	ctid2: for UPDATE triggers, tid of updated tuple to fetch.
   * ----------
   */
  static void
! AfterTriggerExecute(AfterTriggerTupleSet ts,
  					Relation rel, TriggerDesc *trigdesc,
  					FmgrInfo *finfo, Instrumentation *instr,
! 					MemoryContext per_tuple_context,
! 					ItemPointer ctid1, ItemPointer ctid2)
  {
  	TriggerData LocTriggerData;
  	HeapTupleData tuple1;
  	HeapTupleData tuple2;
  	HeapTuple	rettuple;
  	Buffer		buffer1 = InvalidBuffer;
  	Buffer		buffer2 = InvalidBuffer;
! 	int			ntriggers;
! 	int			*tgindx;
! 	int			i;
  
  	/*
! 	 * Loop over all the relation's triggers working out which one(s) to fire.
  	 */
  	LocTriggerData.tg_trigger = NULL;
  
! 	if (ts->atts_event & TRIGGER_EVENT_ROW)
  	{
! 		ntriggers = trigdesc->n_after_row[ts->atts_event & TRIGGER_EVENT_OPMASK];
! 		tgindx = trigdesc->tg_after_row[ts->atts_event & TRIGGER_EVENT_OPMASK];
  	}
  	else
  	{
! 		ntriggers = trigdesc->n_after_statement[ts->atts_event & TRIGGER_EVENT_OPMASK];
! 		tgindx = trigdesc->tg_after_statement[ts->atts_event & TRIGGER_EVENT_OPMASK];
  	}
  
! 	for (i = 0; i < ntriggers; i++)
! 	{
! 		Trigger    *trigger = &trigdesc->triggers[tgindx[i]];
! 
! 		/* Ignore disabled triggers */
! 		if (SessionReplicationRole == SESSION_REPLICATION_ROLE_REPLICA)
! 		{
! 			if (trigger->tgenabled == TRIGGER_FIRES_ON_ORIGIN ||
! 				trigger->tgenabled == TRIGGER_DISABLED)
! 				continue;
! 		}
! 		else	/* ORIGIN or LOCAL role */
! 		{
! 			if (trigger->tgenabled == TRIGGER_FIRES_ON_REPLICA ||
! 				trigger->tgenabled == TRIGGER_DISABLED)
! 				continue;
! 		}
! 
! 		/* Ignore triggers that aren't in the tuple set */
! 		if (!list_member_oid(ts->atts_tgoids, trigger->tgoid))
! 			continue;
! 
! 		/* Have located a trigger to fire */
! 		LocTriggerData.tg_trigger = trigger;
! 
! 		/*
! 		 * If doing EXPLAIN ANALYZE, start charging time to this trigger. We want
! 		 * to include time spent re-fetching tuples in the trigger cost.
! 		 */
! 		if (instr)
! 			InstrStartNode(instr + tgindx[i]);
! 
! 		/*
! 		 * Fetch the required tuple(s).
! 		 */
! 		if (ItemPointerIsValid(ctid1))
! 		{
! 			ItemPointerCopy(ctid1, &(tuple1.t_self));
! 			if (!heap_fetch(rel, SnapshotAny, &tuple1, &buffer1, false, NULL))
! 				elog(ERROR, "failed to fetch tuple1 for AFTER trigger");
! 			LocTriggerData.tg_trigtuple = &tuple1;
! 			LocTriggerData.tg_trigtuplebuf = buffer1;
! 		}
! 		else
! 		{
! 			LocTriggerData.tg_trigtuple = NULL;
! 			LocTriggerData.tg_trigtuplebuf = InvalidBuffer;
! 		}
! 
! 		if (ItemPointerIsValid(ctid2))
! 		{
! 			ItemPointerCopy(ctid2, &(tuple2.t_self));
! 			if (!heap_fetch(rel, SnapshotAny, &tuple2, &buffer2, false, NULL))
! 				elog(ERROR, "failed to fetch tuple2 for AFTER trigger");
! 			LocTriggerData.tg_newtuple = &tuple2;
! 			LocTriggerData.tg_newtuplebuf = buffer2;
! 		}
! 		else
! 		{
! 			LocTriggerData.tg_newtuple = NULL;
! 			LocTriggerData.tg_newtuplebuf = InvalidBuffer;
! 		}
! 
! 		/*
! 		 * Setup the remaining trigger information
! 		 */
! 		LocTriggerData.type = T_TriggerData;
! 		LocTriggerData.tg_event =
! 			ts->atts_event & (TRIGGER_EVENT_OPMASK | TRIGGER_EVENT_ROW);
! 		LocTriggerData.tg_relation = rel;
! 
! 		MemoryContextReset(per_tuple_context);
! 
! 		/*
! 		 * Call the trigger and throw away any possibly returned updated tuple.
! 		 * (Don't let ExecCallTriggerFunc measure EXPLAIN time.)
! 		 */
! 		rettuple = ExecCallTriggerFunc(&LocTriggerData,
! 									   tgindx[i],
! 									   finfo,
! 									   NULL,
! 									   per_tuple_context);
! 		if (rettuple != NULL && rettuple != &tuple1 && rettuple != &tuple2)
! 			heap_freetuple(rettuple);
! 
! 		/*
! 		 * Release buffers
! 		 */
! 		if (buffer1 != InvalidBuffer)
! 		{
! 			ReleaseBuffer(buffer1);
! 			buffer1 = InvalidBuffer;
! 		}
! 		if (buffer2 != InvalidBuffer)
! 		{
! 			ReleaseBuffer(buffer2);
! 			buffer2 = InvalidBuffer;
! 		}
! 
! 		/*
! 		 * If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
! 		 * one "tuple returned" (really the number of firings).
! 		 */
! 		if (instr)
! 			InstrStopNode(instr + tgindx[i], 1);
  	}
! 
! 	if (LocTriggerData.tg_trigger == NULL)
! 		elog(ERROR, "could not find trigger %u", linitial_oid(ts->atts_tgoids));
! }
! 
! /* ----------
!  * AfterTriggerTupleSatisfiesTrigger()
!  *
!  *  Test if a tuple satisfies the requirements for the triggers being
!  *  executed. It must have been INSERTed, UPDATEd or DELETEd in the
!  *  current transaction, no earlier than the commandId recorded at the
!  *  start of the query which queued the triggers, and no later than the
!  *  commandId recorded when we started this round of trigger executions.
!  * ----------
!  */
! bool
! AfterTriggerTupleSatisfiesTrigger(void *tg_data, HeapTuple tuple)
! {
! 	AfterTriggerTupleSet ts = (AfterTriggerTupleSet) tg_data;
! 	TransactionId	xmin, xmax;
! 	CommandId		cmin, cmax;
! 
! 	switch (ts->atts_event & TRIGGER_EVENT_OPMASK)
  	{
! 		case TRIGGER_EVENT_INSERT:
! 			/*
! 			 * Check xmin and cmin as described above, and also exclude
! 			 * any tuples created as part of UPDATEs, except if this is
! 			 * a deferred uniqueness re-check trigger, which is always
! 			 * queued as an INSERT (see AfterTriggerAddIndexRecheck).
! 			 */
! 			xmin = HeapTupleHeaderGetXmin(tuple->t_data);
! 			if (!TransactionIdIsCurrentTransactionId(xmin))
! 				return false;
! 
! 			cmin = HeapTupleHeaderGetCmin(tuple->t_data);
! 			if (cmin < ts->atts_query_cmd ||
! 				cmin > ts->atts_firing_cmd)
! 				return false;
! 
! 			if (!(ts->atts_flags & AFTER_TRIGGER_UNIQUE_KEY_RECHECK) &&
! 				(tuple->t_data->t_infomask & HEAP_UPDATED))
! 				return false;
! 
! 			return true;
! 
! 		case TRIGGER_EVENT_DELETE:
! 			/*
! 			 * Check xmax and cmax as described above, and also exclude
! 			 * tuples deleted as part of UPDATEs.
! 			 */
! 			xmax = HeapTupleHeaderGetXmax(tuple->t_data);
! 			if (!TransactionIdIsCurrentTransactionId(xmax))
! 				return false;
! 
! 			cmax = HeapTupleHeaderGetCmax(tuple->t_data);
! 			if (cmax < ts->atts_query_cmd ||
! 				cmax > ts->atts_firing_cmd ||
! 				!ItemPointerEquals(&(tuple->t_self),
! 								   &(tuple->t_data->t_ctid)))
! 				return false;
! 
! 			return true;
! 
! 		case TRIGGER_EVENT_UPDATE:
! 			/*
! 			 * Check xmax and cmax as described above, and also check
! 			 * that the tuple points to a new UPDATEd tuple, so that
! 			 * tuples which were simply DELETEd are excluded.
! 			 */
! 			xmax = HeapTupleHeaderGetXmax(tuple->t_data);
! 			if (!TransactionIdIsCurrentTransactionId(xmax))
! 				return false;
! 
! 			cmax = HeapTupleHeaderGetCmax(tuple->t_data);
! 			if (cmax < ts->atts_query_cmd ||
! 				cmax > ts->atts_firing_cmd ||
! 				ItemPointerEquals(&(tuple->t_self),
! 								  &(tuple->t_data->t_ctid)))
! 				return false;
! 
! 			return true;
  	}
  
! 	elog(ERROR, "invalid after-trigger event code: %d",
! 		 ts->atts_event);
  
! 	return false; /* Keep the compiler happy */
! }
  
! /* ----------
!  * AfterTriggerExecuteTupleSet()
!  *
!  *  Scan a tuple set and find all tuples updated in our transaction,
!  *  in commands within the defined range, and fire the triggers for
!  *  those tuples.
!  * ----------
!  */
! static void
! AfterTriggerExecuteTupleSet(AfterTriggerTupleSet ts, EState *estate,
! 							Relation rel, TriggerDesc *trigdesc,
! 							FmgrInfo *finfo, Instrumentation *instr,
! 							MemoryContext per_tuple_context)
! {
! 	BitmapHeapScanState *planstate;
! 	TupleTableSlot *slot;
! 	HeapTuple		tuple;
! 	ItemPointerData	ctid1, ctid2;
! 
! 	/*
! 	 * Initialise a bitmap heap scan. This uses SnapnotAny so that
! 	 * we can see deleted tuples, and check them ourselves, using
! 	 * AfterTriggerTupleSatisfiesTrigger().
  	 */
! 	planstate = ExecInitTriggerBitmapHeapScan(rel, estate, ts->atts_tbm, ts);
  
! 	for (;;)
! 	{
! 		slot = ExecProcNode((PlanState *)planstate);
! 		if (TupIsNull(slot))
! 			break;
  
! 		/* Fetch the old and new CTIDs, as needed */
! 		tuple = slot->tts_tuple;
! 
! 		switch (ts->atts_event & TRIGGER_EVENT_OPMASK)
! 		{
! 			case TRIGGER_EVENT_INSERT:
! 				ItemPointerCopy(&(tuple->t_self), &ctid1);
! 				ItemPointerSetInvalid(&ctid2);
! 				break;
! 
! 			case TRIGGER_EVENT_DELETE:
! 				ItemPointerCopy(&(tuple->t_self), &ctid1);
! 				ItemPointerSetInvalid(&ctid2);
! 				break;
! 
! 			case TRIGGER_EVENT_UPDATE:
! 				ItemPointerCopy(&(tuple->t_self), &ctid1);
! 				ItemPointerCopy(&(tuple->t_data->t_ctid), &ctid2);
! 				break;
! 
! 			default:
! 				elog(ERROR, "invalid after-trigger event code: %d",
! 					 ts->atts_event);
! 		}
! 
! 		/* Fire the trigger(s) for this tuple */
! 		AfterTriggerExecute(ts, rel, trigdesc, finfo, instr,
! 							per_tuple_context, &ctid1, &ctid2);
! 	}
! 
! 	ExecEndBitmapHeapScan(planstate);
  }
  
  
***************
*** 2848,2854 ****
   * afterTriggerMarkEvents()
   *
   *	Scan the given event list for not yet invoked events.  Mark the ones
!  *	that can be invoked now with the current firing ID.
   *
   *	If move_list isn't NULL, events that are not to be invoked now are
   *	transferred to move_list.
--- 2982,2988 ----
   * afterTriggerMarkEvents()
   *
   *	Scan the given event list for not yet invoked events.  Mark the ones
!  *	that can be invoked now with the current command ID.
   *
   *	If move_list isn't NULL, events that are not to be invoked now are
   *	transferred to move_list.
***************
*** 2863,2885 ****
  					   AfterTriggerEventList *move_list,
  					   bool immediate_only)
  {
! 	bool		found = false;
! 	AfterTriggerEvent event;
! 	AfterTriggerEventChunk *chunk;
  
! 	for_each_event_chunk(event, chunk, *events)
  	{
! 		AfterTriggerShared evtshared = GetTriggerSharedData(event);
! 		bool		defer_it = false;
  
! 		if (!(event->ate_flags &
  			  (AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS)))
  		{
  			/*
! 			 * This trigger hasn't been called or scheduled yet. Check if we
! 			 * should call it now.
  			 */
! 			if (immediate_only && afterTriggerCheckState(evtshared))
  			{
  				defer_it = true;
  			}
--- 2997,3024 ----
  					   AfterTriggerEventList *move_list,
  					   bool immediate_only)
  {
! 	bool	found = false;
! 	AfterTriggerTupleSet ts;
! 	AfterTriggerTupleSet next_ts;
! 	AfterTriggerTupleSet prev_ts = NULL;
  
! 	for (ts = events->atel_head; ts != NULL; ts = next_ts)
  	{
! 		bool	defer_it = false;
  
! 		next_ts = ts->atts_next;
! 
! 		if (!(ts->atts_flags &
  			  (AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS)))
  		{
  			/*
! 			 * These triggers haven't been called or scheduled yet. Check
! 			 * if we should call them now.
  			 */
! 			if (immediate_only &&
! 				list_length(ts->atts_tgoids) == 1 &&
! 				afterTriggerCheckState(linitial_oid(ts->atts_tgoids),
! 									   ts->atts_event))
  			{
  				defer_it = true;
  			}
***************
*** 2888,2895 ****
  				/*
  				 * Mark it as to be fired in this firing cycle.
  				 */
! 				evtshared->ats_firing_id = afterTriggers->firing_counter;
! 				event->ate_flags |= AFTER_TRIGGER_IN_PROGRESS;
  				found = true;
  			}
  		}
--- 3027,3034 ----
  				/*
  				 * Mark it as to be fired in this firing cycle.
  				 */
! 				ts->atts_firing_cmd = GetCurrentCommandId(false);
! 				ts->atts_flags |= AFTER_TRIGGER_IN_PROGRESS;
  				found = true;
  			}
  		}
***************
*** 2899,2909 ****
  		 */
  		if (defer_it && move_list != NULL)
  		{
! 			/* add it to move_list */
! 			afterTriggerAddEvent(move_list, event, evtshared);
! 			/* mark original copy "done" so we don't do it again */
! 			event->ate_flags |= AFTER_TRIGGER_DONE;
  		}
  	}
  
  	return found;
--- 3038,3058 ----
  		 */
  		if (defer_it && move_list != NULL)
  		{
! 			/* Remove the tuple set from this this */
! 			if (prev_ts != NULL)
! 				prev_ts->atts_next = next_ts;
! 			if (events->atel_head == ts)
! 				events->atel_head = next_ts;
! 			if (events->atel_tail == ts)
! 				events->atel_tail = NULL;
! 
! 			/* ... and add it to move_list (freeing it) */
! 			afterTriggerAddEvent(move_list, ts->atts_relid,
! 								 ts->atts_event, ts->atts_tgoids,
! 								 NULL, ts, ts->atts_flags);
  		}
+ 		else
+ 			prev_ts = ts; /* Previous one, still on this list */
  	}
  
  	return found;
***************
*** 2932,2949 ****
   */
  static bool
  afterTriggerInvokeEvents(AfterTriggerEventList *events,
- 						 CommandId firing_id,
  						 EState *estate,
  						 bool delete_ok)
  {
  	bool		all_fired = true;
- 	AfterTriggerEventChunk *chunk;
  	MemoryContext per_tuple_context;
  	bool		local_estate = false;
  	Relation	rel = NULL;
  	TriggerDesc *trigdesc = NULL;
  	FmgrInfo   *finfo = NULL;
  	Instrumentation *instr = NULL;
  
  	/* Make a local EState if need be */
  	if (estate == NULL)
--- 3081,3097 ----
   */
  static bool
  afterTriggerInvokeEvents(AfterTriggerEventList *events,
  						 EState *estate,
  						 bool delete_ok)
  {
  	bool		all_fired = true;
  	MemoryContext per_tuple_context;
  	bool		local_estate = false;
  	Relation	rel = NULL;
  	TriggerDesc *trigdesc = NULL;
  	FmgrInfo   *finfo = NULL;
  	Instrumentation *instr = NULL;
+ 	AfterTriggerTupleSet ts;
  
  	/* Make a local EState if need be */
  	if (estate == NULL)
***************
*** 2960,3024 ****
  							  ALLOCSET_DEFAULT_INITSIZE,
  							  ALLOCSET_DEFAULT_MAXSIZE);
  
! 	for_each_chunk(chunk, *events)
  	{
! 		AfterTriggerEvent event;
! 		bool		all_fired_in_chunk = true;
! 
! 		for_each_event(event, chunk)
  		{
- 			AfterTriggerShared evtshared = GetTriggerSharedData(event);
- 
  			/*
! 			 * Is it one for me to fire?
  			 */
! 			if ((event->ate_flags & AFTER_TRIGGER_IN_PROGRESS) &&
! 				evtshared->ats_firing_id == firing_id)
  			{
! 				/*
! 				 * So let's fire it... but first, find the correct relation if
! 				 * this is not the same relation as before.
! 				 */
! 				if (rel == NULL || RelationGetRelid(rel) != evtshared->ats_relid)
! 				{
! 					ResultRelInfo *rInfo;
  
! 					rInfo = ExecGetTriggerResultRel(estate, evtshared->ats_relid);
! 					rel = rInfo->ri_RelationDesc;
! 					trigdesc = rInfo->ri_TrigDesc;
! 					finfo = rInfo->ri_TrigFunctions;
! 					instr = rInfo->ri_TrigInstrument;
! 					if (trigdesc == NULL)		/* should not happen */
! 						elog(ERROR, "relation %u has no triggers",
! 							 evtshared->ats_relid);
! 				}
  
! 				/*
! 				 * Fire it.  Note that the AFTER_TRIGGER_IN_PROGRESS flag is
! 				 * still set, so recursive examinations of the event list
! 				 * won't try to re-fire it.
! 				 */
! 				AfterTriggerExecute(event, rel, trigdesc, finfo, instr,
! 									per_tuple_context);
  
! 				/*
! 				 * Mark the event as done.
! 				 */
! 				event->ate_flags &= ~AFTER_TRIGGER_IN_PROGRESS;
! 				event->ate_flags |= AFTER_TRIGGER_DONE;
  			}
! 			else if (!(event->ate_flags & AFTER_TRIGGER_DONE))
  			{
! 				/* something remains to be done */
! 				all_fired = all_fired_in_chunk = false;
  			}
- 		}
  
! 		/* Clear the chunk if delete_ok and nothing left of interest */
! 		if (delete_ok && all_fired_in_chunk)
  		{
! 			chunk->freeptr = CHUNK_DATA_START(chunk);
! 			chunk->endfree = chunk->endptr;
  		}
  	}
  
--- 3108,3178 ----
  							  ALLOCSET_DEFAULT_INITSIZE,
  							  ALLOCSET_DEFAULT_MAXSIZE);
  
! 	for (ts = events->atel_head ; ts != NULL ; ts = ts->atts_next)
  	{
! 		/*
! 		 * Is it a set for me to fire?
! 		 */
! 		if (ts->atts_flags & AFTER_TRIGGER_IN_PROGRESS)
  		{
  			/*
! 			 * So let's fire them... but first, find the correct relation if
! 			 * this is not the same relation as before.
  			 */
! 			if (rel == NULL || RelationGetRelid(rel) != ts->atts_relid)
  			{
! 				ResultRelInfo *rInfo;
  
! 				rInfo = ExecGetTriggerResultRel(estate, ts->atts_relid);
! 				rel = rInfo->ri_RelationDesc;
! 				trigdesc = rInfo->ri_TrigDesc;
! 				finfo = rInfo->ri_TrigFunctions;
! 				instr = rInfo->ri_TrigInstrument;
! 				if (trigdesc == NULL)		/* should not happen */
! 					elog(ERROR, "relation %u has no triggers",
! 						 ts->atts_relid);
! 			}
  
! 			/*
! 			 * Fire them.  Note that the AFTER_TRIGGER_IN_PROGRESS flag is
! 			 * still set, so recursive examinations of the event list
! 			 * won't try to re-fire it.
! 			 */
! 			if (!(ts->atts_event & TRIGGER_EVENT_ROW))
! 			{
! 				/* Statement trigger(s) to be fired just once */
! 				ItemPointerData ctid1, ctid2;
  
! 				ItemPointerSetInvalid(&ctid1);
! 				ItemPointerSetInvalid(&ctid2);
! 
! 				AfterTriggerExecute(ts, rel, trigdesc, finfo, instr,
! 									per_tuple_context, &ctid1, &ctid2);
  			}
! 			else if (ts->atts_tbm != NULL)
  			{
! 				/* Fire the trigger(s) for all the rows in the tuple set */
! 				AfterTriggerExecuteTupleSet(ts, estate, rel, trigdesc,
! 											finfo, instr,
! 											per_tuple_context);
  			}
  
! 			/*
! 			 * Mark the whole set as done.
! 			 */
! 			ts->atts_flags &= ~AFTER_TRIGGER_IN_PROGRESS;
! 			ts->atts_flags |= AFTER_TRIGGER_DONE;
! 
! 			if (delete_ok && ts->atts_tbm != NULL)
! 			{
! 				tbm_free(ts->atts_tbm);
! 				ts->atts_tbm = NULL;
! 			}
! 		}
! 		else if (!(ts->atts_flags & AFTER_TRIGGER_DONE))
  		{
! 			/* something remains to be done */
! 			all_fired = false;
  		}
  	}
  
***************
*** 3063,3073 ****
  		MemoryContextAlloc(TopTransactionContext,
  						   sizeof(AfterTriggersData));
  
- 	afterTriggers->firing_counter = (CommandId) 1;		/* mustn't be 0 */
  	afterTriggers->state = SetConstraintStateCreate(8);
! 	afterTriggers->events.head = NULL;
! 	afterTriggers->events.tail = NULL;
! 	afterTriggers->events.tailfree = NULL;
  	afterTriggers->query_depth = -1;
  
  	/* We initialize the query stack to a reasonable size */
--- 3217,3226 ----
  		MemoryContextAlloc(TopTransactionContext,
  						   sizeof(AfterTriggersData));
  
  	afterTriggers->state = SetConstraintStateCreate(8);
! 	afterTriggers->events.atel_head = NULL;
! 	afterTriggers->events.atel_tail = NULL;
! 	afterTriggers->events.atel_query_cmd = 0;
  	afterTriggers->query_depth = -1;
  
  	/* We initialize the query stack to a reasonable size */
***************
*** 3083,3089 ****
  	afterTriggers->state_stack = NULL;
  	afterTriggers->events_stack = NULL;
  	afterTriggers->depth_stack = NULL;
! 	afterTriggers->firing_stack = NULL;
  	afterTriggers->maxtransdepth = 0;
  }
  
--- 3236,3242 ----
  	afterTriggers->state_stack = NULL;
  	afterTriggers->events_stack = NULL;
  	afterTriggers->depth_stack = NULL;
! 	afterTriggers->cmd_stack = NULL;
  	afterTriggers->maxtransdepth = 0;
  }
  
***************
*** 3124,3132 ****
  
  	/* Initialize this query's list to empty */
  	events = &afterTriggers->query_stack[afterTriggers->query_depth];
! 	events->head = NULL;
! 	events->tail = NULL;
! 	events->tailfree = NULL;
  }
  
  
--- 3277,3285 ----
  
  	/* Initialize this query's list to empty */
  	events = &afterTriggers->query_stack[afterTriggers->query_depth];
! 	events->atel_head = NULL;
! 	events->atel_tail = NULL;
! 	events->atel_query_cmd = GetCurrentCommandId(false);
  }
  
  
***************
*** 3167,3185 ****
  	 * (is that even possible?).  Be careful here: firing a trigger could
  	 * result in query_stack being repalloc'd, so we can't save its address
  	 * across afterTriggerInvokeEvents calls.
- 	 *
- 	 * If we find no firable events, we don't have to increment
- 	 * firing_counter.
  	 */
  	for (;;)
  	{
  		events = &afterTriggers->query_stack[afterTriggers->query_depth];
  		if (afterTriggerMarkEvents(events, &afterTriggers->events, true))
  		{
- 			CommandId	firing_id = afterTriggers->firing_counter++;
- 
  			/* OK to delete the immediate events after processing them */
! 			if (afterTriggerInvokeEvents(events, firing_id, estate, true))
  				break;			/* all fired */
  		}
  		else
--- 3320,3333 ----
  	 * (is that even possible?).  Be careful here: firing a trigger could
  	 * result in query_stack being repalloc'd, so we can't save its address
  	 * across afterTriggerInvokeEvents calls.
  	 */
  	for (;;)
  	{
  		events = &afterTriggers->query_stack[afterTriggers->query_depth];
  		if (afterTriggerMarkEvents(events, &afterTriggers->events, true))
  		{
  			/* OK to delete the immediate events after processing them */
! 			if (afterTriggerInvokeEvents(events, estate, true))
  				break;			/* all fired */
  		}
  		else
***************
*** 3222,3228 ****
  	 * can't assume ActiveSnapshot is valid on entry.)
  	 */
  	events = &afterTriggers->events;
! 	if (events->head != NULL)
  	{
  		PushActiveSnapshot(GetTransactionSnapshot());
  		snap_pushed = true;
--- 3370,3376 ----
  	 * can't assume ActiveSnapshot is valid on entry.)
  	 */
  	events = &afterTriggers->events;
! 	if (events->atel_head != NULL)
  	{
  		PushActiveSnapshot(GetTransactionSnapshot());
  		snap_pushed = true;
***************
*** 3234,3242 ****
  	 */
  	while (afterTriggerMarkEvents(events, NULL, false))
  	{
! 		CommandId	firing_id = afterTriggers->firing_counter++;
! 
! 		if (afterTriggerInvokeEvents(events, firing_id, NULL, true))
  			break;				/* all fired */
  	}
  
--- 3382,3388 ----
  	 */
  	while (afterTriggerMarkEvents(events, NULL, false))
  	{
! 		if (afterTriggerInvokeEvents(events, NULL, true))
  			break;				/* all fired */
  	}
  
***************
*** 3318,3324 ****
  				palloc(DEFTRIG_INITALLOC * sizeof(AfterTriggerEventList));
  			afterTriggers->depth_stack = (int *)
  				palloc(DEFTRIG_INITALLOC * sizeof(int));
! 			afterTriggers->firing_stack = (CommandId *)
  				palloc(DEFTRIG_INITALLOC * sizeof(CommandId));
  			afterTriggers->maxtransdepth = DEFTRIG_INITALLOC;
  
--- 3464,3470 ----
  				palloc(DEFTRIG_INITALLOC * sizeof(AfterTriggerEventList));
  			afterTriggers->depth_stack = (int *)
  				palloc(DEFTRIG_INITALLOC * sizeof(int));
! 			afterTriggers->cmd_stack = (CommandId *)
  				palloc(DEFTRIG_INITALLOC * sizeof(CommandId));
  			afterTriggers->maxtransdepth = DEFTRIG_INITALLOC;
  
***************
*** 3338,3345 ****
  			afterTriggers->depth_stack = (int *)
  				repalloc(afterTriggers->depth_stack,
  						 new_alloc * sizeof(int));
! 			afterTriggers->firing_stack = (CommandId *)
! 				repalloc(afterTriggers->firing_stack,
  						 new_alloc * sizeof(CommandId));
  			afterTriggers->maxtransdepth = new_alloc;
  		}
--- 3484,3491 ----
  			afterTriggers->depth_stack = (int *)
  				repalloc(afterTriggers->depth_stack,
  						 new_alloc * sizeof(int));
! 			afterTriggers->cmd_stack = (CommandId *)
! 				repalloc(afterTriggers->cmd_stack,
  						 new_alloc * sizeof(CommandId));
  			afterTriggers->maxtransdepth = new_alloc;
  		}
***************
*** 3353,3359 ****
  	afterTriggers->state_stack[my_level] = NULL;
  	afterTriggers->events_stack[my_level] = afterTriggers->events;
  	afterTriggers->depth_stack[my_level] = afterTriggers->query_depth;
! 	afterTriggers->firing_stack[my_level] = afterTriggers->firing_counter;
  }
  
  /*
--- 3499,3505 ----
  	afterTriggers->state_stack[my_level] = NULL;
  	afterTriggers->events_stack[my_level] = afterTriggers->events;
  	afterTriggers->depth_stack[my_level] = afterTriggers->query_depth;
! 	afterTriggers->cmd_stack[my_level] = GetCurrentCommandId(false);
  }
  
  /*
***************
*** 3366,3374 ****
  {
  	int			my_level = GetCurrentTransactionNestLevel();
  	SetConstraintState state;
! 	AfterTriggerEvent event;
! 	AfterTriggerEventChunk *chunk;
! 	CommandId	subxact_firing_id;
  
  	/*
  	 * Ignore call if the transaction is in aborted state.	(Probably
--- 3512,3519 ----
  {
  	int			my_level = GetCurrentTransactionNestLevel();
  	SetConstraintState state;
! 	CommandId	subxact_cmd;
! 	AfterTriggerTupleSet ts;
  
  	/*
  	 * Ignore call if the transaction is in aborted state.	(Probably
***************
*** 3428,3453 ****
  		afterTriggers->state_stack[my_level] = NULL;
  
  		/*
! 		 * Scan for any remaining deferred events that were marked DONE or IN
! 		 * PROGRESS by this subxact or a child, and un-mark them. We can
! 		 * recognize such events because they have a firing ID greater than or
! 		 * equal to the firing_counter value we saved at subtransaction start.
  		 * (This essentially assumes that the current subxact includes all
  		 * subxacts started after it.)
  		 */
! 		subxact_firing_id = afterTriggers->firing_stack[my_level];
! 		for_each_event_chunk(event, chunk, afterTriggers->events)
! 		{
! 			AfterTriggerShared evtshared = GetTriggerSharedData(event);
! 
! 			if (event->ate_flags &
! 				(AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS))
! 			{
! 				if (evtshared->ats_firing_id >= subxact_firing_id)
! 					event->ate_flags &=
! 						~(AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS);
! 			}
! 		}
  	}
  }
  
--- 3573,3590 ----
  		afterTriggers->state_stack[my_level] = NULL;
  
  		/*
! 		 * Scan for any deferred triggers marked as fired by this subxact
! 		 * or a child and un-mark them. This is any trigger whose firing_cmd
! 		 * is greater than or equal to the cmd saved at subtransaction start.
  		 * (This essentially assumes that the current subxact includes all
  		 * subxacts started after it.)
  		 */
! 		subxact_cmd = afterTriggers->cmd_stack[my_level];
! 		for (ts = afterTriggers->events.atel_head;
! 			 ts != NULL; ts = ts->atts_next)
! 			if (ts->atts_firing_cmd >= subxact_cmd)
! 				ts->atts_flags &=
! 					~(AFTER_TRIGGER_DONE | AFTER_TRIGGER_IN_PROGRESS);
  	}
  }
  
***************
*** 3765,3772 ****
  
  		while (afterTriggerMarkEvents(events, NULL, true))
  		{
- 			CommandId	firing_id = afterTriggers->firing_counter++;
- 
  			/*
  			 * Make sure a snapshot has been established in case trigger
  			 * functions need one.	Note that we avoid setting a snapshot if
--- 3902,3907 ----
***************
*** 3787,3793 ****
  			 * but we'd better not if inside a subtransaction, since the
  			 * subtransaction could later get rolled back.
  			 */
! 			if (afterTriggerInvokeEvents(events, firing_id, NULL,
  										 !IsSubTransaction()))
  				break;			/* all fired */
  		}
--- 3922,3928 ----
  			 * but we'd better not if inside a subtransaction, since the
  			 * subtransaction could later get rolled back.
  			 */
! 			if (afterTriggerInvokeEvents(events, NULL,
  										 !IsSubTransaction()))
  				break;			/* all fired */
  		}
***************
*** 3815,3822 ****
  bool
  AfterTriggerPendingOnRel(Oid relid)
  {
! 	AfterTriggerEvent event;
! 	AfterTriggerEventChunk *chunk;
  	int			depth;
  
  	/* No-op if we aren't in a transaction.  (Shouldn't happen?) */
--- 3950,3956 ----
  bool
  AfterTriggerPendingOnRel(Oid relid)
  {
! 	AfterTriggerTupleSet ts;
  	int			depth;
  
  	/* No-op if we aren't in a transaction.  (Shouldn't happen?) */
***************
*** 3824,3842 ****
  		return false;
  
  	/* Scan queued events */
! 	for_each_event_chunk(event, chunk, afterTriggers->events)
  	{
- 		AfterTriggerShared evtshared = GetTriggerSharedData(event);
- 
  		/*
  		 * We can ignore completed events.	(Even if a DONE flag is rolled
  		 * back by subxact abort, it's OK because the effects of the TRUNCATE
  		 * or whatever must get rolled back too.)
  		 */
! 		if (event->ate_flags & AFTER_TRIGGER_DONE)
  			continue;
  
! 		if (evtshared->ats_relid == relid)
  			return true;
  	}
  
--- 3958,3975 ----
  		return false;
  
  	/* Scan queued events */
! 	for (ts = afterTriggers->events.atel_head;
! 		 ts != NULL; ts = ts->atts_next)
  	{
  		/*
  		 * We can ignore completed events.	(Even if a DONE flag is rolled
  		 * back by subxact abort, it's OK because the effects of the TRUNCATE
  		 * or whatever must get rolled back too.)
  		 */
! 		if (ts->atts_flags & AFTER_TRIGGER_DONE)
  			continue;
  
! 		if (ts->atts_relid == relid)
  			return true;
  	}
  
***************
*** 3847,3860 ****
  	 */
  	for (depth = 0; depth <= afterTriggers->query_depth; depth++)
  	{
! 		for_each_event_chunk(event, chunk, afterTriggers->query_stack[depth])
  		{
! 			AfterTriggerShared evtshared = GetTriggerSharedData(event);
! 
! 			if (event->ate_flags & AFTER_TRIGGER_DONE)
  				continue;
  
! 			if (evtshared->ats_relid == relid)
  				return true;
  		}
  	}
--- 3980,3992 ----
  	 */
  	for (depth = 0; depth <= afterTriggers->query_depth; depth++)
  	{
! 		for (ts = afterTriggers->query_stack[depth].atel_head;
! 			 ts != NULL; ts = ts->atts_next)
  		{
! 			if (ts->atts_flags & AFTER_TRIGGER_DONE)
  				continue;
  
! 			if (ts->atts_relid == relid)
  				return true;
  		}
  	}
***************
*** 3877,3891 ****
  static void
  AfterTriggerSaveEvent(ResultRelInfo *relinfo, int event, bool row_trigger,
  					  HeapTuple oldtup, HeapTuple newtup,
! 					  List *recheckIndexes, Bitmapset *modifiedCols)
  {
  	Relation	rel = relinfo->ri_RelationDesc;
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
- 	AfterTriggerEventData new_event;
- 	AfterTriggerSharedData new_shared;
  	int			i;
  	int			ntriggers;
  	int		   *tgindx;
  
  	if (afterTriggers == NULL)
  		elog(ERROR, "AfterTriggerSaveEvent() called outside of transaction");
--- 4009,4023 ----
  static void
  AfterTriggerSaveEvent(ResultRelInfo *relinfo, int event, bool row_trigger,
  					  HeapTuple oldtup, HeapTuple newtup,
! 					  Bitmapset *modifiedCols)
  {
  	Relation	rel = relinfo->ri_RelationDesc;
  	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
  	int			i;
  	int			ntriggers;
  	int		   *tgindx;
+ 	ItemPointerData	ctid;
+ 	List	   *immediate_tgoids = NIL;
  
  	if (afterTriggers == NULL)
  		elog(ERROR, "AfterTriggerSaveEvent() called outside of transaction");
***************
*** 3898,3904 ****
  	 * validation is important to make sure we don't walk off the edge of our
  	 * arrays.
  	 */
- 	new_event.ate_flags = 0;
  	switch (event)
  	{
  		case TRIGGER_EVENT_INSERT:
--- 4030,4035 ----
***************
*** 3906,3920 ****
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup != NULL);
! 				ItemPointerCopy(&(newtup->t_self), &(new_event.ate_ctid1));
! 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&(new_event.ate_ctid1));
! 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			}
  			break;
  		case TRIGGER_EVENT_DELETE:
--- 4037,4049 ----
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup != NULL);
! 				ItemPointerCopy(&(newtup->t_self), &ctid);
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&ctid);
  			}
  			break;
  		case TRIGGER_EVENT_DELETE:
***************
*** 3922,3936 ****
  			{
  				Assert(oldtup != NULL);
  				Assert(newtup == NULL);
! 				ItemPointerCopy(&(oldtup->t_self), &(new_event.ate_ctid1));
! 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&(new_event.ate_ctid1));
! 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			}
  			break;
  		case TRIGGER_EVENT_UPDATE:
--- 4051,4063 ----
  			{
  				Assert(oldtup != NULL);
  				Assert(newtup == NULL);
! 				ItemPointerCopy(&(oldtup->t_self), &ctid);
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&ctid);
  			}
  			break;
  		case TRIGGER_EVENT_UPDATE:
***************
*** 3938,3960 ****
  			{
  				Assert(oldtup != NULL);
  				Assert(newtup != NULL);
! 				ItemPointerCopy(&(oldtup->t_self), &(new_event.ate_ctid1));
! 				ItemPointerCopy(&(newtup->t_self), &(new_event.ate_ctid2));
! 				new_event.ate_flags |= AFTER_TRIGGER_2CTIDS;
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&(new_event.ate_ctid1));
! 				ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			}
  			break;
  		case TRIGGER_EVENT_TRUNCATE:
  			Assert(oldtup == NULL);
  			Assert(newtup == NULL);
! 			ItemPointerSetInvalid(&(new_event.ate_ctid1));
! 			ItemPointerSetInvalid(&(new_event.ate_ctid2));
  			break;
  		default:
  			elog(ERROR, "invalid after-trigger event code: %d", event);
--- 4065,4083 ----
  			{
  				Assert(oldtup != NULL);
  				Assert(newtup != NULL);
! 				ItemPointerCopy(&(oldtup->t_self), &ctid);
  			}
  			else
  			{
  				Assert(oldtup == NULL);
  				Assert(newtup == NULL);
! 				ItemPointerSetInvalid(&ctid);
  			}
  			break;
  		case TRIGGER_EVENT_TRUNCATE:
  			Assert(oldtup == NULL);
  			Assert(newtup == NULL);
! 			ItemPointerSetInvalid(&ctid);
  			break;
  		default:
  			elog(ERROR, "invalid after-trigger event code: %d", event);
***************
*** 3978,3983 ****
--- 4101,4107 ----
  	for (i = 0; i < ntriggers; i++)
  	{
  		Trigger    *trigger = &trigdesc->triggers[tgindx[i]];
+ 		bool		immediate_all_rows = !trigger->tgdeferrable;
  
  		if (!TriggerEnabled(trigger, event, modifiedCols))
  			continue;
***************
*** 3998,4003 ****
--- 4122,4128 ----
  						/* key unchanged, so skip queuing this event */
  						continue;
  					}
+ 					immediate_all_rows = false;
  					break;
  
  				case RI_TRIGGER_FK:
***************
*** 4018,4023 ****
--- 4143,4149 ----
  					{
  						continue;
  					}
+ 					immediate_all_rows = false;
  					break;
  
  				case RI_TRIGGER_NONE:
***************
*** 4028,4055 ****
  
  		/*
  		 * If the trigger is a deferred unique constraint check trigger,
! 		 * only queue it if the unique constraint was potentially violated,
! 		 * which we know from index insertion time.
  		 */
  		if (trigger->tgfoid == F_UNIQUE_KEY_RECHECK)
! 		{
! 			if (!list_member_oid(recheckIndexes, trigger->tgconstrindid))
! 				continue;		/* Uniqueness definitely not violated */
! 		}
  
  		/*
! 		 * Fill in event structure and add it to the current query's queue.
  		 */
! 		new_shared.ats_event =
! 			(event & TRIGGER_EVENT_OPMASK) |
! 			(row_trigger ? TRIGGER_EVENT_ROW : 0) |
! 			(trigger->tgdeferrable ? AFTER_TRIGGER_DEFERRABLE : 0) |
! 			(trigger->tginitdeferred ? AFTER_TRIGGER_INITDEFERRED : 0);
! 		new_shared.ats_tgoid = trigger->tgoid;
! 		new_shared.ats_relid = RelationGetRelid(rel);
! 		new_shared.ats_firing_id = 0;
  
  		afterTriggerAddEvent(&afterTriggers->query_stack[afterTriggers->query_depth],
! 							 &new_event, &new_shared);
  	}
  }
--- 4154,4266 ----
  
  		/*
  		 * If the trigger is a deferred unique constraint check trigger,
! 		 * don't queue it here (AfterTriggerAddIndexRecheck should have
! 		 * queued it, if necessary).
  		 */
  		if (trigger->tgfoid == F_UNIQUE_KEY_RECHECK)
! 			continue;
  
  		/*
! 		 * Queue the trigger. All non-deferrable triggers, except for the
! 		 * FK triggers above which don't necessarily fire for all updated
! 		 * rows, are grouped together in a single tuple set, since they
! 		 * are always fired at the same time, and for the same rows.
  		 */
! 		event = (event & TRIGGER_EVENT_OPMASK) |
! 				(row_trigger ? TRIGGER_EVENT_ROW : 0) |
! 				(trigger->tgdeferrable ? AFTER_TRIGGER_DEFERRABLE : 0) |
! 				(trigger->tginitdeferred ? AFTER_TRIGGER_INITDEFERRED : 0);
! 
! 		if (immediate_all_rows)
! 		{
! 			immediate_tgoids = lappend_oid(immediate_tgoids, trigger->tgoid);
! 		}
! 		else
! 		{
! 			List *tgoids = list_make1_oid(trigger->tgoid);
  
+ 			afterTriggerAddEvent(&afterTriggers->query_stack[afterTriggers->query_depth],
+ 								 RelationGetRelid(rel), event, tgoids, &ctid, NULL, 0);
+ 			list_free(tgoids);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Queue the immediate triggers which fire for all rows using a single
+ 	 * tuple set.
+ 	 */
+ 	if (immediate_tgoids != NIL)
+ 	{
  		afterTriggerAddEvent(&afterTriggers->query_stack[afterTriggers->query_depth],
! 							 RelationGetRelid(rel), event, immediate_tgoids,
! 							 &ctid, NULL, 0);
! 		list_free(immediate_tgoids);
! 	}
! }
! 
! 
! /* ----------
!  * AfterTriggerAddIndexRecheck()
!  *
!  *	Queue up the trigger to re-check a deferrable unique index which was
!  *	found to potentially violate the uniqueness check.
!  *
!  *	NOTE: this will always queue up the INSERT trigger, even if the trigger
!  *	event was actually an UPDATE. This allows all tuples to be re-checked
!  *	to be stored in a single bitmap, and avoids the need to read the old
!  *	tuple, which is not needed.
!  * ----------
!  */
! void
! AfterTriggerAddIndexRecheck(ResultRelInfo *relinfo,
! 							Relation indexRelation,
! 							ItemPointer tupleid)
! {
! 	Relation	rel = relinfo->ri_RelationDesc;
! 	TriggerDesc *trigdesc = relinfo->ri_TrigDesc;
! 	Trigger	   *trigger = NULL;
! 	bool		found = false;
! 	int			ntriggers;
! 	int		   *tgindx;
! 	int			i;
! 	TriggerEvent event;
! 	List	   *tgoids;
! 
! 	if (afterTriggers == NULL)
! 		elog(ERROR, "AfterTriggerAddIndexRecheck() called outside of transaction");
! 	Assert(afterTriggers->query_depth >= 0);
! 
! 	/* Find the AFTER INSERT row trigger that re-checks this index */
! 	ntriggers = trigdesc->n_after_row[TRIGGER_EVENT_INSERT];
! 	tgindx = trigdesc->tg_after_row[TRIGGER_EVENT_INSERT];
! 
! 	for (i = 0; i < ntriggers; i++)
! 	{
! 		trigger = &trigdesc->triggers[tgindx[i]];
! 
! 		if (trigger->tgfoid == F_UNIQUE_KEY_RECHECK &&
! 			trigger->tgconstrindid == RelationGetRelid(indexRelation))
! 		{
! 			found = true;
! 			break;
! 		}
  	}
+ 
+ 	if (!found)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_UNDEFINED_OBJECT),
+ 				 errmsg("uniqueness re-checking trigger for index \"%s\" does not exist",
+ 						RelationGetRelationName(indexRelation))));
+ 
+ 	/* Queue up the trigger event as an INSERT */
+ 	event = TRIGGER_EVENT_INSERT |
+ 			TRIGGER_EVENT_ROW |
+ 			(trigger->tgdeferrable ? AFTER_TRIGGER_DEFERRABLE : 0) |
+ 			(trigger->tginitdeferred ? AFTER_TRIGGER_INITDEFERRED : 0);
+ 
+ 	tgoids = list_make1_oid(trigger->tgoid);
+ 	afterTriggerAddEvent(&afterTriggers->query_stack[afterTriggers->query_depth],
+ 						 RelationGetRelid(rel), event, tgoids, tupleid,
+ 						 NULL, AFTER_TRIGGER_UNIQUE_KEY_RECHECK);
+ 	list_free(tgoids);
  }
*** ./src/backend/executor/execUtils.c.orig	2009-10-15 09:23:23.000000000 +0100
--- ./src/backend/executor/execUtils.c	2009-10-19 10:24:20.000000000 +0100
***************
*** 45,50 ****
--- 45,51 ----
  #include "access/genam.h"
  #include "access/heapam.h"
  #include "catalog/index.h"
+ #include "commands/trigger.h"
  #include "executor/execdebug.h"
  #include "nodes/nodeFuncs.h"
  #include "parser/parsetree.h"
***************
*** 958,979 ****
   *		doesn't provide the functionality needed by the
   *		executor.. -cim 9/27/89
   *
-  *		This returns a list of OIDs for any unique indexes
-  *		whose constraint check was deferred and which had
-  *		potential (unconfirmed) conflicts.
-  *
   *		CAUTION: this must not be called for a HOT update.
   *		We can't defend against that here for lack of info.
   *		Should we change the API to make it safer?
   * ----------------------------------------------------------------
   */
! List *
  ExecInsertIndexTuples(TupleTableSlot *slot,
  					  ItemPointer tupleid,
  					  EState *estate,
  					  bool is_vacuum_full)
  {
- 	List	   *result = NIL;
  	ResultRelInfo *resultRelInfo;
  	int			i;
  	int			numIndices;
--- 959,975 ----
   *		doesn't provide the functionality needed by the
   *		executor.. -cim 9/27/89
   *
   *		CAUTION: this must not be called for a HOT update.
   *		We can't defend against that here for lack of info.
   *		Should we change the API to make it safer?
   * ----------------------------------------------------------------
   */
! void
  ExecInsertIndexTuples(TupleTableSlot *slot,
  					  ItemPointer tupleid,
  					  EState *estate,
  					  bool is_vacuum_full)
  {
  	ResultRelInfo *resultRelInfo;
  	int			i;
  	int			numIndices;
***************
*** 1087,1099 ****
  		{
  			/*
  			 * The tuple potentially violates the uniqueness constraint,
! 			 * so make a note of the index so that we can re-check it later.
  			 */
! 			result = lappend_oid(result, RelationGetRelid(indexRelation));
  		}
  	}
- 
- 	return result;
  }
  
  /*
--- 1083,1093 ----
  		{
  			/*
  			 * The tuple potentially violates the uniqueness constraint,
! 			 * so queue up the trigger to re-check the index later.
  			 */
! 			AfterTriggerAddIndexRecheck(resultRelInfo, indexRelation, tupleid);
  		}
  	}
  }
  
  /*
*** ./src/backend/executor/nodeBitmapHeapscan.c.orig	2009-10-07 08:41:50.000000000 +0100
--- ./src/backend/executor/nodeBitmapHeapscan.c	2009-10-19 10:24:41.000000000 +0100
***************
*** 38,43 ****
--- 38,44 ----
  #include "access/heapam.h"
  #include "access/relscan.h"
  #include "access/transam.h"
+ #include "commands/trigger.h"
  #include "executor/execdebug.h"
  #include "executor/nodeBitmapHeapscan.h"
  #include "pgstat.h"
***************
*** 48,54 ****
  
  
  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
! static void bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres);
  
  
  /* ----------------------------------------------------------------
--- 49,56 ----
  
  
  static TupleTableSlot *BitmapHeapNext(BitmapHeapScanState *node);
! static void bitgetpage(BitmapHeapScanState *node,
! 		   HeapScanDesc scan, TBMIterateResult *tbmres);
  
  
  /* ----------------------------------------------------------------
***************
*** 195,201 ****
  			/*
  			 * Fetch the current heap page and identify candidate tuples.
  			 */
! 			bitgetpage(scan, tbmres);
  
  			/*
  			 * Set rs_cindex to first slot to examine
--- 197,203 ----
  			/*
  			 * Fetch the current heap page and identify candidate tuples.
  			 */
! 			bitgetpage(node, scan, tbmres);
  
  			/*
  			 * Set rs_cindex to first slot to examine
***************
*** 333,339 ****
   * interesting according to the bitmap, and visible according to the snapshot.
   */
  static void
! bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
  {
  	BlockNumber page = tbmres->blockno;
  	Buffer		buffer;
--- 335,342 ----
   * interesting according to the bitmap, and visible according to the snapshot.
   */
  static void
! bitgetpage(BitmapHeapScanState *node,
! 		   HeapScanDesc scan, TBMIterateResult *tbmres)
  {
  	BlockNumber page = tbmres->blockno;
  	Buffer		buffer;
***************
*** 376,391 ****
  		 * tbmres; but we have to follow any HOT chain starting at each such
  		 * offset.
  		 */
  		int			curslot;
  
  		for (curslot = 0; curslot < tbmres->ntuples; curslot++)
  		{
  			OffsetNumber offnum = tbmres->offsets[curslot];
  			ItemPointerData tid;
  
! 			ItemPointerSet(&tid, page, offnum);
! 			if (heap_hot_search_buffer(&tid, buffer, snapshot, NULL))
! 				scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
  		}
  	}
  	else
--- 379,412 ----
  		 * tbmres; but we have to follow any HOT chain starting at each such
  		 * offset.
  		 */
+ 		Page		dp = (Page) BufferGetPage(buffer);
  		int			curslot;
  
  		for (curslot = 0; curslot < tbmres->ntuples; curslot++)
  		{
  			OffsetNumber offnum = tbmres->offsets[curslot];
  			ItemPointerData tid;
+ 			ItemId		lp;
+ 			HeapTupleData loctup;
+ 
+ 			if (node->is_tg_scan)
+ 			{
+ 				lp = PageGetItemId(dp, offnum);
+ 				if (!ItemIdIsNormal(lp))
+ 					continue;
+ 				loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
+ 				loctup.t_len = ItemIdGetLength(lp);
+ 				ItemPointerSet(&(loctup.t_self), page, offnum);
  
! 				if (AfterTriggerTupleSatisfiesTrigger(node->tg_data, &loctup))
! 					scan->rs_vistuples[ntup++] = offnum;
! 			}
! 			else
! 			{
! 				ItemPointerSet(&tid, page, offnum);
! 				if (heap_hot_search_buffer(&tid, buffer, snapshot, NULL))
! 					scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
! 			}
  		}
  	}
  	else
***************
*** 408,414 ****
  				continue;
  			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
  			loctup.t_len = ItemIdGetLength(lp);
! 			if (HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer))
  				scan->rs_vistuples[ntup++] = offnum;
  		}
  	}
--- 429,442 ----
  				continue;
  			loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
  			loctup.t_len = ItemIdGetLength(lp);
! 
! 			if (node->is_tg_scan)
! 			{
! 				ItemPointerSet(&(loctup.t_self), page, offnum);
! 				if (AfterTriggerTupleSatisfiesTrigger(node->tg_data, &loctup))
! 					scan->rs_vistuples[ntup++] = offnum;
! 			}
! 			else if (HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer))
  				scan->rs_vistuples[ntup++] = offnum;
  		}
  	}
***************
*** 520,532 ****
  	ExecEndNode(outerPlanState(node));
  
  	/*
! 	 * release bitmap if any
  	 */
  	if (node->tbmiterator)
  		tbm_end_iterate(node->tbmiterator);
  	if (node->prefetch_iterator)
  		tbm_end_iterate(node->prefetch_iterator);
! 	if (node->tbm)
  		tbm_free(node->tbm);
  
  	/*
--- 548,561 ----
  	ExecEndNode(outerPlanState(node));
  
  	/*
! 	 * release bitmap if any, except if it is the trigger queue's
! 	 * bitmap, which is owned by the trigger code
  	 */
  	if (node->tbmiterator)
  		tbm_end_iterate(node->tbmiterator);
  	if (node->prefetch_iterator)
  		tbm_end_iterate(node->prefetch_iterator);
! 	if (node->tbm && !node->is_tg_scan)
  		tbm_free(node->tbm);
  
  	/*
***************
*** 536,543 ****
  
  	/*
  	 * close the heap relation.
  	 */
! 	ExecCloseScanRelation(relation);
  }
  
  /* ----------------------------------------------------------------
--- 565,586 ----
  
  	/*
  	 * close the heap relation.
+ 	 *
+ 	 * We skip this for a trigger queue scan, created with
+ 	 * ExecInitTriggerBitmapHeapScan() since in that case, the relation
+ 	 * was passed in to us.
  	 */
! 	if (!node->is_tg_scan)
! 		ExecCloseScanRelation(relation);
! 
! 	/*
! 	 * For a trigger queue scan, release the tuple slots we created
! 	 */
! 	if (node->is_tg_scan)
! 	{
! 		ExecDropSingleTupleTableSlot(node->ss.ps.ps_ResultTupleSlot);
! 		ExecDropSingleTupleTableSlot(node->ss.ss_ScanTupleSlot);
! 	}
  }
  
  /* ----------------------------------------------------------------
***************
*** 574,579 ****
--- 617,623 ----
  	scanstate->prefetch_iterator = NULL;
  	scanstate->prefetch_pages = 0;
  	scanstate->prefetch_target = 0;
+ 	scanstate->is_tg_scan = false;
  
  	/*
  	 * Miscellaneous initialization
***************
*** 644,646 ****
--- 688,777 ----
  	 */
  	return scanstate;
  }
+ 
+ /* ----------------------------------------------------------------
+  *		ExecInitTriggerBitmapHeapScan
+  *
+  *		Cut-down version of ExecInitBitmapHeapScan, tailored
+  *		for trigger execution. This operates with SnapshotAny,
+  *		checking all tuples, dead or alive using
+  *		AfterTriggerTupleSatisfiesTrigger().
+  * ----------------------------------------------------------------
+  */
+ BitmapHeapScanState *
+ ExecInitTriggerBitmapHeapScan(Relation rel, EState *estate,
+ 							  TIDBitmap *tbm, void *tg_data)
+ {
+ 	BitmapHeapScan *node;
+ 	BitmapHeapScanState *scanstate;
+ 
+ 	/* Assert we are not evaluating PlanQual */
+ 	Assert(estate->es_evTuple == NULL);
+ 
+ 	/*
+ 	 * Create a plan node and the state structure
+ 	 */
+ 	node = makeNode(BitmapHeapScan);
+ 	scanstate = makeNode(BitmapHeapScanState);
+ 	scanstate->ss.ps.plan = (Plan *) node;
+ 	scanstate->ss.ps.state = estate;
+ 
+ 	scanstate->tbm = tbm;
+ 	scanstate->tbmiterator = tbm_begin_iterate(tbm);
+ 	scanstate->tbmres = NULL;
+ 	scanstate->prefetch_iterator = NULL;
+ 	scanstate->prefetch_pages = 0;
+ 	scanstate->prefetch_target = 0;
+ 	scanstate->is_tg_scan = true;
+ 	scanstate->tg_data = tg_data;
+ 
+ 	/*
+ 	 * Create an expression context for the node
+ 	 */
+ 	ExecAssignExprContext(estate, &scanstate->ss.ps);
+ 
+ 	scanstate->ss.ps.ps_TupFromTlist = false;
+ 
+ 	/*
+ 	 * No child expressions
+ 	 */
+ 	scanstate->ss.ps.targetlist = NIL;
+ 	scanstate->ss.ps.qual = NIL;
+ 	scanstate->bitmapqualorig = NIL;
+ 
+ 	/*
+ 	 * Tuples for results
+ 	 */
+ 	scanstate->ss.ps.ps_ResultTupleSlot =
+ 		MakeSingleTupleTableSlot(RelationGetDescr(rel));
+ 	scanstate->ss.ss_ScanTupleSlot =
+ 		MakeSingleTupleTableSlot(RelationGetDescr(rel));
+ 
+ 	/*
+ 	 * Even though we aren't going to do a conventional seqscan, it is useful
+ 	 * to create a HeapScanDesc --- most of the fields in it are usable.
+ 	 *
+ 	 * We use SnapshotAny to return all tuples. The trigger code will pick
+ 	 * the ones it is interested in.
+ 	 */
+ 	scanstate->ss.ss_currentRelation = rel;
+ 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(rel,
+ 														 SnapshotAny,
+ 														 0,
+ 														 NULL);
+ 
+ 	/*
+ 	 * Get the scan type from the relation descriptor.
+ 	 */
+ 	ExecAssignScanType(&scanstate->ss, RelationGetDescr(rel));
+ 
+ 	/*
+ 	 * Initialize result tuple type but no projection info.
+ 	 */
+ 	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
+ 
+ 	/* No child nodes */
+ 	outerPlanState(scanstate) = NULL;
+ 
+ 	return scanstate;
+ }
*** ./src/backend/executor/nodeModifyTable.c.orig	2009-10-15 09:39:46.000000000 +0100
--- ./src/backend/executor/nodeModifyTable.c	2009-10-15 09:45:18.000000000 +0100
***************
*** 166,172 ****
  	ResultRelInfo *resultRelInfo;
  	Relation	resultRelationDesc;
  	Oid			newId;
- 	List	   *recheckIndexes = NIL;
  
  	/*
  	 * get the heap tuple out of the tuple table slot, making sure we have a
--- 166,171 ----
***************
*** 247,257 ****
  	 * insert index entries for tuple
  	 */
  	if (resultRelInfo->ri_NumIndices > 0)
! 		recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! 											   estate, false);
  
  	/* AFTER ROW INSERT Triggers */
! 	ExecARInsertTriggers(estate, resultRelInfo, tuple, recheckIndexes);
  
  	/* Process RETURNING if present */
  	if (resultRelInfo->ri_projectReturning)
--- 246,255 ----
  	 * insert index entries for tuple
  	 */
  	if (resultRelInfo->ri_NumIndices > 0)
! 		ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
  
  	/* AFTER ROW INSERT Triggers */
! 	ExecARInsertTriggers(estate, resultRelInfo, tuple);
  
  	/* Process RETURNING if present */
  	if (resultRelInfo->ri_projectReturning)
***************
*** 425,431 ****
  	HTSU_Result result;
  	ItemPointerData update_ctid;
  	TransactionId update_xmax;
- 	List	   *recheckIndexes = NIL;
  
  	/*
  	 * abort the operation if not running transactions
--- 423,428 ----
***************
*** 559,570 ****
  	 * If it's a HOT update, we mustn't insert new index entries.
  	 */
  	if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
! 		recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! 											   estate, false);
  
  	/* AFTER ROW UPDATE Triggers */
! 	ExecARUpdateTriggers(estate, resultRelInfo, tupleid, tuple,
! 						 recheckIndexes);
  
  	/* Process RETURNING if present */
  	if (resultRelInfo->ri_projectReturning)
--- 556,565 ----
  	 * If it's a HOT update, we mustn't insert new index entries.
  	 */
  	if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
! 		ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
  
  	/* AFTER ROW UPDATE Triggers */
! 	ExecARUpdateTriggers(estate, resultRelInfo, tupleid, tuple);
  
  	/* Process RETURNING if present */
  	if (resultRelInfo->ri_projectReturning)
*** ./src/include/commands/trigger.h.orig	2009-10-15 09:16:34.000000000 +0100
--- ./src/include/commands/trigger.h	2009-10-15 09:18:57.000000000 +0100
***************
*** 132,139 ****
  					 HeapTuple trigtuple);
  extern void ExecARInsertTriggers(EState *estate,
  					 ResultRelInfo *relinfo,
! 					 HeapTuple trigtuple,
! 					 List *recheckIndexes);
  extern void ExecBSDeleteTriggers(EState *estate,
  					 ResultRelInfo *relinfo);
  extern void ExecASDeleteTriggers(EState *estate,
--- 132,138 ----
  					 HeapTuple trigtuple);
  extern void ExecARInsertTriggers(EState *estate,
  					 ResultRelInfo *relinfo,
! 					 HeapTuple trigtuple);
  extern void ExecBSDeleteTriggers(EState *estate,
  					 ResultRelInfo *relinfo);
  extern void ExecASDeleteTriggers(EState *estate,
***************
*** 157,169 ****
  extern void ExecARUpdateTriggers(EState *estate,
  					 ResultRelInfo *relinfo,
  					 ItemPointer tupleid,
! 					 HeapTuple newtuple,
! 					 List *recheckIndexes);
  extern void ExecBSTruncateTriggers(EState *estate,
  					   ResultRelInfo *relinfo);
  extern void ExecASTruncateTriggers(EState *estate,
  					   ResultRelInfo *relinfo);
  
  extern void AfterTriggerBeginXact(void);
  extern void AfterTriggerBeginQuery(void);
  extern void AfterTriggerEndQuery(EState *estate);
--- 156,168 ----
  extern void ExecARUpdateTriggers(EState *estate,
  					 ResultRelInfo *relinfo,
  					 ItemPointer tupleid,
! 					 HeapTuple newtuple);
  extern void ExecBSTruncateTriggers(EState *estate,
  					   ResultRelInfo *relinfo);
  extern void ExecASTruncateTriggers(EState *estate,
  					   ResultRelInfo *relinfo);
  
+ extern bool AfterTriggerTupleSatisfiesTrigger(void *tg_data, HeapTuple tuple);
  extern void AfterTriggerBeginXact(void);
  extern void AfterTriggerBeginQuery(void);
  extern void AfterTriggerEndQuery(EState *estate);
***************
*** 173,178 ****
--- 172,180 ----
  extern void AfterTriggerEndSubXact(bool isCommit);
  extern void AfterTriggerSetState(ConstraintsSetStmt *stmt);
  extern bool AfterTriggerPendingOnRel(Oid relid);
+ extern void AfterTriggerAddIndexRecheck(ResultRelInfo *relinfo,
+ 							Relation indexRelation,
+ 							ItemPointer tupleid);
  
  
  /*
*** ./src/include/executor/executor.h.orig	2009-10-15 09:36:18.000000000 +0100
--- ./src/include/executor/executor.h	2009-10-15 09:37:20.000000000 +0100
***************
*** 309,315 ****
  
  extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
  extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
! extern List *ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
  					  EState *estate, bool is_vacuum_full);
  
  extern void RegisterExprContextCallback(ExprContext *econtext,
--- 309,315 ----
  
  extern void ExecOpenIndices(ResultRelInfo *resultRelInfo);
  extern void ExecCloseIndices(ResultRelInfo *resultRelInfo);
! extern void ExecInsertIndexTuples(TupleTableSlot *slot, ItemPointer tupleid,
  					  EState *estate, bool is_vacuum_full);
  
  extern void RegisterExprContextCallback(ExprContext *econtext,
*** ./src/include/executor/nodeBitmapHeapscan.h.orig	2009-10-07 08:46:08.000000000 +0100
--- ./src/include/executor/nodeBitmapHeapscan.h	2009-10-11 09:26:23.000000000 +0100
***************
*** 17,22 ****
--- 17,24 ----
  #include "nodes/execnodes.h"
  
  extern BitmapHeapScanState *ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags);
+ extern BitmapHeapScanState *ExecInitTriggerBitmapHeapScan(Relation rel, EState *estate,
+ 							  TIDBitmap *tbm, void *tg_data);
  extern TupleTableSlot *ExecBitmapHeapScan(BitmapHeapScanState *node);
  extern void ExecEndBitmapHeapScan(BitmapHeapScanState *node);
  extern void ExecBitmapHeapReScan(BitmapHeapScanState *node, ExprContext *exprCtxt);
*** ./src/include/nodes/execnodes.h.orig	2009-10-15 09:26:04.000000000 +0100
--- ./src/include/nodes/execnodes.h	2009-10-15 09:28:13.000000000 +0100
***************
*** 1182,1187 ****
--- 1182,1189 ----
   *		prefetch_iterator  iterator for prefetching ahead of current page
   *		prefetch_pages	   # pages prefetch iterator is ahead of current
   *		prefetch_target    target prefetch distance
+  *		is_tg_scan		   is this an trigger queue scan?
+  *		tg_data			   trigger data to pass to visibility check fn
   * ----------------
   */
  typedef struct BitmapHeapScanState
***************
*** 1194,1199 ****
--- 1196,1203 ----
  	TBMIterator *prefetch_iterator;
  	int			prefetch_pages;
  	int			prefetch_target;
+ 	bool		is_tg_scan;
+ 	void	   *tg_data;
  } BitmapHeapScanState;
  
  /* ----------------
#3Robert Haas
robertmhaas@gmail.com
In reply to: Dean Rasheed (#2)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, Oct 19, 2009 at 12:48 PM, Dean Rasheed
<dean.a.rasheed@googlemail.com> wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

If the bitmap becomes lossy, how do you preserve the correct semantics?

...Robert

#4Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Robert Haas (#3)
Re: Scaling up deferred unique checks and the after trigger queue

2009/10/19 Robert Haas <robertmhaas@gmail.com>:

On Mon, Oct 19, 2009 at 12:48 PM, Dean Rasheed
<dean.a.rasheed@googlemail.com> wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

If the bitmap becomes lossy, how do you preserve the correct semantics?

...Robert

The idea is that it filters by the transaction ID and command ID of
modified rows to see what's been updated in the command(s) the trigger
is for...

- Dean

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Dean Rasheed (#2)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, 2009-10-19 at 17:48 +0100, Dean Rasheed wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

This is an interesting patch. The justification is fine, the idea is
good, though I'd like to see more analysis of the technique, what other
options exist and some thought about when we should use the technique.

We have a bitmap for each UPDATE statement, I think, but there's no docs
or readme. Why just UPDATE? Is the cost of starting up the bitmap higher
than the existing mechanism? Do we need to look at starting with an
existing mechanism and then switching over to new mechanism? Is the TID
bitmap always a win for large numbers of rows?

The technique relies on these assumptions
* Trigger functions are idempotent
* Trigger execution order is not important (in terms of rows)
* Multiple trigger execution order is not important

All of those seem false in the general case. What will you do?

--
Simon Riggs www.2ndQuadrant.com

#6Jeff Davis
pgsql@j-davis.com
In reply to: Dean Rasheed (#2)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, 2009-10-19 at 17:48 +0100, Dean Rasheed wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

Can you please take a look at my patch here:
http://archives.postgresql.org/message-id/1256499249.12775.20.camel@jdavis

to make sure that we're not interfering with eachother? I implemented
deferred constraint checking in my operator exclusion constraints patch
(formerly "generalized index constraints").

After looking very briefly at your approach, I think that it's entirely
orthogonal, so I don't expect a problem.

I have a git repo here:
http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=shortlog;h=refs/heads/operator-exclusion-constraints

which may be helpful if you just want to look at the commit for deferred
constraint checking. Any comments welcome.

I'll also take a look at your patch in the next few days.

Regards,
Jeff Davis

#7Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Simon Riggs (#5)
Re: Scaling up deferred unique checks and the after trigger queue

2009/10/25 Simon Riggs <simon@2ndquadrant.com>:

On Mon, 2009-10-19 at 17:48 +0100, Dean Rasheed wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

This is an interesting patch. The justification is fine, the idea is
good, though I'd like to see more analysis of the technique, what other
options exist and some thought about when we should use the technique.

We have a bitmap for each UPDATE statement, I think, but there's no docs
or readme. Why just UPDATE? Is the cost of starting up the bitmap higher
than the existing mechanism? Do we need to look at starting with an
existing mechanism and then switching over to new mechanism? Is the TID
bitmap always a win for large numbers of rows?

Thanks for looking at this. It works for all kinds of trigger events,
and is intended as a complete drop-in replacement for the after
triggers queue. I admit that I haven't yet done very much performance
testing. As it stands, there does appear to be a small performance
penalty associated with the bitmaps, but I need to do more testing to
be more specific about that.

I had thought that, for relatively small numbers of rows, I could use
something like a small list of CTID arrays of increasing size, and
then switch over to the new mechanism when this becomes too large.

But first, I wanted to get feedback on whether this TID bitmap
approach is actually valid for general trigger operation.

The technique relies on these assumptions
* Trigger functions are idempotent

I don't understand what you're saying here. It should execute the
triggers in exactly the same way as the current code (but possibly in
a different order). Idempotentence isn't required.

* Trigger execution order is not important (in terms of rows)

It is true that the order in which the rows are processed will change.
As far as I can tell from the spec, there is nothing to say that the
rows for a given statement should be processed in any particular
order. I guess that I'm looking for feedback from people on this list
as to whether that will be a problem for existing apps.

* Multiple trigger execution order is not important

This patch does not change the order of execution in the case where
there are multiple triggers (at least not for regular non-constraint
triggers). They should still be fired in name order, for each row. All
such triggers share a single TID bitmap, and are processed together.
This is in line with the spec.

Deferrable constraint triggers are a different matter, and these will
be fired in a different order (each set of triggers for a given
constraint will be fired together, rather than being interleaved).
This is not covered by the spec, but if they are genuinely being used
to enforce constraints, the order shouldn't matter.

All of those seem false in the general case. What will you do?

At this point I'm looking for more feedback as to whether any of this
is a show-stopper, before I expend more effort on this patch.

- Dean

#8Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Jeff Davis (#6)
1 attachment(s)
Re: Scaling up deferred unique checks and the after trigger queue

2009/10/25 Jeff Davis <pgsql@j-davis.com>:

On Mon, 2009-10-19 at 17:48 +0100, Dean Rasheed wrote:

This is a WIP patch to replace the after-trigger queues with TID bitmaps
to prevent them from using excessive amounts of memory. Each round of
trigger executions is a modified bitmap heap scan.

Can you please take a look at my patch here:
http://archives.postgresql.org/message-id/1256499249.12775.20.camel@jdavis

to make sure that we're not interfering with eachother? I implemented
deferred constraint checking in my operator exclusion constraints patch
(formerly "generalized index constraints").

Yes, I've been following this, and I'm looking forward to this new
functionality.

After looking very briefly at your approach, I think that it's entirely
orthogonal, so I don't expect a problem.

I agree. I think that the 2 are orthogonal.

Possibly they could both share some common bulk checking code, but I've
not thought much about how to do that yet.

I have a git repo here:
http://git.postgresql.org/gitweb?p=users/jdavis/postgres.git;a=shortlog;h=refs/heads/operator-exclusion-constraints

which may be helpful if you just want to look at the commit for deferred
constraint checking. Any comments welcome.

I did a quick bit of testing, and I think that there is a
locking/concurrency problem :-(

Attached is a (rather crappy) python script (using PyGreSQL) that I
used to test consistency while I was working on the deferrable
uniqueness constraints patch. Basically it just spawns a bunch of
threads, each of which does random CRUD, with heavy contention and
lots of constraint violations and deadlocks, which are rolled back.

I modified the script to enforce uniqueness with an exclusion constraint,
and the script is able to break the constraint, forcing invalid data into
the table.

I haven't looked at your code in depth, but I hope that this is not a
difficult problem to fix. It seems like it ought to be similar to the btree
code.

- Dean

Attachments:

mt_test.pytext/x-python; charset=US-ASCII; name=mt_test.pyDownload
#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Dean Rasheed (#7)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, 2009-10-26 at 13:28 +0000, Dean Rasheed wrote:

It works for all kinds of trigger events,
and is intended as a complete drop-in replacement for the after
triggers queue.

All of those seem false in the general case. What will you do?

At this point I'm looking for more feedback as to whether any of this
is a show-stopper, before I expend more effort on this patch.

I see no show stoppers, only for you to look at ways of specifying that
this optimization is possible for particular cases. I think we might be
able to make the general statement that it will work for all after
triggers that execute STABLE or IMMUTABLE functions. I don't think we
can assume that firing order is irrelevant for some cases, e.g. message
queues.

--
Simon Riggs www.2ndQuadrant.com

#10Jeff Davis
pgsql@j-davis.com
In reply to: Dean Rasheed (#8)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, 2009-10-26 at 13:41 +0000, Dean Rasheed wrote:

I did a quick bit of testing, and I think that there is a
locking/concurrency problem :-(

Unfortunately I can't reproduce the problem on my machine; it always
passes.

If you have a minute, can you try to determine if the problem can happen
with a non-deferrable constraint?

I'll keep looking into it.

Thanks,
Jeff Davis

#11Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Simon Riggs (#9)
Re: Scaling up deferred unique checks and the after trigger queue

2009/10/26 Simon Riggs <simon@2ndquadrant.com>:

On Mon, 2009-10-26 at 13:28 +0000, Dean Rasheed wrote:

It works for all kinds of trigger events,
and is intended as a complete drop-in replacement for the after
triggers queue.

All of those seem false in the general case. What will you do?

At this point I'm looking for more feedback as to whether any of this
is a show-stopper, before I expend more effort on this patch.

I see no show stoppers, only for you to look at ways of specifying that
this optimization is possible for particular cases. I think we might be
able to make the general statement that it will work for all after
triggers that execute STABLE or IMMUTABLE functions. I don't think we
can assume that firing order is irrelevant for some cases, e.g. message
queues.

Hmm, thinking about this some more... one thing this patch does is to
separate out the queues for "regular" triggers from those for RI
triggers and deferrable constraint checks. ITSM that row-order only
really matters for the former. It's also the case that for these
triggers there will never be any other choice but to execute them one
at a time, so they may as well just spool to a file rather than using
a TID bitmap.

The bitmaps are probably only useful for constraint triggers, where a
bulk check can be used instead of executing individual triggers for
each row, if enough rows are modified.

- Dean

#12Dean Rasheed
dean.a.rasheed@googlemail.com
In reply to: Jeff Davis (#10)
Re: Scaling up deferred unique checks and the after trigger queue

2009/10/26 Jeff Davis <pgsql@j-davis.com>:

On Mon, 2009-10-26 at 13:41 +0000, Dean Rasheed wrote:

I did a quick bit of testing, and I think that there is a
locking/concurrency problem :-(

Unfortunately I can't reproduce the problem on my machine; it always
passes.

That's odd. It happens every time on my machine (10 threads, 1000 loops).

If you have a minute, can you try to determine if the problem can happen
with a non-deferrable constraint?

If anything, that seems to make it fail more quickly.

If it's of any relevance, I'm currently using an optimised build, with
assert checking off.
[Linux x86_64, 2 core Intel Core2]

- Dean

#13Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#9)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, Oct 26, 2009 at 9:46 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2009-10-26 at 13:28 +0000, Dean Rasheed wrote:

It works for all kinds of trigger events,
and is intended as a complete drop-in replacement for the after
triggers queue.

All of those seem false in the general case. What will you do?

At this point I'm looking for more feedback as to whether any of this
is a show-stopper, before I expend more effort on this patch.

I see no show stoppers, only for you to look at ways of specifying that
this optimization is possible for particular cases. I think we might be
able to make the general statement that it will work for all after
triggers that execute STABLE or IMMUTABLE functions. I don't think we
can assume that firing order is irrelevant for some cases, e.g. message
queues.

Hmm. After-trigger functions are very unlikely to really be STABLE or
IMMUTABLE, though. Almost by definition, they'd better be modifying
some data somewhere, or there's no point.

...Robert

#14Jeff Davis
pgsql@j-davis.com
In reply to: Dean Rasheed (#12)
Re: Scaling up deferred unique checks and the after trigger queue

On Mon, 2009-10-26 at 17:23 +0000, Dean Rasheed wrote:

If it's of any relevance, I'm currently using an optimised build, with
assert checking off.
[Linux x86_64, 2 core Intel Core2]

Ok, I'm able to reproduce it now. Thanks for looking into it!

Regards,
Jeff Davis